Affordable Access

Forgestream:streaming data processing pipeline

Technische Universiteit Eindhoven
Publication Date
  • Computer Science
  • Design
  • Musicology
  • Political Science


Popular social networks like Facebook, Twitter, Pinterest, Tumblr, Instagram, FourSquare, and Google+ generate enormous amounts of social content every day. Consumer product providers want to increasingly rely on the users' social content to learn more about them, personalize the experience of users and provide rich social context within their products. However, the tools that currently exist to integrate and leverage social content in consumer products are very primitive: the user content from Facebook and Twitter API comes in plain text and is hard to process. Consumer products do not have the capability to develop sophisticated algorithms to process this text and do not have the adequate data processing infrastructure. Jetlore technology can extract and disambiguate important concepts from social text (e.g., movies, sports, and food references) and classify short social snippets into topics (e.g., politics, music, sports). Jetlore's algorithms utilize social graph signals and are optimized for colloquial language and minimal textual context. The goal of this project is to create the infrastructure to collect and process social network data streams. The Twitter Firehose stream (full real-time Twitter stream) has 300,000 messages per day. The infrastructure should eventually process that data stream and others as well which imposes strict requirements on scalability and availability. The infrastructure should be extendable so that additional processing units (such as link resolution, stats collection) could be easily added to the system. At the same time the processed results should be made available to the customers. In the first part of this project, we have designed and implemented a prototype of stream processing system built on top of the Akka actors framework [1]. The prototype made clear the possible issues (such as reliability of processing, blocking input/output operations) and design choices. In the second part we have designed and implemented a REST (Representational state transfer) API [2] which allows customers to query and retrieve processed data. We have implemented the main parts of the API: authentication and authorization, rate limiting, versioning, requests validation. The particular business requirements for data querying and retrieval have influenced the design decisions on the stream processing system (such as statistics accumulation). In the third part, we have designed and implemented the Forgestream - the Jetlore stream data processing pipeline. The Forgestream and the REST API are currently both deployed at production and serve as the backbone infrastructure to Jetlore's business.

There are no comments yet on this publication. Be the first to share your thoughts.