Yi Pan (Apache Samza @ LinkedIn)
This talk will provide an overview of LinkedIn’s distributed stream processing platform, including Samza/Kafka/Databus. It will first cover the high level scenarios for stream processing in LinkedIn, followed by detailed requirements around scalability, re-processing, accuracy of results, and easy programmability; then we will focus on the requirements on stateful stream processing applications and explain how Samza’s state management allows us to build applications meet all the above mentioned requirements. The key concepts, architecture and usage in LinkedIn’s stream processing pipeline will be explained, including state management in Samza, the use and configuration of Kafka and Databus as input/output and as a change log. We will also discuss in details how we leverage the reliable, repayable messaging system (i.e. Kafka) together w/ durable state management in Samza to build a Lambda-less stream processing platform. The key mechanism to achieve unified process model between batch and real-time stream is windowing. We will dive into the detailed requirements and our solutions to windowing a real-time stream in this talk as well.
Yi Pan graduated from UCI with a Ph.D. in Computer Science in 2008. Since then, he has worked in distributed platforms for Internet applications for 8 years. He started in Yahoo! on Yahoo!'s NoSQL database project, leading the development of multiple features, such as real-time notification of database updates, secondary index, and live-migration from legacy systems to NoSQL database. Later, he joined and led the development of Cloud Messaging System, which is used heavily as a pub-sub service and transaction logs for distributed databases in Yahoo!. Since 2014, he joined LinkedIn and has quickly become the lead of Apache Samza team in LinkedIn, which provides scalable stream processing service for the whole company.
More Info: http://www.pdl.cmu.edu/SDI/2016/041416.html