Business Intelligence for the Real Time Enterprise

August 31, 2015 - Kohala Coast, Hawaii

Invited Industrial Talks

Twitter Heron: Stream Processing at Scale

By Karthik Ramasamy and Sanjeev Kulkarni (Twitter, Inc.)

Storm has long served as the main platform for real-time analytics at Twitter. However, as the scale of data being processed in real- time at Twitter has increased, along with an increase in the diversity and the number of use cases, many limitations of Storm have become apparent. We need a system that scales better, has better debug-ability, has better performance, andis easier to manage - all while working in a shared cluster infrastructure. We considered various alternatives to meet these needs, and in the end concluded that we needed to build a new real-time stream data processing system. This talk will present the design and implementation of the new system, called Heron. Heron is now the de facto stream data processing engine inside Twitter, and share our experiences from running Heron in production.

About Presenters:

Karthik Ramasamy

Karthik is the engineering manager for Real Time Analytics at Twitter. He has two decades of experience working in parallel databases, big data infrastructure and networking. He cofounded Locomatix, a company that specializes in real timestreaming processing on Hadoop and Cassandra using SQL that was acquired by Twitter. Before Locomatix, he had a brief stint with Greenplum where he worked on parallel query scheduling. Greenplum was eventually acquired by EMC for more than $300M. Prior to Greenplum, Karthik was at Juniper Networks where he designed and delivered platforms, protocols, databases and high availability solutions for network routers that are widely deployed in the Internet. Before joining Juniper at University of Wisconsin, he worked extensively in parallel database systems, query processing, scale out technologies, storage engine and online analytical systems. Several of these research were spun as a company later acquired by Teradata.

He is the author of several publications, patents and one of the best selling book "Network Routing: Algorithms, Protocols and Architectures." He has a Ph.D. in Computer Science from UW Madison with a focus on large scale databases and big data.

Sanjeev Kulkarni

Sanjeev Kulkarni is a Staff Engineer working on next generation streaming technologies required by the growing real-time needs of Twitter. Before Twitter he was Vice-President, Engineering at Locomatix, Inc, where he oversaw the building of Locomatix engine:- a high-performance real-time streaming engine that users could access using SQL. Before that he worked at Google where he was an early member of the team that built the AdSense product. Sanjeev has a Bachelors in Computer Science from Indian Institute of Technology, Guwahati and a Masters in Computer Science from from University of Wisconsin-Madison.

High-Availability at Massive Scale: Building Google's Data Infrastructure for Ads

By Ashish Gupta and Jeff Shute (Google, Inc.)

Google's Ads Data Infrastructure systems run the multi-billion dollar ads business at Google. High availability and strong consistency are critical for these systems, as they provide business critical metrics in real-time, which in turn powers our revenue stream, and directly impact user experiences. Over several years and multiple generations, our strategies for availability have evolved considerably. The first generation systems focused primarily on handling machine-level failures automatically. Although datacenter failures are rare, they do happen, and can be very disruptive and difficult to recover from when data consistency is lost. Our next generation systems were designed to support datacenter failovers and recover state cleanly after failover. Furthermore, we implemented automated failure detection and failover, which reduced operational complexity and risk. Failover-based approaches, however, cannot truly achieve high availability, and can have excessive cost for standby resources. Our current approach to solving these problems is to build natively multi-homed systems. Such systems run hot in multiple datacenters all the time, and adaptively move load between datacenters, with the ability to handle outages of any scale completely transparently.

We've recently published details of several of these systems with heavy read/write load (e.g. F1, Mesa, and Photon), all of which have large global state that needs to be replicated across datacenters in real time. F1 is a distributed relational database system that combines high availability and the scalability of NoSQL systems like Bigtable, and the consistency and usability of traditional SQL databases. Mesa is a petabyte-scale data warehousing system that allows real-time data ingestion and queryability, as well as high availability, reliability, fault tolerance, and scalability for large data and query volumes. Finally, Photon is a geographically distributed system for joining multiple continuously flowing streams of data in real-time with high scalability and low latency. In this talk, we'll describe our general approaches for, and experiences with, availability in multi-homed data storage and processing systems. We will also share details of a new system, Ubiq, a highly-available multi-homed system, that scales to extremely high throughput by continuously processing events in small batches in near real-time.

About Presenters:

Ashish Gupta

Ashish Gupta is a Distinguished Software Engineer at Google. He leads the Ads Data Infrastructure group, where he builds data solutions powering Google's multi-billion dollar ads business. He works on numerous very large scale distributed system problems including near real-time continuous event processing, distributed relational database system, petabyte scale geo-replicated data warehouse, distributed query engine, query optimization, experiment data analysis, data visualization, etc. Ashish holds a Masters in Computer Science from University of Texas at Austin and did his Bachelors in Computer Science from IIT Delhi.

Jeff Shute

Jeff Shute has been developing data storage and processing systems in the Ads Data Infrastructure team at Google since 2005. Jeff led the design, implementation and rollout of F1, a distributed relational database system built to scale like cloud-based NoSQL systems but without compromising typical RDBMS features. Jeff studied computer science and mathematics at the University of Waterloo.

Apache Flink: Scalable Stream and Batch Data Processing

By Asterios Katsifodimos and Volker Markl (TU Berlin)

Apache Flink is an open source system for expressive, declarative, fast,and efficient data analysis on both historical (batch) and real-time (streaming) data. Flink combines the scalability and programming flexibility of distributed MapReduce-like platforms with the efficiency, out-of-core execution, and query optimization capabilities found in parallel databases. At its core, Flink builds on a distributed dataflow runtime that unifies batch and incremental computations over a true-streaming, pipelined execution engine. Its programming model allows for stateful, fault tolerant computations, flexible user-defined windowing semantics for streaming and unique support for iterations. Flink is converging into a use-case complete system for parallel data processing with a wide range of top level libraries including machine learning and graph processing. Apache Flink originates from the Stratosphere project led by TU Berlin and incorporates the results of various scientific papers published in VLDBJ, SIGMOD, (P)VLDB, ICDE, HPDC, etc.

About Presenters:

Asterios Katsifodimos

Asterios Katsifodimos is a postdoctoral researcher co-leading the Stratosphere project at the Technische Universität Berlin. His work focuses on programming models and optimisation for data-parallel analytics. Asterios received his PhD in 2013 from INRIA Saclay and Université Paris-Sud. His thesis focused on materialized view-based techniques for the management of Web Data. Asterios has been a member of the High Performance Computing Lab at the University of Cyprus, where he obtained his Bsc and Msc degrees in 2009. His research interests include language models for data-parallel processing, query optimization, materialized views, and distributed systems.

Volker Markl

Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) group at the Technische Universitat Berlin (TU Berlin) as well as an adjunct status-only professor at the University of Toronto. Earlier in his career, Dr. Markl lead a research group at FORWISS, the Bavarian Research Center for Knowledge-based Systems in Munich, Germany, and was a Research Staff member & Project Leader at the IBM Almaden Research Center in San Jose, California, USA. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 7 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been speaker and principal investigator of the Stratosphere research project that resulted in the "Apache Flink" big data analytics system. Dr. Markl currently serves as the secretary of the VLDB Endowment and was recently elected as one of Germany's leading "digital minds" (Digitale Köpfe) by the German Informatics Society (GI).