[DB Talk][DB Seminar] CMU/Pitt Joint DB Monthly Meetup

Date: Wednesday, October 22nd @ 4:30 pm
Location: Room 5317, Sennott Square Building, University of Pittsburgh Campus
Speaker: Spyros Blanas, Assistant Professor, The Ohio State University
Title: Scalable in situ query processing for exploratory data analysis

Abstract:
Web data are commonly processed using thousands of CPU cores, and
large-scale scientific simulations are quickly approaching the one
million CPU core mark. At this scale, the barrier to efficient data
analysis is commonly the limited bandwidth to the disk. The growing
main memory capacities allow data to be intelligently reduced,
analyzed and transformed in situ, before being written to disk or
transferred over the network. This talk focuses on accelerating data
analysis by embedding in-memory processing capabilities within
existing libraries and tools.

We first present Pytheas, a prototype system that allows a scientist
to leverage sophisticated indexing and query processing capabilities
while analyzing data directly in the HDF5 array file format. We find
that by avoiding the data loading step our system can shorten the time
to insight from hours to seconds for a supernovae detection workload.
When processing the same dataset in parallel, our system is 10X faster
than Apache Hive when running on 512 CPU cores. We then show
preliminary results from in situ query processing with Cloudera
Impala, an open-source, distributed SQL query engine. We find that
carefully selecting the in-memory join algorithm can improve
performance by nearly one order of magnitude. Finally, we briefly
discuss exciting opportunities to better utilize the high-performance
interconnects and the parallel file systems that can be found in the
modern data center.

Brief Bio:

  • PhD., University of Wisconsin-Madison
  • Worked in Microsoft Jim Gray Systems Lab.
  • Research examines the interactions of database systems and hardware, with a focus on in-memory query execution and transaction processing.
  • Has a strong interest in seeing research ideas transition into usable products.
  • As part of his doctoral dissertation was commercialized as the "Hekaton" in-memory optimization in Microsoft SQL Server 2014.