Pitt/CMU DB Meetup – Spyros Blanas (Ohio State)
Web data are commonly processed using thousands of CPU cores, and large-scale scientific simulations are quickly approaching the one million CPU core mark. At this scale, the barrier to efficient data analysis is commonly the limited bandwidth to the disk. The growing main memory capacities allow data to be intelligently reduced, analyzed and transformed in situ, before being written to disk or transferred over the network. This talk focuses on accelerating data analysis by embedding in-memory processing capabilities within existing libraries and tools.
We first present Pytheas, a prototype system that allows a scientist to leverage sophisticated indexing and query processing capabilities while analyzing data directly in the HDF5 array file format. We find that by avoiding the data loading step our system can shorten the time to insight from hours to seconds for a supernovae detection workload. When processing the same dataset in parallel, our system is 10X faster than Apache Hive when running on 512 CPU cores. We then show preliminary results from in situ query processing with Cloudera Impala, an open-source, distributed SQL query engine. We find that carefully selecting the in-memory join algorithm can improve performance by nearly one order of magnitude. Finally, we briefly discuss exciting opportunities to better utilize the high-performance interconnects and the parallel file systems that can be found in the modern data center.
Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research examines the interactions of database systems and hardware, with a focus on in-memory query execution and transaction processing. Spyros received his Ph.D. from the University of Wisconsin–Madison, where he was also working in the Microsoft Jim Gray Systems Lab. He has a strong interest in seeing research ideas transition into usable products, and part of his doctoral dissertation was commercialized as the "Hekaton" in-memory optimization in Microsoft SQL Server 2014. http://web.cse.ohio-state.edu/~sblanas/
More Info: http://db.cs.pitt.edu/group/node/146