[Vaccination 2021] Bodo: Automatic HPC Performance and Scaling for Data Processing in Python (Ehsan Totoni)
Python is the language of choice for machine learning (ML) and AI, but SQL has been used for data processing for decades. Many data applications are often a mix of the two languages, which makes development and deployment cumbersome for data teams. BodoSQL addresses the “two-language” problem by compiling Python and SQL code together, providing type checking, error checking, end-to-end optimization, and parallelization across the two languages. Furthermore, BodoSQL uses Bodo’s high performance computing (HPC) parallel architecture with MPI for execution, delivering extreme performance and scalability for SQL workloads. This avoids parallel overheads of distributed SQL backends, unlocking scaling for 5TB+ datasets on 500+ core clusters. We will explain how BodoSQL works internally, discuss some of the optimization tradeoffs, and present performance results.
This talk is part of the Vaccination Database (Second Dose) Tech Talk Seminar Series.
Ehsan is an entrepreneur, computer science researcher, and software engineer working on democratization of High Performance Computing (HPC) for data analytics/AI/ML. Ehsan received his PhD in computer science from the University of Illinois at Urbana-Champaign, working on various aspects of HPC and Parallel Computing. He then worked as a research scientist at Intel Labs and Carnegie Mellon University, focusing on programming systems to address the gap between programmer productivity and computing performance.
More Info: https://db.cs.cmu.edu/seminar2021-dose2#db4