Quarantine DB Talk 2020: Databricks: A Deep Dive into Spark SQL’s Catalyst Optimizer
Catalyst is the SQL query optimizer in Spark SQL. It is one of the most important components of Apache Spark, as it powers major Spark APIs like SQL, DataFrames/Datasets, as well as Structured Streaming. Unlike many traditional SQL systems, Spark enables users to query data in arbitrary formats stored in arbitrary locations at scale. While being powerful, this also imposes extra query planning challenges such as statistics collection and cost estimation, which further affect performance negatively.
In this talk, we will provide an overview of the Catalyst optimizer, with an emphasis on the Adaptive Query Execution feature newly introduced in Apache Spark 3.0, which looks to tackle the aforementioned challenges by reoptimizing and adjusting query plans based on runtime statistics collected in the process of query execution.
This talk is part of the Quarantine Database Tech Talk Seminar Series.
Maryann Xue is a staff software engineer at Databricks, committer and PMC member of Apache Calcite and Apache Phoenix. Previously, she worked on a number of big data and compiler projects at Intel.
Cheng Lian is an engineering manager at Databricks, PMC member of Apache Spark, committer of Apache Parquet, and an exhausted new dad of an adorable baby daughter <3 He previously actively participated in building the initial a few versions of Spark SQL.
More Info: https://db.cs.cmu.edu/seminar2020/