Events

Events

Databricks: A Deep Dive into Spark SQL’s Catalyst Optimizer

Speakers:
Cheng Lian , Maryann Xue
Date:
Mon Oct 12, 2020 @ 05:00pm EDT
Date:
Mon Oct 12, 2020
Time:
05:00pm EDT
Location:
https://cmu.zoom.us/j/562649242?pwd=djhicnFKWHdJM1o0MlFvYzg3SzB5Zz09Zoom
Title:
Databricks: A Deep Dive into Spark SQL’s Catalyst Optimizer
System:
Databricks
Video:
YouTube

Talk Info:

Catalyst is the SQL query optimizer in Spark SQL. It is one of the most important components of Apache Spark, as it powers major Spark APIs like SQL, DataFrames/Datasets, as well as Structured Streaming. Unlike many traditional SQL systems, Spark enables users to query data in arbitrary formats stored in arbitrary locations at scale. While being powerful, this also imposes extra query planning challenges such as statistics collection and cost estimation, which further affect performance negatively.

In this talk, we will provide an overview of the Catalyst optimizer, with an emphasis on the Adaptive Query Execution feature newly introduced in Apache Spark 3.0, which looks to tackle the aforementioned challenges by reoptimizing and adjusting query plans based on runtime statistics collected in the process of query execution.

This talk is part of the Quarantine Database Tech Talk Seminar Series.

Zoom Link: https://cmu.zoom.us/j/562649242 (Password 264771)

Bio:

Maryann Xue is a staff software engineer at Databricks, committer and PMC member of Apache Calcite and Apache Phoenix. Previously, she worked on a number of big data and compiler projects at Intel.

Cheng Lian is an engineering manager at Databricks, PMC member of Apache Spark, committer of Apache Parquet, and an exhausted new dad of an adorable baby daughter <3 He previously actively participated in building the initial a few versions of Spark SQL.

More Info: https://db.cs.cmu.edu/seminar2020/