[Future Data] Multi-statement Transactions in the Databricks Lakehouse
- Speaker:
- Ryan Johnson
- Date:
- Mon Nov 3, 2025 @ 04:30pm EST
- Date:
- Mon Nov 3, 2025
- Time:
- 04:30pm EST
- Location:
- https://cmu.zoom.us/j/96274590594?pwd=ZIhPZi8CFwaVd5kN9sS5uEiuWanTCa.1Zoom
- Title:
- Multi-statement Transactions in the Databricks Lakehouse
- System:
- Delta Lake
Talk Info:
The data lake architecture originally focused on self-standing tables in cloud storage, with catalogs as mere discovery aids. Modern lakehouse architectures add an ever-growing set of data warehousing capabilities to that original value proposition. Historically a key missing piece was multi-statement transactions -- Delta Lake supported single-statement single-table transactions, with ACID properties for changes made to that table. Sophisticated MERGE operations allowed for read-modify-write, but there was no mechanism for coordinating longer-running transactions across multiple tables. In this talk, we'll take a look "behind the curtain" for the Databricks journey to multi-statement transactions, announced at DAIS earlier this year. We'll learn how they work, but we'll also explore the "messy middle" -- things you probably won't ever read in a SIGMOD paper, but which actually take significant thought and effort to deliver a significantly cross-cutting new feature to an existing platform.
This talk is part of the Future Data Systems Seminar Series.
Bio:
Ryan Johnson (PhD'10) is a principal engineer at Databricks, working on metadata management for the Delta Lake table format. Past lives include metadata and storage management at AWS Redshift, a start-up building a Datalog-based query engine where he discovered that PODS does actually matter, and a stint as a professor of database systems at the University of Toronto where he played with everything from HTAP to distributed log shipping to building an OLTP engine on an FPGA. He is Distinguished CMU-DB alumni.