Archived Events

Archived Events

Oct 21

2025

Oct 21 2025
[Fall 2025] Astronomer / Apache AirFlow Tech Talk
Speaker:
Julian LaNeve
System:
AirFlow

Apache Airflow is the most popular data orchestration tool there is, downloaded over 40m times per month and used to power the data, ML, and AI platforms at OpenAI, Lyft, Airbnb, Uber, and Apple. At its core, Airflow allows you to define data workflows as DAGs using Python. We’ll do a deep dive on how Airflow came to be and... Read More

Oct 20

2025

Oct 20 2025
[Future Data] Where We’re Going, We Don’t Need Rows: Columnar Data Connectivity with ADBC
Speaker:
Ian Cook
System:
Arrow

ADBC (Arrow Database Connectivity) is Apache Arrow’s answer to ODBC and JDBC: It’s a database access API and driver standard that delivers data in Arrow columnar format instead of a row-oriented format. ADBC is on a roll, speeding and simplifying data access for dbt, Databricks, DuckDB, Microsoft, Snowflake, and more. This talk presents the architecture of ADBC (APIs, drivers, driver... Read More

Oct 13

2025

Oct 13 2025
[Future Data] Vortex: LLVM for File Formats
Speaker:
Will Manning
System:
Vortex
Video:
YouTube

Apache Parquet revolutionized columnar storage after its initial release in 2013, but has largely failed to evolve since then. As a result, nearly every Tier 1 tech company has built their own columnar format to replace Parquet. Enter Vortex, a Linux Foundation project that currently achieves 100x faster random access, 10-20x faster scans, and 5x higher write throughput, while maintaining... Read More

Oct 6

2025

Oct 6 2025
[Future Data] DuckLake: Learning from Cloud Data Warehouses to Build a Robust “Lakehouse”
Speaker:
Jordan Tigani
System:
MotherDuck
Video:
YouTube

When building scalable data systems, it is easy to focus on the storage and the compute, but metadata a critical third piece that is often overlooked. This talk will describe how metadata storage enables query performance and helps provide transactional semantics in modern data warehouses. We will then go into how the metadata story in popular open data formats take... Read More

Sep 29

2025

Sep 29 2025
[Future Data] Apache Hudi: A Database Layer over Cloud Storage for Fast Mutations and Efficient Queries
Speaker:
Vinoth Chandar
System:
Hudi
Video:
YouTube

Data lakes emerged as a way to store vast amounts of data as files and objects on infinitely scalable cloud storage, with processing done on scalable distributed compute engines. However, this architecture lacks many of the capabilities of traditional databases, such as efficient mutations, indexing, and transaction management. Apache Hudi was created as the first "lakehouse" project, to bridge this... Read More

Sep 23

2025

Sep 23 2025
[Fall 2025] On Holistic Database Optimization via Leveraging Similarity Across Actions, Workloads, Configurations, and Scenarios (William Zhang)
Speaker:
William Zhang

Modern database management systems (DBMSs) have evolved to support increasingly sophisticated data-intensive applications, at the cost of substantial complexity to configure them for two reasons. First, DBMSs expose a vast configuration space with trillions of possibilities that encompass system knobs, physical design (e.g., indexes), and query options, amongst others. Second, these applications are constantly evolving with changes in data access... Read More

Sep 22

2025

Sep 22 2025
[Future Data] An Extremely Technical Overview of how the Apache Iceberg™ Planning Implementation Actually Works
Speaker:
Russell Spitzer
System:
Iceberg
Video:
YouTube

What are you trying to tell me? That I can read data fast? No, User. I'm trying to tell you that when you are ready, you won't have to. Everyone's heard about how fast Apache Iceberg and maybe you've even heard a few notes about "predicate pushdown" and "file metrics" but you've been left wanting more. You want to know... Read More

Sep 16

2025

Sep 16 2025
Industry Affiliates Program Visit 2025 – Day 2

The second day of Carnegie Mellon University's Database Industry Affiliate Program (IAP) Visit Day, held in the Gates-Hillman Center, shifts focus to the industry side, featuring a series of informative sessions presented by member companies. These sessions offer companies the opportunity to showcase their latest innovations, products, and challenges in the database space, while also highlighting potential career opportunities for... Read More

Sep 15

2025

Sep 15 2025
Industry Affiliates Program Visit 2025 – Day 1

The first day of Carnegie Mellon University's Database Industry Affiliate Program (IAP) Visit Day takes place in the Gates-Hillman Center and is focused on showcasing cutting-edge research in the field of databases. The day is filled with a series of research talks delivered by faculty and students from the university's database group. These presentations provide an in-depth look at the... Read More

May 12

2025

May 12 2025
DBSP: Incremental Computation on Streams and Its Applications to Databases
Speaker:
Mihai Budiu
System:
Feldera

We describe DBSP, a framework for incremental computation. Incremental computations repeatedly evaluate a function on some input values that are "changing". The goal of an efficient implementation is to "reuse" previously computed results. Ideally, when presented with a new change to the input, an incremental computation should only perform work proportional to the size of the changes of the input,... Read More