Archived Events

Archived Events

Nov 4 2024
04:30pm EST
[Building Blocks] Towards “Unified” Compute Engines: Opportunities and Challenges (Mehmet Ozan Kabak)

The architecture diagram of a typical data and AI infrastructure setup often features a primary compute engine (e.g., Apache Spark) alongside an array of supplementary tools for observability, AI integration, streaming support, memory management, interactivity, and more. While this modular architecture can be effective, it also introduces challenges around performance bottlenecks, maintenance costs, and integration complexity. In this talk, we will explore whether it is possible to simplify such complex architectures by addressing some of the core engine-level limitations that... Read More

Oct 28 2024
04:30pm EST
[Building Blocks] Exon: A Built for Purpose Bioinformatics Database (Trent Hauck)

Without having to implement every component of a database engine, it’s now feasible to build databases that can lean into the idiosyncrasies of specific domains to deliver a better user experience. Exon is one such databases. Thanks to DataFusion, Exon can deliver a complete database, but also have capabilities bridge the gap between bioinformatics and database systems. In this talk I’ll discuss some of the features that make Exon specially adapted to biodata and how those features come about due... Read More

Oct 21 2024
04:30pm EST
[Building Blocks] Accelerating Data and AI with Spice.ai Open-Source Software (Luke Kim)

Spice.ai OSS is an open-source, portable runtime designed to simplify building data and AI applications. It’s built on industry leading technologies like Apache DataFusion, Apache Arrow, DuckDB and SQLite. In this talk, we tell the story of building neurofeedback systems, to operating DuckDB at cloud-scale, to building Spice.ai OSS for the intersection of high-performance data query and ML-inference. We introduce Spice.ai OSS, demo some of its capabilities and use-cases, explore the design principles and architecture of the platform, and go... Read More

Oct 7 2024
04:30pm EST
[Building Blocks] ParadeDB – Postgres for Search and Analytics (Philippe Noël)

ParadeDB is Postgres for search and analytics. It is an alternative to Elasticsearch built on Postgres. It offers state-of-the-art full-text and vector search capabilities, as well as fast aggregations inside Postgres. ParadeDB is built in Rust via Postgres extensions on top of database building blocks like Tantivy, DuckDB, and Apache DataFusion. It is compatible with every officially supported PGDG Postgres version. In this talk, we'll discuss how we extended Postgres with these building blocks and dive into the technical details... Read More

Oct 1 2024
12:00pm EST
GHC 8115
[DB Seminar] JSON Relational Duality: Converging the worlds of Objects, Documents, and Relational

The "Object-Relational Impedance Mismatch" has been a multi-decade problem for developers, and past solutions have all had various tradeoffs that have compromised efficiency or consistency.  JSON Relational Duality is a breakthrough capability that combines the best aspects of the Document model and the Relational models without the drawbacks of either model. This session will provide an overview and deep dive into the inner workings of JSON Relational Duality. We will also discuss some of the benefits of being able to... Read More

Sep 30 2024
04:30pm EST
[Building Blocks] Accelerating Apache Spark workloads with Apache DataFusion Comet (Andy Grove)

Apache Spark is one of the most widely-used distributed data analysis frameworks. However, its JVM-based and row-oriented query execution engine limits Spark’s performance and scalability. In this talk, we will introduce DataFusion Comet, an accelerator for Apache Spark designed to improve the efficiency of Spark queries by translating them into native queries that leverage Apache Arrow and Apache DataFusion. We will explore the core architecture of Comet and explain how Spark plans are translated into native plans and talk about... Read More

Sep 23 2024
04:30pm EST
[Building Blocks] Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine (Andrew Lamb)

Apache DataFusion is a fast, embeddable, and extensible query engine written in Rust that uses Apache Arrow as its memory model. In this talk we explain DataFusion in more detail and describe the types of data centric systems it is used to build. We will also review its high level architecture and feature set, discussing tradeoffs and performance between DataFusion's modularity vs more common tightly coupled design. This talk is part of the Database Building Blocks Seminar Series. Zoom Link:... Read More

Sep 17 2024
09:30am EST
GHC 6501
Industry Affiliates Program Visit 2024 – Day 2

The second day of Carnegie Mellon University's Database Industry Affiliate Program (IAP) Visit Day, held in the Gates-Hillman Center, shifts focus to the industry side, featuring a series of informative sessions presented by member companies. These sessions offer companies the opportunity to showcase their latest innovations, products, and challenges in the database space, while also highlighting potential career opportunities for students. Attendees, including faculty, students, and other participants, can engage directly with company representatives to learn about real-world applications of... Read More

Sep 16 2024
09:30am EST
GHC 4405
Industry Affiliates Program Visit 2024 – Day 1

The first day of Carnegie Mellon University's Database Industry Affiliate Program (IAP) Visit Day takes place in the Gates-Hillman Center and is focused on showcasing cutting-edge research in the field of databases. The day is filled with a series of research talks delivered by faculty and students from the university's database group. These presentations provide an in-depth look at the latest advancements in database technologies, methodologies, and applications. Attendees, including industry partners, gain valuable insights into innovative projects, ongoing research,... Read More

Sep 12 2024
12:00pm EST
GHC 9115
[Fall 2024] Advancing Database Performance and Capabilities at Snowflake
Dan Sotolongo , Bowei Chen

This talk presents recent research and development at Snowflake aimed at pushing the boundaries of database performance and functionality. In the first section, we will introduce a series of optimizations designed to accelerate query execution within Snowflake’s platform. We will discuss the technical challenges associated with developing general-purpose optimizations and balancing performance improvements across a wide range of workloads. The second section will explore a novel database constraint we’re developing to enable continuous processing applications. A finalization constraint restricts the... Read More