Archived Events

Archived Events

Nov 30 2020
05:00pm EST
Quarantine DB Talk 2020: The Cascades Framework for Query Optimization at Microsoft
Nico Bruno , Cesar Galindo-Legaria

The Cascades framework was an academic project introduced 25 years ago as a foundation for modern query optimizers. It provides extensibility, memoization-based dynamic programming, an algebraic representation of logical and physical operator trees, and manipulation of such trees using transformation rules to enable cost-based query optimization. Cascades provides a clean framework/skeleton for optimizer development, but it needs to be instantiated with domain-knowledge and augmented in several directions to cope with real-world workloads in an industrial setting. We will describe some... Read More

Nov 23 2020
05:00pm EST
Quarantine DB Talk 2020: ksqlDB: A Stream-Relational Database System

ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka, a distributed event streaming platform. In this talk, we discuss ksqlDB's architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries,... Read More

Nov 16 2020
05:00pm EST
Quarantine DB Talk 2020: Fauna: Lessons Learned Building a Real World, Calvin-based System

Fauna is a NoSQL Database-as-an-API service which supports consistent, global database access for OLTP workloads. While there are many aspects of Fauna which make it unique among similar systems, one in particular is its use of Calvin, a deterministic transaction resolution protocol which underpins its strict-serializability guarantees. This talk will give an overview of Fauna's architecture, why we chose Calvin and the benefits therefore attained, and some lessons learned evolving our system in a real world, production environment where the... Read More

Nov 10 2020
01:00pm EST
Self-Driving Database Management Systems: Forecasting, Modeling, and Planning (Lin Ma)

Database management systems (DBMSs) are an important part of modern data-driven applications. However, they are notoriously difficult to deploy and administer. There are existing methods that recommend physical design or knob configurations for DBMSs. But most of them require humans to make final decisions and decide when to apply changes. Furthermore, they either (1) only focus on a single aspect of the DBMS, (2) are reactionary to the workload patterns and shifts, (3) require expensive exploratory testing on data copies,... Read More

Nov 9 2020
05:00pm EST
Quarantine DB Talk 2020: EraDB: Designing Systems for Cardinality and Dimensionality

EraDB is a distributed database designed for petabyte-scale, schemaless data that leverages cloud-native object storage for global persistence. In this talk, Todd will discuss the historical origins of EraDB and delve into how it is designed to handle high-cardinality and high-dimensionality data within a flexible, horizontally-scalable architecture. This talk is part of the Quarantine Database Tech Talk Seminar Series. Zoom Link: https://cmu.zoom.us/j/562649242 (Password 264771) Read More

Nov 2 2020
05:00pm EST
Quarantine DB Talk 2020: Refactoring Query Processing in MySQL

MySQL is often called the world's most popular open source DBMS, and it's certainly one of the most used. MySQL grew up with the open source movement and the public Internet and became a part of the famous LAMP stack. Today, MySQL server are still powering a huge number of web sites. A lot has changed in MySQL in the 25 years since the initial release, but a lot of the core code has also remained almost unchanged. The query... Read More

Oct 26 2020
05:00pm EST
Quarantine DB Talk 2020: Datometry Hyper-Q: Virtualizing the World’s Enterprise Data Warehouses

Enterprises worldwide are looking to move their database applications to the cloud. However, conventional migration from an on-premise data warehouse to a cloud-native one is a costly, labor-intensive task, laden with many risks. According to Gartner, the majority of these migrations are late, run over budget, or fail altogether. Datometry has developed a virtualization platform that enables applications written for an on-premises data warehouse to run on a cloud data warehouse — without major rewrites, without rearchitecting. Instead, Datometry Hyper-Q... Read More

Oct 19 2020
05:00pm EST
Quarantine DB Talk 2020: FoundationDB or: How I Learned to Stop Worrying and Trust the Database

Getting multiple entities to work nicely together is a difficult task. This is true for machines as much as it is true for humans. This is why testing and debugging distributed systems is such a hard task. Even if well known algorithms are used, subtle bugs can introduce catastrophic failures. FoundationDB uses deterministic simulation to test these failures. This is the secret sauce that makes FoundationDB one of the most robust databases on the market. FoundationDB is a distributed key... Read More

Oct 12 2020
05:00pm EST
Quarantine DB Talk 2020: Databricks: A Deep Dive into Spark SQL’s Catalyst Optimizer
Cheng Lian , Maryann Xue

Catalyst is the SQL query optimizer in Spark SQL. It is one of the most important components of Apache Spark, as it powers major Spark APIs like SQL, DataFrames/Datasets, as well as Structured Streaming. Unlike many traditional SQL systems, Spark enables users to query data in arbitrary formats stored in arbitrary locations at scale. While being powerful, this also imposes extra query planning challenges such as statistics collection and cost estimation, which further affect performance negatively. In this talk, we... Read More

Oct 5 2020
05:00pm EST
Quarantine DB Talk 2020: Apache Arrow Flight: Accelerating Columnar Dataset Transport

In this talk I will discuss the role that Apache Arrow and Arrow Flight are playing to provide a faster and more efficient approach to building data services that transport large datasets. We'll look at the technical details of why the Arrow protocol is an attractive choice and look at specific examples of where Arrow has been employed for better performance and resource efficiency. Finally, I will discuss the implications for databases and the upcoming generation of data systems. This... Read More