News & Events
[Vaccination 2021] SLOG: Serializable, Low-latency, Geo-replicated Transactions (Daniel Abadi)
For decades, applications deployed on a world-wide scale have been forced to give up at least one of (1) strict serializability (2) low latency writes (3) high transactional throughput. This talk will overview SLOG: a system that avoids this tradeoff for workloads which contain physical region locality in data access. SLOG achieves high-throughput, strictly serializable ACID transactions at geo-replicated distance and scale for all transactions submitted across the world, all the while achieving low latency for transactions that initiate from Read More
NoisePage: The Self-Driving Database Management System (Lin Ma)
Database management systems (DBMSs) are an important part of modern data-driven applications. However, they are notoriously difficult to deploy and administer. There are existing methods that recommend physical design or knob configurations for DBMSs. But most of them require humans to make final decisions and decide when to apply changes. The goal of a self-driving DBMS is to remove the DBMS administration impediments by managing itself autonomously. In this talk, I present the design of a new self-driving DBMS (NoisePage) Read More
On Automatic Database Management System Tuning Using Machine Learning (Dana Van Aken)
Database management systems (DBMSs) are an essential component of any data-intensive application. But tuning a DBMS to perform well is a notoriously difficult task because they have hundreds of configuration knobs that control aspects of their runtime behavior, such as cache sizes and how frequently to flush data to disk. Getting the right configuration for these knobs is hard because they are not standardized (i.e., sets of knobs for different DBMSs vary), not independent (i.e., changing one knob may alter Read More
[Fall 2020] A Peek into Snowflake’s Scalable Architecture
Snowflake is an analytic data warehouse offered as a fully-managed service in the cloud. It is faster, easier to use, and far more scalable than traditional on-premise data warehouse offerings and is used by thousands of customers around the world. Snowflake's data warehouse is not built on an existing database or "big data" software platform such as Hadoop—it uses a new SQL database engine with a unique architecture designed for the cloud. Snowflake operates three engineering centers in San Mateo, Read More
Self-Driving Database Management Systems: Forecasting, Modeling, and Planning (Lin Ma)
Database management systems (DBMSs) are an important part of modern data-driven applications. However, they are notoriously difficult to deploy and administer. There are existing methods that recommend physical design or knob configurations for DBMSs. But most of them require humans to make final decisions and decide when to apply changes. Furthermore, they either (1) only focus on a single aspect of the DBMS, (2) are reactionary to the workload patterns and shifts, (3) require expensive exploratory testing on data copies, Read More
Quarantine DB Talk 2020: ksqlDB: A Stream-Relational Database System
ksqlDB is a distributed event streaming database system that allows users to express SQL queries over relational tables and event streams. The project was released by Confluent in 2017 and is hosted on Github and developed with an open-source spirit. ksqlDB is built on top of Apache Kafka, a distributed event streaming platform. In this talk, we discuss ksqlDB's architecture that is influenced by Apache Kafka and its stream processing library, Kafka Streams. We explain how ksqlDB executes continuous queries, Read More
Quarantine DB Talk 2020: TiDB – On the Long Journey of HTAP
Due to the rising demand for real-time analytics and insights on fresh data, the term HTAP becomes hot in recent years. From the very beginning, TiDB was designed for pure TP workload. But gradually as we adapt to users' requirements, TiDB evolves into an HTAP database based on Raft. We will introduce TiDB's design, internals, and HTAP architectural evolvement. This talk is part of the Quarantine Database Tech Talk Seminar Series. Zoom Link: https://cmu.zoom.us/j/562649242 (Password 264771) Read More
Quarantine DB Talk 2020: The Cascades Framework for Query Optimization at Microsoft
The Cascades framework was an academic project introduced 25 years ago as a foundation for modern query optimizers. It provides extensibility, memoization-based dynamic programming, an algebraic representation of logical and physical operator trees, and manipulation of such trees using transformation rules to enable cost-based query optimization. Cascades provides a clean framework/skeleton for optimizer development, but it needs to be instantiated with domain-knowledge and augmented in several directions to cope with real-world workloads in an industrial setting. We will describe some Read More
Quarantine DB Talk 2020: PlanetScale: Query Planning for a Sharded System like Vitess
Traditional query planning involves parsing of an input SQL into an AST, and then transforming it into primitives which can later be sent through an optimizer. However, in a sharded system, each leaf node is a full relational engine that is capable of doing its own optimizations. So, the traditional approach may not work for such a system. And who knows if the finally reconstructed query would be correctly optimized by the underlying engine? The Vitess VTGate proxy uses a Read More
Quarantine DB Talk 2020: Databricks: A Deep Dive into Spark SQL’s Catalyst Optimizer
Catalyst is the SQL query optimizer in Spark SQL. It is one of the most important components of Apache Spark, as it powers major Spark APIs like SQL, DataFrames/Datasets, as well as Structured Streaming. Unlike many traditional SQL systems, Spark enables users to query data in arbitrary formats stored in arbitrary locations at scale. While being powerful, this also imposes extra query planning challenges such as statistics collection and cost estimation, which further affect performance negatively. In this talk, we Read More