News & Events
[Vaccination 2021] Novel Design Choices in Apache CouchDB (Adam Kocoloski)
Apache CouchDB is a JSON document store with a native HTTP API, server-side JavaScript indexing, and active/active data replication across flexible configurations of server instances that are free to come and go as they please. Under the hood the DBMS is implemented largely in Erlang and features copy-on-write B-trees, hash histories for automatic revision tracking of individual records, and a purely asynchronous index maintenance system. This novel combination of capabilities has been powering web and mobile applications of all shapes Read More
Vaccination 2021 Database Tech Talks
Pittsburgh, PA — The Carnegie Mellon Database Group is hosting a series of online database technical tech talks in 2021 as we start to get vaccinated. These talks will feature leading researchers and industry developers that are building state-of-the-art systems. CMU-DB's weekly meetings (Mondays @ 4:30pm EST) are available to the public on Zoom. Non-CMU affiliated members of the general public are invited to attend. See the seminar info page for the schedule of upcoming talks. The recordings are available Read More
[Vaccination 2021] Inside Apache Druid’s Storage and Query Engine (Gian Merlino)
Apache Druid is an open-source columnar database known for high performance at scale; its largest deployments comprise thousands of servers. But no matter the scale, high performance starts with good fundamentals. This talk will dive into those fundamentals by exploring the inner workings of a single data server. We'll cover how Apache Druid stores data, what kinds of compression it uses, how it indexes data, how the storage engine is linked with the query processing engine, and how the system Read More
[Vaccination 2021] Citus: Distributed PostgreSQL as an Extension (Marco Slot)
One of the defining characteristics of PostgreSQL is its extensibility, which enables developers to add new database functionality without forking from the original project. Citus is an open source PostgreSQL extension that transforms PostgreSQL into a distributed database. The goal of Citus is to make the versatile set of data processing capabilities in PostgreSQL available at any scale. Citus can scale transactional workloads by routing transactions across nodes, and analytical workloads by parallelizing operations across all cores in the cluster. Read More
[Vaccination 2021] Star-Tree Index: Space-Time Trade Off in OLAP (Kishore Gopalakrishna)
The need for real-time analytics has proliferated in the modern data landscape. The industry is moving towards providing analytics to end-users via interactive apps instead of traditional dashboards. Whether it's user-facing analytical applications such as LinkedIn's "Who Viewed My Profile" or an internal monitoring tool used by Uber's city ops team to regulate trips in a region, it is imperative that the underlying analytical database is highly performant. For instance, LinkedIn handles 170K queries per second across 70+ user-facing applications. Read More
MS Thesis Defense: An Evaluation of Compilation-Based PL/PGSQL Execution (Tanuj Nayak)
User Defined Functions (UDFs) are an important analytical feature in modern Database Management Systems (DBMSs) due to their server-side execution properties. These properties allow complex analytical queries to execute without serializing intermediate data over a network. However, query engines often incur significant overheads when executing UDFs due to them being non-declarative in contrast to SQL queries. This contrast causes a lot of context switching between UDF and SQL execution. As a given UDF invokes more SQL queries, these overheads become Read More
[Vaccination 2021] Performance Testing at MongoDB (David Daly)
It is important for developers to understand the performance of a software project as they develop new features, fix bugs, and try to generally improve the product. While it is simple to state that requirement, it can be hard to do in practice. There are a lot of choices an organization faces when trying to understand the performance of the software, with a lot of opportunities to waste money (execution resources), or worse, time. We have run full speed into Read More
[Vaccination 2021] SLOG: Serializable, Low-latency, Geo-replicated Transactions (Daniel Abadi)
For decades, applications deployed on a world-wide scale have been forced to give up at least one of (1) strict serializability (2) low latency writes (3) high transactional throughput. This talk will overview SLOG: a system that avoids this tradeoff for workloads which contain physical region locality in data access. SLOG achieves high-throughput, strictly serializable ACID transactions at geo-replicated distance and scale for all transactions submitted across the world, all the while achieving low latency for transactions that initiate from Read More
NoisePage: The Self-Driving Database Management System (Lin Ma)
Database management systems (DBMSs) are an important part of modern data-driven applications. However, they are notoriously difficult to deploy and administer. There are existing methods that recommend physical design or knob configurations for DBMSs. But most of them require humans to make final decisions and decide when to apply changes. The goal of a self-driving DBMS is to remove the DBMS administration impediments by managing itself autonomously. In this talk, I present the design of a new self-driving DBMS (NoisePage) Read More
On Automatic Database Management System Tuning Using Machine Learning (Dana Van Aken)
Database management systems (DBMSs) are an essential component of any data-intensive application. But tuning a DBMS to perform well is a notoriously difficult task because they have hundreds of configuration knobs that control aspects of their runtime behavior, such as cache sizes and how frequently to flush data to disk. Getting the right configuration for these knobs is hard because they are not standardized (i.e., sets of knobs for different DBMSs vary), not independent (i.e., changing one knob may alter Read More