Archived Events

Archived Events

Mar 1 2021
04:30pm EST
[Vaccination 2021] Inside Apache Druid’s Storage and Query Engine (Gian Merlino)

Apache Druid is an open-source columnar database known for high performance at scale; its largest deployments comprise thousands of servers. But no matter the scale, high performance starts with good fundamentals. This talk will dive into those fundamentals by exploring the inner workings of a single data server. We'll cover how Apache Druid stores data, what kinds of compression it uses, how it indexes data, how the storage engine is linked with the query processing engine, and how the system... Read More

Feb 22 2021
04:30pm EST
[Vaccination 2021] Citus: Distributed PostgreSQL as an Extension (Marco Slot)

One of the defining characteristics of PostgreSQL is its extensibility, which enables developers to add new database functionality without forking from the original project. Citus is an open source PostgreSQL extension that transforms PostgreSQL into a distributed database. The goal of Citus is to make the versatile set of data processing capabilities in PostgreSQL available at any scale. Citus can scale transactional workloads by routing transactions across nodes, and analytical workloads by parallelizing operations across all cores in the cluster.... Read More

Feb 15 2021
04:30pm EST
[Vaccination 2021] Star-Tree Index: Space-Time Trade Off in OLAP (Kishore Gopalakrishna)

The need for real-time analytics has proliferated in the modern data landscape. The industry is moving towards providing analytics to end-users via interactive apps instead of traditional dashboards. Whether it's user-facing analytical applications such as LinkedIn's "Who Viewed My Profile" or an internal monitoring tool used by Uber's city ops team to regulate trips in a region, it is imperative that the underlying analytical database is highly performant. For instance, LinkedIn handles 170K queries per second across 70+ user-facing applications.... Read More

Feb 8 2021
04:30pm EST
[Vaccination 2021] Performance Testing at MongoDB (David Daly)

It is important for developers to understand the performance of a software project as they develop new features, fix bugs, and try to generally improve the product. While it is simple to state that requirement, it can be hard to do in practice. There are a lot of choices an organization faces when trying to understand the performance of the software, with a lot of opportunities to waste money (execution resources), or worse, time. We have run full speed into... Read More

Feb 3 2021
02:30pm EST
MS Thesis Defense: An Evaluation of Compilation-Based PL/PGSQL Execution (Tanuj Nayak)

User Defined Functions (UDFs) are an important analytical feature in modern Database Management Systems (DBMSs) due to their server-side execution properties. These properties allow complex analytical queries to execute without serializing intermediate data over a network. However, query engines often incur significant overheads when executing UDFs due to them being non-declarative in contrast to SQL queries. This contrast causes a lot of context switching between UDF and SQL execution. As a given UDF invokes more SQL queries, these overheads become... Read More

Feb 1 2021
04:30pm EST
[Vaccination 2021] SLOG: Serializable, Low-latency, Geo-replicated Transactions (Daniel Abadi)

For decades, applications deployed on a world-wide scale have been forced to give up at least one of (1) strict serializability (2) low latency writes (3) high transactional throughput. This talk will overview SLOG: a system that avoids this tradeoff for workloads which contain physical region locality in data access. SLOG achieves high-throughput, strictly serializable ACID transactions at geo-replicated distance and scale for all transactions submitted across the world, all the while achieving low latency for transactions that initiate from... Read More

Jan 25 2021
04:30pm EST
NoisePage: The Self-Driving Database Management System (Lin Ma)

Database management systems (DBMSs) are an important part of modern data-driven applications. However, they are notoriously difficult to deploy and administer. There are existing methods that recommend physical design or knob configurations for DBMSs. But most of them require humans to make final decisions and decide when to apply changes. The goal of a self-driving DBMS is to remove the DBMS administration impediments by managing itself autonomously. In this talk, I present the design of a new self-driving DBMS (NoisePage)... Read More

Dec 16 2020
12:00pm EST
On Automatic Database Management System Tuning Using Machine Learning (Dana Van Aken)

Database management systems (DBMSs) are an essential component of any data-intensive application. But tuning a DBMS to perform well is a notoriously difficult task because they have hundreds of configuration knobs that control aspects of their runtime behavior, such as cache sizes and how frequently to flush data to disk. Getting the right configuration for these knobs is hard because they are not standardized (i.e., sets of knobs for different DBMSs vary), not independent (i.e., changing one knob may alter... Read More

Dec 14 2020
05:00pm EST
Quarantine DB Talk 2020: TiDB – On the Long Journey of HTAP

Due to the rising demand for real-time analytics and insights on fresh data, the term HTAP becomes hot in recent years. From the very beginning, TiDB was designed for pure TP workload. But gradually as we adapt to users' requirements, TiDB evolves into an HTAP database based on Raft. We will introduce TiDB's design, internals, and HTAP architectural evolvement. This talk is part of the Quarantine Database Tech Talk Seminar Series. Zoom Link: https://cmu.zoom.us/j/562649242 (Password 264771) Read More

Dec 7 2020
03:20pm EST
[Fall 2020] A Peek into Snowflake’s Scalable Architecture
Martin Hentschel , Max Heimel

Snowflake is an analytic data warehouse offered as a fully-managed service in the cloud. It is faster, easier to use, and far more scalable than traditional on-premise data warehouse offerings and is used by thousands of customers around the world. Snowflake's data warehouse is not built on an existing database or "big data" software platform such as Hadoop—it uses a new SQL database engine with a unique architecture designed for the cloud. Snowflake operates three engineering centers in San Mateo,... Read More