[PDL] Package Queries: Scalable Prescriptive Analytics Close to the Data (Matteo Brucato)
Decision making is central to a broad range of domains, including finance, transportation, healthcare, the travel industry, robotics, and engineering. It is often found at the very final step of business analytics--prescriptive analytics--to allow businesses to transform a rich understanding of data, typically provided by advanced predictive models, into actionable decisions. Modeling and solving these problems have relied on application-specific solutions, which are often complex, error-prone, and not generalizable. My goal is to create a domain-independent, declarative approach, supported and... Read More
[Vaccination 2021] HarperDB’s Data Storage Journey: From File System to LMDB (Kyle Bernhardy)
HarperDB is a distributed database with hybrid SQL and NoSQL functionality in one, accessed via a REST API. Known as a structured object store with SQL capabilities, or NewSQL. HarperDB leverages a logical structure enabling ACID compliant efficient storage and retrieval without inconsistency, race conditions, or utilizing in-memory indexing. HarperDB is fully indexed and runs on any device from edge to cloud. In this talk we will cover HarperDB's Data Storage Journey. Kyle will review the different steps along the... Read More
[Vaccination 2021] Novel Design Choices in Apache CouchDB (Adam Kocoloski)
Apache CouchDB is a JSON document store with a native HTTP API, server-side JavaScript indexing, and active/active data replication across flexible configurations of server instances that are free to come and go as they please. Under the hood the DBMS is implemented largely in Erlang and features copy-on-write B-trees, hash histories for automatic revision tracking of individual records, and a purely asynchronous index maintenance system. This novel combination of capabilities has been powering web and mobile applications of all shapes... Read More
[Vaccination 2021] Inside Apache Druid’s Storage and Query Engine (Gian Merlino)
Apache Druid is an open-source columnar database known for high performance at scale; its largest deployments comprise thousands of servers. But no matter the scale, high performance starts with good fundamentals. This talk will dive into those fundamentals by exploring the inner workings of a single data server. We'll cover how Apache Druid stores data, what kinds of compression it uses, how it indexes data, how the storage engine is linked with the query processing engine, and how the system... Read More
[Vaccination 2021] Citus: Distributed PostgreSQL as an Extension (Marco Slot)
One of the defining characteristics of PostgreSQL is its extensibility, which enables developers to add new database functionality without forking from the original project. Citus is an open source PostgreSQL extension that transforms PostgreSQL into a distributed database. The goal of Citus is to make the versatile set of data processing capabilities in PostgreSQL available at any scale. Citus can scale transactional workloads by routing transactions across nodes, and analytical workloads by parallelizing operations across all cores in the cluster.... Read More
[Vaccination 2021] Star-Tree Index: Space-Time Trade Off in OLAP (Kishore Gopalakrishna)
The need for real-time analytics has proliferated in the modern data landscape. The industry is moving towards providing analytics to end-users via interactive apps instead of traditional dashboards. Whether it's user-facing analytical applications such as LinkedIn's "Who Viewed My Profile" or an internal monitoring tool used by Uber's city ops team to regulate trips in a region, it is imperative that the underlying analytical database is highly performant. For instance, LinkedIn handles 170K queries per second across 70+ user-facing applications.... Read More
[Vaccination 2021] Performance Testing at MongoDB (David Daly)
It is important for developers to understand the performance of a software project as they develop new features, fix bugs, and try to generally improve the product. While it is simple to state that requirement, it can be hard to do in practice. There are a lot of choices an organization faces when trying to understand the performance of the software, with a lot of opportunities to waste money (execution resources), or worse, time. We have run full speed into... Read More
MS Thesis Defense: An Evaluation of Compilation-Based PL/PGSQL Execution (Tanuj Nayak)
User Defined Functions (UDFs) are an important analytical feature in modern Database Management Systems (DBMSs) due to their server-side execution properties. These properties allow complex analytical queries to execute without serializing intermediate data over a network. However, query engines often incur significant overheads when executing UDFs due to them being non-declarative in contrast to SQL queries. This contrast causes a lot of context switching between UDF and SQL execution. As a given UDF invokes more SQL queries, these overheads become... Read More
[Vaccination 2021] SLOG: Serializable, Low-latency, Geo-replicated Transactions (Daniel Abadi)
For decades, applications deployed on a world-wide scale have been forced to give up at least one of (1) strict serializability (2) low latency writes (3) high transactional throughput. This talk will overview SLOG: a system that avoids this tradeoff for workloads which contain physical region locality in data access. SLOG achieves high-throughput, strictly serializable ACID transactions at geo-replicated distance and scale for all transactions submitted across the world, all the while achieving low latency for transactions that initiate from... Read More
NoisePage: The Self-Driving Database Management System (Lin Ma)
Database management systems (DBMSs) are an important part of modern data-driven applications. However, they are notoriously difficult to deploy and administer. There are existing methods that recommend physical design or knob configurations for DBMSs. But most of them require humans to make final decisions and decide when to apply changes. The goal of a self-driving DBMS is to remove the DBMS administration impediments by managing itself autonomously. In this talk, I present the design of a new self-driving DBMS (NoisePage)... Read More