MS Thesis Defense: Filter Representation in Vectorized Query Execution (Amadou Ngom)
Advances in memory capacity have allowed Database Management Systems (DBMSs) to store large amounts of data in memory, thereby shifting the performance bottleneck of query execution from disk accesses to CPU efficiency (i.e., instruction count and cycles per instruction). One technique used to achieve such efficiency in analytical applications is batch-oriented processing or vectorization: it reduces interpretation overhead, improves cache locality, and allows for efficient loop optimizations (e.g., loop unrolling, SIMD vectorization). For each vector (i.e., a batch of tuples),... Read More
[DB Seminar] Spring 2020 DB Group: Rockset: Realtime Indexing for fast queries on massive semi-structured data
Rockset is a realtime indexing database that powers fast SQL over semi-structured data such as JSON, Parquet, or XML without requiring any schematization. All data loaded into Rockset are automatically indexed and a fully featured SQL engine powers fast queries over semi-structured data without requiring any database tuning. Rockset exploits the hardware fluidity available in the cloud and automatically grows and shrinks the cluster footprint based on demand. Available as a serverless cloud service, Rockset is used by developers to... Read More
[DB Seminar] Spring 2020 DB Group: Astra: How we built a Cassandra-as-a-Service
At DataStax, we’ve been on a multi-year journey to bring a Cassandra DBaaS to the market, culminating in the GA of Astra in May 2020. In this talk, we’ll share our successes and failures through the iterative journey to GA, our current Kubernetes based architecture, how we built scalability and reliability into the platform, and how Cassandra’s architecture and implementation affected our design choices for current features like multi-tenancy and influences our future initiatives. Zoom Link: https://cmu.zoom.us/j/562649242 (Password 264771) Read More
[DB Seminar] Spring 2020 DB Group: Another Relational Database, Why and How
There are a lot of relational database, so a fair question is why we decided to create a new one. The primary reason is trade-offs. Relational database are optimized for storing a single version of the truth and providing it or updating it with maximum efficiency. More succinctly they are optimized for being good OLTP stores. They are not optimized to meet the increasingly common need to move structured data from one party (person or entity) to another. The existing... Read More
[DB Seminar] Spring 2020 DB Group: Linux 4.x Tracing (Pre-Recorded)
There is no invited speaker today. We will instead watch this video together: Linux 4.x Tracing: Performance Analysis with bcc/BPF (eBPF) Brendan Gregg https://youtu.be/w8nFRoFJ6EQ Zoom Password: 264771 Read More
[DB Seminar] Spring 2020 DB Group: Testing Cloud-Native Databases with Chaos Mesh
In the world of distributed computing, faults happen to clusters unpredictably, especially when they run in the cloud. To make a distributed database like TiDB resilient enough, chaos engineering is the way to go. At PingCAP, we use Chaos Mesh®, an open-source chaos engineering platform for Kubernetes to improve the resiliency of TiDB. Chaos Mesh adopts a cloud-native design and currently supports more than 10 chaos types. This talk will mainly introduce Chaos Mesh and how we use it to test... Read More
[DB Seminar] Spring 2020 DB Group: Deepgreen DB: Greenplum at Speed
Greenplum is an open source Postgres-based MPP solution that can scale to hundreds of nodes and petabytes of data. Deepgreen DB is an optimized version of Greenplum. On top of a mature, market-tested data warehouse, Deepgreen DB adds data-centric code generation for speed, columnar external data engine, new interconnect and SQL-level integration with Go/Python. This talk will mainly recount the challenges of LLVM codegen on PG/GP while maintaining 100% compatibility, a necessity for market acceptance. Zoom Link: https://cmu.zoom.us/j/562649242 Read More
[DB Seminar] Spring 2020 DB Group: Finding Logic Bugs in Database Management Systems
Database Management Systems (DBMS) are used ubiquitously for storing and retrieving data. It is critical that they function correctly --- incorrectly computed result sets (e.g., by omitting a row) can cause serious loss or damage. We refer to such defects as logic bugs. Despite their importance, finding logic bugs in production DBMS is a longstanding challenge. Existing techniques such as fuzzing and differential testing are ineffective in finding them. We have proposed a set of novel techniques to effectively detect... Read More
[DB Seminar] Spring 2020 DB Group: Building Materialize, a Streaming SQL Database powered by Timely Dataflow
Materialize (Materialize.io, GitHub) is a streaming database. Instead of being optimized for processing ad-hoc transactional or analytical queries, it is optimized for view maintenance on an ongoing basis over streams of already processed transactions. Although OLTP and OLAP systems often have support for views, they are not architected to efficiently maintain these views as the data change. Systems designed for view maintenance can often handle substantially higher load for workloads that re-issue the same questions against changing data: they perform... Read More
[DB Seminar] Spring 2020 DB Group: APOLLO: Automatic Detection and Diagnosis of Performance Regressions in Database Systems
The practical art of constructing database management systems (DBMSs) involves a morass of trade-offs among query execution speed, query optimization speed, standards compliance, feature parity, modularity, portability, and other goals. It is no surprise that DBMSs, like all complex software systems, contain bugs that can adversely affect their performance. The performance of DBMSs is an important metric as it determines how quickly an application can take in new information and use it to make new decisions. Both developers and users... Read More