Master Thesis Talk: Supporting Hybrid Workloads for In-Memory Database Management Systems via a Universal Columnar Storage Format
The proliferation of modern data processing ecosystems has given rise to open-source columnar data formats. The key advantage of these formats is that they allow organizations to load data from database management systems (DBMSs) once instead of having to convert it to a new format for each usage. These formats, however, are read-only. This means that organizations must still use a heavy-weight transformation process to load data from their original format into the desired columnar format. We aim to reduce or even eliminate this process by developing an in-memory storage management architecture for transactional DBMSs that is aware of the eventual usage of its data and operates directly on columnar storage blocks. We introduce relaxations to common analytical formats requirements to efficiently update data, and rely on a lightweight in-memory transformation process to convert blocks back to analytical forms when they are cold. We also describe how to directly access data from third-party analytical tools with minimal serialization overhead. To evaluate our work, we implemented our storage engine based on the Apache Arrow format and integrated it CMU’s new DBMS. Our experiments show that our approach achieves comparable performance with dedicated OLTP DBMSs while also enabling orders of magnitude faster data exports to external data science and machine learning libraries than existing approaches.
Thesis committee members – Andy Pavlo, David Andersen