Projects

Projects

Future File Formats

Future File Formats

Columnar storage is a core component of a modern data analytics system. Although many database management systems have proprietary storage formats, most support open-source storage formats such as Apache Parquet and Apache ORC to facilitate cross-platform data sharing. However, these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed.

The Future File Formats project seeks to develop a next-generation open-source columnar storage format that strives for high-performance decoding on advanced hardware and high portability.

People

Publications

  1. M. Prammer, X. Zeng, R. Meng, W. McKinney, H. Zhang, A. Pavlo, and J. Patel, "Towards Functional Decomposition of Storage Formats," in CIDR 2025, Conference on Innovative Data Systems Research, 2025. Bibtex PDF
    @inproceedings{prammer25,
       author = {Prammer, Martin and Zeng, Xinyu and Meng, Ruijun and McKinney, Wes and Zhang, Huanchen and Pavlo, Andrew and Patel, Jignesh},
       title = {{Towards Functional Decomposition of Storage Formats}},
       booktitle = {{CIDR} 2025, Conference on Innovative Data Systems Research},
       year = {2025},
       url = {https://db.cs.cmu.edu/papers/2025/p19-prammer.pdf},
     }
  2. X. Zeng, R. Meng, A. Pavlo, W. McKinney, and H. Zhang, "NULLS!: Revisiting Null Representation in Modern Columnar Formats," in Proceedings of the 20th International Workshop on Data Management on New Hardware, 2024. Bibtex PDF DOI
    @inproceedings{zeng24,
       author = {Zeng, Xinyu and Meng, Ruijun and Pavlo, Andrew and McKinney, Wes and Zhang, Huanchen},
       title = {NULLS!: Revisiting Null Representation in Modern Columnar Formats},
       year = {2024},
       doi = {10.1145/3662010.3663452},
       booktitle = {Proceedings of the 20th International Workshop on Data Management on New Hardware},
       articleno = {10},
       numpages = {10},
       series = {DaMoN '24},
       url = {https://db.cs.cmu.edu/papers/2024/zeng-damon24.pdf},
     }
  3. X. Zeng, Y. Hui, J. Shen, A. Pavlo, W. McKinney, and H. Zhang, "An Empirical Evaluation of Columnar Storage Formats," Proc. VLDB Endow., vol. 17, iss. 2, pp. 148-161, 2023. Bibtex PDF
    @article{zeng23,
       author = {Zeng, Xinyu and Hui, Yulong and Shen, Jiahong and Pavlo, Andrew and McKinney, Wes and Zhang, Huanchen},
       title = {An Empirical Evaluation of Columnar Storage Formats},
       journal = {Proc. {VLDB} Endow.},
       volume = {17},
       number = {2},
       pages = {148--161},
       year = {2023},
       url = {https://www.vldb.org/pvldb/vol17/p148-zeng.pdf},
     }