Projects

Future File Formats

Future File Formats

Columnar storage is a core component of a modern data analytics system. Although many database management systems have proprietary storage formats, most support open-source storage formats such as Apache Parquet and Apache ORC to facilitate cross-platform data sharing. However, these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed.

The Future File Formats project seeks to develop a next-generation open-source columnar storage format that strives for high-performance decoding on advanced hardware and high portability.

People

Publications

  1. X. Zeng, R. Meng, A. Pavlo, W. McKinney, and H. Zhang, "NULLS!: Revisiting Null Representation in Modern Columnar Formats," in Proceedings of the 20th International Workshop on Data Management on New Hardware, 2024. PDF Bibtex

    @inproceedings{zeng24,
      author = {Zeng, Xinyu and Meng, Ruijun and Pavlo, Andrew and McKinney, Wes and Zhang, Huanchen},
      title = {NULLS!: Revisiting Null Representation in Modern Columnar Formats},
      year = {2024},
      doi = {10.1145/3662010.3663452},
      booktitle = {Proceedings of the 20th International Workshop on Data Management on New Hardware},
      articleno = {10},
      numpages = {10},
      series = {DaMoN '24},
      url = {https://db.cs.cmu.edu/papers/2024/zeng-damon24.pdf},
     }

  2. X. Zeng, Y. Hui, J. Shen, A. Pavlo, W. McKinney, and H. Zhang, "An Empirical Evaluation of Columnar Storage Formats," Proc. VLDB Endow., vol. 17, iss. 2, pp. 148-161, 2023. PDF Bibtex

    @article{zeng23,
      author = {Zeng, Xinyu and Hui, Yulong and Shen, Jiahong and Pavlo, Andrew and McKinney, Wes and Zhang, Huanchen},
      title = {An Empirical Evaluation of Columnar Storage Formats},
      journal = {Proc. {VLDB} Endow.},
      volume = {17},
      number = {2},
      pages = {148--161},
      year = {2023},
      url = {https://www.vldb.org/pvldb/vol17/p148-zeng.pdf},
     }