Projects

Projects

Future File Formats

Future File Formats

Columnar storage is a core component of a modern data analytics system. Although many database management systems have proprietary storage formats, most support open-source storage formats such as Apache Parquet and Apache ORC to facilitate cross-platform data sharing. However, these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed.

The Future File Formats project seeks to develop a next-generation open-source columnar storage format that strives for high-performance decoding on advanced hardware and high portability.

People

Publications

  • M. Prammer, X. Zeng, R. Meng, W. McKinney, H. Zhang, A. Pavlo, and J. Patel, "Towards Functional Decomposition of Storage Formats," in CIDR 2025, Conference on Innovative Data Systems Research, 2025. PDF BIB
    @inproceedings{prammer25,
       author = {Prammer, Martin and Zeng, Xinyu and Meng, Ruijun and McKinney, Wes and Zhang, Huanchen and Pavlo, Andrew and Patel, Jignesh},
       title = {{Towards Functional Decomposition of Storage Formats}},
       booktitle = {{CIDR} 2025, Conference on Innovative Data Systems Research},
       year = {2025},
       url = {https://db.cs.cmu.edu/papers/2025/p19-prammer.pdf},
     }
  • X. Zeng, R. Meng, M. Prammer, W. McKinney, J. M. Patel, A. Pavlo, and H. Zhang, "F3: The Open-Source Data File Format for the Future," Proc. ACM Manag. Data, vol. 3, iss. 4, 2025. PDF CODE DOI BIB
    @article{zeng25fff,
       author = {Zeng, Xinyu and Meng, Ruijun and Prammer, Martin and McKinney, Wes and Patel, Jignesh M. and Pavlo, Andrew and Zhang, Huanchen},
       title = {F3: The Open-Source Data File Format for the Future},
       year = {2025},
       issue_date = {September 2025},
       volume = {3},
       number = {4},
       doi = {10.1145/3749163},
       journal = {Proc. ACM Manag. Data},
       month = sep, articleno = {245},
       numpages = {27},
       url = {https://db.cs.cmu.edu/papers/2025/zeng-sigmod2025.pdf},
       code = {https://github.com/future-file-format/f3},
     }
  • X. Zeng, R. Meng, A. Pavlo, W. McKinney, and H. Zhang, "NULLS!: Revisiting Null Representation in Modern Columnar Formats," in Proceedings of the 20th International Workshop on Data Management on New Hardware, 2024. PDF DOI BIB
    @inproceedings{zeng24,
       author = {Zeng, Xinyu and Meng, Ruijun and Pavlo, Andrew and McKinney, Wes and Zhang, Huanchen},
       title = {NULLS!: Revisiting Null Representation in Modern Columnar Formats},
       year = {2024},
       doi = {10.1145/3662010.3663452},
       booktitle = {Proceedings of the 20th International Workshop on Data Management on New Hardware},
       articleno = {10},
       numpages = {10},
       series = {DaMoN '24},
       url = {https://db.cs.cmu.edu/papers/2024/zeng-damon24.pdf},
     }
  • X. Zeng, Y. Hui, J. Shen, A. Pavlo, W. McKinney, and H. Zhang, "An Empirical Evaluation of Columnar Storage Formats," Proc. VLDB Endow., vol. 17, iss. 2, pp. 148-161, 2023. PDF CODE DOI BIB
    @article{zeng23,
       author = {Zeng, Xinyu and Hui, Yulong and Shen, Jiahong and Pavlo, Andrew and McKinney, Wes and Zhang, Huanchen},
       title = {An Empirical Evaluation of Columnar Storage Formats},
       journal = {Proc. {VLDB} Endow.},
       volume = {17},
       number = {2},
       pages = {148--161},
       year = {2023},
       doi = {10.14778/3626292.3626298},
       url = {https://www.vldb.org/pvldb/vol17/p148-zeng.pdf},
       code = {https://github.com/XinyuZeng/EvaluationOfColumnarFormats},
     }