Events: datafusion

Events: datafusion

Jan 21

2025

12:00pm EDT
GHC 8115
Jan 21 2025
SplitSQL: Practical Pushdown Cache for DataLake Analytics (Xiangpeng Hao)
Speaker:
Xiangpeng Hao
System:
DataFusion

Modern data analytics embrace a disaggregated architecture which decouples storage, cache, and compute into network-connected independent components. With disaggregated cache, a key design decision is whether to push down query predicates to the cache server. Without predicate pushdown, the cache must send all data to compute nodes, creating network bottlenecks. With predicate pushdown, the cache server evaluates predicates on cached... Read More

Sep 30

2024

04:30pm EDT
Sep 30 2024
[Building Blocks] Accelerating Apache Spark workloads with Apache DataFusion Comet (Andy Grove)
Speaker:
Andy Grove
System:
DataFusion
Video:
YouTube

Apache Spark is one of the most widely-used distributed data analysis frameworks. However, its JVM-based and row-oriented query execution engine limits Spark’s performance and scalability. In this talk, we will introduce DataFusion Comet, an accelerator for Apache Spark designed to improve the efficiency of Spark queries by translating them into native queries that leverage Apache Arrow and Apache DataFusion. We... Read More

Sep 23

2024

04:30pm EDT
Sep 23 2024
[Building Blocks] Apache Arrow DataFusion: A Fast, Embeddable, Modular Analytic Query Engine (Andrew Lamb)
Speaker:
Andrew Lamb
System:
DataFusion
Video:
YouTube

Apache DataFusion is a fast, embeddable, and extensible query engine written in Rust that uses Apache Arrow as its memory model. In this talk we explain DataFusion in more detail and describe the types of data centric systems it is used to build. We will also review its high level architecture and feature set, discussing tradeoffs and performance between DataFusion's... Read More