Events

Events

Leveraging Optimization-Enabling Properties of User-Defined Functions for Efficient Database Query Execution (Sam Arch)

Speaker:
Sam Arch
Date:
Thu Feb 26, 2026 @ 02:00pm EDT
Date:
Thu Feb 26, 2026
Time:
02:00pm EDT
Location:
GHC 4405
Title:
Leveraging Optimization-Enabling Properties of User-Defined Functions for Efficient Database Query Execution

Talk Info:

After decades of research, analytical database management systems (DBMSs) have become remarkably effective at optimizing and executing SQL queries. However, many users write queries that are not written entirely in SQL. Instead, these queries invoke user-defined functions (UDFs), external functions written in non-SQL programming languages such as Python or PL/SQL. UDFs provide software engineering benefits by enabling code reuse and by extending the DBMS’s capabilities to include those of the UDF language. However, UDFs are inherently non-relational, which makes them challenging for DBMSs to reason about and execute efficiently. Effective optimization is also challenging because UDF languages are Turing-complete, allowing UDFs to be arbitrarily complex. Although general-purpose optimization techniques can improve UDF performance (e.g., compilation and batching), they target arbitrary UDF code and therefore have limited effectiveness. We observe that the most beneficial UDF optimizations (e.g., memoization and inlining) leverage key optimization-enabling properties of UDFs (i.e., how users actually use them in practice).

In this proposal, we present multiple techniques that leverage optimization-enabling properties of UDFs to improve database query execution performance. First, we observe that inlining only the relevant pieces of a UDF improves performance, and leverage UDF decomposability to break UDFs into pieces and hide irrelevant pieces through outlining. Next, we observe that processing all unique UDF inputs simultaneously improves parallelism, and leverage UDF redundancy to build lightweight indexes during query processing to avoid repeated UDF invocations.

We propose extending our preliminary work by observing that enabling inter-tuple parallelism of UDFs improves query execution performance. We plan to leverage UDF pipelining, the observation that UDFs operate as a pipeline of data transformations over their inputs, to enable fusion and auto-vectorization of pipeline stages. Collectively, the techniques presented in this dissertation will enable an analytical database system to execute queries that contain UDF calls efficiently.

Bio:

Sam Arch is the #1 ranked Ph.D. student in the Carnegie Mellon Database Group. He is half Australian but not single.

More Info: https://csd.cmu.edu/calendar/2026-02-26/doctoral-thesis-proposal-sam-arch