Rethinking Systems for Data-Intensive Computing (Matei Zaharia)
A growing fraction of applications today, from basic business processing to machine learning, are data-intensive: they need to correctly process and produce massive datasets that are too large for any human to inspect. These applications pose many systems challenges, from programming interfaces, to monitoring and debugging (how can a human make sure these applications are working well?), to performance. I’ll talk about several research projects that introduce novel ways to tackle these challenges. On the performance side, many researchers have proposed ways to rewrite existing libraries for performance that require substantial engineering effort (e.g., DSL compilers), but my group’s work on annotation-based optimizers (e.g., Split Annotations and TASO) shows how to match these solutions’ performance by wrapping existing libraries. On the debugging and monitoring side, I’ll talk about both research (Model Assertions) and open source industry work (the Delta Lake and MLflow projects at Databricks) that simplifies building reliable data-intensive applications. Finally, I’ll talk about some ongoing work to change the architecture of ML models themselves to be more hardware-friendly — specifically the design of “retrieval-based” NLP models, like ColBERT-QA, that use lookups into storage instead of massive computation as an alternative to large DNNs like GPT-3. These models are now setting the state-of-the-art on multiple hard NLP tasks.
Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley, and has worked on other widely used open source data analytics and AI software, including MLflow and Delta Lake. At Stanford, Matei is a co-PI of the DAWN lab focusing on infrastructure for machine learning, where he has developed new systems and algorithms for efficient, reliable and secure machine ML. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE), the highest honor bestowed by the US government to early-career scientists and engineers.