[DB Seminar] Spring 2017: Wei (David) Dai
Machine Learning (ML) systems depend on data engineering – the practice of transforming a small set of raw measurements to a large number of features – to substantially increase the accuracy of their results. However, as ML problem grow in both data size (number of instances) and model size (number of dimensions), existing systems that support data engineering have not been able to keep pace, and either fail to run or do so very slowly. Sometimes, one-off code can be written to “glue” together several ML software tools, resulting in special-purpose ML systems that are time-consuming to create, brittle to maintain, and difficult to reproduce by other practitioners.
To address this, we propose a systematic and scalable data engineering system that is distributed for scale and speed, and provides a standardized interface of built-in transformations that can be specified in Python or C++. Furthermore, our system integrates with upstream ML training systems to derive new transformations from trained ML models, such as feature outputs from deep neural networks. In our evaluation, our data engineering system achieves throughputs that are 4 to over 100 times higher than Spark on selected transformations.