Events

Speaker:: Wei (David) Dai
Date:: Mon Feb 13, 2017 @ 04:45pm EDT
Date:: Mon Feb 13, 2017
Time:: 04:45pm EDT
Location:: GHC 8102
Title:: A Data Engineering System for Machine Learning at Scale

Talk Info:

Machine Learning (ML) systems depend on data engineering – the practice of transforming a small set of raw measurements to a large number of features – to substantially increase the accuracy of their results. However, as ML problem grow in both data size (number of instances) and model size (number of dimensions), existing systems that support data engineering have not been able to keep pace, and either fail to run or do so very slowly. Sometimes, one-off code can be written to “glue” together several ML software tools, resulting in special-purpose ML systems that are time-consuming to create, brittle to maintain, and difficult to reproduce by other practitioners.

To address this, we propose a systematic and scalable data engineering system that is distributed for scale and speed, and provides a standardized interface of built-in transformations that can be specified in Python or C++. Furthermore, our system integrates with upstream ML training systems to derive new transformations from trained ML models, such as feature outputs from deep neural networks. In our evaluation, our data engineering system achieves throughputs that are 4 to over 100 times higher than Spark on selected transformations.

Events

Events

[DB Seminar] Spring 2017: Wei (David) Dai

Talk Info: