Mike Cafarella (University of Michigan)
Trained systems that apply machine learning to very large datasets, such as web search and IBM’s Watson question-answering system, are among the most important and sophisticated software systems being constructed today. Such trained systems are frequently based on supervised learning tasks that require features, signals extracted from the data that distill complicated raw data objects into a small number of salient values. For example, a good feature for a search engine’s relevance ranker might be the number of times the user’s query term was mentioned in a given Web page. The success of a modern trained system depends substantially on the quality of its features.
Unfortunately, feature engineering — the process of writing code that takes raw data objects as input and out- puts feature vectors that are suitable for a machine learning algorithm — is a tedious, time-consuming, miserable experience. Because “big data” inputs are so diverse, feature engineering is often a trial-and-error process that requires many small iterative code changes; because the inputs are so large, each code change can entail a time-consuming data processing task, such as processing each page in a Web crawl. We introduce Zombie, a data-centric system that accelerates feature engineering by performing intelligent input selection, thereby optimizing the “inner loop” of the feature engineering process. It can evaluate a feature engineer’s code much faster than current practice, thereby enabling a feature engineer to be substantially more productive.
More Info: http://www.pdl.cmu.edu/SDI/2014/042414.html