DB Seminar [Fall 2014]: Alex Beutel
As we record growing amounts of increasingly detailed user actions and complex interactions, how can we understand and make use of the vast amount of user data? In order to make use of this growing user data, there are a number of technical hurdles: we must be able to understand and model our users, we must be able to handle fraudulent data and adversarial users, and we must be able to scale our learning algorithms and models to big data. My thesis focuses on the intersection of these three research areas and the uncomfortable realities of learning from real world user data.
First, we study how to design better models of real world user behavior, focusing primarily on collaborative filtering for online recommendation. In particular, we design more complex models that capture the realities of online ratings — polarized perception of products, fraudulent reviews, etc. While classic collaborative filtering research is necessary for modeling user behavior, handling the realities and oddities of the online interactions is necessary for giving good recommendations.
Second, we investigate novel approaches to spam and fraud detection. As more everyday services depend on modeling user behavior, fraudsters have increasing economic incentives to manipulate these services. Therefore, in order to provide high quality models of behavior, we must face the fact that our data are not necessarily an honest depiction of user behavior. In this thesis, we focus on new models of spam and fraud that are accurate in differentiating between natural and suspicious user behavior and that are difficult for spammers to avoid while remaining effective.
Last, we describe novel general systems for efficiently performing data mining and machine learning on huge datasets. Behavior modeling today relies on learning from the massive numbers of small interactions of many users. This results in large datasets with complex models of user behavior, such as the collaborative filtering and fraud detection algorithms mentioned above. Therefore, to make such models actually useful, we must solve the problems of how to scale the learning of the models in real-world scenarios — on massive datasets and in shared cloud environments.
While each of the tasks above have very different bodies of research behind them, they all are fundamentally asking the question of how to analyze and understand the ever-growing amount of user data being recorded. Therefore, we use intuition from each one to inform research decisions in the others. While most of the completed work has aimed to improve performance in each problem separately, the proposed work also focuses on blurring the lines between the problems, e.g. recommendation that is robust to spam and scaling the learning of complex behavioral models.