Neil Shah (Thesis proposal dry-run)
Given the ever-growing prevalence of online social services, usefully leveraging mas-
sive datasets has become an increasingly important challenge for businesses and end-users
alike. Online services capture a wealth of information about user behavior and platform in-
teractions, such as who-follows-whom relationships in social networks and who-rates-what-
and-when relationships in e-commerce networks. Since many of these services rely on data-
driven algorithms to recommend relevant content to their users, authenticity of user behavior
is paramount to success. But given anonymity on the internet, how do we know which users
and actions are real and which are fake? My thesis focuses on this problem and introduces
new techniques to effectively and efficiently discern anomalous and fraudulent behavior in
online social graphs. Specifically, I propose to work on three thrusts: plain graphs, dynamic
graphs and rich graphs.
Firstly, we focus on plain graphs, in which only static connectivity information is known.
We detail several proposed algorithms spanning the topics of spectral-based fraud detection in
a single graph, blame attribution between graph snapshots, and evaluation of the descriptive
power of various graph clustering and partitioning methods in identifying anomalous graph
Next, we broaden our scope to dynamic graphs, in which we are able to leverage connec-
tivity information over a span of time. Many online interactions are timestamped, and thus
time and interarrival time between user actions are powerful features which can be used to
discern the normal from the abnormal. We describe multiple relevant works which describe
how to identify and summarize anomalous temporal graph structures, model interarrival time
patterns in user queries to find anomalous search behavior, and identify “group” anomalies
comprising of users acting in lockstep.
Lastly, we expand our reach to rich graphs, in which connectivity information is supple-
mented by other attributes, such as time, rating, number of messages sent, etc. Rich graphs
are common in practice, as online services routinely track many aspects of user behavior to
gain multifaceted insights. Multimodal views of data are thus useful in identifying various
types of anomalies in different subspaces. To this end, we propose a principled means for
ranking users by their abnormality by leveraging statistical patterns in edge attributes.
The techniques described in this thesis span various disciplines including machine learn-
ing, graph theory, information theory and social science and are practically applicable to a
number of real-world fraud and general anomaly detection scenarios. They are carefully de-
signed and tuned to attain high precision and recall in practice, and be efficient and scale to
massive datasets, including social networks, telecommunication networks, e-commerce and
collaboration networks with up to millions of nodes and billions of edges.