Miguel Araújo (Thesis defense dry-run)
The identification of anomalies and communities of nodes in real-world graphs has applications in widespread domains, from the automatic categorization of wikipedia articles or websites to bank fraud detection. While recent and ongoing research is supplying tools for the analysis of simple unlabeled data, it is still a challenge to find patterns and anomalies in large labeled datasets such as time evolving networks. What do real communities identified in big networks look like? How can we find sources of infections in bipartite networks? Can we predict who is most likely to join an online discussion on a given topic?
We model interaction networks using appropriate matrix and tensor representations in order to gain insights into these questions. We incorporate edge attributes, such as timestamps in phone-call networks or airline codes in flight networks, and complex node side-information, such as who-retweets-whom in order to understand who uses a given hashtag on Twitter. We provide three major contributions:
1. Hyperbolic communities: Our exploration of real communities provides novel theoretical ideas regarding their structure, and we show how typical long-tail degree distributions can be leveraged to create efficient algorithms on problems that seem to suffer from quadratic explosion.
2. Anomaly detection: Novel distributed algorithms that identify problematic nodes whose influence can only be detected on their neighbors, validated through the analysis of data breaches in bank transactions.
3. Forecasting: New techniques that forecast network evolution by incorporating contextual side-information and the evolution of independent community structures.
Leveraging these techniques, we explore massive datasets such as networks with billions of credit card transactions, Twitter graphs with over 300 million interactions and phone-call networks with over 50 million phone-calls.