News & Events
Agma Traina and Caetano Traina (University of São Paulo)
The evolution of the Relational Database Management Systems must include not only resources to handle big data, but also complex data (such as images, audios, videos, graphs, multidimensional data, long texts, time series, genetic sequences, etc.), where order-based comparisons are not appropriate, and identity-based comparisons are meaningless. Comparing complex data by similarity stirrers much more meaning from data. However, current RDBMSs do not yet have adequate resources to express and execute similarity comparisons. In this lecture, we will present works Read More
Miguel Araújo (Thesis defense dry-run)
The identification of anomalies and communities of nodes in real-world graphs has applications in widespread domains, from the automatic categorization of wikipedia articles or websites to bank fraud detection. While recent and ongoing research is supplying tools for the analysis of simple unlabeled data, it is still a challenge to find patterns and anomalies in large labeled datasets such as time evolving networks. What do real communities identified in big networks look like? How can we find sources of infections in bipartite networks? Can we predict who Read More
[PDL/SDI/ISTC] Derek Murray (Google)
TensorFlow is an open-source machine learning system, originally developed by the Google Brain team, which operates at large scale and in heterogeneous environments. TensorFlow trains and executes a variety of machine learning models at Google, including deep neural networks for image recognition and machine translation. The system uses dataflow graphs to represent stateful computations, and achieves high performance by mapping these graphs across clusters of machines containing multi-core CPUs, general-purpose GPUs, and custom-designed ASICs known as Tensor Processing Units (TPUs). Read More
Dan Ports (University of Washington)
Today's most popular applications are deployed as massive-scale distributed systems in the datacenter. Keeping data consistent and available despite server failures and concurrent updates is a formidable challenge. Two well-known abstractions, strongly consistent replication and serializable transactions, can free developers from these challenges by transparently masking failures and treating complex updates as atomic units. Yet the conventional wisdom is that these techniques are too expensive to deploy in high-performance systems. I will demonstrate a new approach to designing distributed systems Read More
[HCII Seminar] Michael Franklin (University of Chicago)
The “P“ in AMPLab stands for "People" and an important research thrust in the lab was on integrating human processing into analytics pipelines. Starting with the CrowdDB project on human-powered query answering and continuing into the more recent SampleClean and AMPCrowd/Clamshell projects, we have been investigating ways to maximize the benefit that can be obtained through involving people in data collection, data cleaning, and query answering. In this talk I will present an overview of these projects and discuss some Read More
[MLD Seminar] Jure Leskovec (Stanford University)
Evaluating whether machines improve on human performance is one of the central questions of machine learning. However, there are many domains where the data is selectively labeled in the sense that the observed outcomes are themselves a consequence of the existing choices of the human decision-makers. For instance, in the context of judicial bail decisions, the outcome of whether a defendant fails to return for their court appearance is observed only if the judge decides to release the defendant on Read More
[DB Seminar] Spring 2017: Alex Poms
A growing number of visual computing applications depend on the analysis of large video collections. The challenge is that scaling applications to operate on these datasets requires highly efficient systems for pixel data access and parallel processing. Few programmers have the capability to operate efficiently at these scales, limiting the field's ability to explore new applications that analyze large video data sets. Inspired by the impact of systems such as analytics databases and Spark, we are developing Scanner, a platform Read More
[DB Seminar] Spring 2017: Viktor Leis
Managing data sets that are larger than RAM has always been one of the most important tasks for database systems. Traditional systems cache fixed-size pages in an in-memory buffer pool that has complete knowledge of all page accesses and transparently loads/evicts pages from/to disk. While this approach is effective at minimizing the number of I/O operations, it is also one of the main reasons why disk-based systems are slow. For this reason, main-memory database systems abandon buffer management altogether, which Read More
[DB Seminar] Spring 2017: Round table discussion
We will have a round table discussion. Read More
[DB Seminar] Spring 2017: Andy Pavlo
Most of the academic papers on concurrency control published in the last five years have assumed the following two design decisions: (1) applications execute transactions with serializable isolation and (2) applications execute most (if not all) of their transactions using stored procedures. I know this because I am guilty of writing these papers too. But results from a recent survey of database administrators indicates that these assumptions are not realistic. This survey includes both legacy deployments where the cost of Read More