# [DB Seminar] Fall 2016: Wolfgang Gatterbauer (CMU)

##### Date

##### Time

##### Location

##### Speaker

Performing inference over large uncertain data sets is becoming a central data management problem. Recent large knowledge bases, such as Yago, Nell or DeepDive, have millions to billions of uncertain tuples. Because general reasoning under uncertainty is highly intractable, many state-of-the-art systems today perform approximate inference by reverting to sampling. This talk shows an alternative approach that allows approximate ranking answers to hard probabilistic queries in guaranteed polynomial time, and by using only basic operators of existing database management systems (i.e. no sampling required).

(1) The first part of this talk develops upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds (i.e. when the new probabilities are chosen independent of the probabilities of all other variables). Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space.

(2) The second part then draws the connection to lifted inference and shows how the problem of approximate probabilistic inference can be entirely reduced to a standard query evaluation problem with aggregates. There are no iterations and no exponential blow-ups. All benefits of relational engines (such as cost-based optimizations, multi-core query processing, shared-nothing parallelization) are directly available to queries over probabilistic databases. To achieve this, we compute approximate rather than exact probabilities, with a one-sided guarantee: The probabilities are guaranteed to be upper bounds to the true probabilities, which we show is sufficient to rank the top query answers with high precision. We give experimental evidence on synthetic TPC-H data that this approach can be orders of magnitude faster and also more accurate than sampling-based approaches.

(Based on joint work with Dan Suciu from TODS 2014, VLDB 2015, and VLDBJ 2016: http://arxiv.org/pdf/1409.6052, http://arxiv.org/pdf/1412.1069, http://arxiv.org/pdf/1310.6257)