## Projects

# LaPush

Probabilistic inference over large data sets is becoming a central data management problem. Recent large knowledge bases, such as Yago, Nell, DeepDive, or Google’s Knowledge Vault, have millions to billions of uncertain tuples. Data sets with missing values are often “completed” using inference in graphical models models or sophisticated low rank matrix factorization techniques, which ultimately results in a large, probabilistic database.

However, probabilistic inference is known to be #P-hard in the size of the database, even for some very simple queries. Today’s state of the art inference engines use either sampling based methods or are based on some variant of the DPLL algorithm for Weighted Model Counting. For example Tuffy, a popular implementation of Markov Logic Networks (MLN) over relational databases, uses Markov Chain Monte Carlo methods (MCMC). Gibbs sampling can be significantly improved by adapting some classical relational optimization techniques. For another example, MayBMS and its successor Sprout use query plans to guide a DPLL-based algorithm for Weighted Model Counting. While both approaches deploy some advanced relational optimization techniques, at their core they are based on general purpose probabilistic inference techniques, which either run in exponential time (DPLL-based algorithms have been proven recently to take exponential time even for queries computable in polynomial time, or require many iterations until convergence.

In this project, we propose a different approach to approximate probabilistic inference. In our approach, *every query is*

* evaluated entirely in the database engine*. Probability computation is done at query time, using simple arithmetic operations and aggregates. Thus, probabilistic inference is entirely reduced to a standard query evaluation problem with aggregates. There are no iterations and no exponential blowups. All benefits of relational engines (such as cost-based optimizations, multi-core query processing, shared-nothing parallelization) are immediately available to queries over probabilistic databases. To achieve this, we compute *approximate* rather than exact probabilities, with a one-sided guarantee: The probabilities are guaranteed to be upper bounds to the true probabilities, which we show is *sufficient to rank the top query* *answers with high precision*. Our approach consists of approximating the true query probability by evaluating a fixed number of “safe queries” (the number depends on the query), each providing an upper bound on the true probability, then taking their minimum.

### People

### Publications

- W. Gatterbauer and D. Suciu, "Approximate Lifted Inference with Probabilistic Databases,"
*PVLDB*, vol. 8, iss. 5, pp. 629-640, 2015. [PDF] [BIBTEX]`@article{GatterbauerS15,`

author = {Wolfgang Gatterbauer and Dan Suciu},

title = {Approximate Lifted Inference with Probabilistic Databases},

journal = {{PVLDB}},

volume = {8},

number = {5},

pages = {629--640},

year = {2015},

url = {http://www.vldb.org/pvldb/vol8/p629-gatterbauer.pdf},

biburl = {http://dblp.uni-trier.de/rec/bib/journals/pvldb/GatterbauerS15},

} - W. Gatterbauer and D. Suciu, "Oblivious bounds on the probability of Boolean functions,"
*ACM Trans. Database Syst.*, vol. 39, iss. 1, p. 5, 2014. [PDF] [BIBTEX]`@article{gatterbauerS14,`

author = {Wolfgang Gatterbauer and Dan Suciu},

title = {Oblivious bounds on the probability of {B}oolean functions},

journal = {{ACM} Trans. Database Syst.},

year = {2014},

volume = {39},

number = {1},

pages = {5},

url = {http://doi.acm.org/10.1145/2532641},

url2 = {http://arxiv.org/abs/1409.6052},

doi = {10.1145/2532641},

}