[DB Seminar] Spring 2016: Jun Woo Park
Traditional sketches, such as the Bloom filter, the CountMin sketch, and the Space-Saving sketch, estimate set membership, frequency counts, or moments of scalar random variables. In this paper, we extend these approaches to a new family of sketches that approximate moments of vectorial random variables that satisfy convex polytope constraints. One application is the Semidefinite sketch, a succinct way to estimate positive semidefinite matrices obtained from a vectorial data stream. Such a sketch can be used to efficiently estimate covariance matrices for on-line machine learning applications and novelty detection for observations.
To evaluate these sketches, we implemented them in a distributed data store for machine learning and measured their performance and accuracy using different real-world data streams. Our results show that the Semidefinite sketch is able process more than two billion multi-dimensional data points at a rate of 97M operations per second and achieves 2–7× better throughput over other state-of-the-art streaming platforms.