On computing correlated aggregates over continual data streams.
Johannes Gehrke, Flip Korn and Divesh Srivastava.
In many applications from telephone fraud detection to network
management, data arrives in a stream, and there is a need to maintain
a variety of statistical summary information about a large number of
customers in an online fashion. At present, such applications maintain
basic aggregates such as running extrema values (MIN, MAX), averages,
standard deviations, etc., that can be computed over data streams with
limited space in a straightforward way. However, many applications
require knowledge of more complex aggregates relating different
attributes, so-called correlated aggregates. As an example, one might
be interested in computing the percentage of international phone calls
that are longer than the average duration of a domestic phone call.
Exact computation of this aggregate requires multiple passes over the
data stream, which is infeasible.
We propose single-pass techniques for approximate computation of
correlated aggregates over both landmark and sliding window views of
a data stream of tuples, using a very limited amount of space. We
consider both the case where the independent aggregate (average
duration in the example above) is an extrema value and the case where
it is an average value, with any standard aggregate as the dependent
aggregate; these can be used as building blocks for more sophisticated
aggregates. We present an extensive experimental study based on some
real and a wide variety of synthetic data sets to demonstrate the
accuracy of our techniques. We show that this effectiveness is
explained by the fact that our techniques exploit monotonicity and
convergence properties of aggregates over data streams.