Efficient approximation of correlated sums on data streams.
Rohit Ananthakrishna, Abhinandan Das, Johannes Gehrke, Flip Korn,
S. Muthukrishnan and Divesh Srivastava.
In many applications such as IP network management, data arrives
in streams, and queries over those streams need to be processed
online using limited storage. Correlated-sum (CS) aggregates are
a natural class of queries formed by composing basic aggregates
on (x,y) pairs, and are of the form SUM{g(y) : x <= f(AGG(x))},
where AGG(x) can be any basic aggregate and f(), g() are
user-specified functions. CS-aggregates cannot be computed
exactly in one pass through a data stream using limited storage;
hence, we study the problem of computing approximate CS-aggregates.
We guarantee a priori error bounds when AGG(x) can be computed
in limited space (e.g., MIN, MAX, AVG), using two variants of
Greenwald and Khanna's summary structure for the approximate
computation of quantiles. Using real data sets, we experimentally
demonstrate that an adaptation of the quantile summary structure
uses much lesser space, and is significantly faster, than a more
direct use of the quantile summary structure, for the same a
posteriori error bounds. Finally, we prove that, when AGG(x) is
a quantile (which cannot be computed over a data stream in limited
space), the error of a CS-aggregate can be arbitrarily large.