Fast Computation of Sparse Datacubes.
Kenneth A. Ross and Divesh Srivastava.
Datacube queries compute aggregates over database relations at
a variety of granularities, and they constitute an important
class of decision support queries. Real-world data is frequently
sparse, and hence efficiently computing datacubes over large
sparse relations is important. We show that current techniques
for computing datacubes over sparse relations do not scale well
with the number of CUBE BY attributes, especially when the
relation is much larger than main memory.
We propose a novel algorithm for the fast computation of
datacubes over sparse relations, and demonstrate the efficiency
of our algorithm using synthetic, benchmark and real-world
data sets. When the relation fits in memory, our technique
performs multiple in-memory sorts, and does not incur any I/O
beyond the input of the relation and the output of the datacube
itself. When the relation does not fit in memory, a
divide-and-conquer strategy divides the problem of computing the
datacube into several simpler computations of sub-datacubes.
Often, all but one of the sub-datacubes can be computed in
memory and our in-memory solution applies. In that case, the
total I/O overhead is linear in the number of CUBE BY attributes.
We demonstrate with an implementation that the CPU cost of our
algorithm is dominated by the I/O cost for sparse relations.