Counting twig matches in a tree.
Zhiyuan Chen, H. V. Jagadish, Flip Korn, Nick Koudas, S. Muthukrishnan,
Raymond Ng and Divesh Srivastava.
We describe efficient algorithms for accurately estimating
the number of matches of a small node-labeled tree, i.e.,
a twig, in a large node-labeled tree, using a summary data
structure. This problem is of interest for queries on XML
and other hierarchical data, to provide query feedback and
for cost-based query optimization. Our summary data structure
scalably represents approximate frequency information about
twiglets (i.e., small twigs) in the data tree. Given a twig
query, the number of matches is estimated by creating a set
of query twiglets, and combining two complementary approaches:
Set Hashing, used to estimate the number of matches of each
query twiglet, and Maximal Overlap, used to combine the query
twiglet estimates into an estimate for the twig query. We
propose several estimation algorithms that apply these
approaches on query twiglets formed using variations on
different twiglet decomposition techniques. We present an
extensive experimental evaluation using several real XML data
sets, with a variety of twig queries. Our results demonstrate
that accurate and robust estimates can be achieved, even with
limited space.