The data warehouse of newsgroups.
Himanshu Gupta and Divesh Srivastava.
Electronic newsgroups are one of the primary means for the
dissemination, exchange and sharing of information. We
argue that the current newsgroup model is unsatisfactory,
especially when posted articles are relevant to multiple
newsgroups. We demonstrate that considerable additional
flexibility can be achieved by managing newsgroups in a
data warehouse, where each article is a tuple of
attribute-value pairs, and each newsgroup is a view on the
set of all posted articles. Supporting this paradigm for
a large set of newsgroups makes it imperative to efficiently
support a very large number of views: this is the key
difference between newsgroup data warehouses and
conventional data warehouses.
We identify two complementary problems concerning the
design of such a newsgroup data warehouse. An important
design decision that the system needs to make is which
newsgroup views to eagerly maintain (i.e., materialize).
We demonstrate the intractability of the general
newsgroup-selection problem, consider various natural
special cases of the problem, and present efficient
exact/approximation algorithms and complexity hardness
results for them. A second important task concerns the
efficient incremental maintenance of the eagerly
maintained newsgroups. The newsgroup-maintenance problem
for our model of newsgroup definitions is a more general
version of the classical point-location problem, and we
design an I/O and CPU efficient algorithm for this problem.