Data Squashing: Constructing Summary Data Sets

William DuMouchel

A "large dataset" is here defined as one that cannot be analyzed using some particular desired combination of hardware and software because of computer memory constraints. DuMouchel et al. (1999) defined "data squashing" as the construction of a substitute smaller dataset that leads to approximately the same analysis results as the large dataset. Formally, data squashing is a type of lossy compression that attempts to preserve statistical information. To be efficient, squashing must improve upon the common strategy of taking a random sample from the large dataset. Three recent papers on data squashing are summarized and their results are compared.

Keywords: Sampling, Substitute dataset, Moment matching.