pin induces a
pzip(1) column partition on data files of fixed length
rows (records) and columns (fields). If a partition file is specified then that partition is refined. A partition file, suitable
for use by
pin,
pzip(1) or
pop(1) is listed on the standard output. The input
file is referred to as
training data.
--size allows more than one
file argument, otherwise exactly one file must be specified.
Partitions are induced in a three step process:
- (1)
- If --partition is not specified then a subset of columns are filtered from the training data
for partitioning. This filtering usually gathers high frequency columns from the data. A high frequency column has data with a high
rate of change between rows. High frequency cutoff rates, specified by --high, typically range from 10% to 50%, depending on
the data. pzip(1) compresses low frequency columns much more efficiently than
gzip(1).
- (2)
- A heuristic search determines an initial ordering of high frequency columns to present to step (3).
An optimal solution for both ordering and partitioning is NP-complete.
- (3)
- The optimal partition for the ordering from step (2) is determined by a dynamic program that
computes the compressed size for all partitions that preserve order. Alternative greedy methods are also available that work more
quickly but do not guarantee optimality.
See
pzip(1) for a detailed description of file partitions and column
frequencies.
pin can run for a long time on some data (e.g., 10 minutes on a 400Mhz Pentium with
--high 80 --window 4M).
Use
--verbose possibly with
--test=010 to monitor progress.