pin - induce a pzip partition on fixed record data
pin [ options ] file ...
induces a pzip(1)
column partition on data files of fixed length
rows (records) and columns (fields). If a partition file is specified then that partition is refined. A partition file, suitable
for use by pin
is listed on the standard output. The input file
is referred to as
allows more than one file
argument, otherwise exactly one file must be specified.
Partitions are induced in a three step process:
- If --partition is not specified then a subset of columns are filtered from the training data
for partitioning. This filtering usually gathers high frequency columns from the data. A high frequency column has data with a high
rate of change between rows. High frequency cutoff rates, specified by --high, typically range from 10% to 50%, depending on
the data. pzip(1) compresses low frequency columns much more efficiently than
- A heuristic search determines an initial ordering of high frequency columns to present to step (3).
An optimal solution for both ordering and partitioning is NP-complete.
- The optimal partition for the ordering from step (2) is determined by a dynamic program that
computes the compressed size for all partitions that preserve order. Alternative greedy methods are also available that work more
quickly but do not guarantee optimality.
for a detailed description of file partitions and column
can run for a long time on some data (e.g., 10 minutes on a 400Mhz Pentium with --high 80 --window 4M
possibly with --test=010
to monitor progress.
- -b, --bzip
Use bzip(1) compression instead of the default
gzip(1). bzip is not fully supported by pzip, pending further
- -c, --cache
Generate some information on file that can be reused on another invocation; this
information is saved in pin-base, where base is the base name (no directory) of file. Saved information
includes column frequencies and singleton and pairwise column gzip rates.
- -g, --group=columns
Sets the maximum number of columns in any partition group. Lower values speed up
the the dynamic program but also may produce sub-optimal solutions. The default value is row-size.
- -h, --high=columns
Select this number of columns with the highest frequencies for the columns in the
partition. If columns is followed by `%' then columns with frequencies larger than this percentage are selected. The default
value is 10%.
- -l, --level=level
Sets the gzip compression level to level. Levels range from 1 (fastest,
worst compression) to 9 (slowest, best compression). The default value is 6.
- -m, --maxhigh=maxhigh
Exit with exit code 3 if the number of high frequency columns exceeds maxhigh
. If maxhigh is followed by `%' then the limit is maxhigh percent of the total number of columns. The default
value is 40%.
- -o, --sort
Sort the window data by row before inducing the partition.
- -p, --partition=file
Specifies the data row size and the high frequency column partition groups and
permutation. The partition file is a sequence of lines. Comments start with # and continue to the end of the line. The first
non-comment line specifies the optional name string in "...". The next non-comment line specifies the row size. The remaining lines
operate on column offset ranges of the form: begin[-end] where begin is the beginning column offset (starting
at 0), and end is the ending column offset for an inclusive range. The operators are:
- range [...]
places all columns in the specified range list in the same high
frequency partition group. Each high frequency partition group is processed as a separate block by the underlying compressor
(gzip(1) by default).
specifies that each column in range has the fixed character value value
. C-style character escapes are valid for value.
- -r, --row=row-size
Specifies the input row size (number of byte columns). The row size is determined by
sampling the input if not specified.
- -v, --verbose
List partition search details on the standard error.
- -w, --window=window-size
Limit the number of training data rows used to induce the partition. The
window size may be decreased to accomodate an integral number of complete rows. The default value is 4M.
- -O, --optimize=method
Choose the optimization (partitioning) method for step (3) above. The methods
dynamic program optimal partition
greedy approximation partition
- no partition
transitive greedy approximation partition
- The default value is dynamic.
- -Q, --regress
Generate output for regression testing, such that identical invocations with identical input
files will generate the same output.
- -R, --reorder=method
Choose the reordering method for step (2) above. The methods are:
- no reordering
- tsp reordering
- The default value is heuristic.
- -S, --size
Ignore --row, determine the fixed record size based on a window of sampled data, print it on
the standard output, and exit. If more than one file is specified then the record size and name are printed for each file.
If the sample is insufficient, or if --verify is specified, then all of the data read to determine the row size. A 0
size means the record size could not be determined.
- -T, --test=test-mask
Enable implementation-specific tests and tracing.
Enable reorder keep trace.
Enable reorder skip/cost trace.
Enable reorder permutation trace.
Enable reorder level 2 merge prune.
Disable reorder merge prune.
Partition using initial tsp cycles.
- -V, --verify
Verify --size by reading all data instead of the window sample.
- -X, --prefix=count[*terminator]
Uncompressed data contains a prefix that is defined by count and
an optional terminator. This data is not pzip compressed. terminator may be one of:
- count newline terminated records.
count char terminated records.