pzip - fixed record partition compress/decompress
pzip [ options ] file
compresses and decompresses data files of fixed length rows (records) and columns (fields). It performs better than
in space/time on data that has many (typically > 50%) columns that
change at a low rate (columns with a low rate of change are low frequency; columns with a high rate of change are high frequency).
The pzip compress format is itself gzipped; decompressed data is reorganized according to the user-specified partition
file (see the --partition option below) before being passed to gzip. Low frequency columns are difference
encoded and high frequency column groups are transposed to column-major order. The gzip tables are flushed between each
column partition group. This has a positive space/time effect on the gzip string match and huffman tables.
If a partition file is specified then pzip compresses the input file to the standard output, otherwise pzip
decompresses the input file to the standard ouput. If file is omitted then the standard input is used. If the
standard input is a tty then /dev/null is silently used.
file may be pzip compressed, gzip compressed, or raw. pzip files self-identify; the row size and
partition can be determined from the pzip header. For gzip and raw data, the following are done to determine the row
- The row size is taken from the --row option if specified.
- If the --partition option is specified then the row size is taken from the partition
- If the data is newline-terminated and if it contains at least two lines and if the first two data
lines have the same length then that length is taken to be the row size.
- Otherwise the row size cannot be determined and pzip exits with a diagnostic.
- -a, --append
Sets the PZ_APPEND flag that may be used by some disciplines.
- -b, --bzip
Use bzip(1) compression instead of the default
gzip(1). bzip is not fully supported, pending further investigation.
- -c, --comment=comment
Place comment in the output pzip file header when compressing. The
comment is listed by the --header option.
- -x, --crc
Enable gzip crc32 cyclic redundancy checking for decompress. On some systems this can double
the execution wall time. Most data corruption errors are still caught even with nocrc.
- -d, --dump
Enable detailed tracing.
- -B, --bufsize=size
Set the output buffer size to size -- for debugging.
- -D, --debug=level
Set the debug trace level to level. Higher levels produce more output.
- -O, --dio
Push the sfdcdio(3) direct io discipline on
the input streams. Silently ignored on systems that do not support direct io.
- -G, --gzip
--nogzip disables gzip compression. Most often used for conversion or debugging. On
by default; -G means --nogzip.
- -h, --header
List header information on the input pzip file and exit. This output is compatible with
the --partition file format.
- -l, --library=library
Loads the dll library via the pzlib() call. library must
contain the function int pz_init(Pz_t* pz, Pzdisc_t* disc) which is called during pzip stream initialization. pz_init
allows run time modification to disc: most often it supplies alternate discipline functions. Runtime libraries may
interpret options specific to the library; library usage and description will be appended to online help output if the help options
appear after the --library option. Runtime libraries may provide additional diagnostics and tracing when --summary,
--verbose or --dump are specified. In general, runtime libraries are not needed for decompression. The --header
option lists the runtime libraries used to compress the input file.
- -z, --lzw
Use compress(1) lzw compression instead of
the default gzip(1). lzw is not fully supported, pending further
- -o, --override=begin[-end]=value
Override the column partition. Currently only fixed value columns may
be specified. The syntax is begin[-end]='value' where begin is the beginning column offset (starting at
0), end is the ending column offset for an inclusive range, and value is the fixed column value. Uncompress time is
improved when high frequency columns are given fixed values (see the --partition option).
- -p, --partition=file
file specifies the data row size and the high frequency column partition
groups and permutation. file may contain URL-like components: path?name=part or path#part
reads the partition file path and uses the partition named part. Other options may be set by separating each
with , or space. The partition file is a sequence of lines. Comments start with # and continue to the end of the line. The first
non-comment line specifies the optional name string in "...". The next non-comment line specifies the row size. The remaining lines
operate on column offset ranges of the form: begin[-end] where begin is the beginning column offset (starting
at 0), and end is the ending column offset for an inclusive range. The file name // or /gzip/ disables pzip
partitioning and applies only gzip compression. The operators are:
- range [...]
places all columns in the specified range list in the same high
frequency partition group. Each high frequency partition group is processed as a separate block by the underlying compressor
(gzip(1) by default).
specifies that each column in range has the fixed character value value
. C-style character escapes are valid for value.
- -Z, --push
Push the sfdcpzip(3) io discipline rather
than direct library calls. Used for debugging and performance testing.
- -P, --pzip
--nopzip disables pzip compression. Most often used for conversion or debugging. On
by default; -P means --nopzip.
- -Q, --regress
Generate output for regression testing, such that identical invocations with identical input
files will generate the same output.
- -r, --row=row-size
Specifies the input row size (number of byte columns) for data that does not
- -S, --split[=pattern]
Instead of compressing, the input split discipline, which must be specified by a
subsequent --library option, splits the input data into files named id. id is determined by the split
discipline namef function. The optional pattern is a ksh(1) file
match pattern that limits the split to id's matching pattern (e.g., --split='1234|98765'.) If --append
is also specified then the data is appened to any pre-existing id files; otherwise each file is truncated when the first
record containing id data is read. If there are no records with id data then the id file is not modified. --split
should be used in a separate directory, and the directory should be cleared when --append is not specified to avoid
mixing inconsistent data. No records will be written to a split file with size >= --window bytes. The option value may be
- -s, --summary
Enable summary tracing to the standard error. Runtime libraries may add addtional information to
the default pzip(3) library summary output. Compression summary includes the
compression rate, bytes per record, and compression wall time. This option also enables split discipline warnings about id
partitions that should be generated by pin(1) and added to the partition file to
improve compression. The pin(1) output, with an additional "id" line
manually prepended, can then be appended to an existing partition file.
- -T, --test=mask
implementation-specific tests and tracing.
- -v, --verbose
Enable intermediate tracing.
- -w, --window=window-size
Each chunk of window bytes is compressed separately. The window size
may be silently decreased to accomodate an integral number of complete rows. The default value is 4M.
- -W, --write-test=group
Loop on sfread()/
pzwrite(3) in chunks of group records rather than a single pzdeflate
() call for compression. Used for debugging and performance testing.
- -X, --prefix=count[*terminator]
Uncompressed data contains a prefix that is defined by count and
an optional terminator. This data is preserved but is not pzip compressed. If count is 0 on uncompress
then the header is not copied to the output. terminator may be one of:
- count newline terminated records.
count char terminated records.
pzip decompress currently fails if the standard input is a pipe. This will be addressed in a future release.