PZIP(3)C LIBRARY FUNCTIONSPZIP(3)


NAME

pzip - fixed record partition compress/decompress library

SYNOPSIS

#include   <pzip.h>

DATA TYPES
Pz_t;
Pzpart_t;
Pzdisc_t;
int        (*Pzerror_f)(Pz_t*, Pzdisc_t*, int, const char*, ...);
int        (*Pzcheck_f)(Pz_t*, unsigned char*, Pzdisc_t*);

CONSTANTS
PZ_VERSION
PZ_WINDOW
PZ_MAGIC_1
PZ_MAGIC_2
PZ_TEST

FLAGS
PZ_READ
PZ_WRITE
PZ_FORCE
PZ_STAT
PZ_STREAM
PZ_CRC
PZ_DUMP
PZ_VERBOSE
PZ_TEST_1
PZ_TEST_2
PZ_TEST_3
PZ_TEST_4
SFPZ_VERIFY
SFPZ_CRC

OPENING/CLOSING STREAMS
Pz_t*      pzopen(Pzdisc_t* disc, const char* path, int flags);
int        pzclose(Pz_t* pz);

INPUT/OUTPUT OPERATIONS
int        pzdeflate(Pz_t* pz, Sfio_t* out);
int        pzinflate(Pz_t* pz, Sfio_t* out);
int        pzgethdr(Pz_t* pz);
int        pzputhdr(Pz_t* pz, Sfio_t* out);

STREAM INFORMATION
Pzpart_t*  pzgetpart(Pz_t* pz, const char* name);
Pzpart_t*  pzsetpart(Pz_t* pz, Pzpart_t* part);
void       pzdump(Pz_t* pz, Sfio_t* out);

SFIO DECOMPRESS DISCIPLINE
int        sfdcpzip(Sfio_t* stream, const char* options, int flags);

DESCRIPTION

pzip is a library of functions that support compression/decompression of data files of fixed length rows (records) and columns (fields). Each pzip stream represents a file of compressed or decompressed data that may be decompressed or compressed to an sfio(3) output file stream.

pzip format performs better than gzip in space/time on data that has many (> 50%) columns that change at a low rate (columns with a low rate of change are low frequency; columns with a high rate of change are high frequency.)

The pzip compress format is itself gzipped using the sfdcgzip() sfio discipline. Decompressed data is reorganized according to the user-specified partition-file (see Pzdisc_t.partition below) before being passed to gzip. Low frequency columns are run-length encoded and high frequency column groups are transposed to column-major order. The gzip tables are flushed, via the sfsync() discipline function, between each column partition group. This has a positive space/time effect on the gzip string match and huffman tables.

pzip format files self-idenfify by encoding partition information in a header.

DATA TYPES

Pz_t
This is the stream handle returned by pzopen. None of the fields should be changed by the application. The fields are:

const char* id
The library identification string used by the libast errorf function to identify the source of error and warning messages.
Pzdisc_t* disc
A pointer to the user discipline structure from pzopen.
unsigned int flags
The inclusive-or of the flags described below.
int major
The pzip file format major number. For PZ_READ it is the the value when the file was compressed, otherwise it is the major number of the current implementation. In general the file header and implementation major numbers must match.
int minor
The pzip file format minor number. For PZ_READ it is the the value when the file was compressed, otherwise it is the minor number of the current implementation. In general, all implementations and data files with the same file header major number are compatible. The minor allows the implementation to make runtime adjustments.
size_t win
The high frequency window size. This is the adjusted value that is divisible by both the row size and number of high frequency columns.
const char* path
The input file pathname.
Sfio_t* io
The input file sfio stream.
Vmalloc_t* vm
The vmalloc(3) region handle for the entires stream. All allocations associated with the stream, except for sfio intiated allocations, are done in this region. The user may also allocate and free individual chunks of memory from this region, but must not call vmclear(). The region is freed by pzclose().
size_t npart
The number of partitions.
Pzpart_t* part
The current active partition.
Pzpart_t* parts
A table of all partitions.
unsigned char* buf
The high frequency window buffer with win elements.
unsigned char* wrk
The PZ_WRITE high frequency window buffer with win elements.
unsigned char* pat
The low frequency pattern buffer with row elements.

Pzpart_t

Pzpart_t defines one partition. The fields are:

char* name
The partition name.
int index
The partition index. May be used to access a partition given its index: Pz_t.parts[Pzpart_t.index-1].
size_t row
The partition fixed row size.
size_t col
The number of rows that can fit into the high frequency window column buffer.
size_t* map
An array with nmap elements that lists the high frequency column indexes in order.
size_t* grp
An array with ngrp elements that lists the sizes of each high frequency column partition group in the same order as map.
size_t nmap
The number of elements in map.
size_t ngrp
The number of elements in grp.
unsigned char* low
An array with row elements.
low[ i ]
is 1 if column i is low frequency, otherwise it is 0 (and column i is high frequency.)
int* value
If there are no fixed-value columns the value is NULL. Otherwise it is an array with row elements.
value[ i ]
is non-negative if column i has a fixed value (and
value[ i ]
is the fixed column value).
size_t* fix
An array with nfix elements that lists fixed value columns.
size_t nfix
The number of elements in fix.

Pzdisc_t

Pzdisc_t defines a stream discipline structure to the pzopen() function. The discipline fields are:

unsigned long version
Must be initailized to PZ_VERSION.
const char* comment
An optional string that is placed in the pzip output file header; this string can be viewed by the pzip(1) command. Ignored for PZ_READ and PZ_STAT streams.
const char* options
An optional string of run-time options of the form name=value. Currently only fixed value columns may be specified. The syntax is begin[-end]='value' where begin is the beginning column offset (starting at 0), end is the ending column offset for an inclusive range, and value is the fixed column value. Decompress time is improved when high frequency columns are given fixed values.
const char* partition
The name of the partition-file that contains a sequence of lines that specifie the data row size and the high frequency column partition groups. This entry must be specified for PZ_WRITE and is ignored for PZ_READ streams. Comments start with # and continue to the end of the line. The first non-comment line specifies the row size. The remaining lines operate on column offset ranges of the form: begin[-end] where begin is the beginning column offset (starting at 0), end is the ending column offset for an inclusive range. The operations are:

range [ ... ]
places all columns in the specified range list in the same high frequency partition group. Each high frequency partition group is processed as a separate block by gzip.
range='value'
specifies that each column in range has the fixed character value value. C-style character escapes are valid for value.

const char* lib
The library name used by pathfind(3) to locate partition files. The default is 'pzip', and the default partition file suffix is .prt.
size_t window
Low frequency columns are processed one row at a time; high frequency columns are processed across many rows at a time. The space/time tradeoff is controlled by the number of high frequency columns that can be processed in one step. window sets this limit. The high frequency columns are transposed from row-major order to column-major order, which may bring on inefficient paging behavior on some systems when the window size is too large. The default of 4M (4194304) provides reasonable behavior across most paging implementations. Note that compression requires two window buffers whereas decompression requires one. window is shortened to be divisible by both the row size and the number of high frequency columns.
int (*errorf)(Pz_t* pz, Pzdisc_t* disc, int lev, const char*, fmt ...)
An optional function that is called to emit error and warning messages. It is most often set to the libast errorf():
     disc.errorf = (Pzerror_f)errorf;
int (*eventf)(Pz_t* pz, int event, void* value, Pzdisc_t* disc)
An optional function that is called when events occur during stream processing. event is set to the event and value is an event specific value. The events are:

PZ_OPEN
Called just before pzopen() returns successfully. value is 0. A -1 return value causes pzopen() to fail.
PZ_CLOSE
Called before pzclose() releases any resources. value is 0. The return value is used as the pzclose() return value.
PZ_CHECK
Called as each row in a PZ_WRITE stream is processed. value is a pointer to the row data before compression, and eventf may modify the contents up to the row size. The return value determines the disposition of the row: -1 terminates all processing; 0 ignores the row; otherwise the row is processed as usual.

CONSTANTS

PZ_VERSION
This is a macro value of type long int that defines the current version number of the pzip library interface. The form is a six digit date YYYYMMDD.
PZ_WINDOW
The default window size (4Mb).
PZ_MAGIC_1
The first character of the two character pzip header magic number.
PZ_MAGIC_2
The second character of the two character pzip header magic number.

BIT FLAGS

A number of bit flags control stream operations. They are set by the flags argument to pzopen(). The flags are:

PZ_READ
The input file is opened for decompression to the output file.
PZ_WRITE
The input file is opened for compression to the output file. The input file is opened for decompression.
PZ_FORCE
For PZ_READ, if the input file is not in pzip format, then pzinflate() operates in transparent mode. If the input file is in gzip format then gzip inflate is applied. Otherwise PZ_READ input files must be in pzip format.
PZ_STAT
The input file must be in pzip format; the handle may be used to retrieve header information, but pzinflate() is disabled.
PZ_STREAM
The path argument to pzopen() is treated as an sfio SF_READ stream. This is a hack used by sfdcpzip().
PZ_CRC
Enables decompress crc checking. crc checking is a perfomance wart in the otherwise respectable libz(3) gzip library implementation. Decompression crc checking can increase decompression user time by as much as a factor of 2. pzip uses a version of libz that disables decompression crc checking and replaces it with a few sanity checks. The pzip format also has its own checks.
PZ_DUMP
Calls pzdump() just before a successful return from pzopen().
PZ_VERBOSE
Enables a verbose trace of internal actions.
PZ_TEST_1
Enables the implementation defined test #1. PZ_TEST_2, PZ_TEST_3, and PZ_TEST_4 also provided.

OPENING/CLOSING STREAMS

Pz_t*      pzopen(Pzdisc_t* disc, const char* path, int flags);

This function opens a stream on file. It returns a new stream handle on success and NULL on error. disc and flags are described above. If flags contains PZ_READ then pzinflate() may be called to decompress file, otherwise if flags contains PZ_WRITE pzdeflate() may be called to compress file.

int        pzclose(Pz_t* pz);
This functions close the stream handle pz returned by a previous call to pzopen(). It returns 0 on success and -1 on error. All resources allocated on behalf of the stream are released.

INPUT/OUTPUT OPERATIONS

int        pzdeflate(Pz_t* pz, Sfio_t* out);
This function compresses the entire PZ_WRITE pzip stream pz to the output sfio stream out. It returns 0 on success and -1 on error.

int        pzinflate(Pz_t* pz, Sfio_t* out);
This function decompresses the entire PZ_READ pzip stream pz to the output sfio stream out. It returns 0 on success and -1 on error.

int        pzgethdr(Pz_t* pz);
This function reads the header from the PZ_READ pzip stream pz and fills in the appropriate fields in pz. It returns 0 on success and -1 on error.

int        pzputhdr(Pz_t* pz, Sfio_t* out);
This function writes the header from the PZ_WRITE pzip stream pz to the sfio output stream out. It returns 0 on success and -1 on error.

STREAM INFORMATION

Pzpart_t*  pzgetpart(Pz_t* pz, const char* name);
This function returns a partition given its name. 0 is returned if the partition is not found. The default partition name is the empty string ("").

Pzpart_t*  pzsetpart(Pz_t* pz, Pzpart_t* part);
This function sets the current active partition to part. The previous active partition is returned. The current active partition is initialized to the default parition ("").

void       pzdump(Pz_t* pz, Sfio_t* out);
This function writes the header header information from the pzip stream pz to the sfio output stream out in partition file format. A file containing this information is suitable for the Pzdisc_t.partition field. It returns 0 on success and -1 on error.

SFIO DECOMPRESS DISCIPLINE

int        sfdcpzip(Sfio_t* sp, const char* options, int flags);
This function pushes a pzip decompress sfio discipline on the sfio stream sp by calling pzopen() with PZ_READ|PZ_STREAM on sp and setting Pzdisc_t.options to options. Because of the extra information involved, pzip compression is not supported by the discipline. This is better handled by the pzip(1) command interface. This means that sp must be an SF_READ sfio stream. sfdcpzip() stacks the gzip decompress discipline sfdcgzip() on sp if necessary. 0 is returned on success.

AUTHOR

Glenn Fowler, gsf@research.att.com.


1998-08-11 November 07, 2006