PZIP(3)C LIBRARY FUNCTIONSPZIP(3)


NAME

pzip - fixed record partition compress/decompress library

SYNOPSIS

#include   <pzip.h>

DATA TYPES
Pz_t;
Pzpart_t;
Pzdisc_t;

int (*Pzerror_f)(Pz_t*, Pzdisc_t*, int, const char*, ...); int (*Pzcheck_f)(Pz_t*, unsigned char*, Pzdisc_t*);

CONSTANTS
PZ_VERSION

PZ_WINDOW

PZ_MAGIC_1 PZ_MAGIC_2

PZ_TEST

FLAGS
PZ_READ
PZ_WRITE
PZ_FORCE
PZ_STAT
PZ_STREAM
PZ_CRC
PZ_DUMP
PZ_VERBOSE
PZ_TEST_1
PZ_TEST_2
PZ_TEST_3
PZ_TEST_4

SFPZ_VERIFY SFPZ_CRC

OPENING/CLOSING STREAMS
Pz_t*      pzopen(Pzdisc_t* disc, const char* path, int flags);
int        pzclose(Pz_t* pz);

INPUT/OUTPUT OPERATIONS
int        pzdeflate(Pz_t* pz, Sfio_t* out);
int        pzinflate(Pz_t* pz, Sfio_t* out);

int pzgethdr(Pz_t* pz); int pzputhdr(Pz_t* pz, Sfio_t* out);

STREAM INFORMATION
Pzpart_t*  pzgetpart(Pz_t* pz, const char* name);
Pzpart_t*  pzsetpart(Pz_t* pz, Pzpart_t* part);

void pzdump(Pz_t* pz, Sfio_t* out);

SFIO DECOMPRESS DISCIPLINE
int        sfdcpzip(Sfio_t* stream, const char* options, int flags);

DESCRIPTION

pzip is a library of functions that support compression/decompression of data files of fixed length rows (records) and columns (fields). Each pzip stream represents a file of compressed or decompressed data that may be decompressed or compressed to an sfio(3) output file stream.

pzip format performs better than gzip in space/time on data that has many (> 50%) columns that change at a low rate (columns with a low rate of change are low frequency; columns with a high rate of change are high frequency.)

The pzip compress format is itself gzipped using the sfdcgzip() sfio discipline. Decompressed data is reorganized according to the user-specified partition-file (see Pzdisc_t.partition below) before being passed to gzip. Low frequency columns are run-length encoded and high frequency column groups are transposed to column-major order. The gzip tables are flushed, via the sfsync() discipline function, between each column partition group. This has a positive space/time effect on the gzip string match and huffman tables.

pzip format files self-idenfify by encoding partition information in a header.

DATA TYPES

Pz_t
This is the stream handle returned by pzopen. None of the fields should be changed by the application. The fields are:
const char* id

The library identification string used by the libast errorf function to identify the source of error and warning messages.
Pzdisc_t* disc

A pointer to the user discipline structure from pzopen.
unsigned int flags

The inclusive-or of the flags described below.
int major

The pzip file format major number. For PZ_READ it is the the value when the file was compressed, otherwise it is the major number of the current implementation. In general the file header and implementation major numbers must match.
int minor

The pzip file format minor number. For PZ_READ it is the the value when the file was compressed, otherwise it is the minor number of the current implementation. In general, all implementations and data files with the same file header major number are compatible. The minor allows the implementation to make runtime adjustments.
size_t win

The high frequency window size. This is the adjusted value that is divisible by both the row size and number of high frequency columns.
const char* path

The input file pathname.
Sfio_t* io

The input file sfio stream.
Vmalloc_t* vm

The vmalloc(3) region handle for the entires stream. All allocations associated with the stream, except for sfio intiated allocations, are done in this region. The user may also allocate and free individual chunks of memory from this region, but must not call vmclear(). The region is freed by pzclose().
size_t npart

The number of partitions.
Pzpart_t* part

The current active partition.
Pzpart_t* parts

A table of all partitions.
unsigned char* buf

The high frequency window buffer with win elements.
unsigned char* wrk

The PZ_WRITE high frequency window buffer with win elements.
unsigned char* pat

The low frequency pattern buffer with row elements.

Pzpart_t

Pzpart_t defines one partition. The fields are:

char* name

The partition name.
int index

The partition index. May be used to access a partition given its index: Pz_t.parts[Pzpart_t.index-1].
size_t row

The partition fixed row size.
size_t col

The number of rows that can fit into the high frequency window column buffer.
size_t* map

An array with nmap elements that lists the high frequency column indexes in order.
size_t* grp

An array with ngrp elements that lists the sizes of each high frequency column partition group in the same order as map.
size_t nmap

The number of elements in map.
size_t ngrp

The number of elements in grp.
unsigned char* low

An array with row elements. low[i] is 1 if column i is low frequency, otherwise it is 0 (and column i is high frequency.)
int* value

If there are no fixed-value columns the value is NULL. Otherwise it is an array with row elements. value[i] is non-negative if column i has a fixed value (and value[i] is the fixed column value).
size_t* fix

An array with nfix elements that lists fixed value columns.
size_t nfix

The number of elements in fix.

Pzdisc_t

Pzdisc_t defines a stream discipline structure to the pzopen() function. The discipline fields are:

unsigned long version

Must be initailized to PZ_VERSION.
const char* comment

An optional string that is placed in the pzip output file header; this string can be viewed by the pzip(1) command. Ignored for PZ_READ and PZ_STAT streams.
const char* options

An optional string of run-time options of the form name=value. Currently only fixed value columns may be specified. The syntax is begin[-end]='value' where begin is the beginning column offset (starting at 0), end is the ending column offset for an inclusive range, and value is the fixed column value. Decompress time is improved when high frequency columns are given fixed values.
const char* partition

The name of the partition-file that contains a sequence of lines that specifie the data row size and the high frequency column partition groups. This entry must be specified for PZ_WRITE and is ignored for PZ_READ streams. Comments start with # and continue to the end of the line. The first non-comment line specifies the row size. The remaining lines operate on column offset ranges of the form: begin[-end] where begin is the beginning column offset (starting at 0), end is the ending column offset for an inclusive range. The operations are:
range [ ... ]

places all columns in the specified range list in the same high frequency partition group. Each high frequency partition group is processed as a separate block by gzip.
range='value'

specifies that each column in range has the fixed character value value. C-style character escapes are valid for value.

const char* lib

The library name used by pathfind(3) to locate partition files. The default is 'pzip', and the default partition file suffix is .prt.
size_t window

Low frequency columns are processed one row at a time; high frequency columns are processed across many rows at a time. The space/time tradeoff is controlled by the number of high frequency columns that can be processed in one step. window sets this limit. The high frequency columns are transposed from row-major order to column-major order, which may bring on inefficient paging behavior on some systems when the window size is too large. The default of 4M (4194304) provides reasonable behavior across most paging implementations. Note that compression requires two window buffers whereas decompression requires one. window is shortened to be divisible by both the row size and the number of high frequency columns.
int (*errorf)(Pz_t* pz, Pzdisc_t* disc, int lev, const char*, fmt ...)

An optional function that is called to emit error and warning messages. It is most often set to the libast errorf():
disc.errorf = (Pzerror_f)errorf;
int (*eventf)(Pz_t* pz, int event, void* value, Pzdisc_t* disc)

An optional function that is called when events occur during stream processing. event is set to the event and value is an event specific value. The events are:
PZ_OPEN

Called just before pzopen() returns successfully. value is 0. A -1 return value causes pzopen() to fail.
PZ_CLOSE

Called before pzclose() releases any resources. value is 0. The return value is used as the pzclose() return value.
PZ_CHECK

Called as each row in a PZ_WRITE stream is processed. value is a pointer to the row data before compression, and eventf may modify the contents up to the row size. The return value determines the disposition of the row: -1 terminates all processing; 0 ignores the row; otherwise the row is processed as usual.

CONSTANTS

PZ_VERSION

This is a macro value of type long int that defines the current version number of the pzip library interface. The form is a six digit date YYYYMMDD.
PZ_WINDOW

The default window size (4Mb).
PZ_MAGIC_1

The first character of the two character pzip header magic number.
PZ_MAGIC_2

The second character of the two character pzip header magic number.

BIT FLAGS

A number of bit flags control stream operations. They are set by the flags argument to pzopen(). The flags are:

PZ_READ

The input file is opened for decompression to the output file.
PZ_WRITE

The input file is opened for compression to the output file. The input file is opened for decompression.
PZ_FORCE

For PZ_READ, if the input file is not in pzip format, then pzinflate() operates in transparent mode. If the input file is in gzip format then gzip inflate is applied. Otherwise PZ_READ input files must be in pzip format.
PZ_STAT

The input file must be in pzip format; the handle may be used to retrieve header information, but pzinflate() is disabled.
PZ_STREAM

The path argument to pzopen() is treated as an sfio SF_READ stream. This is a hack used by sfdcpzip().
PZ_CRC

Enables decompress crc checking. crc checking is a perfomance wart in the otherwise respectable libz(3) gzip library implementation. Decompression crc checking can increase decompression user time by as much as a factor of 2. pzip uses a version of libz that disables decompression crc checking and replaces it with a few sanity checks. The pzip format also has its own checks.
PZ_DUMP

Calls pzdump() just before a successful return from pzopen().
PZ_VERBOSE

Enables a verbose trace of internal actions.
PZ_TEST_1

Enables the implementation defined test #1. PZ_TEST_2, PZ_TEST_3, and PZ_TEST_4 also provided.

OPENING/CLOSING STREAMS

Pz_t*      pzopen(Pzdisc_t* disc, const char* path, int flags);

This function opens a stream on file. It returns a new stream handle on success and NULL on error. disc and flags are described above. If flags contains PZ_READ then pzinflate() may be called to decompress file, otherwise if flags contains PZ_WRITE pzdeflate() may be called to compress file.

int        pzclose(Pz_t* pz);
This functions close the stream handle pz returned by a previous call to pzopen(). It returns 0 on success and -1 on error. All resources allocated on behalf of the stream are released.

INPUT/OUTPUT OPERATIONS

int        pzdeflate(Pz_t* pz, Sfio_t* out);
This function compresses the entire PZ_WRITE pzip stream pz to the output sfio stream out. It returns 0 on success and -1 on error.

int        pzinflate(Pz_t* pz, Sfio_t* out);
This function decompresses the entire PZ_READ pzip stream pz to the output sfio stream out. It returns 0 on success and -1 on error.

int        pzgethdr(Pz_t* pz);
This function reads the header from the PZ_READ pzip stream pz and fills in the appropriate fields in pz. It returns 0 on success and -1 on error.

int        pzputhdr(Pz_t* pz, Sfio_t* out);
This function writes the header from the PZ_WRITE pzip stream pz to the sfio output stream out. It returns 0 on success and -1 on error.

STREAM INFORMATION

Pzpart_t*  pzgetpart(Pz_t* pz, const char* name);
This function returns a partition given its name. 0 is returned if the partition is not found. The default partition name is the empty string ("").

Pzpart_t*  pzsetpart(Pz_t* pz, Pzpart_t* part);
This function sets the current active partition to part. The previous active partition is returned. The current active partition is initialized to the default parition ("").

void       pzdump(Pz_t* pz, Sfio_t* out);
This function writes the header header information from the pzip stream pz to the sfio output stream out in partition file format. A file containing this information is suitable for the Pzdisc_t.partition field. It returns 0 on success and -1 on error.

SFIO DECOMPRESS DISCIPLINE

int        sfdcpzip(Sfio_t* sp, const char* options, int flags);
This function pushes a pzip decompress sfio discipline on the sfio stream sp by calling pzopen() with PZ_READ|PZ_STREAM on sp and setting Pzdisc_t.options to options. Because of the extra information involved, pzip compression is not supported by the discipline. This is better handled by the pzip(1) command interface. This means that sp must be an SF_READ sfio stream. sfdcpzip() stacks the gzip decompress discipline sfdcgzip() on sp if necessary. 0 is returned on success.

AUTHOR

Glenn Fowler, glenn.s.fowler@gmail.com.


1998-08-11 November 07, 2006