#include <pzip.h>
Pz_t;
Pzpart_t;
Pzdisc_t;
int (*Pzerror_f)(Pz_t*, Pzdisc_t*, int, const char*, ...);
int (*Pzcheck_f)(Pz_t*, unsigned char*, Pzdisc_t*);
PZ_VERSION
PZ_WINDOW
PZ_MAGIC_1
PZ_MAGIC_2
PZ_TEST
PZ_READ
PZ_WRITE
PZ_FORCE
PZ_STAT
PZ_STREAM
PZ_CRC
PZ_DUMP
PZ_VERBOSE
PZ_TEST_1
PZ_TEST_2
PZ_TEST_3
PZ_TEST_4
SFPZ_VERIFY
SFPZ_CRC
Pz_t* pzopen(Pzdisc_t* disc, const char* path, int flags);
int pzclose(Pz_t* pz);
int pzdeflate(Pz_t* pz, Sfio_t* out);
int pzinflate(Pz_t* pz, Sfio_t* out);
int pzgethdr(Pz_t* pz);
int pzputhdr(Pz_t* pz, Sfio_t* out);
Pzpart_t* pzgetpart(Pz_t* pz, const char* name);
Pzpart_t* pzsetpart(Pz_t* pz, Pzpart_t* part);
void pzdump(Pz_t* pz, Sfio_t* out);
int sfdcpzip(Sfio_t* stream, const char* options, int flags);
pzip
is a library of functions that support compression/decompression
of data files of fixed length rows (records) and columns (fields).
Each
pzip
stream represents a file of compressed or decompressed data
that may be decompressed or compressed to an
sfio(3)
output file stream.
pzip
format performs better than
gzip
in space/time on data that has many (> 50%) columns
that change at a low rate
(columns with a low rate of change are low frequency;
columns with a high rate of change are high frequency.)
The
pzip
compress format is itself
gzipped
using the
sfdcgzip()
sfio
discipline.
Decompressed data is reorganized according to the user-specified
partition-file
(see
Pzdisc_t.partition
below) before being passed to
gzip.
Low frequency columns are run-length encoded and high frequency column groups
are transposed to column-major order.
The
gzip
tables are flushed, via the
sfsync()
discipline function, between each column partition group.
This has a positive space/time effect on the
gzip
string match and huffman tables.
pzip
format files self-idenfify by encoding partition information in a header.
This is the stream handle returned by
pzopen.
None of the fields should be changed by the application.
The fields are:
- const char* id
The library identification string used by the
libast
errorf
function to identify the source of error and warning messages.
- Pzdisc_t* disc
A pointer to the user discipline structure from
pzopen.
- unsigned int flags
The inclusive-or of the flags described below.
- int major
The
pzip
file format major number.
For
PZ_READ
it is the the value when the file was compressed,
otherwise it is the major number of the current implementation.
In general the file header and implementation major numbers must match.
- int minor
The
pzip
file format minor number.
For
PZ_READ
it is the the value when the file was compressed,
otherwise it is the minor number of the current implementation.
In general, all implementations and data files with the same
file header major number are compatible.
The minor allows the implementation to make runtime adjustments.
- size_t win
The high frequency window size.
This is the adjusted value that is divisible by both the row size and
number of high frequency columns.
- const char* path
The input file pathname.
- Sfio_t* io
The input file
sfio
stream.
- Vmalloc_t* vm
The
vmalloc(3)
region handle for the entires stream.
All allocations associated with the stream, except for
sfio
intiated allocations, are done in this region.
The user may also allocate and free individual chunks of memory from
this region, but must not call
vmclear().
The region is freed by
pzclose().
- size_t npart
The number of partitions.
- Pzpart_t* part
The current active partition.
- Pzpart_t* parts
A table of all partitions.
- unsigned char* buf
The high frequency window buffer with
win
elements.
- unsigned char* wrk
The PZ_WRITE high frequency window buffer with
win
elements.
- unsigned char* pat
The low frequency pattern buffer with
row
elements.
Pzpart_t
defines one partition.
The fields are:
- char* name
The partition name.
- int index
The partition index.
May be used to access a partition given its index:
Pz_t.parts[Pzpart_t.index-1].
- size_t row
The partition fixed row size.
- size_t col
The number of rows that can fit into the high frequency window column buffer.
- size_t* map
An array with
nmap
elements that lists the high frequency column
indexes in order.
- size_t* grp
An array with
ngrp
elements that lists the sizes of each
high frequency column partition group in the same order as
map.
- size_t nmap
The number of elements in
map.
- size_t ngrp
The number of elements in
grp.
- unsigned char* low
An array with
row
elements.
low[i]
is
1
if column
i
is low frequency,
otherwise it is
0
(and column
i
is high frequency.)
- int* value
If there are no fixed-value columns the
value
is
NULL.
Otherwise it is an array with
row
elements.
value[i]
is non-negative if column
i
has a fixed value
(and
value[i]
is the fixed column value).
- size_t* fix
An array with
nfix
elements that lists fixed value columns.
- size_t nfix
The number of elements in
fix.
Pzdisc_t
defines a stream discipline structure to the
pzopen()
function.
The discipline fields are:
- unsigned long version
Must be initailized to
PZ_VERSION.
- const char* comment
An optional string that is placed in the
pzip
output file header; this string can be viewed by the
pzip(1)
command.
Ignored for
PZ_READ
and
PZ_STAT
streams.
- const char* options
An optional string of run-time options of the form
name=value.
Currently only fixed value columns may be specified.
The syntax is
begin[-end]='value'
where
begin
is the beginning column offset (starting at 0),
end
is the ending column offset for an inclusive range,
and
value
is the fixed column value.
Decompress time is improved when high frequency columns are given fixed values.
- const char* partition
The name of the
partition-file
that contains a sequence of lines that
specifie the data row size and the high frequency
column partition groups.
This entry must be specified for
PZ_WRITE
and is ignored for
PZ_READ
streams.
Comments start with # and continue to the end of the line.
The first non-comment line specifies the row size.
The remaining lines operate on column offset ranges of the form:
begin[-end]
where
begin
is the beginning column offset (starting at 0),
end
is the ending column offset for an inclusive range.
The operations are:
- range [ ... ]
places all columns in the specified
range
list in the same high frequency partition group.
Each high frequency partition group is processed as a separate block by
gzip.
- range='value'
specifies that each column in
range
has the fixed character value
value.
C-style character escapes are valid for
value.
- const char* lib
The library name used by
pathfind(3)
to locate partition files.
The default is
'pzip',
and the default partition file suffix is
.prt.
- size_t window
Low frequency columns are processed one row at a time;
high frequency columns are processed across many rows at a time.
The space/time tradeoff is controlled by the number of
high frequency columns that can be processed in one step.
window
sets this limit.
The high frequency columns are transposed from row-major order to
column-major order, which may bring on inefficient paging behavior
on some systems when the window size is too large.
The default of 4M (4194304) provides reasonable behavior across
most paging implementations.
Note that compression requires two
window
buffers whereas decompression requires one.
window
is shortened to be divisible by both the row size and the number
of high frequency columns.
- int (*errorf)(Pz_t* pz, Pzdisc_t* disc, int lev, const char*, fmt ...)
An optional function that is called to emit error and warning messages.
It is most often set to the
libast
errorf():
disc.errorf = (Pzerror_f)errorf;
- int (*eventf)(Pz_t* pz, int event, void* value, Pzdisc_t* disc)
An optional function that is called when events occur during stream processing.
event
is set to the event and
value
is an event specific value.
The events are:
- PZ_OPEN
Called just before
pzopen()
returns successfully.
value
is
0.
A
-1
return value causes
pzopen()
to fail.
- PZ_CLOSE
Called before
pzclose()
releases any resources.
value
is
0.
The return value is used as the
pzclose()
return value.
- PZ_CHECK
Called as each row in a
PZ_WRITE
stream is processed.
value
is a pointer to the row data before compression, and
eventf
may modify the contents up to the row size.
The return value determines the disposition of the row:
-1
terminates all processing;
0
ignores the row;
otherwise the row is processed as usual.
- PZ_VERSION
This is a macro value of type
long int
that defines
the current version number of the
pzip
library interface.
The form is a six digit date YYYYMMDD.
- PZ_WINDOW
The default window size (4Mb).
- PZ_MAGIC_1
The first character of the two character
pzip
header magic number.
- PZ_MAGIC_2
The second character of the two character
pzip
header magic number.
A number of bit flags control stream operations.
They are set by the
flags
argument to
pzopen().
The flags are:
- PZ_READ
The input file is opened for decompression to the output file.
- PZ_WRITE
The input file is opened for compression to the output file.
The input file is opened for decompression.
- PZ_FORCE
For
PZ_READ,
if the input file is not in
pzip
format, then
pzinflate()
operates in transparent mode.
If the input file is in
gzip
format then
gzip
inflate is applied.
Otherwise
PZ_READ
input files must be in
pzip
format.
- PZ_STAT
The input file must be in
pzip
format; the handle may be used
to retrieve header information, but
pzinflate()
is disabled.
- PZ_STREAM
The
path
argument to
pzopen()
is treated as an
sfio
SF_READ
stream.
This is a hack used by
sfdcpzip().
- PZ_CRC
Enables decompress crc checking.
crc checking is a perfomance wart in the otherwise respectable
libz(3)
gzip
library implementation.
Decompression crc checking can increase decompression user time by as much
as a factor of 2.
pzip
uses a version of
libz
that disables decompression crc checking
and replaces it with a few sanity checks.
The
pzip
format also has its own checks.
- PZ_DUMP
Calls
pzdump()
just before a successful return from
pzopen().
- PZ_VERBOSE
Enables a verbose trace of internal actions.
- PZ_TEST_1
Enables the implementation defined test #1.
PZ_TEST_2,
PZ_TEST_3,
and
PZ_TEST_4
also provided.
Pz_t* pzopen(Pzdisc_t* disc, const char* path, int flags);
This function opens a stream on
file.
It returns a new stream handle on success and
NULL
on error.
disc
and
flags
are described above.
If
flags
contains
PZ_READ
then
pzinflate()
may be called to decompress
file,
otherwise if
flags
contains
PZ_WRITE
pzdeflate()
may be called to compress
file.
int pzclose(Pz_t* pz);
This functions close the stream handle
pz
returned by a previous call to
pzopen().
It returns
0
on success and
-1
on error.
All resources allocated on behalf of the stream are released.
int pzdeflate(Pz_t* pz, Sfio_t* out);
This function compresses the entire
PZ_WRITE
pzip
stream
pz
to the output
sfio
stream
out.
It returns
0
on success and
-1
on error.
int pzinflate(Pz_t* pz, Sfio_t* out);
This function decompresses the entire
PZ_READ
pzip
stream
pz
to the output
sfio
stream
out.
It returns
0
on success and
-1
on error.
int pzgethdr(Pz_t* pz);
This function reads the header from the
PZ_READ
pzip
stream
pz
and fills in the appropriate fields in
pz.
It returns
0
on success and
-1
on error.
int pzputhdr(Pz_t* pz, Sfio_t* out);
This function writes the header from the
PZ_WRITE
pzip
stream
pz
to the
sfio
output stream
out.
It returns
0
on success and
-1
on error.
Pzpart_t* pzgetpart(Pz_t* pz, const char* name);
This function returns a partition given its name.
0 is returned if the partition is not found.
The default partition name is the empty string ("").
Pzpart_t* pzsetpart(Pz_t* pz, Pzpart_t* part);
This function sets the current active partition to part.
The previous active partition is returned.
The current active partition is initialized to the default parition ("").
void pzdump(Pz_t* pz, Sfio_t* out);
This function writes the header header information from the
pzip
stream
pz
to the
sfio
output stream
out
in partition file format.
A file containing this information is suitable for the
Pzdisc_t.partition
field.
It returns
0
on success and
-1
on error.
int sfdcpzip(Sfio_t* sp, const char* options, int flags);
This function pushes a
pzip
decompress
sfio
discipline on the
sfio
stream
sp
by calling
pzopen()
with
PZ_READ|PZ_STREAM
on
sp
and setting
Pzdisc_t.options
to
options.
Because of the extra information involved,
pzip
compression is not supported by the discipline.
This is better handled by the
pzip(1)
command interface.
This means that
sp
must be an
SF_READ
sfio
stream.
sfdcpzip()
stacks the
gzip
decompress discipline
sfdcgzip()
on
sp
if necessary.
0 is returned on success.
Glenn Fowler, gsf@research.att.com.