Vcodex: Data Transformation


Kiem-Phong Vo

Vcodex is a software package for data transformation. Examples of data transformation include compression, delta compression, encryption, portable encodings of binary data, etc. The software architecture of Vcodex is layered as follows:

  • Libvcodex: This base library implements the methods for transforming data. The library assumes that there is sufficient main memory to process given data. Different transforms can be composed to perform particular tasks. For example, a Burrows-Wheeler compressor can be built by composing together the Burrows-Wheeler transform, a move-to-front transform, a run-length encoder, and some entropy encoder.
  • Vcsfio(): This single function is a part of Libvcodex. It provides an interface to the Sfio library to deal with file data. Vcsfio() constructs an Sfio discipline for the particular data transformation task and inserts it to a given Sfio stream. After that, for encoding, raw data can be simply written to the stream and the Sfio discipline will take care of calling the Vcodex functions to encode. Likewise, for decoding, the Sfio discipline will automatically call Vcodex functions to decode data so that the application will see the data in plaintext.
  • Vczip: This single executable command can be used to run the data transforms for both encoding and decoding. With the right combination of transforms, Vczip can perform compression tasks with performance similar to or better than other well-known compressors such as Gzip or Bzip2. Vczip is, in fact, a simple driver to collect command line arguments and make the appropriate invocation to vcsfio() to condition the standard output or input stream as the case required.

Libvcodex is based on a Discipline and Method architecture (see the references below) enabling simple data processing as well as extensibility when new transforms are invented. Data are transformed via handles created with particular transformation methods. Below are the methods distributed in this package:

  • Vcdelta: This performs delta compression and outputs results in the Vcdiff data encoding format. Since this method is a generalization of the 1977 Lempel-Ziv compression strategy, it can also be used to compress single data sets.
  • Vchuffman: This performs static Huffman coding. It implements fast algorithms to compute the Huffman code table and to decode encoded data.
  • Vchuffpart: This partitions data into contiguous segments with consistent statistics so that they compress better with different Huffman code tables than if a single table is used for all data.
  • Vchuffgroup: This is similar to the above. However, it first partitions data into small chunks with the same size, then groups chunks with consistent statistics for compression.
  • Vcbwt: This performs the famed Burrows-Wheeler transform.
  • Vcmtf: This performs the move-to-front transform. The method can be parameterized to select between two different move-to-front strategies. The default strategy uses a simple prediction method to aggressively move data to the front of the queue. This works well with data produced by Vcbwt. The alternate strategy is the plain move-to-front where a data byte is moved to the front of the queue after it is accessed.
  • Vcrle: This performs run-length encoding. The method can be parameterized to select between two different strategies. The default strategy finds runs and encodes the lengths and the data separately. The alternate strategy only encodes runs of zeros using a special binary encoding. This works well with data produced by the move-to-front transform after a Burrows-Wheeler transform.

The external Vcodex release does not include a number of powerful data transforms for compressing various types of table data as these have not been released by AT&T. However, the package does include efficient and reusable functions for suffix array construction, Lempel-Ziv parsing and various types of integer encodings.

Below are papers related to Vcodex:


Comments/Questions/Problems

Contact Phong Vo at: kpv@research.att.com.

Except for configuration problems in building the package, a bug report should be in the form of a small program that I can compile and run. Without such a program, I will likely ignore the report since it often takes too much time for me to tell if a reported bug is real or if it's just a misuse of the code.