RECSORT(3)C LIBRARY FUNCTIONSRECSORT(3)


NAME

Recsort - sorting records

SYNOPSIS

#include    <recsort.h>
#define     uchar unsigned char
#define     ulong unsigned long

RECSORT TYPES
Rs_t;
Rsdisc_t;
Rsmethod_t;
Rsobj_t;

OPENING/CLOSING SORT CONTEXTS
Rs_t*            rsopen(Rsdisc_t* disc, Rsmethod_t* meth, ssize_t c_max, int type);
int              rsclose(Rs_t* rs);
void             rsclear(Rs_t* rs);

SORT METHODS
Rsmethod_t*      Rsradix;
Rsmethod_t*      Rssplay;
Rsmethod_t*      Rsrasp;
Rsmethod_t*      Rsverify;
Rsmethod_t*      rsmethod(Rs_t* rs, Rsmethod_t* meth);

SORT DISCIPLINE
Rsdisc_t*        rsdisc(Rs_t* rs, Rsdisc_t* disc);
typedef ssize_t  (*Rsdefkey_f)(Rs_t* rs, Void_t* data, size_t s_data,
                                Void_t* key, size_t s_key, Rsdisc_t* disc);
typedef int (*Rsevent_f)(Rs_t* rs, int type, Void_t* data, Rsdisc_t* disc); typedef struct { int type; /* RS_KSAMELEN|RS_DSAMELEN */
       ssize_t        data;    /* data length or separator       */
       ssize_t        key;     /* key offset or expansion factor */
       ssize_t        keylen;  /* key length or key end offset   */
       Rsdefdey_f     defkeyf; /* function to define a key       */
} Rsdisc_t;

OBJECT OPERATIONS
ssize_t          rsprocess(Rs_t* rs, Void_t* data, ssize_t size);
ssize_t          rscount(rs);
Rsobj_t*         rslist(Rs_t* rs);
int              rswrite(Rs_t* rs, Sfio_t* f, int type);
int              rsmerge(Rs_t* rs, Sfio_t* f, Sfio_t** files, int n, int type);

DESCRIPTION

Recsort provides functions to sort strings in main memory and to merge sorted data from different files. These functions form a sort engine suitable for constructing programs such as /bin/sort(1).

RECSORT TYPES

Rs_t
This defines the type of a sorting context handle.

Rsdisc_t
This defines the type of a discipline structure which describes how keys are defined.

Rsmethod_t
This defines the type of a sorting method.

Rsobj_t
This is the type of a record to be sorted. It contains the following fields:
           ulong    order;
           Rsobj_t* left;
           Rsobj_t* right;
           Rsobj_t* equal;
           uchar*   key;
           ssize_t  keylen;
           uchar*   data;
           ssize_t  datalen;
order, left:

These fields are only significant in calls to event functions raised by method Rsverify.
right, equal:

The rslist() call returns a linked list of sorted records. obj->right defines the forward link for obj. obj->equal, if not NULL, points to a NULL-terminated linked list of records compared equal to obj.
key, keylen, data, datalen:

These fields define the key and data of the given record.

OPENING/CLOSING SORT CONTEXTS

Rs_t* rsopen(Rsdisc_t* disc, Rsmethod_t* meth, ssize_t c_max, int type)
This function creates a new sort context based on discipline disc and sort method meth. rsopen() returns the new sort context or NULL on error.
c_max:

Due to memory cache behavior, it is sometimes worthwhile to limit the amount of data sorted by a sort method at any given time. Data across calls to rsprocess() accumulated for a sort method is limited as follows. If c_max < 0, there is no limit. If c_max = 0, each rsprocess() call sorts data separately. If c_max > 0, data will be accumulated up to c_max before sorted. Thus, when c_max >= 0, many sorted list of records may result. These lists will be merged together in a rslist() or rswrite() call.
type:

This is a bit combination of: (1) RS_UNIQ, keeping only one in a set of duplicated records, (2) RS_REVERSE, sorting records in reverse order, and (3) RS_DATA, comparing records by keys, then by data.

int rsclose(Rs_t* rs)
This deletes rs and associated memory. If the discipline has an event function, the event RS_CLOSE will be raised first. If the event funtion returns a negative value, rsclose() will return the same value immediately. rsclose() returns 0 on success.

void rsclear(Rs_t* rs)
This frees memory in rs without closing rs.

SORT METHODS

Sort methods are of type Rsmethod_t*. The fields int Rsmethod_t.type and char* Rsmethod_t.name contain the method type and name respectively.

Rsradix
This is radix sort. Records are partitioned in phases by byte values.

Rssplay
Records are inserted into a top-down splay tree.

Rsrasp
This is a combination of radix sort and splay trees. Records are partitioned by first bytes of their keys. Then, groups with short keys are sorted using radix sort while groups with long keys are sorted in splay trees. A final merge phase collects everything together.

Rsverify
This method is used to verify if data is sorted. When a record is out of order, the event RS_VERIFY will be announced (see Rsdisc_t.eventf).

Rsmethod_t* rsmethod(Rs_t* rs, Rsmethod_t* meth)
If meth is not NULL, rsmethod() changes the sort method of rs to meth. This should be done only when there are no partially sorted records in rs, i.e., before any rsprocess() call or immediately after a rsclear() call. rsmethod() returns the old method on success and NULL on failure.

SORT DISCIPLINE

Object key management is defined in the type Rsdisc_t:

         typedef struct
         { int        type;
           ssize_t    data;
           ssize_t    key;
           ssize_t    keylen;
           Rsdefkey_f defkeyf;
           Rsevent_f  eventf;
         } Rsdisc_t;

Rsdisc_t.type
This should be a bit combination of RS_KSAMELEN and RS_DSAMELEN to indicate respectively that keys and data have constant lengths.

Rsdisc_t.data
If type&RS_DSAMELEN, Rsdisc_t.data is the length of the data of a record. Otherwise, Rsdisc_t.data defines the byte that separates records in a byte stream (see rsprocess().)

Rsdisc_t.key, Rsdisc_t.keylen

Rsdisc_t.defkeyf(Rs_t* rs, Void_t* data, ssize_t s_data, Void_t* key, ssize_t s_key, Rsdisc_t* disc)
Assume a record data of length s_data. If Rsdisc_t.defkeyf is NULL, Rsdisc_t.key and Rsdisc_t.keylen together define a key. data+Rsdisc_t.key is the key's starting address. If Rsdisc_t.keylen is positive, it is the length of the key. Otherwise, the length of the key is the quantity s_data+Rsdisc_t.keylen-Rsdisc_t.key.

If Rsdisc_t.defkeyf is not NULL, it is called to define a key. The key should be constructed in the given area key whose size is s_key. s_key is guaranteed to be at least Rsdisc_t.key*s_data. Rsdisc_t.defkeyf should return the length of the key or a negative value on error.

Rsdisc_t.eventf(Rs_t* rs, int type, Void_t* data, Rsdisc_t* disc)
If eventf is not NULL, it is called to announce certain events and associated data. If the return value of eventf is negative, the operation causing the announcement will immediately return an error condition. Following are supported events:
RS_CLOSE:

This event is raised when rs is about to be closed. The associated data is NULL.
RS_DISC:

This event is raised when the discipline is about to be changed. (Sfdisc_t*)data is the new discipline.
RS_METHOD:

This event is raised when the method is about to be changed. (Sfmethod_t*)data is the new method.
RS_VERIFY:

This event is raised when method Rsverify detects a record, r=(Rsobj_t*)data, out of order with respect to a previous record p. There are three cases. If r compares equal with p and RS_UNIQ is on, then r->equal is p. If r compares less than p and RS_REVERSE is off, then r->right is p. Finally, if r compares larger than p and RS_REVERSE is on, then r->left is p.

r->order and p->order are the ordinal positions of r and p in the stream of records.

The return value of eventf is significant as follows. If it is negative, the verification process is aborted. If it is zero, the verification process continues as if r did not exist. If it is positive, the verification process continues as if r was in order.

Rsdisc_t* rsdisc(Rs_t* rs, Rsdisc_t* disc)
If disc is not NULL, rsdisc() changes the discipline of rs to disc. This should be done only when there are no partially sorted records in the context, i.e., before any rsprocess() call or immediately after a rsclear() call. rsdisc() returns the old discipline on success and NULL on failure.

OBJECT OPERATIONS

ssize_t rsprocess(Rs_t* rs, Void_t* data, ssize_t s_data)

ssize_t rscount(Rs_t* rs)
rsprocess() partitions data into records and inserts them into rs. If s_data is non-positive, data is a single record of size -s_data. Otherwise, data is partitioned into records according to the rules defined in the discipline of rs. rsprocess() returns the number of bytes actually processed or -1 on error. After an rsprocess() call, the number of processed records can be retrieved with rscount().

Rsobj_t* rslist(Rs_t* rs)
This returns a sorted list of records previously inserted into rs. After rslist() has been called, no new records can be inserted until rs is cleared with rsclear(rs).

int rswrite(Rs_t* rs, Sfio_t* f, int type)
This writes sorted records in rs to stream f then clears rs. If type is any non-zero combination of RS_ITEXT and RS_OTEXT, data is output in a plain format. Otherwise, data is encoded for fast merging (see rsmerge().) rswrite() returns 0 on success and -1 on failure.

int rsmerge(Rs_t* rs, Sfio_t* f, Sfio_t** files, int n, int type)
This merges the given n files and writes the result to f. If type contains RS_ITEXT, the input data is assumed to be in a plain format. Otherwise, data is assumed to be encoded for fast merging. In addition, if type is any non-zero combination of RS_ITEXT and RS_OTEXT, data is output in a plain format. Otherwise, data is encoded for fast file merging. rsmerge() returns 0 on success and -1 on failure.

AUTHORS

Kiem-Phong Vo, kpv@research.att.com, and Glenn S. Fowler, gsf@research.att.com


November 07, 2006