Recsort provides functions to sort strings in main memory and
to merge sorted data from different files.
These functions form a sort engine suitable for constructing
programs such as /bin/sort(1).
This defines the type of a sorting context handle.
This defines the type of a discipline structure which describes
how keys are defined.
This defines the type of a sorting method.
This is the type of a record to be sorted.
It contains the following fields:
ulong order;
Rsobj_t* left;
Rsobj_t* right;
Rsobj_t* equal;
uchar* key;
ssize_t keylen;
uchar* data;
ssize_t datalen;
- order, left:
These fields are only significant in calls to event functions
raised by method Rsverify.
- right, equal:
The rslist() call returns a linked list of sorted records.
obj->right defines the forward link for obj.
obj->equal, if not NULL,
points to a NULL-terminated linked list of records
compared equal to obj.
- key, keylen, data, datalen:
These fields define the key and data of the given record.
This function creates a new sort context based on discipline
disc
and sort method
meth.
rsopen() returns the new sort context or
NULL on error.
- c_max:
Due to memory cache behavior, it is sometimes worthwhile to limit
the amount of data sorted by a sort method at any given time.
Data across calls to rsprocess() accumulated for a sort method
is limited as follows.
If c_max < 0, there is no limit.
If c_max = 0, each rsprocess() call sorts data separately.
If c_max > 0, data will be accumulated up to c_max before sorted.
Thus, when c_max >= 0, many sorted list of records may result.
These lists will be merged together in a rslist() or rswrite() call.
- type:
This is a bit combination of:
(1) RS_UNIQ, keeping only one in a set of duplicated records,
(2) RS_REVERSE, sorting records in reverse order, and
(3) RS_DATA, comparing records by keys, then by data.
This deletes
rs and associated memory.
If the discipline has an event function, the event
RS_CLOSE will
be raised first. If the event funtion returns a negative value,
rsclose()
will return the same value immediately.
rsclose() returns 0 on success.
This frees memory in
rs without closing
rs.
Sort methods are of type Rsmethod_t*.
The fields int Rsmethod_t.type and char* Rsmethod_t.name
contain the method type and name respectively.
This is radix sort.
Records are partitioned in phases by byte values.
Records are inserted into a top-down splay tree.
This is a combination of radix sort and splay trees.
Records are partitioned by first bytes of their keys.
Then, groups with short keys are sorted using radix sort
while groups with long keys are sorted in splay trees.
A final merge phase collects everything together.
This method is used to verify if data is sorted.
When a record is out of order,
the event
RS_VERIFY will be announced (see
Rsdisc_t.eventf).
If
meth is not
NULL,
rsmethod() changes the sort method of
rs to
meth.
This should be done only when there are no partially sorted records in
rs,
i.e., before any
rsprocess() call or immediately after a
rsclear() call.
rsmethod() returns the old method on success and
NULL on failure.
Object key management is defined in the type Rsdisc_t:
typedef struct
{ int type;
ssize_t data;
ssize_t key;
ssize_t keylen;
Rsdefkey_f defkeyf;
Rsevent_f eventf;
} Rsdisc_t;
This should be a bit combination of
RS_KSAMELEN and
RS_DSAMELEN to
indicate respectively that keys and data have constant lengths.
If
type&RS_DSAMELEN,
Rsdisc_t.data is the length of the data of a record.
Otherwise,
Rsdisc_t.data defines the byte that
separates records in a byte stream (see
rsprocess().)
Assume a record
data of length
s_data.
If
Rsdisc_t.defkeyf is
NULL,
Rsdisc_t.key and
Rsdisc_t.keylen together define a key.
data+
Rsdisc_t.key is the key's starting address.
If
Rsdisc_t.keylen is positive, it is the length of the key.
Otherwise, the length of the key is the quantity
s_data+
Rsdisc_t.keylen-Rsdisc_t.key.
If Rsdisc_t.defkeyf is not NULL, it is called to define a key.
The key should be constructed in the given area key whose size is s_key.
s_key is guaranteed to be at least Rsdisc_t.key*s_data.
Rsdisc_t.defkeyf should return the length of the key or a negative value on error.
If
eventf is not
NULL, it is called to announce certain
events and associated data. If the return value of
eventf is negative,
the operation causing the announcement will immediately return an error condition.
Following are supported events:
- RS_CLOSE:
This event is raised when rs is about to be closed.
The associated data is NULL.
- RS_DISC:
This event is raised when the discipline is about to be changed.
(Sfdisc_t*)data is the new discipline.
- RS_METHOD:
This event is raised when the method is about to be changed.
(Sfmethod_t*)data is the new method.
- RS_VERIFY:
This event is raised when method Rsverify detects
a record, r=(Rsobj_t*)data, out of order with respect to
a previous record p. There are three cases.
If r compares equal with p and RS_UNIQ is on,
then r->equal is p.
If r compares less than p and RS_REVERSE is off,
then r->right is p.
Finally, if r compares larger than p and RS_REVERSE is on,
then r->left is p.
r->order and p->order are the ordinal positions of r and p
in the stream of records.
The return value of eventf is significant as follows.
If it is negative, the verification process is aborted.
If it is zero, the verification process continues as if r did not exist.
If it is positive, the verification process continues as if r was in order.
If
disc is not
NULL,
rsdisc() changes the discipline of
rs to
disc.
This should be done only when there are no partially sorted records in the context,
i.e., before any
rsprocess() call or immediately after a
rsclear() call.
rsdisc() returns the old discipline on success and
NULL on failure.
rsprocess() partitions
data into records and inserts them into
rs.
If
s_data is non-positive,
data is a single record of size
-s_data.
Otherwise,
data is partitioned into records
according to the rules defined in the discipline of
rs.
rsprocess() returns the number of bytes actually processed or
-1 on error.
After an
rsprocess() call,
the number of processed records can be retrieved with
rscount().
This returns a sorted list of records previously inserted into
rs.
After
rslist() has been called, no new records can be inserted
until
rs is cleared with
rsclear(rs).
This writes sorted records in
rs to stream
f then clears
rs.
If
type is any non-zero
combination of
RS_ITEXT and
RS_OTEXT, data is output in a plain format.
Otherwise, data is encoded for fast merging (see
rsmerge().)
rswrite() returns 0 on success and -1 on failure.
This merges the given
n files and writes the result to
f.
If
type contains
RS_ITEXT,
the input data is assumed to be in a plain format.
Otherwise, data is assumed to be encoded for fast merging.
In addition, if
type is any non-zero combination of
RS_ITEXT and
RS_OTEXT, data is output in a plain format.
Otherwise, data is encoded for fast file merging.
rsmerge() returns 0 on success and -1 on failure.
Kiem-Phong Vo, kpv@research.att.com, and
Glenn S. Fowler, gsf@research.att.com