Overview

The AT&T SWIFT Open Source software is a collection of libraries and commands for processing, analysing and visualizing very large datasets. Using the tools included in SWIFT one can build systems that collect data, either passively (e.g. monitoring directories) or actively (e.g. using SSH to copy data from remote servers), process this data to clean it up and put it in a format that is fast to query, and finally display this data to allow users to explore it.

The tools in SWIFT aid this work by providing mechanisms to reliably transfer files, to perform incremental processing of new data so that the cost of adding data to a big dataset is low, and to perform very fast processing of data using efficient algorithms.


Background

The AT&T SWIFT toolkit has been in use internally in AT&T since 1997 and was released as opensource in 2013. It relies heavily on the AT&T AST open source software both for building and for using. Some of the AST components that it uses are KSH, NMAKE, VCODEX, SFIO, and VMALLOC. SFIO and VMALLOC are used to implement efficient IO and memory management mechanisms. VCODEX is used as an embedded file format to transparently compress large datasets much faster and better than standard compression tools. KSH is used extensively to implement all the data handling except for the actual processing of the big datasets. NMAKE is used to manage the data processing component. NMAKE, like all other make tools was designed to compile software. However, NMAKE keeps state information about its actions and does not rely just on the filesystem state. This makes it a good candidate to implement data processing steps that need to be performed in a certain sequence while being able to tell that each step completed successfully, so just checking the existense of files is not enough.


Components

The main data processing component of SWIFT is DDS. This includes a library and a set of tools. It implements the DDS file format, which is a self-describing file format. Each DDS file consists of a header that describes its schema and a data section. The schema maps directly to a C data structure. The DDS tools read this schema and use it to compile (on the fly) a shared module that is then used to efficiently perform the requested operation on the file. For example, the command:
ddsfilter -fe '{ DROP; if (code > 10 strcmp (source, "NYC") == 0) KEEP; }'
which looks like an AWK command will wrap this expression in a set of C functions, compile them into a shared object, load the object and execute it for each record. This combines the ease of use of AWK with the speed of C code. It is better than AWK in the sense that the fields are specified by name (code and source are defined as fields in the schema for this file) and operations on them are type-checked by the compiler. There are several commands included, which essentially provide most of the data processing operations of a database system. DDS tools can be composed into pipelines, either using the standard UNIX pipe mechanism, or a one-writer-many-readers mechanism included in the toolkit. This makes it possible to read a very large dataset from disk once and perform multiple processing actions in sequence or in parallel without having to copy data to / from disk at every step.

Another component of SWIFT is AGGR. This is useful for building and maintaining large data cubes of metrics. It can be used to process plain text files (e.g. CSV files) or it can be added to the end of DDS pipelines. It is incremental, so if data is arriving in small chunks, they can be added into a summary dataset at a cost proportional to the size of the new data, not the size of the overall dataset. This component also includes several tools that can manipulate these datasets.

Another component of SWIFT includes tools for setting up a web site that can be used to run queries on the data generated by the processing tools. The toolkit does not include a finished web site environment, just the tools to set it up. Also, no web server is included, it is assumed that the apache web server is available on the system.

All SWIFT commands use the AST option handling mechanism so they can produce a man page by using the --man option.


May 21, 2013