Computing A Network Traffic Matrix

This query's task is to compute the number of flows, total bytes sent, and total packets sent between pairs of networks by appropriately aggregating flows read from a gzip'd flat file of Cisco NetFlow data. Since each data record describes a flow between IP_address-port pairs, in order to determine the associated networks, it is necessary to apply the appropriate netmasks to the given IP addresses.

Fred True's flowdump program formats each binary NetFlow record into a flat file data record having the following fields:

# flowdump format:
# Srcaddr, Dstaddr, Nexthop, Input, Output, DPkts, DOctets, First, Last, Srcport, Dstport, 
# Pad, Tcp_Flags, Protocol, Tos, Src_As, Dst_As, Src_Mask, Dst_Mask, Reserved

There are two versions of the Perl query. The first, Perltypi, is written in a way typically used for writing these kind of sequentially-scan-and-aggregate queries. The second, Perlwref, is written using Perl5 references.

Although it is not evident in the Cymbal query, part of the reason for its speed is its behind-the-scene use of pointers, i.e., the equivalent of references in Perl and what are called VBL VBLS in Cymbal. The intent behind writing Perlwref is to use references to try to achieve Daytona's speed. In Perlwref, explicit references are used to support computing common subexpressions just once. This created a more lengthy and obscure-looking Perl and while it provided a 10% improvement over Perltypi, the resulting execution speed was still nowhere near that of Daytona. Perlwref also makes use of Perl's builtin sort capabilities, which are not particularly convenient to use when sorting arrays of tuples. Once again, Perl references provide the means to a (cumbersome) solution. This is why it is more common to pipe non-scalar-valued Perl arrays to sort(1).

The queries were run on a single gzip'd 52MB input data file for the Cymbal, Perl, gawk, nawk, and awkcc language processors.

Performance
  Elapsed Time Relative Speed
Cymbal                   1m55s                 1.00       
Perlwref                 15m37s                 8.14
Perltypi                 16m3s                 8.37
gawk                 18m30s                 9.65
awkcc                 20m3s                 10.46
nawk                 34m25s                 17.96


Correctness

Note that Cymbal now has a built-in function masked_ip which will do what the network function does here. The network function may not even be correct. The point of this exercise is to take real-life code, correct or incorrect, efficient or inefficient, written by whomever and express the encoded algorithm equivalently in multiple languages so as to compare language constructs and performance for pedagogical purposes.

Latest Best Version Of Query

Actually, as of Dec 2002, new constructs added to Cymbal enable this query to be written much more concisely:

for_each_time [ .src_net, .dest_net, .tot_flows, .tot_pkts, .tot_bytes ]
  Is_In { [ masked_ip( (IP).src_ip, (INT).src_mask), 
        masked_ip( (IP).dest_ip, (INT).dest_mask), 
        count(), sum( over (INT).pkts ), sum( over (INT).bytes ) ] :
    [ .src_ip, .dest_ip, 3?, .pkts, .bytes, 10?, .src_mask, .dest_mask, ? ]
      = ptokens( for "gzcat /rootbak/jack/data/192.168.31.97.0354.gz" upto "|\n" ) }
sorted_by_spec[ -5 ]
{
  do Write_Words( .src_net, .dest_net, .tot_pkts, .tot_bytes, .tot_flows );
}

Daytona processes this by translating it into pretty much the same code as the previous/original Cymbal version of this query.

Click to see a side-by-side comparison with commentary of the Perltypi and (previous) Cymbal queries.

Click to see the query in awk, Perltypi, Perlwref, and Cymbal.

Back