Sunday May 10, 2008


jcol: A multi-purpose column parser/formatter/processor

Update: (July 23, 2008)
New additions include:

On most of the projects I have been involved with, I have had a heavy hand in the testing and automating of the tests. Along with that, I generally have to post-process large amounts of formatted data and produce other data (be it graphs, logs, archives, ...).

Anyone who has worked in this setting has undoubtedly had to string together many ad-hoc scripting solutions to fit the job. I generally find myself writing the same thing for some project when I can recall just how close it is to the last one I wrote... Many times I do not have access to the code of the last project due to confidentiality constraints and contracts.

jcol represents an effort to produce a consolidated tool that is flexible enough to do both the easy and more involved processing.

Here are the inputs we will be working with:

[20:53]:[0]:jcol > cat four.dat 
1,2,3,4
2,3,4,5
3,4,5,6
4,5,6,7
5,6,7,8
6,7,8,9
7,8,9,0
[20:53]:[0]:jcol > cat squares.dat 
1       1
2       4
3       9
4       16
...
10      100
[20:53]:[0]:jcol > cat strange.dat 
1iii7iii42iii600
2iii6iii24iii602
4iii5iii42iii604
8iii4iii24iii606
[20:53]:[0]:jcol > 

By default, jcol is simply a passthrough mechanism; that is, it prints to stdout what it gets from stdin. One of the more common tasks that I have to do is limit the output fields to a subset of the input. Generally, awk is a good tool for this. However, when you want to start selecting more than 10 columns or you have ranges of columns you want to select awk syntax becomes cumbersome.

jcol supports a limited range syntax for column selection where ':' is the range selector (as in 1:5) and comma is the separator.

[20:53]:[0]:jcol > jcol sep=, field=2:* < four.dat 
2       3       4
3       4       5
4       5       6
...
[20:53]:[0]:jcol > jcol sep=, field=1,4 < four.dat 
1       4
2       5
3       6
...

The syntax allows for ranges (N:M, where N or M can be * which mean all columns from N or all columns up to M). Single columns can be separated by commas. Ranges are treated as separated objects, so the following would be a valid field command:

field=1,3:7,9:*

In order to support wider range of input files, the separator can be set to just about anything. Full regular expression syntax is supported.

[20:53]:[0]:jcol > jcol sep=iii < strange.dat 
1       7       42      600
2       6       24      602
4       5       42      604
...

It is also common for me to have to modify the data as I process it. For instance, needing to convert from Mbit values to Byte values. To do this in jcol, I can simply use the apply= directive. Any number of apply= directives can be applied to a single parse.

[20:53]:[0]:jcol > jcol apply=@1**3 apply=@2/100 < squares.dat 
#col1   col2    "(col1)**3"     "(col2)/100"
1.0     1.0     1.0     0.01
2.0     4.0     8.0     0.04
3.0     9.0     27.0    0.09
...

The apply= directive works by allowing a user to do computation on a particular field of the input. To reference input fields in the computation, use the @N character where N is the field you want to use. Notice that the columns have headers that correspond to the data. This is a default action due to the fact that the order in which columns are added is not always preserved. As with most things in jcol this is an option that can be turned off.

Any valid ruby expression can be placed into an apply= directive.

Using an apply= directive on a column to be generated (from an avg=, sum= or apply= directive) is undefined.

As well as manipulating data on a column basis, it is also possible to limit processing on a row evaluation. Most logs I deal with have more that the required information in them. In fact, I usually have to do a bit of selection before getting down to the information I am really after.

For example, I may have a log that has 3 lines of header information and intermittent failure messages throughout.

[20:53]:[0]:jcol > cat sample.log 
# These first three lines are intended
# to be comments to the file. They are in no 
# way useful to a computational parse
1       7       string  42      100
1       8       string  42      100
1       9       string  42      100
1       7       string  42      100
Failure Message!!!
1       7       string  42      100
1       8       string  42      100
Failure Message!!!
1       7       string  42      100
1       8       string  42      100
1       9       string  42      100
Failure Message!!!
1       7       string  42      100
1       8       string  42      100

This is not an uncommon situation. Now we are getting into some complicated awk or command-line scripting. We have to wait until we pass the header then we have to be aware of the fact that there is extraneous data throughout the file.

jcol makes short work of this. It has nice support for regular expression-based elimination of input lines. You also have the option to specify a number of lines to be ignored as header info.

[20:53]:[0]:jcol > jcol skip=3 ignore=^Fail < sample.log 
1       7       string  42      100
1       8       string  42      100
1       9       string  42      100
...

Knowing that the comments were always preceded by a # character, we could have allowed for any number of header lines to be skipped by combining two ignore= directives.

[20:53]:[0]:jcol > jcol ignore=^# ignore=^Fail < sample.log 
1       7       string  42      100
1       8       string  42      100
1       9       string  42      100
...

For times when you need to produce somewhat commented output once the process is done, jcol allows for column headers to be applied to each column. This is especially important when extra processing has been done on the input by way of an apply= directive since extra columns will be created for each apply=.

[20:53]:[0]:jcol > jcol apply=@1*@2 title=2,Square < squares.dat 
#col1   Square  "(col1)*(Square)"
1.0     1.0     1.0
2.0     4.0     8.0
...

Titles can be applied to columns to be generated as part of an apply= directive as well.

[20:53]:[0]:jcol > jcol apply=@1*@2 title=2,Square title=3,Cube < squares.dat 
#col1   Square  Cube
1.0     1.0     1.0
2.0     4.0     8.0
...

There are many calculations that can be done on input, but two of the most common [for me] are sum and average. I chose to add these two as running values based on the input instead of adding a row or two at the end. Doing so allows for some creative processing like plotting the CDF of any input.

[20:53]:[0]:jcol > jcol avg=1 < squares.dat 
#col1   col2    "avg(col1)"
1       1       1.0
2       4       1.5
3       9       2.0
...
[20:53]:[0]:jcol > jcol sum=2 < squares.dat 
#col1   col2    "sum(col2)"
1       1       1.0
2       4       5.0
3       9       14.0
...

Another powerful feature of jcol is the filter= directive. Any input line can be filtered out based on a set of rules. This is extremely helpful when dealing with log files that seem to follow no particular format whatsoever. Looking back at the sample.log from earlier, suppose we want to ensure that the only lines we see match having a 7 in the second column.

[20:53]:[0]:jcol > jcol ignore=^# filter=2,7 < sample.log 
1       7       string  42      100
1       7       string  42      100
1       7       string  42      100
...

Ok, enough of the small tidbits here and there, lets see a real example. Suppose we want to monitor the output of ping to see how latencies change over time. ping output presents several challenges for handling the output inline on the command line. First of all, there is header information:

PING www.l.google.com (64.233.169.99): 56 data bytes

Then there are blank lines and statistics trailing the informational output:

--- www.l.google.com ping statistics ---
2 packets transmitted, 2 packets received, 0% packet loss
round-trip min/avg/max/stddev = 17.379/17.545/17.711/0.166 ms

And, finally, an output that looks like:

64 bytes from 64.233.169.99: icmp_seq=0 ttl=243 time=18.886 ms

Let's see what we can do with jcol.

	[20:53]:[0]:jcol > ping -c10 google.com | jcol sep="[^\d+]+" skip=1 filter=1,64 field=8,9 delim=.
32.059
32.583
33.375
34.600
33.335
32.310
33.184
31.766
43.898
32.981

That is all it takes to handle ping output. Lets break down the example to see what I did:

sep="[^\d+]+" Any string of non-digit input serves as the separator
skip=1 Skip the header info
filter=1,64 Require that the first field match 64. Note: This handles the empty lines and stat lines as it discards the input before it is evaluated internally.
field=8,9 Only print the 8th and 9th columns
delim=. Join those two fields with the '.'