Utah open-source semi-structured text parser

A good proportion of the software we develop a Sonalake is used to crunch data for analytics and visualisation. That’s the sexy stuff. Behind the scenes there are usually numerous data feeds with varying degrees of ‘niceness’. Some feeds are well structured and easy for software to decode and ingest, while other aren’t.

Parsing well structured record data is pretty straightforward. Things get tricker when handling records that are almost structured.

Take, for example, the BGP summary output from a Juniper router.

Groups: 3 Peers: 3 Down peers: 0
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0               947        310          0          0          0          0
inet6.0              849        807          0          0          0          0
Peer                     AS      InPkt     OutPkt    OutQ   Flaps Last Up/Dwn State|#Active/Received/Damped...
10.247.68.182         65550     131725   28179233       0      11     6w3d17h Establ
inet.0: 4/5/1
  inet6.0: 0/0/0
10.254.166.246        65550     136159   29104942       0       0      6w5d6h Establ
  inet.0: 0/0/0
  inet6.0: 7/8/1
192.0.2.100           65551    1269381    1363320       0       1      9w5d6h 1/2/3 4/5/6

Semi-structured output has been targeted for consumption by people, not machines. As such the format tends not to be strictly enforced and it can change subtly over time. Slight variations in data format mean having to rework the parser.

Whenever we’ve had the need to process this type of data, and Python was an option, we’ve leaned on Google’s wonderful TextFSM. In cases where we needed to use Java though we ended up writing custom parsers – in simple cases by hand, in more complex cases using a parser generator like Antlr.

All good until slight variations occur and we had to tweak the parsers. Pretty quickly we decided that write something that could handle any of the semi-structured files that we were to throw at it. The result was Utah, a small Java library that can handle most of the semi-structured record formats.

We have been using it internally for quite a while and have decided to release it as open-source under the Apache 2.0 license. You can find it on GitHub at https://github.com/sonalake/utah-parser.

We hope you find a use for it and if you have any ideas on how we can improve on it, please let us know – just raise an issue. Even better, fork it, fix it and create a pull request!

As they say, every good work of software starts by scratching a developer’s personal itch.