Utah open-source semi-structured text parser
A good proportion of the software we develop a Sonalake is used to crunch data for analytics and visualisation. That’s the sexy stuff. Behind the scenes there are usually numerous data feeds with varying degrees of ‘niceness’. Some feeds are well structured and easy for software to decode and ingest, while other aren’t.
Parsing well structured record data is pretty straightforward. Things get tricker when handling records that are almost structured.
Take, for example, the BGP summary output from a Juniper router.
Groups: 3 Peers: 3 Down peers: 0 Table Tot Paths Act Paths Suppressed History Damp State Pending inet.0 947 310 0 0 0 0 inet6.0 849 807 0 0 0 0 Peer AS InPkt OutPkt OutQ Flaps Last Up/Dwn State|#Active/Received/Damped... 10.247.68.182 65550 131725 28179233 0 11 6w3d17h Establ inet.0: 4/5/1 inet6.0: 0/0/0 10.254.166.246 65550 136159 29104942 0 0 6w5d6h Establ inet.0: 0/0/0 inet6.0: 7/8/1 192.0.2.100 65551 1269381 1363320 0 1 9w5d6h 1/2/3 4/5/6
Semi-structured output has been targeted for consumption by people, not machines. As such the format tends not to be strictly enforced and it can change subtly over time. Slight variations in data format mean having to rework the parser.
Whenever we’ve had the need to process this type of data, and Python was an option, we’ve leaned on Google’s wonderful TextFSM. In cases where we needed to use Java though we ended up writing custom parsers – in simple cases by hand, in more complex cases using a parser generator like Antlr.
All good until slight variations occur and we had to tweak the parsers. Pretty quickly we decided that write something that could handle any of the semi-structured files that we were to throw at it. The result was Utah, a small Java library that can handle most of the semi-structured record formats.
We have been using it internally for quite a while and have decided to release it as open-source under the Apache 2.0 license. You can find it on GitHub at https://github.com/sonalake/utah-parser.
We hope you find a use for it and if you have any ideas on how we can improve on it, please let us know – just raise an issue. Even better, fork it, fix it and create a pull request!
As they say, every good work of software starts by scratching a developer’s personal itch.