tl;dr LazyJSON now offers the ability to compress JSON based on the idea of JSON templates that separate payload data from structure. Using this along with general purpose compression algorithms lets us get down to 1.8% of the original size for real world JSON data.
We initially wrote LazyJSON to solve a very specific use case in our open source data processing framework, StroomData. You can read more about the concepts behind StroomData here and the initial announcement of LazyJSON here. While we were working on optimizing the internal syntax tree that is being generated by our JSON parser, we had a moment of insight. We could use that syntax tree to separate a JSON document into structure and payload. Let me show you what I mean—have a look at the following JSON document.
Now let’s split it up into the parts that are purely structural and the parts that are the actual payload data. In the following table, structure is shown in blue and payload in green.
If we have a stream of JSON documents that share the same structure, we can save the structure (the blue cells) as a template and just write out the payload (the green cells) as our actual data. We would need to specify to which template we are referring, but if we imagine that we hold the template id in a two byte short, we would be able to reduce the above document from 30 bytes to 12 bytes.
That’s about 3:1 compression for the above sample. Not a lot by modern day text compression standards, but it does have a very desirable feature when it comes to our data storage needs in StroomData. As long as we hold the templates in memory, we can still do random access reads into our data files to pick out a single document without being dependent on earlier data in the file.
We can optimize this compression scheme a bit further by also keeping a lookup dictionary of repeated string values along with the list of templates. In the above sample, this would further decrease the final size from 12 to 8 bytes assuming that we use a two byte short for the dictionary lookups and that “Kasper” is a repeated value worth storing.
We have implemented this compression scheme as described above in the latest version of LazyJSON (v1.2.2). You can specify how many times we need to see a repeated template before we store it and start compressing with it, as well as how many times we need to see a repeated string value before we add it to the dictionary. This lets you reap the compression benefits even if your stream of documents contains a lot of structural data that aren’t repeated and would result in one off templates.
As you can probably imagine, it would be very easy to craft an imaginary piece of JSON data that showed insane compression abilities for this approach. All you would have to do is create JSON objects with insanely large field names that each have a single byte as their payload value. So instead I have chosen to demonstrate the compression abilities of LazyJSON using an actual data file containing DoubleDutch metrics from our mobile clients. The file takes up 2.7 GB of space and contains 4,740,448 JSON objects. While the data set probably doesn’t contain more than 10–20 truly unique templates, the compression engine sees even the slightest change as a completely different template. As such, two pieces of JSON data that have the same integer value field might result in two different templates for an object that has a value of 99 and an object that has a value of 32,000. This is because the compression engine chooses to store numeric values in the smallest possible representation, so the first object will have a field type of a byte and the second object of a two byte short. The end result of this is that we end up with 2,910 unique templates for this data file and a template utilization ratio of 0.99. The metrics also contain a lot of repeated string values such as GUID’s for application id’s and data items. With a dictionary size capped at 32,767 entries, we end up with a full dictionary and a 0.78 utilization rate.
So, what does all of this mean in terms of actual compression? Let’s have a look at some numbers in the following table. “Size x raw” is the ratio of the output data compared to the original raw data file. “Size x gzip” is compared to gzip as a baseline compression.
|Name||Size x raw||Size x gzip||Time*||Weissman score**|
|LazyJSON + gzip||0.02972||0.628||146||1.20|
|LazyJSON + xz||0.01791||0.378||413||1.65|
* Time is measured in seconds
** Weissman score calculated using alpha=1 and gzip as base score
As is expected, LazyJSON offers horrible compression in comparison to even our baseline compressor, gzip. Though, remember—LazyJSON still allows for random access into the stream, so all in all, for some use cases this would still be a very meaningful compression ratio of 4.28:1. However, have a look at the last two rows in the table. Since LazyJSON doesn’t actually attempt to compress the actual payload data (except for the very simple substitution of string values using a dictionary lookup), there is still a lot of opportunity for compression. In this case, LazyJSON followed by xz gets us down to two-thirds of what xz is capable of on it’s own. On top of that, it also gets us down to around two thirds of the total time spent compressing data.
When we wrote the compression functionality in LazyJSON, it was never our intention to use it for general purpose compression of JSON data. But being able to turn 2,792 MB of raw JSON into just 51 MB—that’s just 1.8% of the original size!!!—sure makes it a very tempting option for long term archival of JSON data. We have not yet implemented any command line tools to make this functionality available, but it's usage is pretty simple as shown in the following sample.
LazyJSON is available as an open source project released under the Apache 2.0 license. You can follow the project on github and find its artifacts in Maven Central. Follow the StroomData project on twitter to get updates on the LazyJSON code base!