Sunday, June 26, 2011

JSON Compression algorithms

About

JSON (Java Script Object Notation) is a lightweight data-interchange format. It is easy for humans to read and write. It is easy for machines to parse and generate. It can be used as a data interchange format, just like XML. When comparing JSON to XML, it has several advantages over the last one. JSON is really simple, it has a self-documenting format, it is much shorter because there is no data configuration overhead. That is why JSON is considered a fat-free alternative to XML.

However, the purpose of this post is not to discuss the pros and cons of JSON over XML. Though it is one of the most used data interchanged format, there is still room for improvement. For instance, JSON uses excessively quotes and key names are very often repeated. This problem can be solved by JSON compression algorithms. There are more than one available. Here you'll find an analysis of two JSON compressors algorithms and a conclusion whether JSON compression is useful and when it should be used.

Compressing JSON with CJSON algorithm

CSJON compress the JSON with automatic type extraction. It tackles the most pressing problem: the need to constantly repeat key names over and over. Using this compression algorithm, the following JSON:

[
  { // This is a point
    "x": 100, 
    "y": 100
  }, { // This is a rectangle
    "x": 100, 
    "y": 100,
    "width": 200,
    "height": 150
  },
  {}, // an empty object
]
Can be compressed as:

{
  "templates": [ 
    [0, "x", "y"], [1, "width", "height"] 
  ],
  "values": [ 
    { "values": [ 1,  100, 100 ] }, 
    { "values": [2, 100, 100, 200, 150 ] }, 
    {} 
  ]
}
The more detailed description of the compression algorithm, along with the source code can be found here:

Compressing JSON with HPack algorithm

JSON.hpack is a lossless, cross language, performances focused, data set compressor. It is able to reduce up to 70% number of characters used to represent a generic homogeneous collection. This algorithms provides several level of compression (from 0 to 4). The level 0 compression performs the most basic compression by removing keys (property names) from the structure creating a header on index 0 with each property name. Next levels make it possible to reduce even more the size of the JSON by assuming that there are duplicated entries.

For the following JSON:

[{
  name : "Andrea",
  age : 31,
  gender : "Male",
  skilled : true
}, {
  name : "Eva",
  age : 27,
  gender : "Female",
  skilled : true
}, {
  name : "Daniele",
  age : 26,
  gender : "Male",
  skilled : false
}]
the hpack algorithm produces a compressed version which looks like this:

[["name","age","gender","skilled"],["Andrea",31,"Male",true],["Eva",27,"Female",true],["Daniele",26,"Male",false]]
More details about hpack algorithm can be found at project home page.

Analysis

The purpose of this analysis is to compare each of the described JSON compressor algorithms. For this purpose we will use 5 files with JSON content having different dimensions, varying from 50K to 1MB. Each JSON file will be served to a browser using a servlet container (tomcat) with the following transformations:

  • Unmodified JSON - no change on the server side
  • Minimized JSON - remove whitespaces and new lines (most basic js optimization)
  • Compressed JSON using CJSON algorithm
  • Compressed JSON using HPack algorithm
  • Gzipped JSON - no change on the server side
  • Gzipped and minimized JSON
  • Gzipped and compressed using CJSON algorithm
  • Gzipped and compressed using HPack algorithm

Results

This table contains the results of the benchmark. Each row of the table contains one of the earlier mentioned transformation. The table has 5 columns, one for each JSON file we process.
\
json1 json2 json3 json4 json5
Original JSON size (bytes) 52966 104370 233012 493589 1014099
Minimized 33322 80657 180319 382396 776135
Compress CJSON 24899 48605 108983 231760 471230
Compress HPack 5727 10781 23162 49099 99575
Gzipped 2929 5374 11224 23167 43550
Gzipped and Minimized 2775 5035 10411 21319 42083
Gzipped and compressed with CJSON 2568 4605 9397 19055 37597
Gzipped and compressed with HPack 1982 3493 6981 13998 27358

Relative size of transformations(%)

The relative size of transformation graphic is useful to see if the size of the json to compress affects the efficiency of compression or minimization. You can notice the following:
  • the minimization is much more efficient for smaller files. (~60%)
  • for large and very large json files, the minimization has constant efficiency (~75%)
  • compressors algorithms has the same efficency for any size of json file
  • CJson compressing algorithm is less efficient (~45%) than hpack algorithm (~8%)
  • CJson compressing algorithm is slower than hpack algorihtm
  • Gzipped content has almost the same size as the compressed content
  • Combining compression with gzip or minimization with gzip, doesn't improve significantly efficiency (only about 1-2%)

Conclusion

Both JSON compression algorithms are supported by wro4j since version 1.3.8 by the following processors: CJsonProcessor & JsonHPackProcessor. Both of them provide the following methods: pack & unpack. The underlying implementation uses Rhino engine to run the javascript code on the serverside.

JSON Compression algorithms considerably reduce json file size. There a several compression algorithms. We have covered two of them: CJson and HPack. HPack seems to be much more efficient than CJson and also significantly faster. When two entities exchange JSON and the source compress it before it reach the target, the client (target) have to apply the inverse operation of compression (unpacking), otherwise the JSON cannot be used. This introduce a small overhead which must be taken into account when deciding if JSON compression should be used or not.

When gziping of content is allowed, it has a better efficiency than any other compression algorithm. In conclusion, it doesn't worth to compress a JSON on the server if the client accept the gzipped content. The compression on the server-side does make sense when the client doesn't know how to work with gzipped content and it is important to keep the traffic volue as low as possible (due to cost and time).

Another use-case for JSON compression algorithm is sending a large JSON content from client to server (which is sent ungzipped). In this case, it is important to unpack the JSON content on the server before consuming it.