Apache JAQL: Efficient querying mechanism for JSON data format

So far we have seen data crawling, data transfering from various sources, data mining and data quering. Today we will check out something called JAQL which is a query language specifically made for JSON data. Created by IBM researchers, released to open source community and can be used under Apache license 2.0. JAQL is primarily for JSON style documents but it can also be used for quering XML, CSV, flat files and structured SQL data. So it can handle structured, semi-structured and unstructured data.

JSON is of particular interest in the field of social media mining/content extraction because JSON data format is one of the options given by facebook, twitter to extract data using their APIs. Instead of using Sed and Awk, JAQL would help to query JSON documents as though they are structured.

JAQL is Hadoop compatible, meaning it can be run to query big data inside HDFS. It automatically generates mapreduce jobs and supports parallel processing of queries in a Hadoop cluster. Almost like a hybrid of Pig and Hive.

There are other similar softwares/add-ons available to query JSON documents but you either have to migrate all your JSON to a different database (like MongoDB) or use add-ons with SQL (like JSON Path or JSON query) which may not support functionality under HDFS. Without the need for a new NoSQL or SQL, one should be able to query data inside HDFS. I would like to believe this is where JAQL emerges as better option. Not to forget, JAQL can be used from shell or as part of eclipse development.

We will see how we can setup JAQL in our system and start querying JSON documents. Very simple steps.

wget http://code.google.com/p/jaql/downloads/detail?name=jaql-0.5.1_12_07_2010.tgz

tar xzvf jaql-0.5.1_12_07_2010.tgz

Add the following in your bashrc or your customized runtime config file:

export JAQL_HOME=/home//jaql-0.5.1



Note:1 Jaql does not seem to support HADOOP YARN. Either try running it on Traditional Hadoop or run it in local mode. If you want to run in local mode, I suggest to unset all Hadoop related environment variables, set only JAQL HOME and start bin/jaqlshell from the jaql folder.

Note:2 if wget link does not work for you, go to the website and download by right-click and select “save as”