Apache Pig: for ETL functions, data cleansing and munging

Pig is one of the add-ons in the Hadoop framework that supports and helps executing mapreduce data processing in Hadoop. Yes, Hadoop does support mapreduce natively but Pig makes it easier so you dont have to write a complex java program defining mapper and reducer classes.

Pig includes a scripting language called the Pig Latin that provides many standard database operations that we normally do with data and also UDFs (User defined functions so the user can write his own query method to extract relevant data or munge the data). Data Munging is done using Pig, but not limited to Pig. (Look up Sed and Awk in google)

A common example for Pig Latin script is when web companies bringing in logs from their web servers, cleansing the data, and precomputing common aggregates before loading it into their data warehouse. In this case, the data is loaded onto the grid, and then Pig is used to clean out records from bots and records with corrupt data. It is also used to join web event data against user databases so that user cookies can be connected with known user information.


Pig is not needed to be installed in our multi-node cluster. Only on the machine where you run in your Hadoop job, the Master. Remeber there can be more than one master for a multi-node cluster; pig can be installed on these machines too. Also, you can run Pig in localmode if you want to use it locally in your linux operating system.

1. Download Apache Pig

$ wget http://www.eu.apache.org/dist/pig/pig-latest-version/pig-latest-version.tar.gz

2. Extract the Tarball

$ tar –zxvf pig-latest-version.tar.gz

3. Add the following to your previously created hadoop configuration bash script:

export PIG_HOME=/data/pig-latest-version
export JAVA_HOME=/usr/java/jdk1.7.0_05
export HADOOP_HOME=/data/hadoop

export CLASSPATH=$JAVA_HOME:/data/hadoop/hadoop-core-1.0.1.jar:$PIG_HOME/pig-latest-version.jar

4. Run the configuration file

$ . hadoopconfiguration.sh

5. Now launch Pig,

$ pig

You will see set of instructions followed by grunt shell like this


Remember, this whole setup needs to be done in a Master/Masters in multi-node cluster, on single-node cluster where Hadoop framework is up and running.