ETL

Apache Hive: to query structured data from HDFS

If we want to talk about Hive, we need to understand the difference between Pig and Hive as people can easily confuse or ask why do we need this when we have that questions.

The Apache Hive software provides a SQL like querying dialect called Hive Query Language that can be used to query data that is stored in Hadoop cluster. Once again Hive eliminates the need for writing mappers and reducers and we can use the HQL to query the language. I would not go deep into why do we need to query. Look up SQL tutorials in search engine.

The importance of Hive can only be understood when we have the right kind of data for Hive to process: static data, data not changing, quick response time is not the priority. Hive is only a software on top of Hadoop framework. It is not database package. We are querying the data from HDFS and we will use Hive where the need for SQL like querying arises.

HCatalog stores the metadata from HQL. Just popped up in my mind as I write this post. We’ll look into it later. Hive is ideal for data warehousing applications, where data is just stored and mined when necessary for report generation, visualization, etc …

Following is the Apache definition of Hive:

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Pig is a procedural data flow languages — the pig latin script, (think of python or perl). Hive is like a SQL querying language. Just like what they say, Pig can eat anything, which means Pig can take structured and unstructured data. Hive on the other hand can only process structured data. Data representation in Pig is through variables, whereas in Hive is through tables. Hive also supports UDF but it is more complex than Pig.

Installation:

Very similar to Pig

1. Download Apache Hive
$ wget http://www.eu.apache.org/dist/hive/hive-latest-version/hive-latest-version.tar.gz

2. Extract the Tarball

$ tar –zxvf hive-latest-version.tar.gz

3. Add the following to your previously created hadoop configuration bash script:

export HIVE_HOME=/data/hive-latest-version
export JAVA_HOME=/usr/java/jdk1.7.0_05
export HADOOP_HOME=/data/hadoop

export PATH=$HIVE_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME:/data/hadoop/hadoop-core-1.0.1.jar:$HIVE_HOME/hive-latest-version.jar

4. Run the configuration file

$ . hadoopconfiguration.sh

5. Now launch Hive,

$ hive

Hadoop should be live.