Hive

Big Data Logistics: data transfer using Apache Sqoop from RDBMS

Apache Sqoop is a connectivity tool to perform data transfer between Hadoop and traditional databases (RDBMS) which contains structured data. Using sqoop, one can import data to Hadoop Distributed File System from RDBMS like Oracle, Teradata, MySQL, etc… and also export the data from Hadoop to any RDBMS in form of CSV file or direct export to databases.

There is a possibility of writing mapreduce programs that would use jdbc connectors to export/import data from/to Hadoop but sqoop automates most of such process. Advantage here is operations are parallelized and fault tolerance.

One caveat though: user has to download the appropriate jdbc driver for the database. I will use MySQL so I would need MySQL jdbc driver. These drivers are developed already so not to worry. Do lookup cloudera, hortonworks for teradata, postgresql, oracle’s jdbc drivers.

JDBC driver, Java DataBase Connectivity is basically an API that lets you access data from relational databases and data warehouses. Each RDBMS provider has its own JDBC driver interface program.

One can import an entire table from RDBMS and specify a location where it needs to be stored in HDFS or can incorporate a query where only a subset of an entire table will be written in HDFS.

Installation:

Sqoop is very easy to install. If you are a Red Hat/CentOS user, do a yum install on sqoop.

$ wget http://name of the mirror/sqoop/1.4.4/sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz

$ tar -zxvf sqoop-1.4.4.bin__hadoop-1.0.0.tar.gz

Add the SQOOP_HOME in your runtime configuration file.
export SQOOP_HOME=/data/sqoop-1.4.4
export PATH=$SQOOP_HOME/bin

Hadoop should be up and running.

Now, Download the Mysql connector jar from, http://dev.mysql.com/downloads/connector/j/, extract it, copy the mysql-connector-java-5.1.22-bin.jar into SQOOP_HOME/lib

I do not have MySQL-server so I am going to install it.

$ sudo apt-get install mysql-server

$ sudo service mysql status

it will show process running.

Now, sqoop tool has been setup. Say if I created a table called EmpRecordsData from a Database called FirstSQOOP with details of employees, salary, dept and wanted to import the table into HDFS, then we could enter,

sqoop import \
--connect jdbc:mysql://localhost/firstsqoop \
--username root \
--password "" \
--table employee

This would perform a mapreduce of importing data from mysql database using the mysql jdbc connector. You can also import a subset of existing table

sqoop import \
--connect jdbc:mysql://localhost/firstsqoop \
--username root \
--password "" \
--table employee \
--where "Dept = 'Accounting' && Salary >= "5000""

Or even query the RDBMS and only import what you want

sqoop import \
--connect jdbc:mysql://localhost/firstsqoop \
--username root \
--password "" \
--table employee
--query 'SELECT EmpRecordsData,
ContractRecordsData \
FROM ORC_Solutions \
JOIN EmpName USING EmpDept \
WHERE "EmpDept = 'Operations Research'"' \
--target-dir TargetHDFSDirectory

Now lets look at how we can export data from Hadoop/HDFS

sqoop export \
--connect jdbc:mysql://localhost/firstsqoop \
--username root \
--password "" \
--table employee \
--export-dir EmployeeDepartment

You can update the existing database if you prefer as sqoop also provides such facility. Also, sqoop lets you import RDBMS data directly to hive. Here how thats done:

sqoop import \
--connect jdbc:mysql://localhost/firstsqoop \
--username root \
--password "" \
--table employee \
--hive-import

Advertisements

Apache Hive: to query structured data from HDFS

If we want to talk about Hive, we need to understand the difference between Pig and Hive as people can easily confuse or ask why do we need this when we have that questions.

The Apache Hive software provides a SQL like querying dialect called Hive Query Language that can be used to query data that is stored in Hadoop cluster. Once again Hive eliminates the need for writing mappers and reducers and we can use the HQL to query the language. I would not go deep into why do we need to query. Look up SQL tutorials in search engine.

The importance of Hive can only be understood when we have the right kind of data for Hive to process: static data, data not changing, quick response time is not the priority. Hive is only a software on top of Hadoop framework. It is not database package. We are querying the data from HDFS and we will use Hive where the need for SQL like querying arises.

HCatalog stores the metadata from HQL. Just popped up in my mind as I write this post. We’ll look into it later. Hive is ideal for data warehousing applications, where data is just stored and mined when necessary for report generation, visualization, etc …

Following is the Apache definition of Hive:

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Pig is a procedural data flow languages — the pig latin script, (think of python or perl). Hive is like a SQL querying language. Just like what they say, Pig can eat anything, which means Pig can take structured and unstructured data. Hive on the other hand can only process structured data. Data representation in Pig is through variables, whereas in Hive is through tables. Hive also supports UDF but it is more complex than Pig.

Installation:

Very similar to Pig

1. Download Apache Hive
$ wget http://www.eu.apache.org/dist/hive/hive-latest-version/hive-latest-version.tar.gz

2. Extract the Tarball

$ tar –zxvf hive-latest-version.tar.gz

3. Add the following to your previously created hadoop configuration bash script:

export HIVE_HOME=/data/hive-latest-version
export JAVA_HOME=/usr/java/jdk1.7.0_05
export HADOOP_HOME=/data/hadoop

export PATH=$HIVE_HOME/bin:$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME:/data/hadoop/hadoop-core-1.0.1.jar:$HIVE_HOME/hive-latest-version.jar

4. Run the configuration file

$ . hadoopconfiguration.sh

5. Now launch Hive,

$ hive

Hadoop should be live.