Month: January 2014

Book Review: Apache Mahout Cookbook

bookQuick summary:
Very well written for Developers who are new to both Mahout and Machine Learning, with walk-throughs and screenshots. However, if you have experience in writing heuristics/have expertise in Machine learning, you can skip this book. Concise and to the point, few clerical errors and typos, though. This book certainly makes a wonderful academic companion if anyone plan to use Mahout in their academic research project.

Detailed Review:
When I was asked to review this book, I was skeptical about this book because of the TSP receipe that is included no longer supported by Mahout. I guess a technical cookbook should have real world use cases and here was a receipe which cannot be practically implemented and hence misleading Mahout’s capabilities. However, when I read this book right from chapter 1, it was written so well that anyone can understand setting up and working with Mahout. Caveat: You should have some amount of knowledge in Software development and Java programming.

I disagree with comments that most of receipes in this book can be obtained by google search. The book carefully explains a given concept with output screenshots and also puts a walkthrough on how to implement the same in Netbeans. Glad to see author using Netbeans, I personally support that and it is easy to work with. Receipes like import/export data from HDFS/RDBMS, spectral clustering are a highlight. The author does not assume that the user is familiar with MySQL so there is a walkthrough on installing the same. Topic modeling, Pattern mining are good to see.

There is an entire chapter on classification walkthrough (for binary and multi-level classification) in Mahout for which there are plenty of tutorials available on the web and it is well written in MiA. Same goes with k-meansg. Also, based on the discussions happened with developers, it is pretty conclusive mapreduce version of genetic programming may not possibly see the light in future Mahout release. My personal recommendation is not to get too involved with chapter 10. Also, TSP example is basically a sample and not a real life one. For those who want to learn more, I would suggest to look up watchmaker project. Instead of outdated TSP demo, I would have liked to see Hidden Markov Modeling case study even though it is only partially parallelized.

I personally would like to see a second edition with more in-depth recipes where data is extracted and cleansed using Pig/Hive, then fed to Mahout to produce meaningful results. I would like to see detailed coverage on building recommendation engines, building a fraud detection engine based on large amount of data that is transformed using Pig and finding hidden patterns where Hadoop ecosystem tools are put to use. Author’s choice of preferred NoSQL database in Mahout context would also be good to see.

You can buy this book at Packt Publishing

HBase and NoSQL Architecture: Part I

HBase is the NoSQL database for Hadoop. It is a distributed database. NoSQL database could be understood as a database that is not relational in nature but supports SQL as its primary access language. HBase is ideal if you have huge amounts of data. HBase supports massively parallelized processing via MapReduce. HBase is designed to handle billions of rows and columns — big data.

Hbase is quite hard to understand especially if you have used RDBMS earlier. HBase is actually from Google’s BigTable. According to the google research paper, a bigtable is a sparse, distributed, persistent multidimensional sorted map and the map is indexed by a row key, column key, and a time-stamp. A map is simply an abstract data type with collection of keys and values, where each key is associated with one value. One of its many features is time-stamping/version. Hbase will not perform soft delete. Instead it will keep versions of cell values. every cell value maintains its own time-stamps. Default threshold for time-stamps is 10.

Lets look at how Hbase can be installed in Hadoop. Hadoop should be live with all daemon processes running.

1. Download the tarball of latest Hbase
$ wget http://hbase-latest-version.tar.gz

2. Extract the tarball
$ tar -zxvf hbase-latest-version.tar.gz

3. Go to Hbase-latest-version and cd into conf
$ cd hbase-latest-version/conf/

4. Replace the hbase-site.xml with the following information
<!-- This is the location in HDFS that HBase will use to store its files -->

5. Update the regionservers file with ‘localhost’
$ vi regionservers --> localhost --> :wq

6. Exit the conf directory and do bin/
$ bin/
You should have four additional process running: HMaster, Zookeeper, HRegionserver, Hquorampeer

7. Lanuch hbase
bin/hbase shell

you will get the command line like this,


It is actually very difficult to understand Hbase in one go because of its NoSQL architecture and lets look at it seperately in detail.

Data Crawling using Jsoup

Data Crawling in simple terms, is extracting data from the websites. You need such information to analyze and derive meaningful results. The web is filled with variety of information and how we use it to optimize our business decision is part of a Data Scientist’s work. So lets dive in:

We are looking at a Java API, Jsoup which will be used to extract information from websites. Today we will use a very simple example to demonstrate how we can use. Jsoup is available at:

The following is the most simple java program that uses the Jsoup API:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class crawl {
public static void main(String[] args) {
Document doc = Jsoup.connect("").get();
String title = doc.title();
}catch(IOException e){System.out.println("Exception occured");}


Download the core jar file which we can use it with our java program. There are two ways to use it:

  • Add the jar to Netbeans (or Eclipse) classpath
  • Directly run from terminal without adding to classpath

Go to Netbeans->Tools->Libraries->Class Library->Add New->Select the jar file and press OK. You can also add it as a dependency in your project.

In terminal the following two commands will be useful to compile and run. If our program is named, then the following should be performed.

user@ubuntu:~/$ javac -cp jsoup-1.7.3.jar

user@ubuntu:~/$ java -cp .:jsoup-1.7.3.jar crawl

We will get the output where all the texts and the title of the story would be crawled. The output is too ugly to post it here and needs formatting but one can see the functionality in this example.

Apache Hive: to query structured data from HDFS

If we want to talk about Hive, we need to understand the difference between Pig and Hive as people can easily confuse or ask why do we need this when we have that questions.

The Apache Hive software provides a SQL like querying dialect called Hive Query Language that can be used to query data that is stored in Hadoop cluster. Once again Hive eliminates the need for writing mappers and reducers and we can use the HQL to query the language. I would not go deep into why do we need to query. Look up SQL tutorials in search engine.

The importance of Hive can only be understood when we have the right kind of data for Hive to process: static data, data not changing, quick response time is not the priority. Hive is only a software on top of Hadoop framework. It is not database package. We are querying the data from HDFS and we will use Hive where the need for SQL like querying arises.

HCatalog stores the metadata from HQL. Just popped up in my mind as I write this post. We’ll look into it later. Hive is ideal for data warehousing applications, where data is just stored and mined when necessary for report generation, visualization, etc …

Following is the Apache definition of Hive:

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Pig is a procedural data flow languages — the pig latin script, (think of python or perl). Hive is like a SQL querying language. Just like what they say, Pig can eat anything, which means Pig can take structured and unstructured data. Hive on the other hand can only process structured data. Data representation in Pig is through variables, whereas in Hive is through tables. Hive also supports UDF but it is more complex than Pig.


Very similar to Pig

1. Download Apache Hive
$ wget

2. Extract the Tarball

$ tar –zxvf hive-latest-version.tar.gz

3. Add the following to your previously created hadoop configuration bash script:

export HIVE_HOME=/data/hive-latest-version
export JAVA_HOME=/usr/java/jdk1.7.0_05
export HADOOP_HOME=/data/hadoop

export CLASSPATH=$JAVA_HOME:/data/hadoop/hadoop-core-1.0.1.jar:$HIVE_HOME/hive-latest-version.jar

4. Run the configuration file

$ .

5. Now launch Hive,

$ hive

Hadoop should be live.