Machine Learning

H2O and Machine Learning

Working with H2O has been quite an experience so far. Lets look at how to set it up. We can setup H2O as standalone server, install in R or install in Hadoop. Setting it up on standalone is quite simple.

download the zip file
cd into the directory
java -jar h2o.jar

Go to http://localhost:54321/ to see your output. It should look like this:


Installing in R requires slightly complex steps, especially if you are working with Ubuntu or linux.

Install the package by the following command:

install.packages("h2o", repos=(c("", getOption("repos"))))

Initialize the package and verify that H2O installed properly:


localH2O = h2o.init()


Installing in Hadoop requires you to have a Cloudera or Hortonworks or MapR version of Hadoop running on your system because this is what I found inside the h2o/hadoop directory. You can see the drivers for these versions of Hadoop only.



Introduction to H2O, in-memory large scale data processing

So far, in our data science series on DataScience Hacks, we have seen R, Apache Mahout and Apache Spark for large scale Data Science applications.

Apache Mahout is the earlier project for distributed machine learning executes on MapReduce framework where as Apache Spark is built on Scala, comparatively recent project that performs distributed large scale machine learning and outsmarts MapReduce in terms of computational speed. Also it succeeds where MapReduce fails. We even saw how Apache Mahout has been migrating from MapReduce framework to Mahout spark binding framework for faster data processing.

Let us look at another fast, large scale data processing engine that runs on Hadoop (hence, distributed) and also has possible binding framework with Mahout.

Behold, H2O. H2O, according to their developers, is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is distributed, scalable (of course) and licensed as open source software that could run on many nodes.

H2O can work with HDFS data, natively supports Java, Scala and also can interact with Hadoop ecosystem tools (like Apache Pig for example). It also works with Apache Spark – Sparkling Water, is a package that integrates H2O fast scalable machine learning engine with Spark.

Keep watching this space for more information on H2O in this continuation of large scale data processing series


Market Basket Analysis with Mahout

Also known as Affinity Analysis/Frequent Pattern Mining.
Finding patterns in huge amounts of customer transactional data is called market basket analysis. This is useful where store’s transactional data is readily available. Using market basket analysis, one can find purchasing patterns. Market basket analysis is also called associative rule mining (actually its otherway around) or affinity analysis or frequent pattern mining. This technique is behind all customer promotional offers like buy 1 get 1 free, discounts, complimentary products, etc… that we see in the deparmental stores/supermarket chains.

MBA is one of the ways of recommending products to the customer. If we have customer transaction data, data where the number of items bought by each customer is available (the receipt that we get for buying a product) and imagine we have a million transaction records like this, then we can find out buying patterns. For example, lets assume the lifestyle of people at a certain locality eat chips while drinking beer. Then whoever comes into supermarket with sole purpose of buying beer, could most likely to pick up a packet of chips (crisps if you are from the UK) — now this is a pattern we know from local knowledge.

But what about other patterns that we don’t know/we don’t speak about? MBA helps us to find out such patterns.  Why do we need such pattern information? We can use this information for purchase planning, introducing new products (not just on MBA results based also with the help of statistical hypothesis/inference)

Lets perform market basket analysis using Apache Mahout. You should have Apache Mahout and Hadoop installed, up and running.

For the input dataset, I found an anonymous Belgian supermarket transaction data. This data is available thanks to the courtesy of Dr Tom Briggs from University of Hasselt, Belgium. Please visit and We will use the retail.dat as input dataset. Each record in the data set contains information about the date of purchase, the receipt number, the article number, the number of items purchased, the article price in Belgian Francs and the customer number.

The data are collected over three non-consecutive periods. The first period runs from half December 1999 to half January 2000. The second period runs from 2000 to the beginning of June 2000. The third and final period runs from the end of August 2000 to the end of November 2000. In between these periods, no data is available, unfortunately. This results in approximately 5 months of data. The total amount of receipts being collected equals 88,163. 5,133 customers have purchased at least one product in the supermarket during the data collection period.

The following steps are required to perform the market basket analysis with Apache Mahout:

1. Copy the data from local disk to HDFS:
hadoop fs -copyFromLocal /home/path to your file/retail.dat retail.dat

2. Execute Mahout’s FPG procedure
mahout fpg -i retail.dat -o patternone -method mapreduce

Note: the number of top items with most frequent pattern is by default 50. Meaning, the output will provide you top 50 items.
you can also run this procedure in sequential mode, just make sure you have the data file in the directory where you execute this.

3. Check the output after processing. It will be in the folder ‘patternone’ in your HDFS.

hadoop fs -ls /user/hduser/patternone

4. Identify the output in form of sequence file. It will be under frequentpatterns folder. Use sequence dumper utility to extract the output.
mahout seqdumper -i hdfs://localhost:54310/user/hduser/patternone/frequentpatterns/part-r-00000

5. Learn to interpret the output. It will be likely to be in following key value pair:

FPM Output

Item is called the key here. ([99],32) means that item 99 seems to be appearing in 32 transactions. ([142, 99],22) means item 142 and 99 seems to be appearing in 22 transactions. We dont know for sure what those items are but in real life situations items will be indexed against its names so you can get output with name of the item.

For example item 99 could be beer. 32 means beer was bought by 32 customers. 142 and 99 appearing in 22 transactions meaning, people have bought beer and chips together in those 22 instances. May be the remaining people already have chips at home :).

This data can be used for further analysis to determine the nature of promotion that could be offered to customers.

Building Recommender Engines with Apache Mahout: Part II

In my previous post we looked at user ratings based recommendations. Here, we are going to build recommender engine based on items.

Item-based Recommendation:
In item-based analysis, items likes recommend related to a particular item are determined. When a user likes a particular item, items related to that item are recommended. As shown in figure, if items A and C are highly similar and a user likes item A, then item C is recommended to the user.

Item-based recommendation calculate recommendations based on items, and not users.

Objective is the same: recommend items to users but approach is different

High level logic:
– for every item i that u has no preference for yet
– for every item j that u has a preference for
– compute a similarity s between i and j
– add u’s preference for j, weighted by s, to a running average
– return the top items, ranked by weighted average

We are going to use Movie lens data (similar to IMDB database) where the ratings are for 3900 movies by 6040 users and has 1,000,209 ratings
Format of the data set:
User ID tab Item ID tab Rating tab Time Stamp

  • Format is similar to other 100,000 dataset, approach would focus on item (and not the user)
  • Ratings are anonymous.
  • Item-based recommender in Mahout supports a distributed execution model and it can be computed using MapReduce

The following steps are required to build a user-based recommendation in Mahout:

#1 Pre-processing the data
awk -F"::" '{print $1","$2","$3}' ml-1m/ratings.dat > ratings.csv
we are using awk to convert the input data file into a comma seperated value format and export the same to a new csv file

#2 Copy the Files to HDFS
hadoop fs -copyFromLocal ratings.csv users.txt hdfs://localhost:54310/user/hduser/itemreco

simple command to copy the files into hadoop distributed file system

#3 Use the following mahout function to compute the recommendation based on product
mahout recommenditembased \
--Dmapred.reduce.tasks=10 \
--input ratings.csv \
--output item-rec-output \
--tempDir item-rec-tmp \
--Dmapred.reduce.tasks=10 \
--usersFile user-ids.txt

Parameter detail:

–Dmapred.reduce.tasks → assiging the number of map/reduce tasks
–similarityClassname → specifying the similarity distance metric. In this case, Pearson Correlation.
–input → path to input file
–output → path to output directory
–tempDir → path to temporary directory (optional parameter, can be used to store temp data)
–usersFile → path to user-ids

The output can be see like this:


Building Recommender Engines with Apache Mahout: Part I

Introduction to recommendation:
Recommender systems provide personalized information by learning the user’s interests from traces of interaction with that user.
Two broad types of recommendation:
– User-based recommendation
– Item-based recommendation
Recommendation engines aim to show items of interest to a user. Recommendation engines in essence are matching engines that take into account the context of where the items are being shown and to whom they’re being shown.
Recommendation engines are one of the best ways of utilizing collective intelligence in your application.
Collaborative Filtering: The process of producing recommendations based on, and only based on, knowledge of users’ relationships to item is called collaborative filtering

User-based recommendation:

In user-based analysis, users similar to the user are first determined. As shown in figure, if a user likes item A, then the same item can be recommended to other users who are similar to user A. Similar users can be obtained by using
profile-based information about the user—for example cluster the users based on their attributes, such as age, gender, geographic location, net worth, and so on. Alternatively, you can find similar users using a collaborative-based approach by analyzing the users’ actions.


  • Recommendation based on historical ratings and reviews by users
  • Answers the question, How users are similar to other users?
  • Mahout does not use MapReduce to compute user-based recommendation because the user-based recommender is only designed to work within a single JVM

High level logic

– for every item i that u has no preference for yet
– for every other user v that has a preference for i
– compute a similarity s between u and v
– incorporate v's preference for i, weighted by s, into a running average
– return the top items, ranked by weighted average

We are going to use Movie lens data (similar to IMDB database) where the ratings are for 1682 movies by 943 users and has 100,000 ratings
Format of the data set:
User ID tab Item ID tab Rating tab Time Stamp
– Not everyone has seen all the movies
– Each user rated at least 20 movies but not seen them all
– Objective is to recommend movies for 943 users
The following steps are required to build a user-based recommendation in Mahout:

#1 Pre-processing the Data
sed 's/ /,/g' ua.base > final.csv
→ using Sed to convert the tab spaces into comma separated values and export it to into a csv file.

#2 Split the input dataset into training data and test data
mahout splitDataset \
--input reco/final.csv \
--output reco/ \
--trainingPercentage 0.9 \
--probePercentage 0.1

–input → is the input file directory
–output → is the output file directory
–trianingPercentage → data split for training Data
–probePercentage → data split for test Data
Using mahout’s splitDataset function, we are splitting final.csv into two datasets with predefined training and test data percentages.

#3 Use Mahout’s parallelALS to compute the ratings matrix
Use Mahout’s Alternating Least Squares with Weighted Lambda-Regularization to find the decomposition. This method is faster than singular value decomposition
mahout parallelALS \
--input hdfs://localhost:54310/user/hduser/reco/trainingSet/ \
--output hdfs://localhost:54310/user/hduser/reco/als/out \
--numFeatures 20 \
--numIterations 10 \
--lambda 0.065

–input → path to input directory
–output → path to output directory
–numFeatures → dimensions of the feature space
–numIterations → number of iterations
–lambda → the regularization parameter the above procedure will compute the ratings matrix required for the recommendation computation.

#4 Use Mahout’s evaluate Factorization to compute RMSE and MAE of a rating matrix factorization against probes
RMSE – Root Mean Square Error
MAE – Mean absolute Error
these parameters are helpful to measure the accuracy
mahout evaluateFactorization \
--input hdfs://localhost:54310/user/hduser/reco/probeSet/ \
--output hdfs://localhost:54310/user/hduser/reco/als/rmse/ \
--userFeatures hdfs://localhost:54310/user/hduser/reco/als/out/U/ \
--itemFeatures hdfs://localhost:54310/user/hduser/reco/als/out/M/

–input → path to the input directory
–output → path to the output directory
–userFeatures → path to the user feature matrix
–itemFeatures → path to item feature matrix
The output will be stored in RMSE.txt in HDFS file system
– Output obtained was: 0.9760895406876606
– Higher is better, we can proceed to run the main recommendation algorithm

#5 Mahout’s recommendfactorized function will compute recommendations based on ratings matrix
mahout recommendfactorized \
--input hdfs://localhost:54310/user/hduser/reco/als/out/userRa
tings/ \ --output hdfs://localhost:54310/user/hduser/reco/recommendation
s/ \ --userFeatures hdfs://localhost:54310/user/hduser/reco/als/out/U/ \
--itemFeatures hdfs://localhost:54310/user/hduser/reco/als/out/M/ \
--numRecommendations 6 --maxRating 5

–input → path to the input directory
–output → path to the output directory
–userFeatures → path to the user feature matrix
–itemFeatures → path to item feature matrix
–numRecommendations → number of recommendations made to each user
–maxRating → maximum rating on the rating scale

This will start computing recommendations based on the matrix and we can use the option to define how many recommendations needed to be defined
The output will be in the following format:
{movieID1:rating, movieID2:rating,movieID3:rating,
movieID4:rating,movieID5:rating, movieID6:rating}


Multi-level classification using stochastic gradient descent

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier.
There are two types of Classification: Binary classification and Multi-level classification

  • Binary classification is the task of classifying the members of a given set of objects into two groups on the basis of whether they have some property or not.
  • Multi-level classification is the problem of classifying instances into more than two classes.

The input here is transcripts of months of postings made in 20 Usenet newsgroups from the early 1990s. Usenet is worldwide Internet discussion system. Newsgroups are typically accessed with newsreaders: applications that allow users to read and reply to postings in newsgroups. These applications act as clients to one or more news servers.

The major set of worldwide newsgroups is contained within nine hierarchies, eight of which are operated under consensual guidelines that govern their administration and naming. The current Big Eight are:

  • comp.* – computer-related discussions (, comp.sys.amiga)
  • humanities.* – fine arts, literature, and philosophy (humanities.classics,
  • misc.* – miscellaneous topics (,,
  • news.* – discussions and announcements about news (meaning Usenet, not current events) (news.groups, news.admin)
  • rec.* – recreation and entertainment (, rec.arts.movies)
  • sci.* – science related discussions (sci.psychology, sci.research)
  • soc.* – social discussions (, soc.culture.african)
  • talk.* – talk about various controversial topics (talk.religion, talk.politics,

The input data set consists of 20 such newsgroups, under each newsgroup we have stories. We dividethe data set into training data and test data (as this is a supervised learning algorithm). The objective is to initially obtain parameters from the trained data set and then test the model’s accuracy by running it with test data set.

Listed below is the “format” of the data set:
From: (anonymous)
Subject: Re: about the bible quiz answers
Organization: AT&T
Distribution: na
Lines: 18
In article <>, (Anonymous Person1) writes:
> #12) the quick fox jumped over a lazy frog
> .
the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog the quick fox jumped over a lazy frog

The following steps are required to perform a multi-class classification task:
#1 Create a new directory named 20news-all and copy the complete data set into the directory

$ mkdir 20news-all
$ cp -R 20news-bydate/*/* 20news-all

#2 Create sequence files from the 20 Newsgroups data set: Sequence file format is the intermediate data format to be used from mahout

$ mahout seqdirectory \
> -i 20news-all \
> -o 20news-seq
$ mahout seq2sparse \
> -i 20news-seq \
> -o 20news-vectors -lnorm -nv -wt tfidf
-i → input directory
-o → output directory
-lnorm → lognormalize the output (method of standardization)
-nv → create named vectors
-wt → create TFIDF weights to use in the algorithm

#4 Split the generated vector data set to create two sets:
Training set: Data produce the model used to train the classification algorithm to produce the model
Test set: Data used to test the classification algorithm In this case we are taking 20% of the documents in each category and writing them to the test directory, the rest of the documents are written to the training data directory.

mahout split \
-i 20news-vectors/tfidf-vectors \
-tr 20news-train-vectors \
-te 20news-test-vectors \
-rp 20 \
-ow \
-seq -xm sequential
-tr → training vectors
-te → test vectors
-rp → random selection point
-ow → overwrite
-seq -xm → method of writing the vectors

#5 train the vectors using mahout’s naïve bayes classifier:

mahout trainnb \
-i hdfs://localhost:54310/user/hduser/20news-train-vectors -el \
-o hdfs://localhost:54310/user/hduser/model \
-li hdfs://localhost:54310/user/hduser/labelindex \
-ow -c
-i → input directory
-el → extract the label index (to prepare named vectors)
-o → output directory
-li → the path to store the extracted label index
-ow → overwrite
-c → train complementary

#6 Test the model obtained using mahout naïve bayes:

$ mahout testnb \
> -i hdfs://localhost:54310/user/hduser/20news-test-vectors \
> -m hdfs://localhost:54310/user/hduser/model \> -l hdfs://localhost:54310/user/hduser/labelindex \
> -ow -o hdfs://localhost:54310/user/hduser/20newstesting2
-i → input directory
-m → model used to test the dataset
-l → using the labelindex (incase of named vectors, this option is
-ow → overwrite
-o → output directory

Upon successful execution we could observe the following output – the confusion matrix. (Confusion Matrix: A confusion matrix contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix.)

its the confusion matrix! (aptly named, isn't it?)

its the confusion matrix! (aptly named, isn’t it?)

Performing document clustering using Apache Mahout k-means

Introduction to k-means clustering:
Clustering is all about organizing items from a given collection into groups of similar items. These clusters could be thought of as sets of items similar to each other in some ways but dissimilar from the items belonging to other clusters.

Clustering a collection involves three things:

  • An algorithm—This is the method used to group the items together.
  • A notion of both similarity and dissimilarity—which items belong in an existing stack and which should start a new one.
  • A stopping condition—this might be the point beyond which books can’t be stacked anymore, or when the stacks are already quite dissimilar.

Running k-means clustering in Apache Mahout:
k-means algorithm in Apache Mahout is mainly for text processing, if you need to process some numerical data, you need to write some utility functions to write the numerical data into sequence-vector format. For the general example “Reuters”, the first few Mahout steps are actually doing some data processing.
To be explicit, for Reuters example, the original downloaded file is in SGML format, which is similar to XML. So we need to first parse these files into document-id and document-text. For this preprocessing job, a much quicker way is to use the Reuters parser given in the Lucene benchmark JAR file. After that we can convert the file into sequenceFiles. SequencesFiles is a file with structure of key-value format. Key is the document id and value is the document content. This step will be done using ‘seqdirectory’, a built-in method in Apache Mahout to convert input files into sequenceFiles. We need to then convert the sequenceFile to index frequencey vectors using mahout’s ‘seq2sparse‘ built-in method. The generated vectors dir should contain the following items:

  • reuters-vectors/df-count
  • reuters-vectors/dictionary.file-0
  • reuters-vectors/frequency.file-0
  • reuters-vectors/tf-vectors
  • reuters-vectors/tfidf-vectors
  • reuters-vectors/tokenized-documents
  • reuters-vectors/wordcount


We will then use tfidf-vectors to run kmeans. You could give a ‘fake’ initial center path, as given argument k, mahout will automatically random select k to initial the clustering. On successfully running k-means algorithm, we could use mahout’s ‘clusterdumper‘ utility to export the data from HDFS to a text or a csv file.

The following are the actual codes that are written while performing this task.

Get the data from:

Place it within the example folder from mahout home director:
mkdir reuters
cd reuters
mkdir reuters-out
mv reuters21578.tar.gz reuters-out
cd reuters-out
tar -xzvf reuters21578.tar.gz
cd ..

Mahout Commands

#1 Run the org.apache.lucene.benchmark .utils.ExtractReuters class
org.apache.lucene.benchmark.utils.ExtractReuters reuters-out

#2 Copy the file to your HDFS
bin/hadoop fs -copyFromLocal

#3 Generate sequence-file
mahout seqdirectory -i hdfs://localhost:54310/user/bigdata/reuters-text
-o hdfs://localhost:54310/user/bigdata/reuters-seqfiles -c UTF-8 -chunk 5
-chunk → specifying the number of data blocks
UTF-8 → specifying the appropriate input format

#4 Check the generated sequence-file
mahout-0.7$ ./bin/mahout seqdumper -i
/your-hdfs-path-to/reuters-seqfiles/chunk-0 |less

#5 From sequence-file generate vector file
mahout seq2sparse -i
hdfs://localhost:54310/user/bigdata/reuters-seqfiles -o
hdfs://localhost:54310/user/bigdata/reuters-vectors -ow
-ow → overwrite

#6 Take a look at it should have 7 items by using this command
bin/hadoop fs -ls
bin/hadoop fs -ls reuters-vectors

#7 Check the vector: reuters-vectors/tf-vectors/part-r-00000
mahout-0.7$ hadoop fs -ls reuters-vectors/tf-vectors

#8 Run canopy clustering to get optimal initial centroids for k-means
mahout canopy -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tf-vectors -o
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -dm
org.apache.mahout.common.distance.CosineDistanceMeasure -t1 1500 -t2 2000

-dm → specifying the distance measure to be used while clustering (here it is cosine distance measure)

#9 Run k-means clustering algorithm
mahout kmeans -i
hdfs://localhost:54310/user/bigdata/reuters-vectors/tfidf-vectors -c
hdfs://localhost:54310/user/bigdata/reuters-canopy-centroids -o
hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters -cd 0.1 -ow
-x 20 -k 10

-i → input
-o → output
-c → initial centroids for k-means (not defining this parameter will
trigger k-means to generate random initial centroids)
-cd → convergence delta parameter
-ow → overwrite
-x → specifying number of k-means iterations
-k → specifying number of clusters

#10 Export k-means output using Cluster Dumper tool
mahout clusterdump -dt sequencefile -d hdfs://localhost:54310/user/bigdata/reuters-vectors/dictionary.file-*
-i hdfs://localhost:54310/user/bigdata/reuters-kmeans-clusters/clusters-8-
final -o clusters.txt -b 15

-dt → dictionary type
-b → specifying length of each word

The output would look like this (note: this screenshot was on real life document clustering (with big data) and not reuters example):



Setting up Apache Mahout

Apache Mahout is a beautiful and scalable Machine Learning library built in maven to solve large scale machine learning problems. We use Mahout on a Apache Hadoop cluster as the primary purpose of using Mahout is to solve large scale (big data) problems. Refer to apache hadoop website to setup a single node/multi node cluster. Here we look at how to install Mahout given that Hadoop is setup and running.

Follow these simple steps:

  • Download Apache Mahout 0.7 (tried and tested, stable version) from Apache Mahout website.

$ wget

  •  Install Apache Mahout-distribution 0.7

$ tar -xzvf mahout-distribution-0.7

  • Create hadoop configuration (vim bash script and add the following along with Hadoop and Java_HOME:

export MAHOUT_HOME=/home/hduser/mahout-distribution-0.7/
export MAHOUT_CONF_DIR=/home/hduser/mahout-distribution-0.7/conf
export CLASSPATH=$MAHOUT_HOME/mahout-core-0.7.jar

  • Run the configuration file so that path gets stored in linux’s memory

$ . [.]

  • After initializing hadoop (bin/ type ‘mahout’ and you may obtain the following output
Mahout screen

Mahout screen