Data science

Some thoughts on MongoDB

Lets talk about MongoDB. Well, it is a NoSQL database. Controversial topic these days. I should talk with a disclaimer.

NoSQL databases cannot and should not be compared with SQL database. It is like comparing Apples and Oranges. Imagine if you have a social media website, where you have data about users (members) — profile description, messaging history, pictures, videos, user generated content (status updates, etc…) .

In a pure SQL environment, you may use different databases and tables to store various types of data and perform an inner join or outer join depending upon what you need — seperate tables for user information, status updates, different method of storing pictures and videos. In the context of NoSQL, you may be able to horizontally scale the database in order to store everything under one roof. Imagine, each row (or each document) would represent record for one member that has fields for user information, status updates, images, blah blah blah… so the developer/data scientist does not have to do any more joins. Due to large generation of user and machine generated data, this may be a good option (as it can dynamically add schema – adding more fields in a record).

MongoDB is a document storage model based NoSQL, stores records in key:value pairs. Under MongoDB database, a table is a collection, row is called a document, column is a called a field. I am using MongoDB to store a bunch of JSON documents. 

You may follow these instructions to setup as they are easy to clean up too.


developing text processing data products: part I

Folks, this is going to be a series of information pieces (more than one blog post about same topic) about text processing. In this series, if intend to discuss some of my experiences and also take this moment to organize the discussion on my blog. In the past, if touched upon some of the text processing recipes purely in an application point of view; however, if have spent more time in my career work with text content in automation and analytics and if owe it to myself to write more on text processing. Without further ado, lets jump into the this series.

Text processing is extracting information that could be used for decision making purposes from a text say like a book/paper/article — WIHTOUT HAVING TO READ IT. I use to read the newspaper to find out about weather forecast (five years ago) but these days if we need information about something we type it in search engine to get results (and ads). But imagine someone need information on day to day basis for decision making purposes, where he/she needs to read many articles or news or books in short period of time. Information is power but timing is important too. Hence the need for text processing data products. 

Some basic things that we can do with text data:

Crawling: extract data/content from a website using a crawlier, which is used to extract the data.

Tokenization: process of splitting a string into tiny objects in an array, based on the space between words.

Stemming: word reduction to a smaller word, like a root of word only to use it to search for all of its variations.

Stop word removal: straight forward, to remove frequently occurring words like a, an, the.

Parsing: process of breaking down a sentence into a set of phrases and further breakdown each phrase into noun, verb, adjective, etc… Also called parts of speech tagging. Cool topic.

These are just my experiences and my knowledge, and as always I try to write in a way that anyone can read and understand. More later.

Introduction to H2O, in-memory large scale data processing

So far, in our data science series on DataScience Hacks, we have seen R, Apache Mahout and Apache Spark for large scale Data Science applications.

Apache Mahout is the earlier project for distributed machine learning executes on MapReduce framework where as Apache Spark is built on Scala, comparatively recent project that performs distributed large scale machine learning and outsmarts MapReduce in terms of computational speed. Also it succeeds where MapReduce fails. We even saw how Apache Mahout has been migrating from MapReduce framework to Mahout spark binding framework for faster data processing.

Let us look at another fast, large scale data processing engine that runs on Hadoop (hence, distributed) and also has possible binding framework with Mahout.

Behold, H2O. H2O, according to their developers, is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is distributed, scalable (of course) and licensed as open source software that could run on many nodes.

H2O can work with HDFS data, natively supports Java, Scala and also can interact with Hadoop ecosystem tools (like Apache Pig for example). It also works with Apache Spark – Sparkling Water, is a package that integrates H2O fast scalable machine learning engine with Spark.

Keep watching this space for more information on H2O in this continuation of large scale data processing series


Mahout Spark Shell: An overview

As we know already that Mahout is sunsetting its mapreduce algorithms support and moving to advance data processing systems that are significantly faster than mapreduce, today we will see one of the Mahout’s latest system: Mahout Scala and Spark bindings package.

If you had hands on with either R’s command line on Linux or Julia on Linux, you will learn this new package pretty quick. Note: Julia is a open source scientific computing and mathematical optimization platform on linux.

Lets look at how to set up Mahout spark shell on linux without hadoop. It is very simple and straight forward, if you follow these steps.

Note: Always check out mahout and spark latest version , else you will end up with java.lang.AbstractMethodError (version mismatch)

First, lets setup Spark:

wget http://path-to-spark/sparkx.x.x.tgz

Looks simple but, careful in what you are trying to select. I would choose the latest version under spark release and choose the “source code” under package type.

Once downloaded, build using sbt. This will take close to an hour.

Secondly, clone Mahout 1.0 from github:

git clone mahout

and build Mahout using Maven.

To start Mahout-spark shell go to spark folder and do a sbin/

Obtain the spark url master (if you are localhost, then it would be: http://localhost:8080/spark )

Create a file and type in the following:

export MAHOUT_HOME=your-path/mahout
export SPARK_HOME=your-path/spark
export MASTER=http://localhost:8080/spark

save it and run “.” followed by going into Mahout directory and a “bin/mahout spark-shell

You would get the following screen:


Setting up Apache Spark: Quick look at Hadoop, Scala and SBT

We are going to look at installing spark on a Hadoop. Lets try to setup hadoop yarn here once again with screenshots from scratch, as i received some comments that my installation needs more screenshots so i am doing one with screenshots. In this post, we will look at creating a new user account on Ubuntu 14.04 and installing Hadoop 2.5.x stable version.

To create new user,

sudo passwd

enter your admin password to set up your root passwd

sudo adduser <new user-name>

enter the details

now providing the root access to the new user

sudo visudo

add the line

new user-name ALL = (ALL:ALL) ALL

if you want to delete new user then

sudo deluser <new user-name> from account with sudo privileges ( not guest)

For java:

Oracle jdk is the official. to install oracle-java 8, add oracle -8 to your packet manager repository and then do an update. install only after these steps are completed.

sudo apt-get install python-software-properties

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install oracle-java8-installer

Quickest way to setup java home

sudo update-alternatives --config java

copy the path of java 8 till java-8-oracle. for instance


sudo nano /etc/environment


export JAVA_HOME = "/usr/lib/jvm/java-8-oracle"

source /etc/environment

if you echo, you will see the path.

Setting up passwordless ssh:

Look up my previous posts on ssh introduction. we will just directly jump into passwordless ssh with screenshots

Generate the key pair

Create a folder in localhost and permanently add those keys generated to the localhost


Thats it. You are done.

Install hadoop 2.5 stable version:
wget http://your-path-to-hadoop-2.5x-version/hadoop-2.4.1.tar.gz
tar xzvf hadoop-2.4.1.tar.gz

mv hadoop-2.4.1 hadoop

Create HDFS directory inside hadoop folder:

mkdir -p data/namenode
mkdir -p data/datanode

You should have these:

hadoop hdfs directory

go to and update the java home path, hadoop_opts, hadoop_common_lib_native_dir. it is in etc/hadoop



Edit core-site.xml and add the following:


create a file called “mapred-site.xml” and add the following:


Edit hdfs-site.xml and add the following:


Edit yarn-site.xml and add the following:


Now, when you run the start-yarn/start-dfs files under sbin, you will get the following screen:


Install spark:
Obtain the latest version of Spark from To interact with Hadoop Distributed File System (HDFS), you need to use a Spark version that is built against the same version of Hadoop as your cluster. Go to and choose the package type: prebuilt for hadoop-2.4 and download spark. Note that Spark 1.1.0 uses scala 2.10.x. So we need to install scala.

Lets install Scala:

tar -xvf scala-2.9.3.tgz
cd scala-2.9.3
to get the path

You will probably want to add these to your .bashrc file or equivalent:

export SCALA_HOME=`pwd`
export PATH=`pwd`/bin:$PATH


We also need something called sbt. Sbt stands for simple build tool but to me it seems to be more complicated than Maven. You can still use maven to build however, I would suggest to get acquainted with sbt, if you are interested in exploring Scala in general.

More on the next post.

Data Crawling using Jsoup

Data Crawling in simple terms, is extracting data from the websites. You need such information to analyze and derive meaningful results. The web is filled with variety of information and how we use it to optimize our business decision is part of a Data Scientist’s work. So lets dive in:

We are looking at a Java API, Jsoup which will be used to extract information from websites. Today we will use a very simple example to demonstrate how we can use. Jsoup is available at:

The following is the most simple java program that uses the Jsoup API:

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
public class crawl {
public static void main(String[] args) {
Document doc = Jsoup.connect("").get();
String title = doc.title();
}catch(IOException e){System.out.println("Exception occured");}


Download the core jar file which we can use it with our java program. There are two ways to use it:

  • Add the jar to Netbeans (or Eclipse) classpath
  • Directly run from terminal without adding to classpath

Go to Netbeans->Tools->Libraries->Class Library->Add New->Select the jar file and press OK. You can also add it as a dependency in your project.

In terminal the following two commands will be useful to compile and run. If our program is named, then the following should be performed.

user@ubuntu:~/$ javac -cp jsoup-1.7.3.jar

user@ubuntu:~/$ java -cp .:jsoup-1.7.3.jar crawl

We will get the output where all the texts and the title of the story would be crawled. The output is too ugly to post it here and needs formatting but one can see the functionality in this example.

Data Science 101

The term data science was first used by Peter Naur in 1974. Lets come back to today’s definition.
The following is how one of the companies advertized to recruit data scientists.

-> experience in big data, hadoop, mapreduce, pig, hive, sqoop, etc…
-> experience of machine learning using Knime, Weka, RapidMiner, Pentaho, Scipy/Numpy algorithms, data structures, statistics, mathematics, optimization modeling
-> experience in Data visualization of all forms
-> experience in SQL, MS-SQL, MySQL, PostgreSQL, Oracle SQL, HBase, cassandra, mongodb, couchdb and other nosql technologies
-> proficient in Fortran, C, C++, C#, Java, J2EE, Python, Ruby, Perl, R, SAS, SPSS, Minitab, AMPL, LINGO, OPL, CPLEX, MOSEK, XPRESS, Matlab, Octave, in Eclipse, Netbeans, Maven, Ant, etc…
-> business intelligence tools, OLAP, IBM Cognos, Lavastorm, etc…
-> proficient in parallell and distributed computing
-> from IITs/MS from US
-> blah, blah, blah . . .

The one candidate I can find who can match all these requirements is . . .

wait for it . . .


Of course google is not human but it knows (knowledge of) all the above and plus many other things. I’d like to get into the minds of HR professionals and ask them if they truly got the candidate for this job.

Going by what is on the internet a Data Scientist is a blend Economist, Statistician, Operational Researcher, Mathematician, Software developer, System administrator, Computer Scientist, Project Manager. 20 years of grad school and half-a-million US dollars debt would make an ideal data scientist if we go by the internet.

So who is Data Scientist? What is Data Science?

A Data Scientist can discover what we don’t know from data, Obtain predictive, actionable insight from data, develop optimal solutions that are directly algined to minimizing costs and maximizing profits, communicate relevant business stories from data.

The following are somethings DS should be familiar with:

  • Ask questions and create and test hypothesis using Data
  • Data Munging, Data Wrestling, Data Taming
  • Tell the Machine to learn from Data
  • Using Modeling extract optimal data to support business decisions

Data Munging is the process of manually converting or mapping data from one “raw” form into another format that allows for more convenient consumption of the data with the help of semi-automated tools.
Data Wrestling is the ability to turn big data into data products that would generate immediate business value.
Data Taming is finding opportunities in huge data (most likely from web)

In order to become a Data Scientist you need to learn to understand the business better. Of course how the results is achieved (by what software) is completely left to the Data Scientist himself. It cannot be put in as requirements.

All the successful entrepreneurs of our 21st century are Data Scientists!

Apache Hadoop — an efficient distributed computing framework

Apache Hadoop is an open source distributed computing framework that is used mainly for ETL tasks (read: Extract, Transform and Load) — storing and processing large scale of data on clusters of commodity hardware.

I agree distributed computing exists for more than a decde, but why has Hadoop gained so much publicity?

  • The entire distributed computing setup can be done in commodity hardware, which means you can setup hadoop in the computer that you are viewing this blog (I stand corrected if you are using your phone)
  • Open source and most important, apache licensed. That means you are most welcome to use this for industry, academia or research and there is no clause that you need to take permission from developer (I am using layman terms here)
  • The need for processing large amount of data has arrived and traditional computing may not be enough
  • HBase’s NoSQL archicteture supercedes Oracle, MySQL in features (later explained in another blog)
  • Built and Supported by Nerds across the globe (just like you!)

Sometimes I wonder worldpeace can only be achieved with Apache License like philosophy! 😀

Here’s a question: What is virtually one complete file system but stored physically acorss different nodes in cluster?

Distributed File System.

In our context, Hadoop Distributed File System. Provides High-through put. Born from Google File System and Google MapReduce.

Apache Hadoop is a Distributed Computing Framework with HDFS on its base and uses MapReduce paradigm to perform large scale processing.

MapReduce is nothing but splitting a giant tasks into many small chunks, send it to all nodes in the given cluster, do the necessary processing and assemble back the processed pieces into one. This whole task happens inside the HDFS in the background. For technical definition of MapReduce refer books. Map means split, Reduce means combine, thats all. Of course it is not that simple, but try to look at it with this perspective. This explanation would paint you a picture/help you visualize what would typically happen in background.

Now this does not mean we need to operate Hadoop at terminal level. There are tools that could be used within Hadoop fraemwork:

HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
ZooKeeper: A high-performance coordination service for distributed applications.

We will look at each of the above in detail with a use case.

Where are we using Hadoop?

Currently as far as I know, Hadoop framework is used in context of large-scale machine learning and data mining, web crawling and text processing, processing huge amounts of data in forms of relational/tabular data, soft analytics.