Big Data

Python Driver, PyMongo and other MongoDB Operations

MongoDB has a mongo shell and drivers to connect to Java, Python, etc…

To start using MongoDB,

sudo service mongod start

For importing and exporting data, use commands at terminal. For instance:

mongoimport \
--db <name of database> \
--collection <name of collection> \ # collection is equivalent to table
--file <input file>

To create and/or use a database, simply

use <name of database>

To create and/or use a collection, simply

use <name of collection>

after database is created

To list down databases present,

show dbs

To list down collections under database,

show db.collection

A basic query would be like:


db.<collection_name>.find().pretty() # to display in JSON format

More querying with specific requests:


db.twitter1.find_one() # to list just one query

To extract just one key value in JSON,


db.twitter1.find({},{text:1,_id:0}) # to remove id

db.twitter1.find({},{text:1,_id:0}).count() # to get count

To access subfields of JSON documents in our query,


The Python driver I worked with is pretty cool. Just do a pip install for pymongo — the python MongoDB driver and you could either work with Python CLE or write a script on it.


Some thoughts on MongoDB

Lets talk about MongoDB. Well, it is a NoSQL database. Controversial topic these days. I should talk with a disclaimer.

NoSQL databases cannot and should not be compared with SQL database. It is like comparing Apples and Oranges. Imagine if you have a social media website, where you have data about users (members) — profile description, messaging history, pictures, videos, user generated content (status updates, etc…) .

In a pure SQL environment, you may use different databases and tables to store various types of data and perform an inner join or outer join depending upon what you need — seperate tables for user information, status updates, different method of storing pictures and videos. In the context of NoSQL, you may be able to horizontally scale the database in order to store everything under one roof. Imagine, each row (or each document) would represent record for one member that has fields for user information, status updates, images, blah blah blah… so the developer/data scientist does not have to do any more joins. Due to large generation of user and machine generated data, this may be a good option (as it can dynamically add schema – adding more fields in a record).

MongoDB is a document storage model based NoSQL, stores records in key:value pairs. Under MongoDB database, a table is a collection, row is called a document, column is a called a field. I am using MongoDB to store a bunch of JSON documents. 

You may follow these instructions to setup as they are easy to clean up too.


Some notes on hadoop cluster

One way Passwordless SSH from Master to worker nodes:

1. Generate Key: ssh-keygen -t rsa
2. Create folder in worker node: ssh [user]@ mkdir -p .ssh
3. Copy key to worker node: ssh-copy-id [user]@
4. Enable permissions to worker: ssh [user]@ "chmod 700 .ssh; chmod 640 .ssh/authorized_keys"

700 — user can read, write and execute. group and others have no permissions
640 — user can read and write. group can read. others have no permissions

Configuration Files: (For minimal configuration)

core-site.xml configuration should be the same for master and all worker nodes. Namenode URL ( should point to master node only.
mapred-site.xml should be edited only on master node.
yarn-site.xml should be the same configuration on master and worker nodes.
the file “slaves” should be updated only in the master node.
hdfs-site.xml is self explanatory

Disable IPV6:

IPV4 – Internet Protocol Version 4 — IP address follows this pattern:

IPV6 – Internet Protocol Version 6 — IP address follows this pattern:

Hadoop works/communicates on IPV4 within its cluster. It does not support IPV6 at the moment.

Setting up Apache Spark: Part II

Now that we have Hadoop YARN, Scala and other pre-requistes setup, we are now ready to install apache spark. If you are visiting this blog post for the first time, please do have a look at the earlier post Setting up Apache Spark: quick look at the Hadoop, Scala and SBT

You definitely need maven, sbt, scala, java, ssh and git (though git is optional). Have git incase you want to fork — didnt work out for me. I am not using my multi node cluster workspace to perform this demonstration so not much of screenshots as of now. However, I am sure these instructions will work fine.

To begin, lets download the appropriate spark binaries (select the one that corresponds to your hadoop installation from the spark site):


Assuming we have sbt, setup without any problems,

sbt/sbt package

followed by

sbt/sbt clean compile

Now, lets build spark using maven. Allocate desired amount of physical memory based on the capacity of the machine that you are working with and enter the following:

export MAVEN_OPTS="-Xmx1300M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Phadoop2-yarn -Dhadoop.version=2.0.5-alpha -Dyarn.version=2.0.5-alpha -DskipTests clean package

now, it may not be necessary to do a clean but I suggest

sbt clean

followed by

sbt/sbt assembly

This takes time. Once this is over,


and finally,


Voila! the following screen will show up with scala command line. And now folks, lets learn Scala!


Hadoop 2: YARN — Overview and Setup

We have already seen MapReduce. Now, lets dig deep into the new data processing model of Hadoop. Hadoop comes with a advanced resource management tool called YARN, and packaged as Hadoop-2.x.

What is YARN?

YARN stands for Yet Another Resource Navigator. YARN is also known as Hadoop Data Operating System — YARN enables data processing models beyond traditional mapreduce, such as Storm (real time streaming), Solr (searching) and interactive systems (Apache Tez).


Hadoop committers decided to split up resource management and job scheduling into separate daemons to overcome some deficiencies in the original MapReduce. In traditional MR terms, this is basically splitting up jobtracker to provide flexibility and improve performance.

YARN splits the functionality of a JobTracker into two separate daemons:

A global Resource Manager (RM) that consists of a Scheduler and an Applications Manager.
An Application Master (AM) that provides support for a specific application. It runs on every node and controls execution of applications. (Here application could mean a single MR job or a Directed Acyclic Graph (DAG) of jobs).


YARN, also referred to as MapReduce 2 is not an improvment over traditional MapReduce data processing model. It merely provides a resource management model that executes MapReduce jobs.

How to deploy Hadoop-2.x YARN?

We will look at how to deploy YARN in a single node cluster setup. The following are system requirements:

Java 6 or above

Download Hadoop-2.2.0:

wget http://your download link here/hadoop-2.2.0.tar.gz

unpack the tarball:

tar -zxvf hadoop-2.2.0.tar.gz

Create HDFS directory inside Hadoop-2.2.0 folder:

mkdir -p /data/namenode
mkdir -p /data/datanode

Rename hadoop-2.2.0 to hadoop but I wouldn’t suggest that.

Go to Hadoop-2.2.0/etc/hadoop/ and copy/paste the following:

export JAVA_HOME= < your java path here >
<your path for lib folder in Hadoop-2.2.0> type pwd from lib and paste it below.
export HADOOP_COMMON_LIB_NATIVE_DIR="/home/hduser/hadoop-2.2.0/lib"
export HADOOP_OPTS="$HADOOP_OPTS -Djava.library.path=/home/hduser/hadoop-2.2.0/lib"

Similar to Hadoop 1 (traditional hadoop), make changes in core-site, mapred-site, yarn-site, hdfs-site:

In core-site.xml, copy/paste:


In hdfs-site.xml, copy/paste:

<name>dfs.replication</name> <value>1</value>

<value> <path to your hadoop> /hadoop/data/namenode</value>

<value> <path to your hadoop> /hadoop/data/datanode</value>

In mapred-site.xml, copy/paste: (create mapred-site.xml by vim mapred-site.xml , if file is not available)


In yarn-site.xml, copy/paste:



Either update your bashrc file or create and paste the following:

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk <your java path here>
export HADOOP_HOME=/home/hduser/hadoop-2.2.0 <your hadoop path here>
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

To run,

source ~/.bashrc or .

Before you run YARN, go to Hadoop-2.2.0 folder and type to format the namenode

bin/hdfs namenode -format

To start Hadoop:


You can check whether everthing went well by checking jps. Ideally you should have the following:

8144 ResourceManager
7722 DataNode
7960 SecondaryNameNode
8333 NodeManager
9636 Jps
7539 NameNode

Note: The PID might be different but the processes are the same.

Access the UI for:

Try some examples in your new YARN:

bin/hadoop jar /home/hduser/hadoop-2.2.0/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar pi 4 15 

Graph Database Neo4j: Part I

What are graph databases? Liitle refresher on Graph Theory (Refer Wiley book on graphs)

A Graph Database stores data in a Graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. The records in a graph database are called Nodes. Nodes are connected through typed, directed arcs, called Relationships. Each node and relationship can have named attributes referred to as Properties. A Label is a name that organizes nodes into groups.

In Neo4j, a node can be simply represented as (a) or (), for an empty node. It is not mandatory that nodes should be connected. The real advantage is to build relationships between nodes, (arcs in graph theory).

A relationship can be represented as (a)-->(b)/(a)-->()

We can query the graph database to find out matching of patterns by defining relationships. for example, if we want to know with whom all node ‘x’ has relationships with, then we can query (x)-[r]->()

If node ‘x’ has many types of relationships like worked in this company, acted in the movie, etc… we can retreive that in our query by asking, (a)-[:WORKED_AT]->(c)

We can also query with labels. Imagine Label is like a node’s property. If multiple nodes have same property, then we can group them. For example, if we have many nodes with labels like man, woman, American, British, then we can group them as American women, British women, etc…

For example,

(a:man {citizen:"American"}) list of American men in our database

(a:woman {citizen:"British"})->[WORKED_AT]->(c:ORC Solutions)

list of British women worked at ORC Solutions Pvt Ltd (just an example)

So What is Neo4j really?

Neo4j is a one of the NoSQL databases whose data model is a Graph, specifically a Property Graph. Cypher is Neo4j’s graph query language (SQL for graphs or GQL). It is a declarative query language where we can describe what we are interested in, not how it is acquired. Supposedly very readable and expressive.

NoSQL Architecture: Part III

There are different categories of NoSQL databases. These categories are based on how the data is stored. NoSQL products are optimized for insertion and retrival operations — because they usually happens in large scale and to calibrate performance most of the NoSQL products follow a horizontal structure. (as far as I know)

There are four major storage types available in NoSQL paradigm:
1. Column-oriented
2. Document Store
3. Key Value Store
4. Graph

Column-oriented: Data stored as columns. Typical RDBMS stores data as rows. You may argue that relational database displays data in two dimensional table with rows and columns. The difference here is when you query a RDBMS it will process one row at a time where as a column oriented database will have the data stored as columns. An example would enlighten the concept here:

Imagine following data needs to be stored. Lets compare RDBMS and NoSQL here:

Original Data

StudentID StudentName Mark1 Mark2 Mark3
12001 Bruce Wayne 55 65 75
12002 Peter Parker 66 77 88
12003 Charles Xavier 44 33 22

Data in RDMBS will be stored in the following way:
12001,Bruce Wayne,55,65,75
12002,Peter Parker,66,77,88
12003,Charles Xavier,44,33,22

Data in NoSQL will be stored in the following way:
Bruce Wayne, Peter Parker, Charles Xavier

Note: This is a very trivial, simple example just to demonstrate a point. One cannot take this in face value and argue insertion will be much difficult in NoSQL. Whether it is RDBMS or NoSQL they are more sophisticated and their systems are optimized enough to handle data for processing. We are just looking things at a higher level.

The advantage of column based approach is that it is computationally faster than RDBMS. Imagine if you would like to find out average, maximum or minimum of a given subject, you dont have to go through each and every row. Instead, just look at that respective column to determine the value. Also when you query the database, it does not have to scan each row for matching conditions; whichever the column is conditioned to retrive data, only those will be touched and voila, faster processing. You have to read these assuming you have a billion records in your database and need to query all of them at once just to retrieve few hundreds of it.

Examples are HBase, Cassandra, Google Big Table, etc… Oracle also has this feature introduced quite recently.

Document Store: In the previous category, we looked at structured data, students’ records to be precise. Now we are looking at how to store somewhat structure/semi-structure data. When we use Facebook API to extract posts from the given group, we would get that in JSON format. Like this:

{ "technology": "big data", "message" ":"Way of Internet of Things to Internet of Everything","permalink":"","actor_id":130571316485}
{ "technology": "big data", "message" ":"Challenges of Big Data Analysis","permalink":"","actor_id":130571316485}
{ "technology": "big data", "message" ":"Big Data'nin hayatimiza etkisi","permalink":"","actor_id":130571316485}
{ "technology": "big data", "message" ":"Etkinligimiz hazir!! Peki Ya Siz??","permalink":"","actor_id":130571316485}
{ "technology": "big data", "message" ":"30 Nisan'da 'girisimci melekler' Istanbul'u kusatiyor. Siz nerdesiniz?","permalink":"","actor_id":130571316485}

Or even imagine something like this:

{ "StudentID": "12001", "StudentName":"Bruce Wayne", "Location" : "Gotham city" }

Another record is like this:

{ "StudentID": "12002", "StudentName":"James Tiberius Kirk", "Address" :
{"Street": "2nd street","City": "Edge city", "State": "New York"} }

Imagine where records/documents does not follow even/constant schema. RDBMS cannot process this. We need something better where such semi-structured data can be indexed and queried. The document store category of NoSQL. (About the indexing part — the document ID could be the URL from where these data are crawled, or the timestamp when the data was crawled. It is even okay if these records are without document ID)

These category of database with room for changing schema or schemaless documents would provide flexibility and hence the popularity. Ideal situation would be any web-based application where the content would have varying schema. Or in cases where the data is available in JSON or XML format.

Examples are Redis (In-memory), MongoDB, CouchDB, Lotus Notes, etc…

More on the remaining categories in future posts.

NoSQL Architecture: Part II

NoSQL can be simply understood as data storage that is not just SQL but a bit more than that. We know that relational databases have been used to store the structured data. The data is sub-divided into groups, called tables which store well-defined units of data in terms of type, size, and other constraints. Each unit of data is known as column while each unit of the group is known as row.

Why are we talking about NoSQL?

We know that traditional RDBMS has limitations in terms of scalability and parallelization. Companies like eBay, Google, Yahoo would usually get billions of requests each day/week. For example, if you are searching on google search engine, using gmail services or gtalk, or gsomething, usually accessing these services from google would be a result of multiple systems but we cannot see with our eyes how many systems are being used to process our requests. Traditional RDBMS cannot be used to process these requests and they need something more robust, parallelized and scalable. So they came up with
1. GFS: Distributed filesystem (google file system)
2. Chubby: Distributed coordination system
3. MapReduce: Parallel processing system
4. Big Table: Column oriented database

Based on the idea of the above products, many NoSQL products were born like HBase, MongoDB, etc…

Traditional databases have limitations which are perceived as features like a transaction cannot leave database in an inconsistent state, one transaction cannot interfere with another, etc… These qualities would seem ideal in context of structured data but if we are talking about web-scale then performance will be compromised. Imagine, if I am looking at a book on eBay/Amazon and finalized that I am buying the book and proceeding to payment, it will lock a part of the database, specifically the inventory, and every other person in the world will have to wait to even access the book until I complete my transaction (this cannot be possible but the point is locking one web page for a secure transaction). This would be very counterproductive and hence, NoSQL gained momentum.

Why NoSQL at all?
1. Almost all NoSQL products offer schemaless data representation. This means that you don’t have to think too far ahead to define a structure.
2. Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of milliseconds—especially over mobile/hand held devices and other intermittently connected devices—you have much higher probability of winning users over.
3. NoSQL is an elastic product and not brittle. It can handle sudden increase of load and it is also scalable — imagine billions of rows and columns and no need to delete anything as NoSQL can create timestamps.
4. Open source, Apache Licensed and online support from experienced users, developers, committers of the project and articles from bloggers like me!

More on NoSQL in future posts.

HBase and NoSQL Architecture: Part I

HBase is the NoSQL database for Hadoop. It is a distributed database. NoSQL database could be understood as a database that is not relational in nature but supports SQL as its primary access language. HBase is ideal if you have huge amounts of data. HBase supports massively parallelized processing via MapReduce. HBase is designed to handle billions of rows and columns — big data.

Hbase is quite hard to understand especially if you have used RDBMS earlier. HBase is actually from Google’s BigTable. According to the google research paper, a bigtable is a sparse, distributed, persistent multidimensional sorted map and the map is indexed by a row key, column key, and a time-stamp. A map is simply an abstract data type with collection of keys and values, where each key is associated with one value. One of its many features is time-stamping/version. Hbase will not perform soft delete. Instead it will keep versions of cell values. every cell value maintains its own time-stamps. Default threshold for time-stamps is 10.

Lets look at how Hbase can be installed in Hadoop. Hadoop should be live with all daemon processes running.

1. Download the tarball of latest Hbase
$ wget http://hbase-latest-version.tar.gz

2. Extract the tarball
$ tar -zxvf hbase-latest-version.tar.gz

3. Go to Hbase-latest-version and cd into conf
$ cd hbase-latest-version/conf/

4. Replace the hbase-site.xml with the following information
<!-- This is the location in HDFS that HBase will use to store its files -->

5. Update the regionservers file with ‘localhost’
$ vi regionservers --> localhost --> :wq

6. Exit the conf directory and do bin/
$ bin/
You should have four additional process running: HMaster, Zookeeper, HRegionserver, Hquorampeer

7. Lanuch hbase
bin/hbase shell

you will get the command line like this,


It is actually very difficult to understand Hbase in one go because of its NoSQL architecture and lets look at it seperately in detail.

Apache Hive: to query structured data from HDFS

If we want to talk about Hive, we need to understand the difference between Pig and Hive as people can easily confuse or ask why do we need this when we have that questions.

The Apache Hive software provides a SQL like querying dialect called Hive Query Language that can be used to query data that is stored in Hadoop cluster. Once again Hive eliminates the need for writing mappers and reducers and we can use the HQL to query the language. I would not go deep into why do we need to query. Look up SQL tutorials in search engine.

The importance of Hive can only be understood when we have the right kind of data for Hive to process: static data, data not changing, quick response time is not the priority. Hive is only a software on top of Hadoop framework. It is not database package. We are querying the data from HDFS and we will use Hive where the need for SQL like querying arises.

HCatalog stores the metadata from HQL. Just popped up in my mind as I write this post. We’ll look into it later. Hive is ideal for data warehousing applications, where data is just stored and mined when necessary for report generation, visualization, etc …

Following is the Apache definition of Hive:

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Pig is a procedural data flow languages — the pig latin script, (think of python or perl). Hive is like a SQL querying language. Just like what they say, Pig can eat anything, which means Pig can take structured and unstructured data. Hive on the other hand can only process structured data. Data representation in Pig is through variables, whereas in Hive is through tables. Hive also supports UDF but it is more complex than Pig.


Very similar to Pig

1. Download Apache Hive
$ wget

2. Extract the Tarball

$ tar –zxvf hive-latest-version.tar.gz

3. Add the following to your previously created hadoop configuration bash script:

export HIVE_HOME=/data/hive-latest-version
export JAVA_HOME=/usr/java/jdk1.7.0_05
export HADOOP_HOME=/data/hadoop

export CLASSPATH=$JAVA_HOME:/data/hadoop/hadoop-core-1.0.1.jar:$HIVE_HOME/hive-latest-version.jar

4. Run the configuration file

$ .

5. Now launch Hive,

$ hive

Hadoop should be live.