Data Storage Science

NoSQL, Graph databases and products where huge amount of various kinds of data could be stored.

Python Driver, PyMongo and other MongoDB Operations

MongoDB has a mongo shell and drivers to connect to Java, Python, etc…

To start using MongoDB,

sudo service mongod start

For importing and exporting data, use commands at terminal. For instance:

mongoimport \
--db <name of database> \
--collection <name of collection> \ # collection is equivalent to table
--file <input file>

To create and/or use a database, simply

use <name of database>

To create and/or use a collection, simply

use <name of collection>

after database is created

To list down databases present,

show dbs

To list down collections under database,

show db.collection

A basic query would be like:

db.<collection_name>.find()

db.<collection_name>.find().pretty() # to display in JSON format

More querying with specific requests:

db.twitter1.find({"retweeted":false}).pretty()

db.twitter1.find_one() # to list just one query

To extract just one key value in JSON,

db.twitter1.find({},{text:1})

db.twitter1.find({},{text:1,_id:0}) # to remove id

db.twitter1.find({},{text:1,_id:0}).count() # to get count

To access subfields of JSON documents in our query,

db.twitter1.find({},{"user.location":1,text:1,_id:0})

The Python driver I worked with is pretty cool. Just do a pip install for pymongo — the python MongoDB driver and you could either work with Python CLE or write a script on it.

 

Advertisements

Some thoughts on MongoDB

Lets talk about MongoDB. Well, it is a NoSQL database. Controversial topic these days. I should talk with a disclaimer.

NoSQL databases cannot and should not be compared with SQL database. It is like comparing Apples and Oranges. Imagine if you have a social media website, where you have data about users (members) — profile description, messaging history, pictures, videos, user generated content (status updates, etc…) .

In a pure SQL environment, you may use different databases and tables to store various types of data and perform an inner join or outer join depending upon what you need — seperate tables for user information, status updates, different method of storing pictures and videos. In the context of NoSQL, you may be able to horizontally scale the database in order to store everything under one roof. Imagine, each row (or each document) would represent record for one member that has fields for user information, status updates, images, blah blah blah… so the developer/data scientist does not have to do any more joins. Due to large generation of user and machine generated data, this may be a good option (as it can dynamically add schema – adding more fields in a record).

MongoDB is a document storage model based NoSQL, stores records in key:value pairs. Under MongoDB database, a table is a collection, row is called a document, column is a called a field. I am using MongoDB to store a bunch of JSON documents. 

You may follow these instructions to setup as they are easy to clean up too.

 

Graph Database Neo4j: Part II

Why use Graph Database at all?

How do I know the person XXX? How to go to YY from XX? Where to have lunch? Where to shop? These are every day questions that could be solved using graph database like Neo4j. Powerful recommender engines are behind facebook, google that recommends items that so perfectly. In order to build such recommender engine, social routing we can use Neo4j. To highlight some key domains:

1. Logistics Industry: where short term planning problems are solved frequently. (using Dijksta’s Algorithm, Travelling Salesman Problem, read more network theory)
2. Energy Industry: where Electric circuits are designed and optimized for minimizing wastage (read more on kirchoff’s laws)
3. Financial Engineering: Where markov chains and Bayesian networks are utilized to solve problems (actually any domain that uses stochastic modeling techniques)
4. Social Network analysis: understanding local and global patterns of social networks and examine network dynamics of individuals who share similar interests.
5. Machine Learning: Building a recommender engine by collecting web analytics data and finding interference patterns of products, user preferences, services, etc…

Data Model Archicteure of Neo4j:

Neo4j’s data model is a property graph.

A Property-Graph consists of labeled nodes and relationships each with properties. Nodes as we saw earlier, are just data records, usually used for an label and also contain their relationships to other nodes. Relationships connect two nodes. They are also an explicit data record in the graph database. Think of them as containing shared information between two entities (direction, type, properties) and representing the connection to both nodes. Properties are simple key:value pairs. There is no schema, just structure.

So how are we suppose to use Graph Databases?

We can query the graph databases using GQL, the graph query language. Cypher is a declarative, SQL inspired language for describing patterns in graphs. It allows us to describe what we want to select, insert, update or delete from a graph database without specifying how to do it. For example, if we want to write a GQL query to understand the relationship between two nodes, it will be something like:

MATCH (node1)-[relationship:TYPE]->(node2)
RETURN relationship.property

Installing Neo4j on your linux machine: Installing in Windows could be self explanatory; just download executable file, click ‘ok’ where asked for your persmission to install and follow on-screen procedures. Lets look at linux:

I always prefer installing from terminal compared to rpms and make procedures. It is quite simple. Import the vendor’s signing key and create apt sources list file. Do an update and install the neo4j community edition.

1998 wget -O - http://debian.neo4j.org/neotechnology.gpg.key | apt-key add -
1999 echo 'deb http://debian.neo4j.org/repo stable/' > /etc/apt/sources.list.d/neo4j.list
2000 apt-get update
2001 apt-get install neo4j

Caveat: You need Oracle Java 7 installed and not open Java. neo4j doesn’t seem to support open Java.

cd /etc/neo4j/
sudo pico neo4j.properties
remove the comment the following:
1. #node_auto_indexing=true
2. #node_keys_indexable=name,age
save and exit. Now,
sudo /etc/init.d/neo4j-service restart

Graph Database Neo4j: Part I

What are graph databases? Liitle refresher on Graph Theory (Refer Wiley book on graphs)

A Graph Database stores data in a Graph, the most generic of data structures, capable of elegantly representing any kind of data in a highly accessible way. The records in a graph database are called Nodes. Nodes are connected through typed, directed arcs, called Relationships. Each node and relationship can have named attributes referred to as Properties. A Label is a name that organizes nodes into groups.

In Neo4j, a node can be simply represented as (a) or (), for an empty node. It is not mandatory that nodes should be connected. The real advantage is to build relationships between nodes, (arcs in graph theory).

A relationship can be represented as (a)-->(b)/(a)-->()

We can query the graph database to find out matching of patterns by defining relationships. for example, if we want to know with whom all node ‘x’ has relationships with, then we can query (x)-[r]->()

If node ‘x’ has many types of relationships like worked in this company, acted in the movie, etc… we can retreive that in our query by asking, (a)-[:WORKED_AT]->(c)

We can also query with labels. Imagine Label is like a node’s property. If multiple nodes have same property, then we can group them. For example, if we have many nodes with labels like man, woman, American, British, then we can group them as American women, British women, etc…

For example,

(a:man {citizen:"American"}) list of American men in our database

(a:woman {citizen:"British"})->[WORKED_AT]->(c:ORC Solutions)

list of British women worked at ORC Solutions Pvt Ltd (just an example)

So What is Neo4j really?

Neo4j is a one of the NoSQL databases whose data model is a Graph, specifically a Property Graph. Cypher is Neo4j’s graph query language (SQL for graphs or GQL). It is a declarative query language where we can describe what we are interested in, not how it is acquired. Supposedly very readable and expressive.

NoSQL Architecture: Part IV

In our last post, we looked at column store and document store categories. In this post we will see the remaining two, key-value store and graph store.

Key-value store: Similar to document store in the context of not defining the schema; however, Unlike a document store that can create a key when a new document is inserted, a key-value store requires the key to be specified and the key must be known in order to retrieve the value. Please not that querying is not possible — you will not get SQL/SQL like support. There will be no relationship between entities.

Confusing?

Think of dictionary data structure in Python.

Key-value is mainly implemented in in-memory distributed/cache storage. Hence, the advantages are fairly straightforward. Quick retrieval of data. Applications, which require real time responses would like to use an in memory database, perhaps application to control aircraft, in cases, where the quick response times are crucial.

Examples are Redis, MemcacheDB (built on Memcached), Berkley DB

Graph Store: Graph databases are special case of NoSQL. It is a databases that uses a graph structure to denote the relationship between entities. There can be multiple links between two entities in a graph, making it known that the multiple relationships that the two nodes share.

The relationships represented may include social relationships between people, transport links between places, or network topologies between connected systems. The advantage that graph databases have is easy representation, retrieval and manipulation of relationships between the entities in the system. Use it only if there are relationships between the data.

Example of such database is Neo4j

There are some databases available which can support multiple categories that we have seen so far. One such beauty is OrientDB,  that supports document store, key-value store (quick retrieval is possible due to MVRB-tree indexing algorithm and so fast look-ups) and graph store. Completely written on Java and hence can run on windows/linux/anything that has java in it operating systems. Provides SQL support so the user can query. No dependencies needed and its very light. Does have ‘running from local‘ mode as well. And yes, it is Apache licensed! so free to use.

NoSQL Architecture: Part III

There are different categories of NoSQL databases. These categories are based on how the data is stored. NoSQL products are optimized for insertion and retrival operations — because they usually happens in large scale and to calibrate performance most of the NoSQL products follow a horizontal structure. (as far as I know)

There are four major storage types available in NoSQL paradigm:
1. Column-oriented
2. Document Store
3. Key Value Store
4. Graph

Column-oriented: Data stored as columns. Typical RDBMS stores data as rows. You may argue that relational database displays data in two dimensional table with rows and columns. The difference here is when you query a RDBMS it will process one row at a time where as a column oriented database will have the data stored as columns. An example would enlighten the concept here:

Imagine following data needs to be stored. Lets compare RDBMS and NoSQL here:

Original Data

StudentID StudentName Mark1 Mark2 Mark3
12001 Bruce Wayne 55 65 75
12002 Peter Parker 66 77 88
12003 Charles Xavier 44 33 22

Data in RDMBS will be stored in the following way:
12001,Bruce Wayne,55,65,75
12002,Peter Parker,66,77,88
12003,Charles Xavier,44,33,22

Data in NoSQL will be stored in the following way:
12001,12002,12003
Bruce Wayne, Peter Parker, Charles Xavier
55,66,44
65,77,33
75,88,22

Note: This is a very trivial, simple example just to demonstrate a point. One cannot take this in face value and argue insertion will be much difficult in NoSQL. Whether it is RDBMS or NoSQL they are more sophisticated and their systems are optimized enough to handle data for processing. We are just looking things at a higher level.

The advantage of column based approach is that it is computationally faster than RDBMS. Imagine if you would like to find out average, maximum or minimum of a given subject, you dont have to go through each and every row. Instead, just look at that respective column to determine the value. Also when you query the database, it does not have to scan each row for matching conditions; whichever the column is conditioned to retrive data, only those will be touched and voila, faster processing. You have to read these assuming you have a billion records in your database and need to query all of them at once just to retrieve few hundreds of it.

Examples are HBase, Cassandra, Google Big Table, etc… Oracle also has this feature introduced quite recently.

Document Store: In the previous category, we looked at structured data, students’ records to be precise. Now we are looking at how to store somewhat structure/semi-structure data. When we use Facebook API to extract posts from the given group, we would get that in JSON format. Like this:

{ "technology": "big data", "message" ":"Way of Internet of Things to Internet of Everything","permalink":"http://www.facebook.com/comsoc/posts/10151526753216486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Challenges of Big Data Analysis","permalink":"http://www.facebook.com/comsoc/posts/10151494314921486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Big Data'nin hayatimiza etkisi","permalink":"http://www.facebook.com/comsoc/posts/10151490942041486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Etkinligimiz hazir!! Peki Ya Siz??","permalink":"http://www.facebook.com/comsoc/posts/10151325074526486","actor_id":130571316485}
{ "technology": "big data", "message" ":"30 Nisan'da 'girisimci melekler' Istanbul'u kusatiyor. Siz nerdesiniz?","permalink":"http://www.facebook.com/comsoc/posts/10151318889096486","actor_id":130571316485}

Or even imagine something like this:

{ "StudentID": "12001", "StudentName":"Bruce Wayne", "Location" : "Gotham city" }

Another record is like this:

{ "StudentID": "12002", "StudentName":"James Tiberius Kirk", "Address" :
{"Street": "2nd street","City": "Edge city", "State": "New York"} }

Imagine where records/documents does not follow even/constant schema. RDBMS cannot process this. We need something better where such semi-structured data can be indexed and queried. The document store category of NoSQL. (About the indexing part — the document ID could be the URL from where these data are crawled, or the timestamp when the data was crawled. It is even okay if these records are without document ID)

These category of database with room for changing schema or schemaless documents would provide flexibility and hence the popularity. Ideal situation would be any web-based application where the content would have varying schema. Or in cases where the data is available in JSON or XML format.

Examples are Redis (In-memory), MongoDB, CouchDB, Lotus Notes, etc…

More on the remaining categories in future posts.

NoSQL Architecture: Part II

NoSQL can be simply understood as data storage that is not just SQL but a bit more than that. We know that relational databases have been used to store the structured data. The data is sub-divided into groups, called tables which store well-defined units of data in terms of type, size, and other constraints. Each unit of data is known as column while each unit of the group is known as row.

Why are we talking about NoSQL?

We know that traditional RDBMS has limitations in terms of scalability and parallelization. Companies like eBay, Google, Yahoo would usually get billions of requests each day/week. For example, if you are searching on google search engine, using gmail services or gtalk, or gsomething, usually accessing these services from google would be a result of multiple systems but we cannot see with our eyes how many systems are being used to process our requests. Traditional RDBMS cannot be used to process these requests and they need something more robust, parallelized and scalable. So they came up with
1. GFS: Distributed filesystem (google file system)
2. Chubby: Distributed coordination system
3. MapReduce: Parallel processing system
4. Big Table: Column oriented database

Based on the idea of the above products, many NoSQL products were born like HBase, MongoDB, etc…

Traditional databases have limitations which are perceived as features like a transaction cannot leave database in an inconsistent state, one transaction cannot interfere with another, etc… These qualities would seem ideal in context of structured data but if we are talking about web-scale then performance will be compromised. Imagine, if I am looking at a book on eBay/Amazon and finalized that I am buying the book and proceeding to payment, it will lock a part of the database, specifically the inventory, and every other person in the world will have to wait to even access the book until I complete my transaction (this cannot be possible but the point is locking one web page for a secure transaction). This would be very counterproductive and hence, NoSQL gained momentum.

Why NoSQL at all?
1. Almost all NoSQL products offer schemaless data representation. This means that you don’t have to think too far ahead to define a structure.
2. Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of milliseconds—especially over mobile/hand held devices and other intermittently connected devices—you have much higher probability of winning users over.
3. NoSQL is an elastic product and not brittle. It can handle sudden increase of load and it is also scalable — imagine billions of rows and columns and no need to delete anything as NoSQL can create timestamps.
4. Open source, Apache Licensed and online support from experienced users, developers, committers of the project and articles from bloggers like me!

More on NoSQL in future posts.

HBase and NoSQL Architecture: Part I

HBase is the NoSQL database for Hadoop. It is a distributed database. NoSQL database could be understood as a database that is not relational in nature but supports SQL as its primary access language. HBase is ideal if you have huge amounts of data. HBase supports massively parallelized processing via MapReduce. HBase is designed to handle billions of rows and columns — big data.

Hbase is quite hard to understand especially if you have used RDBMS earlier. HBase is actually from Google’s BigTable. According to the google research paper, a bigtable is a sparse, distributed, persistent multidimensional sorted map and the map is indexed by a row key, column key, and a time-stamp. A map is simply an abstract data type with collection of keys and values, where each key is associated with one value. One of its many features is time-stamping/version. Hbase will not perform soft delete. Instead it will keep versions of cell values. every cell value maintains its own time-stamps. Default threshold for time-stamps is 10.

Lets look at how Hbase can be installed in Hadoop. Hadoop should be live with all daemon processes running.

1. Download the tarball of latest Hbase
$ wget http://hbase-latest-version.tar.gz

2. Extract the tarball
$ tar -zxvf hbase-latest-version.tar.gz

3. Go to Hbase-latest-version and cd into conf
$ cd hbase-latest-version/conf/

4. Replace the hbase-site.xml with the following information
<configuration>
<property>
<!-- This is the location in HDFS that HBase will use to store its files -->
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/data/zookeeper</value>
</property>
</configuration>

5. Update the regionservers file with ‘localhost’
$ vi regionservers --> localhost --> :wq

6. Exit the conf directory and do bin/start-all.sh
$ bin/start-hbase.sh
You should have four additional process running: HMaster, Zookeeper, HRegionserver, Hquorampeer

7. Lanuch hbase
bin/hbase shell

you will get the command line like this,

hbase(main):001:0>

It is actually very difficult to understand Hbase in one go because of its NoSQL architecture and lets look at it seperately in detail.