Hbase

NoSQL Architecture: Part III

There are different categories of NoSQL databases. These categories are based on how the data is stored. NoSQL products are optimized for insertion and retrival operations — because they usually happens in large scale and to calibrate performance most of the NoSQL products follow a horizontal structure. (as far as I know)

There are four major storage types available in NoSQL paradigm:
1. Column-oriented
2. Document Store
3. Key Value Store
4. Graph

Column-oriented: Data stored as columns. Typical RDBMS stores data as rows. You may argue that relational database displays data in two dimensional table with rows and columns. The difference here is when you query a RDBMS it will process one row at a time where as a column oriented database will have the data stored as columns. An example would enlighten the concept here:

Imagine following data needs to be stored. Lets compare RDBMS and NoSQL here:

Original Data

StudentID StudentName Mark1 Mark2 Mark3
12001 Bruce Wayne 55 65 75
12002 Peter Parker 66 77 88
12003 Charles Xavier 44 33 22

Data in RDMBS will be stored in the following way:
12001,Bruce Wayne,55,65,75
12002,Peter Parker,66,77,88
12003,Charles Xavier,44,33,22

Data in NoSQL will be stored in the following way:
12001,12002,12003
Bruce Wayne, Peter Parker, Charles Xavier
55,66,44
65,77,33
75,88,22

Note: This is a very trivial, simple example just to demonstrate a point. One cannot take this in face value and argue insertion will be much difficult in NoSQL. Whether it is RDBMS or NoSQL they are more sophisticated and their systems are optimized enough to handle data for processing. We are just looking things at a higher level.

The advantage of column based approach is that it is computationally faster than RDBMS. Imagine if you would like to find out average, maximum or minimum of a given subject, you dont have to go through each and every row. Instead, just look at that respective column to determine the value. Also when you query the database, it does not have to scan each row for matching conditions; whichever the column is conditioned to retrive data, only those will be touched and voila, faster processing. You have to read these assuming you have a billion records in your database and need to query all of them at once just to retrieve few hundreds of it.

Examples are HBase, Cassandra, Google Big Table, etc… Oracle also has this feature introduced quite recently.

Document Store: In the previous category, we looked at structured data, students’ records to be precise. Now we are looking at how to store somewhat structure/semi-structure data. When we use Facebook API to extract posts from the given group, we would get that in JSON format. Like this:

{ "technology": "big data", "message" ":"Way of Internet of Things to Internet of Everything","permalink":"http://www.facebook.com/comsoc/posts/10151526753216486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Challenges of Big Data Analysis","permalink":"http://www.facebook.com/comsoc/posts/10151494314921486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Big Data'nin hayatimiza etkisi","permalink":"http://www.facebook.com/comsoc/posts/10151490942041486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Etkinligimiz hazir!! Peki Ya Siz??","permalink":"http://www.facebook.com/comsoc/posts/10151325074526486","actor_id":130571316485}
{ "technology": "big data", "message" ":"30 Nisan'da 'girisimci melekler' Istanbul'u kusatiyor. Siz nerdesiniz?","permalink":"http://www.facebook.com/comsoc/posts/10151318889096486","actor_id":130571316485}

Or even imagine something like this:

{ "StudentID": "12001", "StudentName":"Bruce Wayne", "Location" : "Gotham city" }

Another record is like this:

{ "StudentID": "12002", "StudentName":"James Tiberius Kirk", "Address" :
{"Street": "2nd street","City": "Edge city", "State": "New York"} }

Imagine where records/documents does not follow even/constant schema. RDBMS cannot process this. We need something better where such semi-structured data can be indexed and queried. The document store category of NoSQL. (About the indexing part — the document ID could be the URL from where these data are crawled, or the timestamp when the data was crawled. It is even okay if these records are without document ID)

These category of database with room for changing schema or schemaless documents would provide flexibility and hence the popularity. Ideal situation would be any web-based application where the content would have varying schema. Or in cases where the data is available in JSON or XML format.

Examples are Redis (In-memory), MongoDB, CouchDB, Lotus Notes, etc…

More on the remaining categories in future posts.

Advertisements

NoSQL Architecture: Part II

NoSQL can be simply understood as data storage that is not just SQL but a bit more than that. We know that relational databases have been used to store the structured data. The data is sub-divided into groups, called tables which store well-defined units of data in terms of type, size, and other constraints. Each unit of data is known as column while each unit of the group is known as row.

Why are we talking about NoSQL?

We know that traditional RDBMS has limitations in terms of scalability and parallelization. Companies like eBay, Google, Yahoo would usually get billions of requests each day/week. For example, if you are searching on google search engine, using gmail services or gtalk, or gsomething, usually accessing these services from google would be a result of multiple systems but we cannot see with our eyes how many systems are being used to process our requests. Traditional RDBMS cannot be used to process these requests and they need something more robust, parallelized and scalable. So they came up with
1. GFS: Distributed filesystem (google file system)
2. Chubby: Distributed coordination system
3. MapReduce: Parallel processing system
4. Big Table: Column oriented database

Based on the idea of the above products, many NoSQL products were born like HBase, MongoDB, etc…

Traditional databases have limitations which are perceived as features like a transaction cannot leave database in an inconsistent state, one transaction cannot interfere with another, etc… These qualities would seem ideal in context of structured data but if we are talking about web-scale then performance will be compromised. Imagine, if I am looking at a book on eBay/Amazon and finalized that I am buying the book and proceeding to payment, it will lock a part of the database, specifically the inventory, and every other person in the world will have to wait to even access the book until I complete my transaction (this cannot be possible but the point is locking one web page for a secure transaction). This would be very counterproductive and hence, NoSQL gained momentum.

Why NoSQL at all?
1. Almost all NoSQL products offer schemaless data representation. This means that you don’t have to think too far ahead to define a structure.
2. Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of milliseconds—especially over mobile/hand held devices and other intermittently connected devices—you have much higher probability of winning users over.
3. NoSQL is an elastic product and not brittle. It can handle sudden increase of load and it is also scalable — imagine billions of rows and columns and no need to delete anything as NoSQL can create timestamps.
4. Open source, Apache Licensed and online support from experienced users, developers, committers of the project and articles from bloggers like me!

More on NoSQL in future posts.

HBase and NoSQL Architecture: Part I

HBase is the NoSQL database for Hadoop. It is a distributed database. NoSQL database could be understood as a database that is not relational in nature but supports SQL as its primary access language. HBase is ideal if you have huge amounts of data. HBase supports massively parallelized processing via MapReduce. HBase is designed to handle billions of rows and columns — big data.

Hbase is quite hard to understand especially if you have used RDBMS earlier. HBase is actually from Google’s BigTable. According to the google research paper, a bigtable is a sparse, distributed, persistent multidimensional sorted map and the map is indexed by a row key, column key, and a time-stamp. A map is simply an abstract data type with collection of keys and values, where each key is associated with one value. One of its many features is time-stamping/version. Hbase will not perform soft delete. Instead it will keep versions of cell values. every cell value maintains its own time-stamps. Default threshold for time-stamps is 10.

Lets look at how Hbase can be installed in Hadoop. Hadoop should be live with all daemon processes running.

1. Download the tarball of latest Hbase
$ wget http://hbase-latest-version.tar.gz

2. Extract the tarball
$ tar -zxvf hbase-latest-version.tar.gz

3. Go to Hbase-latest-version and cd into conf
$ cd hbase-latest-version/conf/

4. Replace the hbase-site.xml with the following information
<configuration>
<property>
<!-- This is the location in HDFS that HBase will use to store its files -->
<name>hbase.rootdir</name>
<value>hdfs://localhost:54310/hbase</value>
</property>
<property>
<name>hbase.cluster.distributed</name>
<value>true</value>
</property>
<property>
<name>hbase.zookeeper.quorum</name>
<value>localhost</value>
</property>
<property>
<name>hbase.zookeeper.property.dataDir</name>
<value>/data/zookeeper</value>
</property>
</configuration>

5. Update the regionservers file with ‘localhost’
$ vi regionservers --> localhost --> :wq

6. Exit the conf directory and do bin/start-all.sh
$ bin/start-hbase.sh
You should have four additional process running: HMaster, Zookeeper, HRegionserver, Hquorampeer

7. Lanuch hbase
bin/hbase shell

you will get the command line like this,

hbase(main):001:0>

It is actually very difficult to understand Hbase in one go because of its NoSQL architecture and lets look at it seperately in detail.