MongoDB

Python Driver, PyMongo and other MongoDB Operations

MongoDB has a mongo shell and drivers to connect to Java, Python, etc…

To start using MongoDB,

sudo service mongod start

For importing and exporting data, use commands at terminal. For instance:

mongoimport \
--db <name of database> \
--collection <name of collection> \ # collection is equivalent to table
--file <input file>

To create and/or use a database, simply

use <name of database>

To create and/or use a collection, simply

use <name of collection>

after database is created

To list down databases present,

show dbs

To list down collections under database,

show db.collection

A basic query would be like:

db.<collection_name>.find()

db.<collection_name>.find().pretty() # to display in JSON format

More querying with specific requests:

db.twitter1.find({"retweeted":false}).pretty()

db.twitter1.find_one() # to list just one query

To extract just one key value in JSON,

db.twitter1.find({},{text:1})

db.twitter1.find({},{text:1,_id:0}) # to remove id

db.twitter1.find({},{text:1,_id:0}).count() # to get count

To access subfields of JSON documents in our query,

db.twitter1.find({},{"user.location":1,text:1,_id:0})

The Python driver I worked with is pretty cool. Just do a pip install for pymongo — the python MongoDB driver and you could either work with Python CLE or write a script on it.

 

Advertisements

Some thoughts on MongoDB

Lets talk about MongoDB. Well, it is a NoSQL database. Controversial topic these days. I should talk with a disclaimer.

NoSQL databases cannot and should not be compared with SQL database. It is like comparing Apples and Oranges. Imagine if you have a social media website, where you have data about users (members) — profile description, messaging history, pictures, videos, user generated content (status updates, etc…) .

In a pure SQL environment, you may use different databases and tables to store various types of data and perform an inner join or outer join depending upon what you need — seperate tables for user information, status updates, different method of storing pictures and videos. In the context of NoSQL, you may be able to horizontally scale the database in order to store everything under one roof. Imagine, each row (or each document) would represent record for one member that has fields for user information, status updates, images, blah blah blah… so the developer/data scientist does not have to do any more joins. Due to large generation of user and machine generated data, this may be a good option (as it can dynamically add schema – adding more fields in a record).

MongoDB is a document storage model based NoSQL, stores records in key:value pairs. Under MongoDB database, a table is a collection, row is called a document, column is a called a field. I am using MongoDB to store a bunch of JSON documents. 

You may follow these instructions to setup as they are easy to clean up too.

 

NoSQL Architecture: Part III

There are different categories of NoSQL databases. These categories are based on how the data is stored. NoSQL products are optimized for insertion and retrival operations — because they usually happens in large scale and to calibrate performance most of the NoSQL products follow a horizontal structure. (as far as I know)

There are four major storage types available in NoSQL paradigm:
1. Column-oriented
2. Document Store
3. Key Value Store
4. Graph

Column-oriented: Data stored as columns. Typical RDBMS stores data as rows. You may argue that relational database displays data in two dimensional table with rows and columns. The difference here is when you query a RDBMS it will process one row at a time where as a column oriented database will have the data stored as columns. An example would enlighten the concept here:

Imagine following data needs to be stored. Lets compare RDBMS and NoSQL here:

Original Data

StudentID StudentName Mark1 Mark2 Mark3
12001 Bruce Wayne 55 65 75
12002 Peter Parker 66 77 88
12003 Charles Xavier 44 33 22

Data in RDMBS will be stored in the following way:
12001,Bruce Wayne,55,65,75
12002,Peter Parker,66,77,88
12003,Charles Xavier,44,33,22

Data in NoSQL will be stored in the following way:
12001,12002,12003
Bruce Wayne, Peter Parker, Charles Xavier
55,66,44
65,77,33
75,88,22

Note: This is a very trivial, simple example just to demonstrate a point. One cannot take this in face value and argue insertion will be much difficult in NoSQL. Whether it is RDBMS or NoSQL they are more sophisticated and their systems are optimized enough to handle data for processing. We are just looking things at a higher level.

The advantage of column based approach is that it is computationally faster than RDBMS. Imagine if you would like to find out average, maximum or minimum of a given subject, you dont have to go through each and every row. Instead, just look at that respective column to determine the value. Also when you query the database, it does not have to scan each row for matching conditions; whichever the column is conditioned to retrive data, only those will be touched and voila, faster processing. You have to read these assuming you have a billion records in your database and need to query all of them at once just to retrieve few hundreds of it.

Examples are HBase, Cassandra, Google Big Table, etc… Oracle also has this feature introduced quite recently.

Document Store: In the previous category, we looked at structured data, students’ records to be precise. Now we are looking at how to store somewhat structure/semi-structure data. When we use Facebook API to extract posts from the given group, we would get that in JSON format. Like this:

{ "technology": "big data", "message" ":"Way of Internet of Things to Internet of Everything","permalink":"http://www.facebook.com/comsoc/posts/10151526753216486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Challenges of Big Data Analysis","permalink":"http://www.facebook.com/comsoc/posts/10151494314921486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Big Data'nin hayatimiza etkisi","permalink":"http://www.facebook.com/comsoc/posts/10151490942041486","actor_id":130571316485}
{ "technology": "big data", "message" ":"Etkinligimiz hazir!! Peki Ya Siz??","permalink":"http://www.facebook.com/comsoc/posts/10151325074526486","actor_id":130571316485}
{ "technology": "big data", "message" ":"30 Nisan'da 'girisimci melekler' Istanbul'u kusatiyor. Siz nerdesiniz?","permalink":"http://www.facebook.com/comsoc/posts/10151318889096486","actor_id":130571316485}

Or even imagine something like this:

{ "StudentID": "12001", "StudentName":"Bruce Wayne", "Location" : "Gotham city" }

Another record is like this:

{ "StudentID": "12002", "StudentName":"James Tiberius Kirk", "Address" :
{"Street": "2nd street","City": "Edge city", "State": "New York"} }

Imagine where records/documents does not follow even/constant schema. RDBMS cannot process this. We need something better where such semi-structured data can be indexed and queried. The document store category of NoSQL. (About the indexing part — the document ID could be the URL from where these data are crawled, or the timestamp when the data was crawled. It is even okay if these records are without document ID)

These category of database with room for changing schema or schemaless documents would provide flexibility and hence the popularity. Ideal situation would be any web-based application where the content would have varying schema. Or in cases where the data is available in JSON or XML format.

Examples are Redis (In-memory), MongoDB, CouchDB, Lotus Notes, etc…

More on the remaining categories in future posts.

NoSQL Architecture: Part II

NoSQL can be simply understood as data storage that is not just SQL but a bit more than that. We know that relational databases have been used to store the structured data. The data is sub-divided into groups, called tables which store well-defined units of data in terms of type, size, and other constraints. Each unit of data is known as column while each unit of the group is known as row.

Why are we talking about NoSQL?

We know that traditional RDBMS has limitations in terms of scalability and parallelization. Companies like eBay, Google, Yahoo would usually get billions of requests each day/week. For example, if you are searching on google search engine, using gmail services or gtalk, or gsomething, usually accessing these services from google would be a result of multiple systems but we cannot see with our eyes how many systems are being used to process our requests. Traditional RDBMS cannot be used to process these requests and they need something more robust, parallelized and scalable. So they came up with
1. GFS: Distributed filesystem (google file system)
2. Chubby: Distributed coordination system
3. MapReduce: Parallel processing system
4. Big Table: Column oriented database

Based on the idea of the above products, many NoSQL products were born like HBase, MongoDB, etc…

Traditional databases have limitations which are perceived as features like a transaction cannot leave database in an inconsistent state, one transaction cannot interfere with another, etc… These qualities would seem ideal in context of structured data but if we are talking about web-scale then performance will be compromised. Imagine, if I am looking at a book on eBay/Amazon and finalized that I am buying the book and proceeding to payment, it will lock a part of the database, specifically the inventory, and every other person in the world will have to wait to even access the book until I complete my transaction (this cannot be possible but the point is locking one web page for a secure transaction). This would be very counterproductive and hence, NoSQL gained momentum.

Why NoSQL at all?
1. Almost all NoSQL products offer schemaless data representation. This means that you don’t have to think too far ahead to define a structure.
2. Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of milliseconds—especially over mobile/hand held devices and other intermittently connected devices—you have much higher probability of winning users over.
3. NoSQL is an elastic product and not brittle. It can handle sudden increase of load and it is also scalable — imagine billions of rows and columns and no need to delete anything as NoSQL can create timestamps.
4. Open source, Apache Licensed and online support from experienced users, developers, committers of the project and articles from bloggers like me!

More on NoSQL in future posts.