An update on DataScience Hacks

I appreciate that you stop by and view my blogs and my code. Some of you, even send me a thank you note, which is so lovely to see as well. This is regarding being absent from blogging in past few months.

It takes a few days of full time hours to actually publish a decent blog post (I am no expert like so many others). Given my current work commitments among others, I have not been able to find time to publish blogs in recent times (not to mention, my laptop crashed).

I miss talking to you readers, this blog is my social media. I hope to touch base soon but I might be inclined to take a different direction.

In my humble opinion, technology is growing faster than the pace that we can catch up. There is so much more new research on machine learning and mathematical modeling happening on every day, however the ability to deploy these as a service remains somewhat of a challenge — project design, requirements definition, enterprise architecture, etc… all these may not click and could make the mathematical models look bad and unreliable.

I do not wish to digress, things could be argued both ways, based on one’s professional experience. But, the ability to ideate, conceptualize, reason, code, experiment and publish a blog is an intellectually rewarding exercise. I thank you for reading, commenting and just merely, viewing – from the bottom of my heart.

See you soon. Live long and prosper.

Advertisements

Python Driver, PyMongo and other MongoDB Operations

MongoDB has a mongo shell and drivers to connect to Java, Python, etc…

To start using MongoDB,

sudo service mongod start

For importing and exporting data, use commands at terminal. For instance:

mongoimport \
--db <name of database> \
--collection <name of collection> \ # collection is equivalent to table
--file <input file>

To create and/or use a database, simply

use <name of database>

To create and/or use a collection, simply

use <name of collection>

after database is created

To list down databases present,

show dbs

To list down collections under database,

show db.collection

A basic query would be like:

db.<collection_name>.find()

db.<collection_name>.find().pretty() # to display in JSON format

More querying with specific requests:

db.twitter1.find({"retweeted":false}).pretty()

db.twitter1.find_one() # to list just one query

To extract just one key value in JSON,

db.twitter1.find({},{text:1})

db.twitter1.find({},{text:1,_id:0}) # to remove id

db.twitter1.find({},{text:1,_id:0}).count() # to get count

To access subfields of JSON documents in our query,

db.twitter1.find({},{"user.location":1,text:1,_id:0})

The Python driver I worked with is pretty cool. Just do a pip install for pymongo — the python MongoDB driver and you could either work with Python CLE or write a script on it.

 

Some thoughts on MongoDB

Lets talk about MongoDB. Well, it is a NoSQL database. Controversial topic these days. I should talk with a disclaimer.

NoSQL databases cannot and should not be compared with SQL database. It is like comparing Apples and Oranges. Imagine if you have a social media website, where you have data about users (members) — profile description, messaging history, pictures, videos, user generated content (status updates, etc…) .

In a pure SQL environment, you may use different databases and tables to store various types of data and perform an inner join or outer join depending upon what you need — seperate tables for user information, status updates, different method of storing pictures and videos. In the context of NoSQL, you may be able to horizontally scale the database in order to store everything under one roof. Imagine, each row (or each document) would represent record for one member that has fields for user information, status updates, images, blah blah blah… so the developer/data scientist does not have to do any more joins. Due to large generation of user and machine generated data, this may be a good option (as it can dynamically add schema – adding more fields in a record).

MongoDB is a document storage model based NoSQL, stores records in key:value pairs. Under MongoDB database, a table is a collection, row is called a document, column is a called a field. I am using MongoDB to store a bunch of JSON documents. 

You may follow these instructions to setup as they are easy to clean up too.

 

Some notes on hadoop cluster

One way Passwordless SSH from Master to worker nodes:

1. Generate Key: ssh-keygen -t rsa
2. Create folder in worker node: ssh [user]@255.255.255.255 mkdir -p .ssh
3. Copy key to worker node: ssh-copy-id id_rsa.pub [user]@255.255.255.255
4. Enable permissions to worker: ssh [user]@255.255.255.255 "chmod 700 .ssh; chmod 640 .ssh/authorized_keys"

700 — user can read, write and execute. group and others have no permissions
640 — user can read and write. group can read. others have no permissions

Configuration Files: (For minimal configuration)

core-site.xml configuration should be the same for master and all worker nodes. Namenode URL (fs.default.name) should point to master node only.
mapred-site.xml should be edited only on master node.
yarn-site.xml should be the same configuration on master and worker nodes.
the file “slaves” should be updated only in the master node.
hdfs-site.xml is self explanatory

Disable IPV6:

IPV4 – Internet Protocol Version 4 — IP address follows this pattern:
ipv4.png

IPV6 – Internet Protocol Version 6 — IP address follows this pattern:
ipv6.png

Hadoop works/communicates on IPV4 within its cluster. It does not support IPV6 at the moment.

developing text processing data products: part II

I want to talk a little bit about challenges, something I wanted to share. The challenge, when working with text is that text data usually are complex bunch of words and sentences, meaning and interpretations could be different based on the geographic location. Not to mention the formatting and special characters that we have to deal with.

Although text can be extracted from web using XML and JSON formats, there is still several stages of cleaning and processing needs to be done before we even begin to analyze. Missing this step, we would end up with “garbage in, garbage out”.

After data cleansing, one of the type of analysis that we can perform is parts of speech tagging. Essentially, parts of speech tagging is breaking down a sentence into noun, verb, adverb, conjunction, etc…  Once we have PoS analysis done, we could use the output for various applications. Some of them are like word count to determine the most occurring noun or investigating the sentiment of a text for instance.

Extracting the information is one challenges. Finding the right software/package is another challenge. I am not trying to market any package/API here, but it appears that every package could have its own style of interpretation and providing different output. Lets look at some of these variations soon. More later.

developing text processing data products: part I

Folks, this is going to be a series of information pieces (more than one blog post about same topic) about text processing. In this series, if intend to discuss some of my experiences and also take this moment to organize the discussion on my blog. In the past, if touched upon some of the text processing recipes purely in an application point of view; however, if have spent more time in my career work with text content in automation and analytics and if owe it to myself to write more on text processing. Without further ado, lets jump into the this series.

Text processing is extracting information that could be used for decision making purposes from a text say like a book/paper/article — WIHTOUT HAVING TO READ IT. I use to read the newspaper to find out about weather forecast (five years ago) but these days if we need information about something we type it in search engine to get results (and ads). But imagine someone need information on day to day basis for decision making purposes, where he/she needs to read many articles or news or books in short period of time. Information is power but timing is important too. Hence the need for text processing data products. 

Some basic things that we can do with text data:

Crawling: extract data/content from a website using a crawlier, which is used to extract the data.

Tokenization: process of splitting a string into tiny objects in an array, based on the space between words.

Stemming: word reduction to a smaller word, like a root of word only to use it to search for all of its variations.

Stop word removal: straight forward, to remove frequently occurring words like a, an, the.

Parsing: process of breaking down a sentence into a set of phrases and further breakdown each phrase into noun, verb, adjective, etc… Also called parts of speech tagging. Cool topic.

These are just my experiences and my knowledge, and as always I try to write in a way that anyone can read and understand. More later.

Crawling Twitter content using R

Over the course of time, mining tweets has been made simple and easy to use and also requires less number of dependency packages.

Tweets information can be used in many ways ranging from extracting consumer sentiments about an entity (brand) to finding your next job. Although there is a set limit of 3000 tweets, there are work around strategies available (dynamically changing client auth information once the current client has exceeded the limit).

I chose R because, it is easy to work on post extraction analysis, relatively easy to setup a database so extracted tweets can be stored in a table and using machine learning packages for any analysis. Python is also an excellent choice.

To extract the tweets, you need to log in to https://dev.twitter.com/ and then click on “Create New App” (by clicking on managing your apps)

Follow the instructions, fill in information to get yourself a API Key, API secret, Access Token information. Execute the following code and you are all set with extracting tweets!

install.packages(c("devtools", "rjson", "bit64", "httr"))

library(devtools)

library(twitteR)

APIkey <- "xxxxxxxxxxxxxxxxxxxxxxx"
APIsecret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
accesstoken <- "23423432-xxxxxxxxxxxxxxxxxxxxxxxxxx"
accesstokensecret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"

setup_twitter_oauth(key,apisecret,accesstoken,accesstokensecret)

Screenshot from 2015-08-09 19:06:19

Read twitteR manual to learn various methods like searchtwitter() and play around a bit. Here is what I did:

Screenshot from 2015-08-09 19:11:21

 

tmux, a time saving linux tool

Tmux is a linux based tool, called terminal multiplexer helps enable many terminals to be accessed and controlled by a single terminal. Say if you wanted to install a program in all the linux machines in your network, then by using tmux you could access the terminal of all your machines and install the package simultaneously.

This tool has been very beneficial to me and thought I could share this to my readers.

You could install by

sudo apt-get install tmux

and when you enter

'tmux'

from command line, you will get a screen like this:

Screenshot from 2015-06-30 11:23:34

Caveat: You need to have openssh-server installed in every machine however you do not need to install Tmux in every machine.

Pictures speak more than words and the following pictures show how I installed Java in multiple machines.

tmux new -s 'new session 'ssh user@host_or_ipaddress' \; split-window -h 'ssh user@host_or_ipaddress' \; split-window -h 'ssh user@host_or_ipaddress' \; select-layout even-vertical

once you enter these panes, press

ctrl-b and ":" (use shift key) setw synchronized-panes

Screenshot from 2015-06-30 10:29:20

and there you go. The following picture is where I installed Java 8 in 6 machines in one go

Screenshot from 2015-06-30 10:37:52

more information can be found here.

full stack development

Full stack developer is someone who has the knowledge and ability to develop/work on all layers of web application development. When a user clicks on a web page to perform some action, the webpage communicates with server which accesses a database and retrieves information to be sent back to the user. Easier said than done.

The web application development consists of many layers. You have the server that resides on an operating system, a database where the data gets stored, a front end which the user sees when he/she opens a website and a backend layer that communicates between front end and the server. There is also the work of gathering requirements, meeting with clients, creating documentation and project management.

For instance, consider a web page(s) where:

Front-end is developed using HTML, CSS and JavaScript; back-end is developed using JavaScript, PHP or Python uses MySQL or MongoDB on a Apache Server

Keep watching this space for full stack development posts.