Computational Statistics

Demo using R Studio

Crawling Twitter content using R

Over the course of time, mining tweets has been made simple and easy to use and also requires less number of dependency packages.

Tweets information can be used in many ways ranging from extracting consumer sentiments about an entity (brand) to finding your next job. Although there is a set limit of 3000 tweets, there are work around strategies available (dynamically changing client auth information once the current client has exceeded the limit).

I chose R because, it is easy to work on post extraction analysis, relatively easy to setup a database so extracted tweets can be stored in a table and using machine learning packages for any analysis. Python is also an excellent choice.

To extract the tweets, you need to log in to and then click on “Create New App” (by clicking on managing your apps)

Follow the instructions, fill in information to get yourself a API Key, API secret, Access Token information. Execute the following code and you are all set with extracting tweets!

install.packages(c("devtools", "rjson", "bit64", "httr"))



APIkey <- "xxxxxxxxxxxxxxxxxxxxxxx"
APIsecret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
accesstoken <- "23423432-xxxxxxxxxxxxxxxxxxxxxxxxxx"
accesstokensecret <- "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"


Screenshot from 2015-08-09 19:06:19

Read twitteR manual to learn various methods like searchtwitter() and play around a bit. Here is what I did:

Screenshot from 2015-08-09 19:11:21



H2O and Machine Learning

Working with H2O has been quite an experience so far. Lets look at how to set it up. We can setup H2O as standalone server, install in R or install in Hadoop. Setting it up on standalone is quite simple.

download the zip file
cd into the directory
java -jar h2o.jar

Go to http://localhost:54321/ to see your output. It should look like this:


Installing in R requires slightly complex steps, especially if you are working with Ubuntu or linux.

Install the package by the following command:

install.packages("h2o", repos=(c("", getOption("repos"))))

Initialize the package and verify that H2O installed properly:


localH2O = h2o.init()


Installing in Hadoop requires you to have a Cloudera or Hortonworks or MapR version of Hadoop running on your system because this is what I found inside the h2o/hadoop directory. You can see the drivers for these versions of Hadoop only.


Computational Statistics setup

With the demand for analytics fast growing, many professionals are looking at leveraging open source systems for their analytics tasks. “R” or “R programming” is one of the famous statistical programming tool that is used for scientific computations. R programming language can also be linked with Hadoop to perform scalable/big data analytics.

Here we are going to look at installing R programming and R Studio IDE for R programming. Specifically on Linux(Debian). If you follow the code sequentially then you will be installing R in no time!

sudo apt-get update
sudo apt-get install r-base

If you want to compile R-packages then also install the following package as well.

sudo apt-get install r-base-dev

A number of R packages are available for Debian and they have the names starting with r-cran-****. These are usually kept upto date. Packages may require some build dependencies for it to run smoothly. Users should be aware of this and the following command would help:

sudo apt-get build-dep r-cran-

Now we have the R ready. Type ‘R’ on terminal to get the command line interface.

Next, we would like to install RStudio. You can either download the latest debian package from Rstudio or if you have ubuntu, then open software center and type Rstudio. IF you choose to download R debian package from the site, use the following command:

sudo dpkg -i rstudio-0.98.495-i386.deb

sudo dpkg --install package-name-here.deb