Real-time analytics using distributed computing system Storm: Part II

How to setup a Storm cluster ?
We will look at how to set up a single node cluster of storm project. The following are the prerequisites for setting up:

  1. Java 6 or above
  2. Python 2.6
  3. Zookeeper
  4. ZeroMQ
  5. JZMQ
  6. any other dependencies (unzip, git, etc…)

Zookeeper: Apache zookeeper project gives you a set of tools to build distributed applications that can safely handle partial failures in distributed computing.
ZeroMQ: ZeroMQ is a messaging library, which allows you to design a complex communication system without much effort. It is not a complete messaging system. Instead, it allows you to quickly build your own messaging system.
JZMQ: Java binding for ZeroMQ


Step 1: Configure Zookeeper


tar -xzvf zookeeper-3.4.5.tar.gz

cd zookeeper/conf

create a new cfg file with following information:

vim zoo.cfg


bin/ start

Note: if it says failed to write pid, then change the dataDir to someplace else where root permission is not required.

Step 2a: Configure Messaging library


tar -xzvf zeromq-2.1.7.tar.gz

cd zeromq-2.1.7



sudo make install


Step 2b: JZMQ

git clone

cd jzmq




sudo make install

Note: if git is not available then,

sudo apt-get update

sudo apt-get install git-core

Java, Python 2.6 should be in up and running. JAVA_HOME should be set.


Step 3: Install Storm


unzip -d /working directory path

cd into storm/conf and modify the storm.yaml with following. Note: We are setting up single server node and reference to localhost being made here.
- "localhost"
storm.zookeeper.port: 2181 "localhost"
storm.local.dir: "/var/stormtmp"
java.library.path: "/usr/lib/jvm/java-1.7.0-openjdk-i386"
- 6700
- 6701
- 6702
- 6703
worker.childopts: "-Xmx768m"
nimbus.childopts: "-Xmx512m"
supervisor.childopts: "-Xmx256m"
For a multi node cluster you may have to fill out additional information with respect to masters and slaves.

and now,

bin/storm nimbus for master
bin/storm supervisor for slave
bin/storm ui
to get nice ui on your browser.

Use http://localhost:8080 to see the UI


Real-time analytics using distributed computing system Storm: Part I

We know that using data we can produce analytics and create reports that would help us in decision-making, the usual story. The constraint here is in order to create analytics reports right now, data should have been collected, cleansed and pre-processed earlier. Only after spending considerable amount of time doing these tasks, we can use framework like Hadoop and its Ecosystem tools to analyze big data. What if I tell you that you can get your analytics in a matter of few minutes?

This is called Real-time analytics. For example:

In financial engineering, we have something called algorithmic trading. Imagine you have a computer system embedded with an AI based program that would place trading orders from pre-programmed trading instructions based on timing, price, quantity, etc… without a human stock broker involvement.

In Revenue Management, there is dynamic pricing. Imagine a flight ticket that you view online keeps changing its price every few hours, based on demand of that route. That’s near real-time analytics in picture.

There are many such use cases available where quicker availability of data would help make better decisions. Now there is a tool that performs real-time analytics in distributed computing where the task is distributed between multiple machines. I present you, Real time Analytics tool, “Storm” .

What is Storm?

Storm is a distributed and fault-tolerant system for processing streams of real-time data. Just like Hadoop’s job tracker and task tracker, here the work is delegated to different components that are each responsible for a task.

A spout is a component that handles the input stream. A bolt is another component that either persists the data in some storage or performs some tasks to the stream of data and passes the transformed data to some other bolt — can have more than one bolt in the Storm cluster. The arrangement of spouts and bolts and their connection is called a topology. (Just like mapreduce in Hadoop)

Why use Storm?

Storm lowers the complexity for doing real-time processing. Computations are done in parallel using multiple threads, processes and servers. You can use any programming language on top of storm. It is also fast as the system is designed such the messages are processed quickly using ØMQ.

How does Storm work?

Just like Hadoop cluster, there are two kind of nodes: Master node and Slave node. Master runs a daemon process called Nimbus that distributes code around the cluster, assigns tasks to each worker node and monitors failures. As mentioned earlier, Storm is a fault-tolerant system: meaning, when a slave fails, Nimbus will re-assign the tasks of failed slave as necessary. Slaves run a daemon process called Supervisor that executes portion of topology (set of spouts and bolts and their connections)

Storm can run on local mode (for testing the topology) and remote mode(production). Can be set up as single server setup or as multi-server cluster of nodes and masters. The user can design a topology (set of spouts and bolts). Basically these spouts and bolts are just java programs where you specify which bolt/spout should perform what operation. You can also write spouts and bolts using Python or Ruby using multilang protocol. Now this protocol uses standard input/output communication channel that performs the job of spouts/bolts. Messages are encoded either JSON or plain text through this channel.

How to develop real-time applications in Apache Storm?

At a high level, we have to:

0. Setup Storm cluster in your machine
1. Design the topology of bolts and spouts and their connections between them(just like graph theory of nodes and arcs)
2. Write program for each of bolt and spout to instruct what it should do upon receiving the streams of data
3. Pack your code and dependencies into a single jar file and run them