Big Data Analytics – DataScience Hacks

Docker

November 8, 2019July 18, 2023 PavanLeave a comment

I recently discovered a solution (a product) called Docker. This piece of software isolates a given job and its dependencies. Similar to virtual machine environment but this is light weight, and lot more storage friendly. This piece of software could store an application or a service in a bin. They call it a container. A… Continue reading Docker

Some notes on hadoop cluster

February 21, 2016March 1, 2016 PavanLeave a comment

One way Passwordless SSH from Master to worker nodes: 1. Generate Key: ssh-keygen -t rsa 2. Create folder in worker node: ssh [user]@255.255.255.255 mkdir -p .ssh 3. Copy key to worker node: ssh-copy-id id_rsa.pub [user]@255.255.255.255 4. Enable permissions to worker: ssh [user]@255.255.255.255 "chmod 700 .ssh; chmod 640 .ssh/authorized_keys" 700 -- user can read, write and execute. group… Continue reading Some notes on hadoop cluster

developing text processing data products: part I

October 30, 2015November 1, 2015 PavanLeave a comment

Folks, this is going to be a series of information pieces (more than one blog post about same topic) about text processing. In this series, if intend to discuss some of my experiences and also take this moment to organize the discussion on my blog. In the past, if touched upon some of the text… Continue reading developing text processing data products: part I

Setting up Apache Spark: Part II

September 28, 2014October 10, 2014 PavanLeave a comment

Now that we have Hadoop YARN, Scala and other pre-requistes setup, we are now ready to install apache spark. If you are visiting this blog post for the first time, please do have a look at the earlier post Setting up Apache Spark: quick look at the Hadoop, Scala and SBT You definitely need maven,… Continue reading Setting up Apache Spark: Part II

Setting up Apache Spark: Quick look at Hadoop, Scala and SBT

September 26, 2014October 10, 2014 PavanLeave a comment

We are going to look at installing spark on a Hadoop. Lets try to setup hadoop yarn here once again with screenshots from scratch, as i received some comments that my installation needs more screenshots so i am doing one with screenshots. In this post, we will look at creating a new user account on… Continue reading Setting up Apache Spark: Quick look at Hadoop, Scala and SBT

Apache Spark: data processing engine for cluster computing

September 12, 2014October 10, 2014 Pavan2 Comments

May I present Apache Spark, another apache licensed top-level project that could perform large scale data processing way faster than Hadoop (I am referring to MR1.0 here). It is possible due to Resilient Distributed Datasets concept that is behind this fast data processing. RDD is basically a collection of objects, spraed across a cluster stored… Continue reading Apache Spark: data processing engine for cluster computing

Hadoop 2: YARN — Overview and Setup

May 5, 2014October 10, 2014 Pavan1 Comment

We have already seen MapReduce. Now, lets dig deep into the new data processing model of Hadoop. Hadoop comes with a advanced resource management tool called YARN, and packaged as Hadoop-2.x. What is YARN? YARN stands for Yet Another Resource Navigator. YARN is also known as Hadoop Data Operating System -- YARN enables data processing… Continue reading Hadoop 2: YARN — Overview and Setup

Book Review: Optimizing Hadoop for MapReduce

April 18, 2014October 10, 2014 PavanLeave a comment

I had a chance to review another book titled “Optimizing Hadoop for MapReduce” and must say this book is an good resource for devops professionals who build MapReduce programs in Hadoop. The book is well organized -- starts off with introducing basic concepts, identifying system bottlenecks and resource weaknesses, suggesting ways to fix and optimize them, followed… Continue reading Book Review: Optimizing Hadoop for MapReduce

Big Data Logistics: data transfer using Apache Sqoop from RDBMS

March 8, 2014October 10, 2014 PavanLeave a comment

Apache Sqoop is a connectivity tool to perform data transfer between Hadoop and traditional databases (RDBMS) which contains structured data. Using sqoop, one can import data to Hadoop Distributed File System from RDBMS like Oracle, Teradata, MySQL, etc... and also export the data from Hadoop to any RDBMS in form of CSV file or direct… Continue reading Big Data Logistics: data transfer using Apache Sqoop from RDBMS

Real-time analytics using distributed computing system Storm: Part II

February 20, 2014October 10, 2014 Pavan3 Comments

How to setup a Storm cluster ? We will look at how to set up a single node cluster of storm project. The following are the prerequisites for setting up: Java 6 or above Python 2.6 Zookeeper ZeroMQ JZMQ any other dependencies (unzip, git, etc...) Zookeeper: Apache zookeeper project gives you a set of tools… Continue reading Real-time analytics using distributed computing system Storm: Part II