Month: August 2013

Setting up Apache Mahout

Apache Mahout is a beautiful and scalable Machine Learning library built in maven to solve large scale machine learning problems. We use Mahout on a Apache Hadoop cluster as the primary purpose of using Mahout is to solve large scale (big data) problems. Refer to apache hadoop website to setup a single node/multi node cluster. Here we look at how to install Mahout given that Hadoop is setup and running.

Follow these simple steps:

  • Download Apache Mahout 0.7 (tried and tested, stable version) from Apache Mahout website.

$ wget http://apache.spinellicreations.com/mahout/0.7/mahout-distribution-0.7.tar.gz

  •  Install Apache Mahout-distribution 0.7

$ tar -xzvf mahout-distribution-0.7

  • Create hadoop configuration (vim hconfig.sh) bash script and add the following along with Hadoop and Java_HOME:

export MAHOUT_HOME=/home/hduser/mahout-distribution-0.7/
export MAHOUT_CONF_DIR=/home/hduser/mahout-distribution-0.7/conf
export PATH=$MAHOUT_HOME/bin:$PATH
export CLASSPATH=$MAHOUT_HOME/mahout-core-0.7.jar

  • Run the configuration file so that path gets stored in linux’s memory

$ . hconfig.sh [. your-runtime-config.sh]

  • After initializing hadoop (bin/start-all.sh) type ‘mahout’ and you may obtain the following output
Mahout screen

Mahout screen

Advertisements

Apache Hadoop setup

I would recommend Ubuntu 12.04 and never ever (even if you are paid a big bag of money) upgrade. Whether if you perfer Ubuntu or Mint or Debian or Fedora or CentOS, always always go for Long Term Support or as they call it LTS. My personal suggestion.

Now, unfortunately Hadoop being a distributed computing platform, to get hands on, not every one has few hundred machines at home. It is rare people have more than 1 desktop or laptop. So it is important to learn how to setup a single server node in your own machine. Follow the simple steps:

$ sudo apt-get update

You need Java. Hadoop Framework almost entirely written in Java.

$ sudo apt-get install <open jdk or oracle jdk version >= 6>

It would be great if a user creates seperate account just for Hadoop (or not if you are super organized).

Now we need to configure SSH. SSH is another framework designed to create secure connections across multiple computers so your HDFS is safe and secure — no one can access it. Ubuntu comes with pre-installed SSH but just incase
$ sudo apt-get install ssh
$ ssh-keygen -r rsa
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Once this is done, try the following:

ssh localhost

This step should proceed through without requiring password.

Next step is to download Hadoop. I’d recommend Hadoop 1.2.1 Stable version.

wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz

tar -xzvf hadopp-1.2.1.targ.gz (just press tab, no need to type in terminal)

cd into the hadoop directory -> cd into conf directory

Update the following files : core-site.xml, mapred-site.xml, hdfs-site.xml, masters, slaves with the following:

Core-Site.xml:

<property>
<name>hadoop.tmp.dir</name>
<value>/data/hdfstmp</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri’s scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri’s authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

mapred-site.xml

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If “local”, then jobs are run in-process as a single map
and reduce task.
</description>
</property>

hdfs-site.xml

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

Masters, Slaves: delete all information inside the file and just type ‘localhost’

Now update the hadoop configuration script – hadoopenv.sh with path of Java home

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386

Also create a new run time configuration file <your-config-file-name>.sh and add the following path:

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-i386
export HADOOP_HOME=/home/hduser/hadoop-1.2.1
export HADOOP_CONF_DIR=$HADOOP_HOME/conf
export PATH=$JAVA_HOME/bin:$HADOOP_HOME/bin:$PATH
export CLASSPATH=$JAVA_HOME:$HADOOP_HOME/hadoop-core-1.2.1.jar:$PIG_HOME/pig-0.12.0.jar

Important notes:
1. I have copy pasted my configuration here so look for any typos or clerical errors
2. Some books say that you need to get into bash.sh to add these information. I would strongly disagree with those books. As many of us are not expert users of linux and it is likely we may delete something imporant in bash script which would result in OS not booting properly! (usually we open in vi editor and if you are not familiar with vi commands you can make a mess).
3. Open a nano editor nano hadoopconfiguration.sh and add these information. That way you need to run this config file during each session but it is highly recommended that way. Stay away from bash script

Now we are ready to launch Hadoop! Follow the steps in same order

.hadoopconfigurationfile.sh

cd into hadoop -> bin/start-all.sh

this will popup the following:

starting namenode, logging to /home/hduser/hadoop-1.2.1/libexec/../logs/hadoop-hduser-namenode-ubuntu.out
localhost:
starting datanode, logging to /home/hduser/hadoop-1.2.1/libexec/../logs/hadoop-hduser-datanode-ubuntu.out localhost:
starting secondarynamenode, logging to /home/hduser/hadoop-1.2.1/libexec/../logs/hadoop-hduser-secondarynamenode-ubuntu.out
starting jobtracker, logging to /home/hduser/hadoop-1.2.1/libexec/../logs/hadoop-hduser-jobtracker-ubuntu.outlocalhost:

starting tasktracker, logging to /home/hduser/hadoop-1.2.1/libexec/../logs/hadoop-hduser-tasktracker-ubuntu.out

When you type jps you should get the following:

5388 NameNode
5746 SecondaryNameNode
6076 Jps
5562 DataNode
6026 TaskTracker
5839 JobTracker

Your Hadoop setup is ready! Congratulations. Now your first operation should be formatting the namenode

hadoop/bin> hadoop namenode -format

Should you wish to terminate your current Hadoop session type

bin/stop-all.sh

Hope you find this helpful. Comment here if you face any problems installing.

 

Apache Hadoop — an efficient distributed computing framework

Apache Hadoop is an open source distributed computing framework that is used mainly for ETL tasks (read: Extract, Transform and Load) — storing and processing large scale of data on clusters of commodity hardware.

I agree distributed computing exists for more than a decde, but why has Hadoop gained so much publicity?

  • The entire distributed computing setup can be done in commodity hardware, which means you can setup hadoop in the computer that you are viewing this blog (I stand corrected if you are using your phone)
  • Open source and most important, apache licensed. That means you are most welcome to use this for industry, academia or research and there is no clause that you need to take permission from developer (I am using layman terms here)
  • The need for processing large amount of data has arrived and traditional computing may not be enough
  • HBase’s NoSQL archicteture supercedes Oracle, MySQL in features (later explained in another blog)
  • Built and Supported by Nerds across the globe (just like you!)

Sometimes I wonder worldpeace can only be achieved with Apache License like philosophy! 😀

Here’s a question: What is virtually one complete file system but stored physically acorss different nodes in cluster?

Distributed File System.

In our context, Hadoop Distributed File System. Provides High-through put. Born from Google File System and Google MapReduce.

Apache Hadoop is a Distributed Computing Framework with HDFS on its base and uses MapReduce paradigm to perform large scale processing.

MapReduce is nothing but splitting a giant tasks into many small chunks, send it to all nodes in the given cluster, do the necessary processing and assemble back the processed pieces into one. This whole task happens inside the HDFS in the background. For technical definition of MapReduce refer books. Map means split, Reduce means combine, thats all. Of course it is not that simple, but try to look at it with this perspective. This explanation would paint you a picture/help you visualize what would typically happen in background.

Now this does not mean we need to operate Hadoop at terminal level. There are tools that could be used within Hadoop fraemwork:

HBase: A scalable, distributed database that supports structured data storage for large tables.
Hive: A data warehouse infrastructure that provides data summarization and ad hoc querying.
Mahout: A Scalable machine learning and data mining library.
Pig: A high-level data-flow language and execution framework for parallel computation.
Spark: A fast and general compute engine for Hadoop data. Spark provides a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, stream processing, and graph computation.
ZooKeeper: A high-performance coordination service for distributed applications.

We will look at each of the above in detail with a use case.

Where are we using Hadoop?

Currently as far as I know, Hadoop framework is used in context of large-scale machine learning and data mining, web crawling and text processing, processing huge amounts of data in forms of relational/tabular data, soft analytics.