bash

Setting up Apache Spark: Quick look at Hadoop, Scala and SBT

We are going to look at installing spark on a Hadoop. Lets try to setup hadoop yarn here once again with screenshots from scratch, as i received some comments that my installation needs more screenshots so i am doing one with screenshots. In this post, we will look at creating a new user account on Ubuntu 14.04 and installing Hadoop 2.5.x stable version.

To create new user,

sudo passwd

enter your admin password to set up your root passwd

sudo adduser <new user-name>

enter the details

now providing the root access to the new user

sudo visudo

add the line

new user-name ALL = (ALL:ALL) ALL

if you want to delete new user then

sudo deluser <new user-name> from account with sudo privileges ( not guest)

For java:

Oracle jdk is the official. to install oracle-java 8, add oracle -8 to your packet manager repository and then do an update. install only after these steps are completed.

sudo apt-get install python-software-properties

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install oracle-java8-installer

Quickest way to setup java home

sudo update-alternatives --config java

copy the path of java 8 till java-8-oracle. for instance

/usr/lib/jvm/java-8-oracle

sudo nano /etc/environment

add

export JAVA_HOME = "/usr/lib/jvm/java-8-oracle"

source /etc/environment

if you echo, you will see the path.

Setting up passwordless ssh:

Look up my previous posts on ssh introduction. we will just directly jump into passwordless ssh with screenshots

Generate the key pair
ssh1

Create a folder in localhost and permanently add those keys generated to the localhost

ssh3

Thats it. You are done.

Install hadoop 2.5 stable version:
wget http://your-path-to-hadoop-2.5x-version/hadoop-2.4.1.tar.gz
tar xzvf hadoop-2.4.1.tar.gz

mv hadoop-2.4.1 hadoop

Create HDFS directory inside hadoop folder:

mkdir -p data/namenode
mkdir -p data/datanode

You should have these:

hadoop hdfs directory

go to hadoop-env.sh and update the java home path, hadoop_opts, hadoop_common_lib_native_dir. it is in etc/hadoop

 

hadoop-env

Edit core-site.xml and add the following:

core-site

create a file called “mapred-site.xml” and add the following:

mapred-site

Edit hdfs-site.xml and add the following:

hdfs-site

Edit yarn-site.xml and add the following:

yarn-site

Now, when you run the start-yarn/start-dfs files under sbin, you will get the following screen:

hadoop-complete

Install spark:
Obtain the latest version of Spark from http://spark-project.org/download. To interact with Hadoop Distributed File System (HDFS), you need to use a Spark version that is built against the same version of Hadoop as your cluster. Go to http://spark.apache.org/downloads.html and choose the package type: prebuilt for hadoop-2.4 and download spark. Note that Spark 1.1.0 uses scala 2.10.x. So we need to install scala.

Lets install Scala:

wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz
tar -xvf scala-2.9.3.tgz
cd scala-2.9.3
pwd
to get the path

You will probably want to add these to your .bashrc file or equivalent:

export SCALA_HOME=`pwd`
export PATH=`pwd`/bin:$PATH

scala1

We also need something called sbt. Sbt stands for simple build tool but to me it seems to be more complicated than Maven. You can still use maven to build however, I would suggest to get acquainted with sbt, if you are interested in exploring Scala in general.

More on the next post.