scala

Mahout Spark Shell: An overview

As we know already that Mahout is sunsetting its mapreduce algorithms support and moving to advance data processing systems that are significantly faster than mapreduce, today we will see one of the Mahout’s latest system: Mahout Scala and Spark bindings package.

If you had hands on with either R’s command line on Linux or Julia on Linux, you will learn this new package pretty quick. Note: Julia is a open source scientific computing and mathematical optimization platform on linux.

Lets look at how to set up Mahout spark shell on linux without hadoop. It is very simple and straight forward, if you follow these steps.

Note: Always check out mahout and spark latest version , else you will end up with java.lang.AbstractMethodError (version mismatch)

First, lets setup Spark:

wget http://path-to-spark/sparkx.x.x.tgz

Looks simple but, careful in what you are trying to select. I would choose the latest version under spark release and choose the “source code” under package type.

Once downloaded, build using sbt. This will take close to an hour.

Secondly, clone Mahout 1.0 from github:


git clone https://github.com/apache/mahout mahout

and build Mahout using Maven.

To start Mahout-spark shell go to spark folder and do a sbin/start-all.sh

Obtain the spark url master (if you are localhost, then it would be: http://localhost:8080/spark )

Create a mahoutspark.sh file and type in the following:


export MAHOUT_HOME=your-path/mahout
export SPARK_HOME=your-path/spark
export MASTER=http://localhost:8080/spark

save it and run “. mahoutspark.sh” followed by going into Mahout directory and a “bin/mahout spark-shell

You would get the following screen:

mspark

Advertisements

Setting up Apache Spark: Part II

Now that we have Hadoop YARN, Scala and other pre-requistes setup, we are now ready to install apache spark. If you are visiting this blog post for the first time, please do have a look at the earlier post Setting up Apache Spark: quick look at the Hadoop, Scala and SBT

You definitely need maven, sbt, scala, java, ssh and git (though git is optional). Have git incase you want to fork — didnt work out for me. I am not using my multi node cluster workspace to perform this demonstration so not much of screenshots as of now. However, I am sure these instructions will work fine.

To begin, lets download the appropriate spark binaries (select the one that corresponds to your hadoop installation from the spark site):

wget http://www.apache.org/dyn/closer.cgi/spark/spark-1.1.0/spark-1.1.0-bin-hadoop2.4.tgz

Assuming we have sbt, setup without any problems,

sbt/sbt package

followed by

sbt/sbt clean compile

Now, lets build spark using maven. Allocate desired amount of physical memory based on the capacity of the machine that you are working with and enter the following:

export MAVEN_OPTS="-Xmx1300M -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
mvn -Phadoop2-yarn -Dhadoop.version=2.0.5-alpha -Dyarn.version=2.0.5-alpha -DskipTests clean package

now, it may not be necessary to do a clean but I suggest

sbt clean

followed by

sbt/sbt assembly

This takes time. Once this is over,

sbin/start-all.sh

and finally,

./bin/spark-shell

Voila! the following screen will show up with scala command line. And now folks, lets learn Scala!

spark

Setting up Apache Spark: Quick look at Hadoop, Scala and SBT

We are going to look at installing spark on a Hadoop. Lets try to setup hadoop yarn here once again with screenshots from scratch, as i received some comments that my installation needs more screenshots so i am doing one with screenshots. In this post, we will look at creating a new user account on Ubuntu 14.04 and installing Hadoop 2.5.x stable version.

To create new user,

sudo passwd

enter your admin password to set up your root passwd

sudo adduser <new user-name>

enter the details

now providing the root access to the new user

sudo visudo

add the line

new user-name ALL = (ALL:ALL) ALL

if you want to delete new user then

sudo deluser <new user-name> from account with sudo privileges ( not guest)

For java:

Oracle jdk is the official. to install oracle-java 8, add oracle -8 to your packet manager repository and then do an update. install only after these steps are completed.

sudo apt-get install python-software-properties

sudo add-apt-repository ppa:webupd8team/java

sudo apt-get update

sudo apt-get install oracle-java8-installer

Quickest way to setup java home

sudo update-alternatives --config java

copy the path of java 8 till java-8-oracle. for instance

/usr/lib/jvm/java-8-oracle

sudo nano /etc/environment

add

export JAVA_HOME = "/usr/lib/jvm/java-8-oracle"

source /etc/environment

if you echo, you will see the path.

Setting up passwordless ssh:

Look up my previous posts on ssh introduction. we will just directly jump into passwordless ssh with screenshots

Generate the key pair
ssh1

Create a folder in localhost and permanently add those keys generated to the localhost

ssh3

Thats it. You are done.

Install hadoop 2.5 stable version:
wget http://your-path-to-hadoop-2.5x-version/hadoop-2.4.1.tar.gz
tar xzvf hadoop-2.4.1.tar.gz

mv hadoop-2.4.1 hadoop

Create HDFS directory inside hadoop folder:

mkdir -p data/namenode
mkdir -p data/datanode

You should have these:

hadoop hdfs directory

go to hadoop-env.sh and update the java home path, hadoop_opts, hadoop_common_lib_native_dir. it is in etc/hadoop

 

hadoop-env

Edit core-site.xml and add the following:

core-site

create a file called “mapred-site.xml” and add the following:

mapred-site

Edit hdfs-site.xml and add the following:

hdfs-site

Edit yarn-site.xml and add the following:

yarn-site

Now, when you run the start-yarn/start-dfs files under sbin, you will get the following screen:

hadoop-complete

Install spark:
Obtain the latest version of Spark from http://spark-project.org/download. To interact with Hadoop Distributed File System (HDFS), you need to use a Spark version that is built against the same version of Hadoop as your cluster. Go to http://spark.apache.org/downloads.html and choose the package type: prebuilt for hadoop-2.4 and download spark. Note that Spark 1.1.0 uses scala 2.10.x. So we need to install scala.

Lets install Scala:

wget http://www.scala-lang.org/files/archive/scala-2.9.3.tgz
tar -xvf scala-2.9.3.tgz
cd scala-2.9.3
pwd
to get the path

You will probably want to add these to your .bashrc file or equivalent:

export SCALA_HOME=`pwd`
export PATH=`pwd`/bin:$PATH

scala1

We also need something called sbt. Sbt stands for simple build tool but to me it seems to be more complicated than Maven. You can still use maven to build however, I would suggest to get acquainted with sbt, if you are interested in exploring Scala in general.

More on the next post.