Month: October 2014

Introduction to H2O, in-memory large scale data processing

So far, in our data science series on DataScience Hacks, we have seen R, Apache Mahout and Apache Spark for large scale Data Science applications.

Apache Mahout is the earlier project for distributed machine learning executes on MapReduce framework where as Apache Spark is built on Scala, comparatively recent project that performs distributed large scale machine learning and outsmarts MapReduce in terms of computational speed. Also it succeeds where MapReduce fails. We even saw how Apache Mahout has been migrating from MapReduce framework to Mahout spark binding framework for faster data processing.

Let us look at another fast, large scale data processing engine that runs on Hadoop (hence, distributed) and also has possible binding framework with Mahout.

Behold, H2O. H2O, according to their developers, is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is distributed, scalable (of course) and licensed as open source software that could run on many nodes.

H2O can work with HDFS data, natively supports Java, Scala and also can interact with Hadoop ecosystem tools (like Apache Pig for example). It also works with Apache Spark – Sparkling Water, is a package that integrates H2O fast scalable machine learning engine with Spark.

Keep watching this space for more information on H2O in this continuation of large scale data processing series



Mahout Spark Shell: An overview

As we know already that Mahout is sunsetting its mapreduce algorithms support and moving to advance data processing systems that are significantly faster than mapreduce, today we will see one of the Mahout’s latest system: Mahout Scala and Spark bindings package.

If you had hands on with either R’s command line on Linux or Julia on Linux, you will learn this new package pretty quick. Note: Julia is a open source scientific computing and mathematical optimization platform on linux.

Lets look at how to set up Mahout spark shell on linux without hadoop. It is very simple and straight forward, if you follow these steps.

Note: Always check out mahout and spark latest version , else you will end up with java.lang.AbstractMethodError (version mismatch)

First, lets setup Spark:

wget http://path-to-spark/sparkx.x.x.tgz

Looks simple but, careful in what you are trying to select. I would choose the latest version under spark release and choose the “source code” under package type.

Once downloaded, build using sbt. This will take close to an hour.

Secondly, clone Mahout 1.0 from github:

git clone mahout

and build Mahout using Maven.

To start Mahout-spark shell go to spark folder and do a sbin/

Obtain the spark url master (if you are localhost, then it would be: http://localhost:8080/spark )

Create a file and type in the following:

export MAHOUT_HOME=your-path/mahout
export SPARK_HOME=your-path/spark
export MASTER=http://localhost:8080/spark

save it and run “.” followed by going into Mahout directory and a “bin/mahout spark-shell

You would get the following screen: