Apache Spark

Introduction to H2O, in-memory large scale data processing

So far, in our data science series on DataScience Hacks, we have seen R, Apache Mahout and Apache Spark for large scale Data Science applications.

Apache Mahout is the earlier project for distributed machine learning executes on MapReduce framework where as Apache Spark is built on Scala, comparatively recent project that performs distributed large scale machine learning and outsmarts MapReduce in terms of computational speed. Also it succeeds where MapReduce fails. We even saw how Apache Mahout has been migrating from MapReduce framework to Mahout spark binding framework for faster data processing.

Let us look at another fast, large scale data processing engine that runs on Hadoop (hence, distributed) and also has possible binding framework with Mahout.

Behold, H2O. H2O, according to their developers, is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is distributed, scalable (of course) and licensed as open source software that could run on many nodes.

H2O can work with HDFS data, natively supports Java, Scala and also can interact with Hadoop ecosystem tools (like Apache Pig for example). It also works with Apache Spark – Sparkling Water, is a package that integrates H2O fast scalable machine learning engine with Spark.

Keep watching this space for more information on H2O in this continuation of large scale data processing series