H2O and Machine Learning

Working with H2O has been quite an experience so far. Lets look at how to set it up. We can setup H2O as standalone server, install in R or install in Hadoop. Setting it up on standalone is quite simple. download the zip file unzip h2o-version.zip cd into the directory java -jar h2o.jar Go to… Continue reading H2O and Machine Learning

Introduction to H2O, in-memory large scale data processing

So far, in our data science series on DataScience Hacks, we have seen R, Apache Mahout and Apache Spark for large scale Data Science applications. Apache Mahout is the earlier project for distributed machine learning executes on MapReduce framework where as Apache Spark is built on Scala, comparatively recent project that performs distributed large scale… Continue reading Introduction to H2O, in-memory large scale data processing

Market Basket Analysis with Mahout

Also known as Affinity Analysis/Frequent Pattern Mining. Finding patterns in huge amounts of customer transactional data is called market basket analysis. This is useful where store's transactional data is readily available. Using market basket analysis, one can find purchasing patterns. Market basket analysis is also called associative rule mining (actually its otherway around) or affinity… Continue reading Market Basket Analysis with Mahout

Building Recommender Engines with Apache Mahout: Part II

In my previous post we looked at user ratings based recommendations. Here, we are going to build recommender engine based on items. Item-based Recommendation: In item-based analysis, items likes recommend related to a particular item are determined. When a user likes a particular item, items related to that item are recommended. As shown in figure, if items… Continue reading Building Recommender Engines with Apache Mahout: Part II

Building Recommender Engines with Apache Mahout: Part I

Introduction to recommendation: Recommender systems provide personalized information by learning the user’s interests from traces of interaction with that user. Two broad types of recommendation: – User-based recommendation – Item-based recommendation Recommendation engines aim to show items of interest to a user. Recommendation engines in essence are matching engines that take into account the context of where… Continue reading Building Recommender Engines with Apache Mahout: Part I

Multi-level classification using stochastic gradient descent

Classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An algorithm that implements classification, especially in a concrete implementation, is known as a classifier. There are two types of Classification: Binary… Continue reading Multi-level classification using stochastic gradient descent

Performing document clustering using Apache Mahout k-means

Introduction to k-means clustering: Clustering is all about organizing items from a given collection into groups of similar items. These clusters could be thought of as sets of items similar to each other in some ways but dissimilar from the items belonging to other clusters. Clustering a collection involves three things: An algorithm—This is the method used… Continue reading Performing document clustering using Apache Mahout k-means

Setting up Apache Mahout

Apache Mahout is a beautiful and scalable Machine Learning library built in maven to solve large scale machine learning problems. We use Mahout on a Apache Hadoop cluster as the primary purpose of using Mahout is to solve large scale (big data) problems. Refer to apache hadoop website to setup a single node/multi node cluster.… Continue reading Setting up Apache Mahout