Apache Mahout

developing text processing data products: part I

Folks, this is going to be a series of information pieces (more than one blog post about same topic) about text processing. In this series, if intend to discuss some of my experiences and also take this moment to organize the discussion on my blog. In the past, if touched upon some of the text processing recipes purely in an application point of view; however, if have spent more time in my career work with text content in automation and analytics and if owe it to myself to write more on text processing. Without further ado, lets jump into the this series.

Text processing is extracting information that could be used for decision making purposes from a text say like a book/paper/article — WIHTOUT HAVING TO READ IT. I use to read the newspaper to find out about weather forecast (five years ago) but these days if we need information about something we type it in search engine to get results (and ads). But imagine someone need information on day to day basis for decision making purposes, where he/she needs to read many articles or news or books in short period of time. Information is power but timing is important too. Hence the need for text processing data products. 

Some basic things that we can do with text data:

Crawling: extract data/content from a website using a crawlier, which is used to extract the data.

Tokenization: process of splitting a string into tiny objects in an array, based on the space between words.

Stemming: word reduction to a smaller word, like a root of word only to use it to search for all of its variations.

Stop word removal: straight forward, to remove frequently occurring words like a, an, the.

Parsing: process of breaking down a sentence into a set of phrases and further breakdown each phrase into noun, verb, adjective, etc… Also called parts of speech tagging. Cool topic.

These are just my experiences and my knowledge, and as always I try to write in a way that anyone can read and understand. More later.

Introduction to H2O, in-memory large scale data processing

So far, in our data science series on DataScience Hacks, we have seen R, Apache Mahout and Apache Spark for large scale Data Science applications.

Apache Mahout is the earlier project for distributed machine learning executes on MapReduce framework where as Apache Spark is built on Scala, comparatively recent project that performs distributed large scale machine learning and outsmarts MapReduce in terms of computational speed. Also it succeeds where MapReduce fails. We even saw how Apache Mahout has been migrating from MapReduce framework to Mahout spark binding framework for faster data processing.

Let us look at another fast, large scale data processing engine that runs on Hadoop (hence, distributed) and also has possible binding framework with Mahout.

Behold, H2O. H2O, according to their developers, is the world’s fastest in-memory platform for machine learning and predictive analytics on big data. It is distributed, scalable (of course) and licensed as open source software that could run on many nodes.

H2O can work with HDFS data, natively supports Java, Scala and also can interact with Hadoop ecosystem tools (like Apache Pig for example). It also works with Apache Spark – Sparkling Water, is a package that integrates H2O fast scalable machine learning engine with Spark.

Keep watching this space for more information on H2O in this continuation of large scale data processing series