Book Reviews

Book Review: Optimizing Hadoop for MapReduce

5655OS_Optimizing Mapreduce.jpgI had a chance to review another book titled “Optimizing Hadoop for MapReduce” and must say this book is an good resource for devops professionals who build MapReduce programs in Hadoop. The book is well organized — starts off with introducing basic concepts, identifying system bottlenecks and resource weaknesses, suggesting ways to fix and optimize them, followed by Hadoop best practices and recommendations. Though packed with advanced concepts and information on Hadoop architecture, the author writing is such that it could appeal to all types of audience (from novice to expert) with helpful hints on each chapter.

The first chapter on map reduce is written for people who are new to this paradigm. It contains pictorial representations on how the “low-level” MapReduce works. It’s easier to misunderstand the low-level MapReduce process and this chapter will clarify that.

The second chapter discusses performance tuning parameters — allocating map/reduce tasks based on number of cores in the respective Hadoop cluster. It also suggests widely used cluster management tools such as Ambari, Chukwa, etc.

The third and fourth chapter discusses identifying system bottlenecks and resource weaknesses respectively. The author takes an organized approach by introducing performance tuning process cycle and demystifying how various major components of a given Hadoop cluster (CPU, RAM, Storage and network bandwidth) could cause a bottleneck and how to eliminate them. Especially in the fourth chapter, I particularly liked the idea of discussing formulas that could be used as part of planning the Hadoop cluster and demonstrated using examples.

The remaining three chapters focus on enhancing and optimizing the Map/Reduce tasks and best practices and recommendations. The author introduces performance metrics for Map/Reduce tasks and suggests ways to enhance the map/reduce tasks and fine-tuning parameters to improve performance of a MapReduce job. The final chapter on Best practices is packed with valuable information on hardware tuning for optimal performance of the Hadoop cluster and Hadoop best practices.

Few minor points here and there should be read with caution. For instance, the author says each slave is called a task tracker in the first chapter — could have been better by saying it assumes the responsibilities of task tracker while in general it is actually called a data node. That is just my suggestion. In short, this book is a compilation of all the MapReduce performance related issues and ideas on troubleshooting and optimizing the performance of the same including best practices. Must have book especially for hadoop administrators and developers. This book is available at packtpub

Book Review: Apache Mahout Cookbook

bookQuick summary:
Very well written for Developers who are new to both Mahout and Machine Learning, with walk-throughs and screenshots. However, if you have experience in writing heuristics/have expertise in Machine learning, you can skip this book. Concise and to the point, few clerical errors and typos, though. This book certainly makes a wonderful academic companion if anyone plan to use Mahout in their academic research project.

Detailed Review:
When I was asked to review this book, I was skeptical about this book because of the TSP receipe that is included no longer supported by Mahout. I guess a technical cookbook should have real world use cases and here was a receipe which cannot be practically implemented and hence misleading Mahout’s capabilities. However, when I read this book right from chapter 1, it was written so well that anyone can understand setting up and working with Mahout. Caveat: You should have some amount of knowledge in Software development and Java programming.

I disagree with comments that most of receipes in this book can be obtained by google search. The book carefully explains a given concept with output screenshots and also puts a walkthrough on how to implement the same in Netbeans. Glad to see author using Netbeans, I personally support that and it is easy to work with. Receipes like import/export data from HDFS/RDBMS, spectral clustering are a highlight. The author does not assume that the user is familiar with MySQL so there is a walkthrough on installing the same. Topic modeling, Pattern mining are good to see.

There is an entire chapter on classification walkthrough (for binary and multi-level classification) in Mahout for which there are plenty of tutorials available on the web and it is well written in MiA. Same goes with k-meansg. Also, based on the discussions happened with developers, it is pretty conclusive mapreduce version of genetic programming may not possibly see the light in future Mahout release. My personal recommendation is not to get too involved with chapter 10. Also, TSP example is basically a sample and not a real life one. For those who want to learn more, I would suggest to look up watchmaker project. Instead of outdated TSP demo, I would have liked to see Hidden Markov Modeling case study even though it is only partially parallelized.

I personally would like to see a second edition with more in-depth recipes where data is extracted and cleansed using Pig/Hive, then fed to Mahout to produce meaningful results. I would like to see detailed coverage on building recommendation engines, building a fraud detection engine based on large amount of data that is transformed using Pig and finding hidden patterns where Hadoop ecosystem tools are put to use. Author’s choice of preferred NoSQL database in Mahout context would also be good to see.

You can buy this book at Packt Publishing