Cluster management

Book Review: Optimizing Hadoop for MapReduce

5655OS_Optimizing Mapreduce.jpgI had a chance to review another book titled “Optimizing Hadoop for MapReduce” and must say this book is an good resource for devops professionals who build MapReduce programs in Hadoop. The book is well organized — starts off with introducing basic concepts, identifying system bottlenecks and resource weaknesses, suggesting ways to fix and optimize them, followed by Hadoop best practices and recommendations. Though packed with advanced concepts and information on Hadoop architecture, the author writing is such that it could appeal to all types of audience (from novice to expert) with helpful hints on each chapter.

The first chapter on map reduce is written for people who are new to this paradigm. It contains pictorial representations on how the “low-level” MapReduce works. It’s easier to misunderstand the low-level MapReduce process and this chapter will clarify that.

The second chapter discusses performance tuning parameters — allocating map/reduce tasks based on number of cores in the respective Hadoop cluster. It also suggests widely used cluster management tools such as Ambari, Chukwa, etc.

The third and fourth chapter discusses identifying system bottlenecks and resource weaknesses respectively. The author takes an organized approach by introducing performance tuning process cycle and demystifying how various major components of a given Hadoop cluster (CPU, RAM, Storage and network bandwidth) could cause a bottleneck and how to eliminate them. Especially in the fourth chapter, I particularly liked the idea of discussing formulas that could be used as part of planning the Hadoop cluster and demonstrated using examples.

The remaining three chapters focus on enhancing and optimizing the Map/Reduce tasks and best practices and recommendations. The author introduces performance metrics for Map/Reduce tasks and suggests ways to enhance the map/reduce tasks and fine-tuning parameters to improve performance of a MapReduce job. The final chapter on Best practices is packed with valuable information on hardware tuning for optimal performance of the Hadoop cluster and Hadoop best practices.

Few minor points here and there should be read with caution. For instance, the author says each slave is called a task tracker in the first chapter — could have been better by saying it assumes the responsibilities of task tracker while in general it is actually called a data node. That is just my suggestion. In short, this book is a compilation of all the MapReduce performance related issues and ideas on troubleshooting and optimizing the performance of the same including best practices. Must have book especially for hadoop administrators and developers. This book is available at packtpub