Our world is being revolutionized by data-driven access to large amounts of data has generated new insights and opened exciting new opportunities in commerce, science, and computing applications. Processing the enormous quantities of data necessary for these advances requires large clusters, making distributed computing paradigms more crucial than ever. MapReduce is a programming model for expressing distributed computations on massive datasets and an execution framework for large-scale data processing on clusters of commodity servers. The programming model provides an easy-to-understand abstraction for designing scalable algorithms, while the execution framework transparently handles many system-level details, ranging from scheduling to synchronization to fault tolerance. This book focuses on MapReduce algorithm design, with an emphasis on text processing algorithms common in natural language processing, information retrieval, and machine learning. We introduce the notion of MapReduce design patterns, which represent general reusable solutions to commonly occurring problems across a variety of problem domains. This book not only intends to help the reader "think in MapReduce", but also discusses limitations of the programming model as well. Table of Introduction / MapReduce Basics / MapReduce Algorithm Design / Inverted Indexing for Text Retrieval / Graph Algorithms / EM Algorithms for Text Processing / Closing Remarks
This is a very concise text on the given topic: In the first half it breaks down the basic structure of the MapReduce algorithm, often with a short glance at implementations in Apache Hadoop or at Google's offices. There are already some simple, yet practical examples in this first part. The second half elaborates on some mathematically more complex problems, which are rather explained theoretically than practically: The main focus is EM (expectation maximization) on hidden Markov models, and though that topic is using some advanced mathematical notation, the presentation is still clear and followable. The book's rounded off with a few hints on what MapReduce cannot do. If you want to understand the concepts first before you decide about an implementation, this is a good book for you.
Great book on development algorithms for map/reduce-based solutions (Hadoop is mentioned, but not required to understand this book). This book describes tweaks of algorithms for map/reduce for different tasks - graph processing, machine learning, and common questions of map/reduce design, including performance optimization and related questions. If you'll read this book, then you need to look onto Cloud9 library ()
Some really good examples, but also sidetracks (about half of the book in total) into relatively complex topics that are only loosely related to text processing.