Data-Intensive Text Processing with MapReduce
January 7, 2015 at 9:22:25 AM GMT+1
MapReduce  is a programming model for expressing distributed computations on
massive amounts of data and an execution framework for large-scale data processing
on clusters of commodity servers. It was originally developed by Google and built on
well-known principles in parallel and distributed processing dating back several decades.
MapReduce has since enjoyed widespread adoption via an open-source implementation
called Hadoop, whose development was led by Yahoo (now an Apache project). Today,
a vibrant software ecosystem has sprung up around Hadoop, with signicant activity
in both industry and academia.
This book is about scalable approaches to processing large amounts of text with
MapReduce. Given this focus, it makes sense to start with the most basic question:
Why? There are many answers to this question, but we focus on two. First, \big data"
is a fact of the world, and therefore an issue that real-world systems must grapple with.
Second, across a wide range of text processing applications, more data translates into
more effective algorithms, and thus it makes sense to take advantage of the plentiful
amounts of data that surround us.