MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
MapReduce: Simplified Data Processing On Large Clusters
Google, Inc.
Master
(2)
(2) assign
assign reduce
map
worker
split 0 (6) write
output
split 1 worker file 0
(5) remote read
split 2 (3) read (4) local write
worker output
split 3 worker
file 1
split 4
worker
Inverted Index: The map function parses each docu- large clusters of commodity PCs connected together with
ment, and emits a sequence of hword, document IDi switched Ethernet [4]. In our environment:
pairs. The reduce function accepts all pairs for a given (1) Machines are typically dual-processor x86 processors
word, sorts the corresponding document IDs and emits a running Linux, with 2-4 GB of memory per machine.
hword, list(document ID)i pair. The set of all output
pairs forms a simple inverted index. It is easy to augment (2) Commodity networking hardware is used – typically
this computation to keep track of word positions. either 100 megabits/second or 1 gigabit/second at the
machine level, but averaging considerably less in over-
all bisection bandwidth.
Distributed Sort: The map function extracts the key
(3) A cluster consists of hundreds or thousands of ma-
from each record, and emits a hkey, recordi pair. The
chines, and therefore machine failures are common.
reduce function emits all pairs unchanged. This compu-
tation depends on the partitioning facilities described in (4) Storage is provided by inexpensive IDE disks at-
Section 4.1 and the ordering properties described in Sec- tached directly to individual machines. A distributed file
tion 4.2. system [8] developed in-house is used to manage the data
stored on these disks. The file system uses replication to
provide availability and reliability on top of unreliable
3 Implementation hardware.
(5) Users submit jobs to a scheduling system. Each job
Many different implementations of the MapReduce in- consists of a set of tasks, and is mapped by the scheduler
terface are possible. The right choice depends on the to a set of available machines within a cluster.
environment. For example, one implementation may be
suitable for a small shared-memory machine, another for
a large NUMA multi-processor, and yet another for an 3.1 Execution Overview
even larger collection of networked machines.
This section describes an implementation targeted The Map invocations are distributed across multiple
to the computing environment in wide use at Google: machines by automatically partitioning the input data
Input (MB/s)
uppercase = GetCounter("uppercase");
20000
map(String name, String contents):
for each word w in contents: 10000
if (IsCapitalized(w)):
uppercase->Increment(); 0
EmitIntermediate(w, "1"); 20 40 60 80 100
Seconds
The counter values from individual worker machines
are periodically propagated to the master (piggybacked Figure 2: Data transfer rate over time
on the ping response). The master aggregates the counter
values from successful map and reduce tasks and returns
them to the user code when the MapReduce operation disks, and a gigabit Ethernet link. The machines were
is completed. The current counter values are also dis- arranged in a two-level tree-shaped switched network
played on the master status page so that a human can with approximately 100-200 Gbps of aggregate band-
watch the progress of the live computation. When aggre- width available at the root. All of the machines were
gating counter values, the master eliminates the effects of in the same hosting facility and therefore the round-trip
duplicate executions of the same map or reduce task to time between any pair of machines was less than a mil-
avoid double counting. (Duplicate executions can arise lisecond.
from our use of backup tasks and from re-execution of Out of the 4GB of memory, approximately 1-1.5GB
tasks due to failures.) was reserved by other tasks running on the cluster. The
Some counter values are automatically maintained programs were executed on a weekend afternoon, when
by the MapReduce library, such as the number of in- the CPUs, disks, and network were mostly idle.
put key/value pairs processed and the number of output
key/value pairs produced.
Users have found the counter facility useful for san- 5.2 Grep
ity checking the behavior of MapReduce operations. For
example, in some MapReduce operations, the user code The grep program scans through 1010 100-byte records,
may want to ensure that the number of output pairs searching for a relatively rare three-character pattern (the
produced exactly equals the number of input pairs pro- pattern occurs in 92,337 records). The input is split into
cessed, or that the fraction of German documents pro- approximately 64MB pieces (M = 15000), and the en-
cessed is within some tolerable fraction of the total num- tire output is placed in one file (R = 1).
ber of documents processed. Figure 2 shows the progress of the computation over
time. The Y-axis shows the rate at which the input data is
scanned. The rate gradually picks up as more machines
5 Performance are assigned to this MapReduce computation, and peaks
at over 30 GB/s when 1764 workers have been assigned.
In this section we measure the performance of MapRe-
As the map tasks finish, the rate starts dropping and hits
duce on two computations running on a large cluster of
zero about 80 seconds into the computation. The entire
machines. One computation searches through approxi-
computation takes approximately 150 seconds from start
mately one terabyte of data looking for a particular pat-
to finish. This includes about a minute of startup over-
tern. The other computation sorts approximately one ter-
head. The overhead is due to the propagation of the pro-
abyte of data.
gram to all worker machines, and delays interacting with
These two programs are representative of a large sub-
GFS to open the set of 1000 input files and to get the
set of the real programs written by users of MapReduce – information needed for the locality optimization.
one class of programs shuffles data from one representa-
tion to another, and another class extracts a small amount
of interesting data from a large data set. 5.3 Sort
The sort program sorts 1010 100-byte records (approxi-
5.1 Cluster Configuration mately 1 terabyte of data). This program is modeled after
All of the programs were executed on a cluster that the TeraSort benchmark [10].
consisted of approximately 1800 machines. Each ma- The sorting program consists of less than 50 lines of
chine had two 2GHz Intel Xeon processors with Hyper- user code. A three-line Map function extracts a 10-byte
Threading enabled, 4GB of memory, two 160GB IDE sorting key from a text line and emits the key and the
Input (MB/s)
15000 15000
Input (MB/s)
15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000
Shuffle (MB/s)
Shuffle (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0 0 0
500 1000 500 1000 500 1000
Output (MB/s)
15000 15000 15000
10000 10000 10000
5000 5000 5000
0
0 0
500 1000
500 1000 500 1000
Seconds
Seconds Seconds
(a) Normal execution (b) No backup tasks (c) 200 tasks killed
Figure 3: Data transfer rates over time for different executions of the sort program
original text line as the intermediate key/value pair. We the first batch of approximately 1700 reduce tasks (the
used a built-in Identity function as the Reduce operator. entire MapReduce was assigned about 1700 machines,
This functions passes the intermediate key/value pair un- and each machine executes at most one reduce task at a
changed as the output key/value pair. The final sorted time). Roughly 300 seconds into the computation, some
output is written to a set of 2-way replicated GFS files of these first batch of reduce tasks finish and we start
(i.e., 2 terabytes are written as the output of the program). shuffling data for the remaining reduce tasks. All of the
As before, the input data is split into 64MB pieces shuffling is done about 600 seconds into the computation.
(M = 15000). We partition the sorted output into 4000 The bottom-left graph shows the rate at which sorted
files (R = 4000). The partitioning function uses the ini- data is written to the final output files by the reduce tasks.
tial bytes of the key to segregate it into one of R pieces. There is a delay between the end of the first shuffling pe-
riod and the start of the writing period because the ma-
Our partitioning function for this benchmark has built-
chines are busy sorting the intermediate data. The writes
in knowledge of the distribution of keys. In a general
continue at a rate of about 2-4 GB/s for a while. All of
sorting program, we would add a pre-pass MapReduce
the writes finish about 850 seconds into the computation.
operation that would collect a sample of the keys and
Including startup overhead, the entire computation takes
use the distribution of the sampled keys to compute split-
891 seconds. This is similar to the current best reported
points for the final sorting pass.
result of 1057 seconds for the TeraSort benchmark [18].
Figure 3 (a) shows the progress of a normal execution A few things to note: the input rate is higher than the
of the sort program. The top-left graph shows the rate shuffle rate and the output rate because of our locality
at which input is read. The rate peaks at about 13 GB/s optimization – most data is read from a local disk and
and dies off fairly quickly since all map tasks finish be- bypasses our relatively bandwidth constrained network.
fore 200 seconds have elapsed. Note that the input rate The shuffle rate is higher than the output rate because
is less than for grep. This is because the sort map tasks the output phase writes two copies of the sorted data (we
spend about half their time and I/O bandwidth writing in- make two replicas of the output for reliability and avail-
termediate output to their local disks. The corresponding ability reasons). We write two replicas because that is
intermediate output for grep had negligible size. the mechanism for reliability and availability provided
The middle-left graph shows the rate at which data by our underlying file system. Network bandwidth re-
is sent over the network from the map tasks to the re- quirements for writing data would be reduced if the un-
duce tasks. This shuffling starts as soon as the first derlying file system used erasure coding [14] rather than
map task completes. The first hump in the graph is for replication.
0
5.5 Machine Failures
2003/03
2003/06
2003/09
2003/12
2004/03
2004/06
2004/09
In Figure 3 (c), we show an execution of the sort program
where we intentionally killed 200 out of 1746 worker
processes several minutes into the computation. The Figure 4: MapReduce instances over time
underlying cluster scheduler immediately restarted new
worker processes on these machines (since only the pro- Number of jobs 29,423
cesses were killed, the machines were still functioning Average job completion time 634 secs
properly). Machine days used 79,186 days
Input data read 3,288 TB
The worker deaths show up as a negative input rate Intermediate data produced 758 TB
since some previously completed map work disappears Output data written 193 TB
(since the corresponding map workers were killed) and Average worker machines per job 157
needs to be redone. The re-execution of this map work Average worker deaths per job 1.2
happens relatively quickly. The entire computation fin- Average map tasks per job 3,351
ishes in 933 seconds including startup overhead (just an Average reduce tasks per job 55
increase of 5% over the normal execution time). Unique map implementations 395
Unique reduce implementations 269
Unique map/reduce combinations 426
6 Experience
Table 1: MapReduce jobs run in August 2004
We wrote the first version of the MapReduce library in
February of 2003, and made significant enhancements to
it in August of 2003, including the locality optimization, Figure 4 shows the significant growth in the number of
dynamic load balancing of task execution across worker separate MapReduce programs checked into our primary
machines, etc. Since that time, we have been pleasantly source code management system over time, from 0 in
surprised at how broadly applicable the MapReduce li- early 2003 to almost 900 separate instances as of late
brary has been for the kinds of problems we work on. September 2004. MapReduce has been so successful be-
It has been used across a wide range of domains within cause it makes it possible to write a simple program and
Google, including: run it efficiently on a thousand machines in the course
of half an hour, greatly speeding up the development and
• large-scale machine learning problems, prototyping cycle. Furthermore, it allows programmers
who have no experience with distributed and/or parallel
• clustering problems for the Google News and
systems to exploit large amounts of resources easily.
Froogle products,
At the end of each job, the MapReduce library logs
• extraction of data used to produce reports of popular statistics about the computational resources used by the
queries (e.g. Google Zeitgeist), job. In Table 1, we show some statistics for a subset of
MapReduce jobs run at Google in August 2004.
• extraction of properties of web pages for new exper-
iments and products (e.g. extraction of geographi-
cal locations from a large corpus of web pages for 6.1 Large-Scale Indexing
localized search), and
One of our most significant uses of MapReduce to date
• large-scale graph computations. has been a complete rewrite of the production index-
[15] Erik Riedel, Christos Faloutsos, Garth A. Gibson, and MapReduceSpecification spec;
David Nagle. Active disks for large-scale data process-
ing. IEEE Computer, pages 68–74, June 2001. // Store list of input files into "spec"
for (int i = 1; i < argc; i++) {
[16] Douglas Thain, Todd Tannenbaum, and Miron Livny. MapReduceInput* input = spec.add_input();
Distributed computing in practice: The Condor experi- input->set_format("text");
ence. Concurrency and Computation: Practice and Ex- input->set_filepattern(argv[i]);
perience, 2004. input->set_mapper_class("WordCounter");
}
[17] L. G. Valiant. A bridging model for parallel computation.
Communications of the ACM, 33(8):103–111, 1997. // Specify the output files:
// /gfs/test/freq-00000-of-00100
[18] Jim Wyllie. Spsort: How to sort a terabyte quickly. // /gfs/test/freq-00001-of-00100
http://alme1.almaden.ibm.com/cs/spsort.pdf. // ...
MapReduceOutput* out = spec.output();
out->set_filebase("/gfs/test/freq");
A Word Frequency out->set_num_tasks(100);
out->set_format("text");
out->set_reducer_class("Adder");
This section contains a program that counts the number
of occurrences of each unique word in a set of input files // Optional: do partial sums within map
specified on the command line. // tasks to save network bandwidth
out->set_combiner_class("Adder");
#include "mapreduce/mapreduce.h"
// Tuning parameters: use at most 2000
// User’s map function // machines and 100 MB of memory per task
class WordCounter : public Mapper { spec.set_machines(2000);
public: spec.set_map_megabytes(100);
virtual void Map(const MapInput& input) { spec.set_reduce_megabytes(100);
const string& text = input.value();
const int n = text.size(); // Now run it
for (int i = 0; i < n; ) { MapReduceResult result;
// Skip past leading whitespace if (!MapReduce(spec, &result)) abort();
while ((i < n) && isspace(text[i]))
i++; // Done: ’result’ structure contains info
// about counters, time taken, number of
// Find word end // machines used, etc.
int start = i;
while ((i < n) && !isspace(text[i])) return 0;
i++; }