Hadoop (Mapreduce)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 43

Hadoop Framework

(MapReduce)
The Outline
1 What is MapReduce?

2 Examples of MapReduce
use case
The map and reduce
3 functions

4 Architecture of MapReduce

5 MapReduce workflow

6 Execution on MapReduce
job: Yarn/hadoop v.1

7 More examples
2
Let’s explore: MAPREDUCE

3
Hadoop general architecture (2.0)

Storage unit of Hadoop Resources Manager Processing unit of


unit of Hadoop Hadoop

4
What is MapReduce?
 It is the processing unit of Hadoop.
 It has been introduced by Google on 2004.
 It is a programming technique where huge data is processed in parallel and distributed manner.
 The objective of this module is: reduce data into information that people can understand and that is
valuable.

Output
Big Data Processing unit of
Hadoop

Processor

5
MapReduce Analogy: example counting
elections

6
MapReduce Analogy: example counting
elections

7
MapReduce vs Traditional approach?

Slave Slave Slave Slave

Master

Slave Slave Slave Slave

Traditional approach – Data is processed MapReduce approach – Data is processed


at the Master node At the Slave nodes
8
What is MapReduce?
 It was designed in the early 2000s by the engineers at Google Research.

 Within this module, processing is done at the slave nodes and the final output is sent to the

Master nodes.

 This is to overcome the traditional approach shown below, in which processing is done at

the Master Node.

 Hence, a large bandwidth is consumed to transfer the needed data from DataNodes to feed

them to the program in the Master Node.

 In MapReduce, parallel processing is implemented as jobs are processed in parallel by slave

nodes. 9
What is MapReduce?
 It is an illustration of using the functional programming on a distributed cluster: having several
functions working independently on different chunks of data, and then aggregating the results.

 MapReduce is the processing engine of Hadoop that processes and computes large volumes of data
(terabytes, petabytes). This processing unit implements a Mapping and Reducing mechanism that
execute in parallel. Here is the global overview:

Processing unit of
Hadoop

Map Reduce
A user defined A user defined
code code
10
Applications of MapReduce?
MR is commonly preferred for the following applications:

 Searching, sorting, grouping.

 Simple statistics: counting, ranking.

 Complex statistics: covariance.

 Pre-processing huge data to apply machine learning algorithms.

 Text processing, index building, graph creation and analysis, pattern recognition,

collaborative filtering, sentiment analysis.

11
What is MapReduce? the map function
Here is a pseudocode of the map function. A function that takes two arguments, a key and a value. After
performing some analysis or transformation on the input data, the map function may then output zero or
more resulting key/value pairs:

12
The parallelization of map function
 The key point is the possibility of parallelizing these functions in order to calculate mutch faster on a
multi-core machine or on a cluster of machines.

 The map function is parallelizable, because the calculations are independent.

 The map function must be a pure function of its parameter, it must have no side effect such as
modifying a global variable or memorizing its previous values.
13
What is MapReduce? the reduce function
The reducer is intended to aggregate the many values that are output from the map phase in order to
transform a large volume of data into a smaller, more manageable set of summary data

14
The parallelization of reduce function
 The reduce function is partially parallelized, in a hierarchical form, for example:

15
Architecture of MapReduce?
MapReduce is the processing engine of Hadoop that processes and computes large volumes of data
(terabytes, petabytes). This processing unit implements a Mapping and Reducing mechanism. Here is
the global architecture:

Map ()

Map () Reduce ()

Input Data Output


Map () Reduce ()

Map ()

16
What is MapReduce? (con’t)
 Hadoop MapReduce jobs are divided into a set of map tasks and reduce tasks that run in a

distributed fashion on a cluster of computers.

 Each task works on the small subset of the data it has been assigned so that the load is

spread across the cluster.

 The map tasks generally load, parse, transform, and filter data.

 Each reduce task is responsible for handling a subset of the map task output.

 Intermediate data is then copied from mapper tasks by the reducer tasks in order to group

and aggregate the data.


17
MapReduce: Implemented on a Cluster
 Each node gets a copy of the mapper operation, and applies the mapper to the key/value
pairs that are stored in the blocks of data of the local HDFS data nodes.

 There can be any number of mappers working independently on as much data as


possible.

 The mappers’ output is not dependent on anything but the incoming values, and
therefore failed mappers can be reattempted on another node.

 Reducers require as input the output of the mappers on a per-key basis; so, reducer
computation can also be distributed such that there can be as many reduce operations as
there are keys available from the mapper output.

 Shuffle and sort operation is required to coordinate the map and reduce phases, reducer
sees all values for a single, unique key.
18
MapReduce: Implemented on a Cluster
As shown in the figure below, MapReduce is implemented as a staged framework where a
map phase is coordinated to a reduce phase via an intermediate shuffle and sort phase.

19
Workflow of MapReduce? Basic example
A MapReduce job for the word count job is

defined by the following steps:

1. The input data is split into records.

2. Map functions process these records and

produce key/value pairs for each word.

3. All key/value pairs that are output by the

map function are merged together, grouped by

a key shuffled and sorted.

4. The intermediate results are transmitted to

the reduce function, which will produce the

final output.
20
Workflow of MapReduce?more details
Input data (stored in A map task is assigned A common optimization at this point is
HDFS), and split by to one or more nodes to apply a combiner: an aggregation
InputFormat before starting in the cluster, which of the mapping output for a single
the Map operation contains local blocks of mapper => facilitates reducer’s work
data specified as input
to the map operation

The reducers pulls an The intermediate keys are pulled


iterator of data for each from the map processes to a
key and performs a reduce partitioner. The partitioner
operation such as an The partitioner also sorts the decides how to allocate the
aggregation. Their output key/value pairs such that the keys to the reducers. A hash
full “shuffle and sort” phase is function is used to evenly
key/value pairs are then
implemented. divide keys among the
written back to HDFS
reducers.
using an OutputFormat
class.

21
Workflow of MapReduce?
 The input to a MapReduce job is a set of files in the data store that are spread out over the Hadoop
Distributed File System (HDFS).

 In Hadoop, these files are split with an input format, which defines how to separate a file into input
splits. Number of splits is generally specified in the mapred-site.xml by mapred.min.split.size, and
mapred.max.split.size

 An input split is a byte oriented view of a chunk of the file to be loaded by a map task.

 Each map task in Hadoop is broken into the following phases: record reader, mapper, combiner, and
partitioner. The output of the map tasks, called the intermediate keys and values, are sent to the
reducers.

 The reduce tasks are broken into the following phases: shuffle, sort, reducer, and output format.

 The nodes in which the map tasks run are optimally on the nodes in which the data rests. This way, the
data typically does not have to move over the network and can be computed on the local machine.
22
MapReduce workflow:
Illustration of words count within a
document

23
MapReduce workflow: I- Input Split
Input is stored in hadoop distributed file system. It is split based on the input format that specifies
how input will be read and split. Each split will be handled by an individual mapper.

Text input
format Input Split

I am a big data
I am a big expert

Input data expert

stored in I can handle I can handle big


data efficiently
HDFS big data
efficiently …..
…..

24
MapReduce workflow: II-Record Reader
The record reader communicates with the input split and converts the data into key-value pairs
that will be adequate to be read by the mapper.
RecordReaderExample of words counts
Input Split

I am a big data
RecordReader
expert

I can handle big


RecordReader
data efficiently
…………

RecordReader

RecordReader

…………
25
MapReduce workflow: III- the mapper
The mapper processes input records from RecordReader and generates intermediate key-value
pairs
Input key
Value pair

RecordReader Mapper

RecordReader Mapper Processes input records


from RecordReader and
generates intermediate


key-value pairs
RecordReader Mapper

26
MapReduce workflow: IV- the combiner
Combiner is a mini reducer, and there is a combiner for each mapper: performs the aggregation
for each mapper
Input key
Value pair Intermediate
key value
pair

Mapper Combiner

Mapper Combiner
Combiner is a mini
reducer and for every


combiner there is one
Mapper mapper
Combiner

27
MapReduce workflow: V- the partitioner
Input key
Value pair Intermediate
key value
pair

Mapper Combiner

Mapper Combiner
Substitute


Intermediate
Mapper Combiner key value
pair

Partitioner

This decides how


outputs from combiner Partitioner
are sent to reducers


Partitioner
28
MapReduce workflow: VI- Sorting
Input key
Value pair Intermediate
key value
pair
Mapper Combiner

Example of words counts Mapper Combiner


Substitute


a1 Intermediate
Mapper Combiner key value
am, 1 pair
big, 1
big, 1
Partitioner
can, 1
data, 1 Shuffling
data, 1 and Partitioner
Results shuffling and Sorting
efficiently, 1 sorting: example


expert, 1
I, 1 Partitioner
I, 1
handle, 1
29
MapReduce workflow: VII- the reducer
Input key
Value pair Intermediate
key value
pair
Mapper Combiner

Example of words
counts Mapper Combiner Substitute


Intermediate
a1
key value
am, 1 Mapper Combiner
pair
big, 1,1
Partitioner
can, 1
Reducer Shuffling
data, 1,1
and Partitioner
efficiently, 1 Results of reducer:


Sorting
expert, 1 example


I, 1,1 Reducer
handle, 1 Partitioner

All the intermediate values for the


intermediate keys are combined into a
list by the reducer called tuples
MapReduce workflow: VIII- the output
Input key
Intermediate
Value pair
key value
pair
Mapper Combiner

Example of words counts Mapper Combiner Substitute


Intermediate
a1 key value
Mapper Combiner
am, 1 pair
big, 2
Partitioner
can, 1
Reducer Shuffling
data, 2 Results of
and Partitioner
efficiently, 1 output format: Output


Sorting
format


expert, 1
I, 2 Reducer
Partitioner
handle, 1
RecordWriter writes these output key-
value pairs from the Reducer to the
output files.
Summary of MapReduce phases
 Input data : this concerns the data stored in blocks in HDFS datanodes.
 Input split format: the phase where the splitting of the source data is done. (criteria of
splitting are defined by developpers. For example, the splitting can be by lines).

 RecordReader:communicates with the input split and converts the data into key-value pairs
that will be adequate to be read by the mapper.

 Maps tasks: reads the data, and processes it to generate key-value pairs. Within this task,
mapping functions are applied on the blocks of data stored in HDFS depending on the file
size. (see the hdfs section).

 Reduce tasks: this performs operations on output generated from the previous phase which
is the mapping phase. It is the reducing function. It is done by shuffling, sorting, and
aggregating the key-value pairs into a smaller set of tuples.

 Output: The smaller set of tuples is the final output and gets stored in HDFS. 32
MapReduce workflow recap
Input key
Value pair Intermediate
key value
pair
InputSplit RecordReader Mapper Combiner
Input
Data
stored Input
in format InputSplit RecordReader Mapper Combiner
Substitute
HDFS


Intermediate
InputSplit RecordReader Mapper Combiner key value
pair
Partitioner

Output Reducer Shuffling


Data and Partitioner
Output


stored Sorting
format


in
HDFS Reducer
Partitioner
How many mappers and reducers?
 Can be set in a configuration file (JobConfFile) or during command line execution

 Number of Maps: driven by the number of HDFS blocks in the input files. HDFS can be

adjusted in terms of block size to adjust the number of maps.

 Number of Reducers: can be user-defined (the default is 1) Partition function

 The keyspace of the intermediate key-value pairs is evenly distributed over the reducers

with an hash function (same keys in different mappers end up at the same reducer)

 The partition of the intermediate key-value pairs generated by the maps can be

customized.
34
MapReduce architecture: within Yarn
 Input Data is split into smaller pieces. These chunks of data will be processed in
parallel in independent tasks. The ouput will be a list.

 This will be done based on the input format.


 Resource manager breaks the clients’ application into smaller tasks and assigns them
to the different nodes. Each node will handle the task thanks to its Node
Manager(responsible of resources’ allocation and management within each node).

 Within each node, mapping, combining and partitioning functions are executed on the
chunk of data. The processing ‘s results at each node is stored in the RAM of the
corresponding machine.

 At the level of the rack, the reduce function is called to provide outputs. This will be
written back to hdfs. 35
Submitting MapReduce job to Yarn
Node Manager

Scheduler Applications
Manager Container App Master

Submit job
request Node Manager
MR Resource
Manager
Client
App Master Container

job submission
Node Manager
Node status

MapReduce status Container Container


Resource request
MapReduce architecture: within Yarn
 The Yarn layer consists of two components: Scheduler and Applications Manager

 Scheduler allocates resources and does not participate in running or monitoring an

application. Resources are allocated as resource containers, with each resource

container assigned a specific memory

 Applications Manager accepts mapreduce job submission from clients and starts

processes called Applications Master to run the jobs

 There is one Application Master slave daemon for each client application.

37
Submitting MapReduce job(old version of hadoop)
It is called Resource Manager in
Hadoop version 2 and the slaves
are called Node Managers

38
MapReduce example 1: counting songs
Counting songs that were played. Use case: Google Music Analytics : to find the top ten
trending songs:

39
MapReduce example 2:
Let’s the following sample of cars dataset:

tuples = [
{'id':1, 'Brand':'Renault', 'Model':'Clio', 'Price':4200},
{'id':3, 'Brand':'Fiat','Model':'500', 'Price':8840},
{'id':4, 'Brand':'Fiat','Model':'600', 'Price':10240},
{'id':5, 'Brand':'Peugeot', 'Model':'206', 'Price':4300},
…]

To calculate the maximum, or average price, the functions


Map and Reduce can be written as:
• FunctionM: it extracts the desired value from the
tuples.
• FunctionR: is a grouping or aggregation function.
40
MapReduce example 2: analyzing the cars’ price
To analyze the cars’ price by calculating the average, min and max, the MapReduce program should
look like:
• FunctionM extracts the price of a car from the dataset.
• FunctionR calculates the max of a prices:
For efficiency, the intermediate values are not stored but transmitted between the two functions by
a kind of pipe (as in Unix).

Here is the python implementation of


The MapReduce job:

41
MapReduce example 2: analyzing the cars’ price(con’t)

The following figure depicts the MapReduce job of this example:

Values =
[ 4200,8200,8840,
10240, 4300,6140 ]
10240

Max
tuples map values reduce Prices

42
MapReduce configuration file: mapred-site.xml

Specifies the location of the


resource manager

Specifies that MapReduce


Will use Yarn for resource
management

The location where


previous jobs shall be
saved

43

You might also like