Definition: MapReduce is a programming model and processing technique for large scale data processing across a distributed cluster of computers. It was popularized by Google and is widely used in big data processing.
Procedure:
Map Phase:
The input data is partitioned into smaller chunks.
Each chunk is processed by a Map function, which transforms the data into a set of key-value pairs.
Shuffle and Sort:
The system groups all the key-value pairs based on their keys.
The pairs are then sorted to assist in the reduction phase.
Reduce Phase:
The grouped key-value pairs are sent to the Reduce function (reducers can output exactly their input, OR have further computation).
The Reduce function aggregates or processes the values for each key.
Example
SQL → MapReduce
Data Locality:
Try to run mappers on machine where DataNode has needed data. Uses disk but not network, in order to avoid network transfers.
Pipelines: Sequence of MapReduce Jobs
Characteristics:
Operations (4 operations in 2 types):
Transformation: create a new RDD (lazy, so no execution yet).
Here: parallelize, map, and filter.
Action: perform all operations in the graph to get an actual result.
Here: collect.
table = sc.parallelize(data)
double = table.map(mult2)
doubleA = double.filter(onlyA)
doubleA.collect()
Optimization
Partitions
Repartitioning: if the data is growing/shrinking a lot after a transformation, you might want to change the partition count.
table.filter(onlyA).map(mult2).collect()
table.filter(onlyA).repartition(1).map(mult2).collect()