Spark

MapReduce

1. Data Warehouse → Data Lake

Data Lake: A data lake is a centralized storage repository that allows you to store all your structured and unstructured data at any scale. You can store data as-is, without having to first structure the data, and run different types of analytics (raw, varied data)
Data Warehouse: Data warehouses, on the other hand, are optimized for fast query performance and are well-suited for structured data analytics (SQL).

2. Hadoop MapReduce

Definition: MapReduce is a programming model and processing technique for large scale data processing across a distributed cluster of computers. It was popularized by Google and is widely used in big data processing.
Procedure:
- Map Phase:
  - The input data is partitioned into smaller chunks.
  - Each chunk is processed by a Map function, which transforms the data into a set of key-value pairs.
- Shuffle and Sort:
  - The system groups all the key-value pairs based on their keys.
  - The pairs are then sorted to assist in the reduction phase.
- Reduce Phase:
  - The grouped key-value pairs are sent to the Reduce function (reducers can output exactly their input, OR have further computation).
  - The Reduce function aggregates or processes the values for each key.
Example
SQL → MapReduce
- Map Phase: SELECT, WHERE, GROUP BY, JOIN
- Shufﬂe Phase: ORDER BY, GROUP BY, JOIN
- Reduce Phase: SELECT, AGGREGATE, HAVING, JOIN
Data Locality:

Try to run mappers on machine where DataNode has needed data. Uses disk but not network, in order to avoid network transfers.
Pipelines: Sequence of MapReduce Jobs
- Problems:
  - Replication on data we could re-compute seems wasteful
  - Treating each stage independently prevents optimization tools from improving the whole pipeline
- Solutions: Spark

1. Resilient Distributed Datasets (RDD)

Characteristics:
- Lazy Evaluation: computation only done when results needed (to write ﬁle, make plot, etc.) **Instead of actually having the bytes of the data, it basically contains almost like code or a description of the steps you would have to do to produce the data if at some point you need it.
- Immutability: can’t change an RDD, but you can deﬁne a new one in terms of another.
- Data Lineage: record series of operations on other data necessary to obtain results. Able to track the transformations applied to them back to the source data. This lineage is crucial for recomputing lost data in case of node failure.
Operations (4 operations in 2 types):
- Transformation: create a new RDD (lazy, so no execution yet).
  
  Here: parallelize, map, and ﬁlter.
- Action: perform all operations in the graph to get an actual result.
  
  Here: collect.
```
table = sc.parallelize(data)
double = table.map(mult2)
doubleA = double.filter(onlyA)
doubleA.collect()
```
Optimization
- Distinction creates opportunities for optimize, choosing a more efﬁcient sequence of transformations to reach the same endpoint.
- Tools need to know what transformations are doing (difﬁcult with Python functions) to automatically optimize.
Partitions
- Whole dataset could all proceed through, one transformation at a time, but might not ﬁt in memory. Sometime you can pass through one row at one time, but quite slow. A balance would be paritions where Spark users can specify the number of partitions for an RDD.
- Choosing partition count directly affects number of tasks necessary to do a job. (The whole partition is loaded into memory, a task runs it, then the task exits)
  - Advantages of larger partitions: less overhead in starting tasks
  - Disadvantages of larger partitions:
    - might not have enough to use all cores that are available
    - harder to balance work evenly
    - uses more memory
Repartitioning: if the data is growing/shrinking a lot after a transformation, you might want to change the partition count.
```
table.filter(onlyA).map(mult2).collect()
table.filter(onlyA).repartition(1).map(mult2).collect()
```