What are the basic principles of MapReduce?
Inspired by functional programming (e.g., Lisp, Haskell)
Map
Generates intermediate results from input data
Input: aggregates
Output: key-value pairs
Each map task is independent → safely parallelizable
Reduce
Aggregates intermediate results
Input: multiple map outputs with the same key
Output: a combined value per key
Analogies:
Map ≈ SQL GROUP BY
Reduce ≈ SQL AGGREGATE
Assume you have three nodes containing words like
a,a,b,c
c,d
a,a,c
How would the map and shuffle & sort step look like?
How could pseudo code look like?
What does the reduce step do after the map step?
How can MapReduce be implemented with respect to system architecture?
Not bound to a specific architecture
Can be realized in:
Distributed memory environments (e.g., clusters)
Often use centralized coordination with a single master
Examples: Google, Hadoop
Shared memory environments (e.g., multi-core machines)
Can use decentralized coordination, such as hash-based
Examples: Phoenix (Stanford), C++ with PThreads
What is the purpose of combinable reducers in MapReduce?
Address network traffic issues by reducing data locally first
Apply reduce function locally (pre-aggregation) before global reduction
Transfers less data across the network
Steps:
Local Reduce
Shuffle and Sort
Global Reduce
What properties must a reduce function have to be combinable?
Composability:
Output type of reduce must match input type of map
Allows nesting: reduce(key, [C, reduce(key, [A, B])]) == reduce(key, [C, A, B])
Confluence:
Idempotency: Reapplying reduce does not change the result
Order-agnosticism: Result doesn’t depend on order of values:
reduce(key, [A, B]) == reduce(key, [B, A])
What is decentralized MapReduce and how does it work?
No coordinating master needed
Works with consistent hashing
MAP: Logic sent to nodes, applied locally based on hashed input data, clockwise
REDUCE: Logic sent to nodes, applied locally based on hashed intermediate results, also clockwise
Nodes act as workers, executing tasks based on hash values
What are the drawbacks and enhancements related to MapReduce?
Drawbacks:
No traditional RDBMS optimization (no indexes, no query optimizer)
Incompatible with common DB tools (e.g., BI, mining tools)
No high-level query languages, only low-level operations
Enhancements:
Sawzall (Google): scripting language for MapReduce generation
Pig (Yahoo): higher-level scripting with SQL-like constructs
These tools make MapReduce more accessible for data analysts
Zuletzt geändertvor einem Monat