What is Map Reduce?
programming model and associated implementation for processing and generating large datasets
What are the tasks of map and reduce?
map takes key, value pairs and maps them to intermediate key,value pairs
reduce takes the intermediate key,value pairs and merges all values associated with the same intermediate key
What is a hughe advantage of map reduce?
can be run in paralell on large number of machines…
Give an example of mapreduce
count number of occurences of each word in large collection of documents
map(String key, String value):
// key: document name
//value : document contents
for each word w in value:
EmitIntermediate(w, ‘1’);
-> create count of words for a document and emit dict with word : count values
-> reduce then takes this from all different decentralized processed documents and sums up the count … (can also be divided…)
What are the abstract types required by mapreduce?
map(k1, v1) -> list (k2,v2)
reduce(k2,v2) -> list(v2)
=> input and output different domain while intermediate same domain as output
What are challenges for applications that run on e.g. datacenters ?
how to parallelize application logic?
how to communicate?
how to synchronize?
how to perform load balancing?
how to handle faults?
how to schedule jobs?
What is a problem in application design for datacenters?
solve challenges
=> design implement optimize debug and maintain for each and every application…
How to handle of doing stuff over and over again=
create libraries that abstract stuff to be able to use for many different applicatoins
How does coordination of map reduce work?
master slave
-> master makes control
-> slaves do data part
Are master single entities?
logical yes
-> physical not necessary (can also be distributed for e.g. fault tolerance…)
Central idea of distributed systems
logically single software / master
-> physically distributed execution, storeage,…
In what parts is the master split?
job tracker (handles mapReduce jobs)
namenode (responsible for file storage -> where are the files stored required for map reduce?)
What is part of each slave?
task tracker -> what has to be done (map reduce)
data node -> what data is stored there? (managing data as part of distributed file system)
What is an essential separation of distributed system?
data part
file storage part
Example of invoking e.g. pagerank?
user program invokes page rank at master
-> master assigns functionality to workers and says what data is requrired
-> workers read input files and perform map
-> write the output as intermediate values
-> master assigns reduce and workers read intermediate
-> workers write
Last changed2 years ago