Programming Models And Frameworks Advanced Cloud

Transcription

Programming Modelsand FrameworksAdvanced Cloud Computing15-719/18-847bGarth GibsonGreg GangerMajd SakrJan 30, 201715719/18847b Adv. Cloud Computing1

Advanced Cloud Computing Programming Models Ref 1: MapReduce: simplified data processing on large clusters.Jeffrey Dean and Sanjay Ghemawat. OSDI’04. ll papers/dean/dean.pdf Ref 2: Spark: cluster computing with working sets. Matei Zaharia,Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.USENIX Hot Topics in Cloud Computing (HotCloud’10).http://www.cs.berkeley.edu/ matei/papers/2010/hotcloud spark.pdfJan 30, 201715719/18847b Adv. Cloud Computing2

Advanced Cloud Computing Programming Models OptionalRef 3: DyradLinQ: A system for general-purpose distributed dataparallel computing using a high-level language. Yuan Yu, Michael Isard,Dennis Fetterly, Mihai Budiu, Ulfar Erlingsson, Pradeep Kumar Gunda,Jon Currey. ects/dryadlinq/dryadlinq.pdf Ref 4: Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson,Carlos Guestrin, and Joseph M. Hellerstein (2010). "GraphLab: A NewParallel Framework for Machine Learning." Conf on Uncertainty inArtificial Intelligence ripts/papers.cgiJan 30, 201715719/18847b Adv. Cloud Computing3

Advanced Cloud Computing Programming Models OptionalRef 5: TensorFlow: A system for large-scale machine learning.Martin Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, AndyDavis, Jeff Dean, Matthieu Devin, Sanjay Ghemawatt, GeoffreyIrving, Michael Isard. erence/osdi16/osdi16abadi.pdfJan 30, 201715719/18847b Adv. Cloud Computing4

Recall the SaaS, PaaS, IaaS Taxonomy Service, Platform or Infrastructure as a ServiceoSaaS: service is a complete application (client-server computing) Not usually a programming abstractionoPaaS: high level (language) programming model for cloud computer Eg. Rapid prototyping languages Turing complete but resource management hiddenoIaaS: low level (language) computing model for cloud computer Eg. Assembler as a language Basic hardware model with all (virtual) resources exposed For PaaS and IaaS, cloud programming is neededoHow is this different from CS 101? Scale, fault tolerance, elasticity, .Jan 30, 201715719/18847b Adv. Cloud Computing5

Embarrassingly parallel “Killer app:” Web servers Online retail stores (like amazon.com for example)oMost of the computational demand is for browsing product marketing,forming and rendering web pages, managing customer session state Actual order taking and billing not as demanding, have separate specializedservices (Amazon bookseller backend) oOne customer session needs a small fraction of one serveroNo interaction between customers (unless inventory near exhaustion)Parallelism is more cores running identical copy of web serverLoad balancing, maybe in name service, is parallel programmingoElasticity needs template service, load monitoring, cluster allocationoThese need not require user programming, just configurationJan 30, 201715719/18847b Adv. Cloud Computing6

Eg., Obama for America Elastic Load BalancerJan 30, 201715719/18847b Adv. Cloud Computing7

What about larger apps? Parallel programming is hard – how can cloud frameworks help?Collection-oriented languages (Sipelstein&Blelloch, Proc IEEE v79, n4, 1991)oAlso known as Data-paralleloSpecify a computation on an element; apply to each in collection Analogy to SIMD: single instruction on multiple dataoSpecify an operation on the collection as a whole Union/intersection, permute/sort, filter/select/map Reduce-reorderable (A) /reduce-ordered (B)– (A) Eg., ADD(1,7,2) (1 7) 2 (2 1) 7 10– (B) Eg., CONCAT(“the “, “lazy “, “fox “) “the lazy fox “ Note the link to MapReduce . its no accidentJan 30, 201715719/18847b Adv. Cloud Computing8

High Performance Computing Approach HPC was almost the only home for parallel computing in the 90sPhysical simulation was the killer app – weather, vehicle design,explosions/collisions, etc – replace “wet labs” with “dry labs”oPhysics is the same everywhere, so define a mesh on a set of particles, code thephysics you want to simulate at one mesh point as a property of the influenceof nearby mesh points, and iterateoBulk Synchronous Processing (BSP): run all updates of mesh points in parallelbased on value at last time point, form new set of values & repeat Defined “Weak Scaling” for bigger machines – rather than make a fixedproblem go faster (strong scaling), make bigger problem go same speedoMost demanding users set problem size to match total available memoryJan 30, 201715719/18847b Adv. Cloud Computing9

High Performance Computing Frameworks Machines cost O( 10-100) million, sooemphasis was on maximizing utilization of machines (congress checks)olow-level speed and hardware specific optimizations (esp. network)opreference for expert programmers following established best practicesDeveloped MPI (Message Passing Interface) framework (eg. MPICH)oLaunch N threads with library routines for everything you need: Naming, addressing, membership, messaging, synchronization (barriers) Transforms, physics modules, math libraries, etc oResource allocators and schedulers space share jobs on physical clusteroFault tolerance by checkpoint/restart requiring programmer save/restoreoProto-elasticity: kill N-node job & reschedule a past checkpoint on M nodesVery manual, deep learning curve, few commercial runaway successesJan 30, 201715719/18847b Adv. Cloud Computing10

Broadening HPC: Grid Computing Grid Computing started with commodity servers (predates Cloud)o Frameworks were less specialized, easier to use (& less efficient)o 1989 concept of “killer micros” that would kill off supercomputersBeowulf, PVM (parallel virtual machine), Condor, Rocks, Sun Grid EngineFor funding reasons grid emphasized multi-institution sharingoSo authentication, authorization, single-signon, parallel-ftpoHeterogeneous workflow (run job A on mach. B, then job C on mach. D)Basic model: jobs selected from batch queue, take over clusterSimplified “pile of work”: when a core comes free, take a task fromthe run queue and run to completionJan 30, 201715719/18847b Adv. Cloud Computing11

Cloud Programming, back to the future HPC demanded too much expertise, too many details and tuningCloud frameworks all about making parallel programming easieroWilling to sacrifice efficiency (too willing perhaps)oWilling to specialize to application (rather than machine)Canonical BigData user has data & processing needs that require lotsof computer, but doesn’t have CS or HPC training & experienceoWants to learn least amount of computer science to get results this weekoMight later want to learn more if same jobs become a personal bottleneckJan 30, 201715719/18847b Adv. Cloud Computing12

2005 NIST Arabic-English CompetitionExpert nslationTopicidentificationBLEU Score0.70.4GoogleISIIBM CMUUMDJHU CUEdinburgh0.3Useless0.20.1SystranMitreJan 30, 20170.0 2005 : Google wins!Qualitatively better 1st entry0.60.5Translate 100 articlesFSCNot most sophisticated approachNo one knew ArabicBrute force statisticsBut more data & compute !!200M words from UN translations1 billion words of Arabic docs1000 processor cluster! Can’t compete w/o big data15719/18847b Adv. Cloud Computing13

Cloud Programming Case Studies MapReduceoPackage two Sipelstein91 operators filter/map and reduce as the base of adata parallel programming model built around Java libraries DryadLinqoCompile workflows of different data processing programs intoschedulable processes Sparko Work to keep partial results in memory, and declarative programmingTensorFlowoSpecialize to iterative machine learningJan 30, 201715719/18847b Adv. Cloud Computing14

MapReduce (Majd)Jan 30, 201715719/18847b Adv. Cloud Computing15

DryadLinq Simplify efficient data parallel codeoCompiler support for imperative anddeclarative (eg., database) operationsoExtends MapReduce to workflowsthat can be collectively optimized Data flows on edges between processes at vertices (workflows)Coding is processes at vertices and expressions representing workflowInteresting part of the compiler operates on the expressionsoInspired by traditional database query optimizations – rewrite theexecution plan with equivalent plan that is expected to execute fasterJan 30, 201715719/18847b Adv. Cloud Computing16

DryadLinq Data flowing through a graph abstractionoVertices are programs (possibly different with each vertex)oEdges are data channels (pipe-like)oRequires programs to have no side-effects (no changes to shared state)oApply function similar to MapReduce reduce – open ended user codeCompiler operates on expressions, rewriting execution sequencesoExploits prior work on compiler for workflows on sets (LINQ)oExtends traditional database query planning with less type restrictive code Unlike traditional plans, virtualizes resources (so might spill to storage)oKnows how to partition sets (hash, range and round robin) over nodesoDoesn’t always know what processes do, so less powerful optimizer thandatabase – where it can’t infer what is happening, it takes hints from usersoCan auto-pipeline, remove redundant partitioning, reorder partitionings, etcJan 30, 201715719/18847b Adv. Cloud Computing17

Spark: Optimize/generalize MR for iterative appsThrough files (disk) MapReduceuses disks forinput, tmp, &output Want to usememory mostly Machine Learning apps iterate over same data to “solve” somethingo Way too much use of disk when the data is not giantSpark is MR rewrite: more general (dryad-like graphs of work), moreinteractive (scala interpreter) & more efficient (in-memory)Jan 30, 201715719/18847b Adv. Cloud Computing18

Spark Resilient Distributed Datasets (RDD) Spark programs are functional, deterministic same input means same resulto Spark is a set/collection (called an RDD) oriented systemo This is the basis of selective re-execution and automated fault-toleranceSplits a set into partitions, and assign to workers to parallelize operationStore invocation (code & args) with inputs as a closureoTreat this as a “future” – compute now or later at system’s choice (lazy)oIf code & inputs already at node X, “args” is faster to send than results Futures can be used as compression on wire & in replica nodesJan 30, 201715719/18847b Adv. Cloud Computing19

Spark Resilient Distributed Datasets (RDD) con’t Many operators are builtins (well-known properties, like Dryad)o Spark automates transforms when pipelining multiple builtin operationsSpark is lazy – only specific operators force computationoE,g, materialize in file systemoBuild programs interactively, computing only when and what user needsoLineage is chain of invocations: future on future delayed computeReplication/FT: ship & cache RDDs on other nodesoCan recompute everything there if needed, but mostly don’toSave space in memory on replicas and network bandwidthoNeed entire lineage to be replicated in non-overlapping fault domainsJan 30, 201715719/18847b Adv. Cloud Computing20

Spark “combining python functions” example rdd x.map(foo).map(bar)Function foo() takes in a record x and outputs a record yFunction bar() takes in a record y and outputs a record zSpark automatically creates a function foo bar() that takes in arecord x and outputs a record z.Feb 1, 201615719 Adv. Cloud Computing21

Next day plan Encapsulation and virtual machinesoGuest lecturer, Michael Kozuch, Intel LabsFeb 1, 201615719 Adv. Cloud Computing22

Programming ModelsMapReduce15-719/18-847b Advanced Cloud ComputingSpring 2017Majd Sakr, Garth Gibson, Greg GangerJanuary 30, 20171

Motivation How do you performbatch processing of large data setsusing low cost clusterswith thousands of commodity machineswhich frequently experience partial failureor slowdowns

Batch Processing of Large Datasets Challenges– Parallel programming– Job orchestration– Scheduling– Load Balancing– Communication– Fault Tolerance– Performance–

Google MapReduce Data parallel framework for processing BigData on large commodity hardware Transparently tolerates– Data faults– Computation faults Achieves– Scalability and fault tolerance

Commodity ClustersMapReduce is designed to efficiently process big data using regularcommodity computersA theoretical 1000-CPU machine would cost a very large amount ofmoney, far more than 1000 single-CPU or 250 quad-corecommodity machinesPremise: MapReduce ties smaller and more reasonably pricedmachines together into a single cost-effective commodity cluster tosolve Big Data problems5

Three Strategies Parallelism– Break down jobs into distributed independenttasks to exploit parallelism Scheduling– Consider data-locality and variations in overallsystem workloads for scheduling Fault Tolerance– Transparently tolerate data and task failures

Hadoop MapReduce Hadoop is an open source implementation of MapReduce– 2006 Hadoop presents MapReduce as an analytics engine and underthe hood uses a distributed storage layer referred to as HadoopDistributed File System (HDFS) HDFS mimics Google File System (GFS) Applications in MapReduce are represented as jobs Each job encompasses several map and reduce tasks Map and reduce tasks operate on data independently and in parallel7

MapReduce In a Nutshell MapReduce incorporates two phases Map Phase Reduce nReduceTaskPartitionPartitionShuffle StageMap PhaseMerge &SortStageReduce PhaseTo HDFSReduce Stage

Data Distribution In a MapReduce cluster, data is distributed to all the nodes of the clusteras it is being loaded An underlying distributed file systems (e.g., GFS, HDFS) splits large datafiles into chunks which are managed by different nodes in the clusterInput data: A large fileNode 1Node 2Node 3Chunk of input dataChunk of input dataChunk of input data Even though the file chunks are distributed across several machines, theyform a single namespace9

Network Topology In MapReduce MapReduce assumes a tree style network topology Nodes are spread over different racks embraced in one or many data centers The bandwidth between two nodes is dependent on their relative locations in thenetwork topology The assumption is that nodes that are on the same rack will have higher bandwidthbetween them as opposed to nodes that are off-rack

Computing Units: TasksMapReduce divides the workload into multiple independent tasks andautomatically schedule them on cluster nodesA work performed by each task is done in isolation from one anotherThe amount of communication which can be performed by tasks is limitedmainly for scalability and fault-tolerance reasons11

MapReduce Phases The output from the mappers is denoted asintermediate output and broughtinto a second set of tasks called Reducers The process of reading intermediate output intoa set of Reducers is known as shufflingThe Reducers produce the final outputC0C1C2C3mappers M0M1M2M3IOIOIOIOsplitsMap PhaseIn MapReduce, splits are processed inisolation by tasks called MappersReduce Phase Shuffling DataMerge & SortOverall, MapReduce breaks the data flow into two phases,map phase and reduce phaseReducersR0R1FOFO

Keys and Values The programmer in MapReduce has to specify two functions, themap function and the reduce function that implement the Mapperand the Reducer in a MapReduce program In MapReduce data elements are always structured as key-value(i.e., (K, V)) pairs The map and reduce functions receive and emit (K, V) pairsInput Splits(K, V)PairsIntermediate OutputMapFunction(K’, V’)PairsFinal OutputReduceFunction(K’’, V’’)Pairs

Splits A split can contain reference to one or more HDFS blocks Configurable parameter Each map task processes one split# of splits dictates the # of map tasksBy default one split contains reference to one HDFS blockMap tasks are scheduled in the vicinity of HDFS blocks toreduce network traffic

Partitions Map tasks store intermediate output on local disk (not HDFS) A subset of intermediate key space is assigned to each Reducer hash(key) mod R These subsets are known as partitionsDifferent colors representdifferent keys (potentially)from different MappersPartitions are the inputto Reducers

Hadoop MapReduce: A Closer LookNode 1Node 2Files loaded from local HDFS storeFiles loaded from local HDFS put (K, V) pairsInput (K, V) pairsMapMapIntermediate (K, V) termediate(K,V) pairsexchanged byall nodesMapMapIntermediate (K, V) pairsPartitionerSortReduceFinal (K, V) pairsWriteback to localHDFS storeRecordReadersFinal (K, V) pairsOutputFormatOutputFormatWriteback to localHDFS store

Input Files Input files are where the data for a MapReduce task isinitially stored The input files typically reside in a distributed file system(e.g. HDFS) The format of input files is arbitrary Line-based log filesBinary filesMulti-line input recordsOr something else entirelyfilefile17

InputFormat How the input files are split up and read is definedby the InputFormat InputFormat is a class that does the following: Selects the files that should be used for inputDefines the InputSplits that break a fileProvides a factory for RecordReader objects that read the fileFiles loaded from local HDFS storeInputFormatfilefile18

InputFormat Types Several InputFormats are provided with atDefault format;reads lines of textfilesThe byte offsetof the lineThe line contentsKeyValueInputFormatParses lines into(K, V) pairsEverything upto the first tabcharacterThe remainder ofthe lineSequenceFileInputFormatA Hadoop-specifichigh-performancebinary formatuser-defineduser-definedMyInputFormatA user-specifiedinput formatuser-defineduser-defined19

Input Splits An input split describes a unit of data that a single map task in aMapReduce program will process By dividing the file into splits, we allowseveral map tasks to operate on a singlefile in parallelFiles loaded from local HDFS storeInputFormat If the file is very large, this can improveperformance significantly through parallelismfileSplitfile Each map task corresponds to a single input splitSplitSplit

RecordReader The input split defines a slice of data but does not describe howto access it The RecordReader class actually loads data from its source and converts itinto (K, V) pairs suitable for reading by Mappers The RecordReader is invoked repeatedlyon the input until the entire split is consumed Each invocation of the RecordReader leadsto another call of the map function definedby the programmerFiles loaded from local HDFS storeInputFormatfileSplitSplitSplitRRRRRRfile

Mapper and Reducer The Mapper performs the user-defined work of thefirst phase of the MapReduce programFiles loaded from local HDFS storeInputFormatA new instance of Mapper is created for each splitfileThe Reducer performs the user-defined work ofthe second phase of the MapReduce programSplitSplitSplitRRRRRRMapMapMapfileA new instance of Reducer is created for each partition For each key in the partition assigned to a Reducer, theReducer is called oncePartitionerSortReduce

Partitioner Each mapper may emit (K, V) pairs to any partitionFiles loaded from local HDFS store Therefore, the map nodes must all agree onwhere to send different pieces ofintermediate ile The partitioner class determines whichpartition a given (K,V) pair will go to The default partitioner computes a hash value for agiven key and assigns it to a partition based onthis resultPartitionerSortReduce

Sort (merge) Each Reducer is responsible for reducingthe values associated with (several)intermediate keys The set of intermediate keys on a singlenode is automatically sorted (merged) byMapReduce before they are presentedto the ReducerFiles loaded from local HDFS filePartitionerSortReduce

OutputFormatFiles loaded from local HDFS store The OutputFormat class defines the way (K,V) pairsproduced by Reducers are written to output filesInputFormatfile The instances of OutputFormat provided byHadoop write to files on the local disk or in HDFSSplitSplitSplitRRRRRRMapMapMapfileSeveral OutputFormats are provided by ult; writes lines in "key \tvalue" formatSequenceFileOutputFormatWrites binary files suitable forreading into subsequentMapReduce jobsNullOutputFormatGenerates no output filesPartitionerSortReduceOutputFormat

Combiner Functions MapReduce applications are limited by the bandwidth available on the clusterIt pays to minimize the data shuffled between map and reduce tasksHadoop allows the user to specify a combiner function (just like the reducefunction) to be run on a map output only if the Reduce function is commutativeand associative.(Y, T) CombinerMapoutputMTMTN(1950, 0)(1950, 20)(1950, 10)output(1950, 20)MTRNMTMTNMTRNMTLEGEND:RT R Rack N Node MT Map Task RT Reduce Task Y Year T Temperature

MapReduce In a Nutshell MapReduce incorporates two phases, Map and Reduce.

The Shuffle in MapReduce

Job Scheduling in MapReduce In MapReduce, an application is represented as a job A job encompasses multiple map and reduce tasks Job schedulers in MapReduce are pluggable Hadoop MapReduce by default FIFO scheduler for jobs Schedules jobs in order of submissionStarvation with long-running jobsNo job preemptionNo evaluation of job priority or size29

Multi-user Job Scheduling in MapReduce Fair scheduler (Facebook)– Pools of jobs, each pool is assigned a set of shares– Jobs get (on ave) equal share of slots over time– Across pools, use Fair scheduler, within pool, FIFOor Fair scheduler Capacity scheduler (Yahoo!)– Creates job queues– Each queue is configured with # (capacity) of slots– Within queue, scheduling is priority based

Task Scheduling in MapReduce MapReduce adopts a master-slave architectureTTTask Slots The master node in MapReduce is referredto as Job Tracker (JT) Each slave node in MapReduce is referredto as Task Tracker (TT)JTTasks QueueT0T0T1T1T2TT MapReduce adopts a pull scheduling strategy rather thana push one Task SlotsI.e., JT does not push map and reduce tasks to TTs but rather TTs pull them bymaking requests31

Map and Reduce Task Scheduling Every TT sends a heartbeat message periodically to JT encompassing arequest for a map or a reduce task to runI. Map Task Scheduling: JT satisfies requests for map tasks via attempting to schedule mappers in thevicinity of their input splits (i.e., it considers locality)II. Reduce Task Scheduling: However, JT simply assigns the next yet-to-run reduce task to a requesting TTregardless of TT’s network location and its implied effect on the reducer’sshuffle time (i.e., it does not consider locality)32

Task Scheduling

Task Scheduling in Hadoop A golden principle adopted by Hadoop is: “Moving computation towards datais cheaper than moving data towards computation”– Hadoop applies this principle to Map task scheduling With map task scheduling, once a slave (or a TaskTracker- TT) polls for a maptask, M, at the master node (or the JobTracker- JT), JT attempts to assign TTan M that has its input data local to TTCore SwitchRack Switch 1TaskTracker1TaskTracker2Rack Switch 2TaskTracker3TaskTracker4TaskTracker5MT2MT3Request a Map TaskSchedule a Map Task at an Empty Map Slot on TaskTracker1JobTrackerMT1 MT2 MT3

Task Scheduling in Hadoop Hadoop does not apply the locality principle to Reduce task scheduling With reduce task scheduling, once a slave (or a TaskTracker- TT) polls for areduce task, R, at the master node (or the JobTracker- JT), JT assigns TT any RCSRS1TT1TT2TT4TT3Request Reduce Task RAssign R to TT1Shuffle PartitionsA locality problem,where R is scheduledat TT1 while itsRS2partitions existat TT4TT5JT35CS Core Switch & RS Rack Switch

Fault Tolerance in Hadoop Data redundancy Achieved at the storage layer through replicas (default is 3) Stored at physically separate machines Can tolerate– Corrupted files– Faulty nodes HDFS:– Computes checksums for all data written to it– Verifies when reading Task Resiliency (task slowdown or failure) Monitors to detect faulty or slow tasks Replicates tasks36

Task Resiliency MapReduce can guide jobs toward a successful completion even when jobs arerun on a large cluster where probability of failures increases The primary way that MapReduce achieves fault tolerance is throughrestarting tasks If a TT fails to communicate with JT for a period of time (by default, 1 minute inHadoop), JT will assume that TT in question has crashed If the job is still in the map phase, JT asks another TT to re-execute allMappers that previously ran at the failed TT If the job is in the reduce phase, JT asks another TT to re-execute allReducers that were in progress on the failed TT37

Speculative Execution A MapReduce job is dominated by the slowest task MapReduce attempts to locate slow tasks (stragglers) and run redundant(speculative) tasks that will optimistically commit before the correspondingstragglers This process is known as speculative execution Only one copy of a straggler is allowed to be speculated Whichever copy of a task commits first, it becomes the definitive copy,and the other copy is killed by JT

Locating Stragglers How does Hadoop locate stragglers? Hadoop monitors each task progress using a progress score between0 and 1 based on the amount of data processed If a task’s progress score is less than (average – 0.2), and the task hasrun for at least 1 minute, it is marked as a stragglerNot a stragglerT1PS 2/3A stragglerT2PS 1/12Time

Issues with Speculative Execution Susceptible in heterogeneous environments– If transient congestion, lots of speculative tasks Launches speculative tasks without checking speed ofTT or load of speculative task– Slow TT will become slower Locality always trumps– If 2 speculative tasks ST1 & ST2 With stragglers T1@70% and T2@20% If task slot is local to ST2’s HDFS block, ST2 gets scheduled Three reduce stages treated equally– Shuffle stage is typically slower than the merge & sort andreduce stages

MapReduce ApplicationsMapInput DataShuffledDataLocal Disk orNetworkData PatternMap InputShuffle DataOutputDataReduceNetworkNetworkShuffle Data/MapInput RatioExample0Sobel EdgeDetection 1Grep 1Sort 1Some IR apps

Grep ExampleGrep Search Map & Reduce task detailsReduce Task 1 - Sorting0:00:36Reduce Task 1 - Shuffling0:00:35Reduce Task 10:00:37Map task 150:00:05Map task 140:00:09Map task 130:00:08Map task 120:00:09Map task 110:00:09Map task 100:00:11Map task 90:00:09Map task 80:00:08Map task 70:00:09Map task 60:00:10Map task 5Map task 4Map task 3Map task 2Map task 0900:00:1700:00:2600:00:3500:00:4300:00:52

TeraSort ExampleTeraSort Map & Reduce Task DetailsReduce Task 1 - Sorting0:01:50Reduce Task 1 - Shuffling0:01:28Reduce Task 10:02:54Map task 160:00:10Map task 150:00:09Map task 140:00:22Map task 130:00:13Map task 120:00:15Map task 110:00:16Map task 100:00:15Map task 90:00:18Map task 80:00:13Map task 70:00:12Map task 60:00:31Map task 5Map task 40:00:130:00:16Map task 30:00:13Map task 20:00:12Map task :03:36

What Makes MapReduce Popular? MapReduce is characterized by:1. Its simplified programming model which allows the user toquickly write and test distributed systems2. Its efficient and automatic distribution of data and workloadacross machines3. Its flat scalability curve. Specifically, after a Mapreduceprogram is written and functioning on 10 nodes, very little-ifany- work is required for making that same program run on1000 nodes4. Its fault tolerance approach44

Programming Models and Frameworks Advanced Cloud Computing 15-719/18-847b Garth Gibson Greg Ganger M