SUBMITTED TO IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING . - UC Davis

Transcription

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2018.2854787, IEEETransactions on Multi-Scale Computing SystemsSUBMITTED TO IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS1Hardware Accelerated Mappers for HadoopMapReduce StreamingKatayoun Neshatpour, Maria Malik, Avesta Sasan, Setareh Rafatirad, Houman HomayounAbstract—Heterogeneous architectures have emerged as an effective solution to address the energy-efficiency challenges. This isparticularly happening in data centers where the integration of FPGA hardware accelerators with general purpose processors such asbig Xeon or little Atom cores introduces enormous opportunities to address the power, scalability and energy-efficiency challenges ofprocessing emerging applications, in particular in domain of big data. Therefore, the rise of hardware accelerators in data centers,raises several important research questions: What is the potential for hardware acceleration in MapReduce, a defacto standard for bigdata analytics? What is the role of processor after acceleration; whether big or little core is most suited to run big data applications posthardware acceleration? This paper answers these questions through methodical real-system experiments on state-of-the-art hardwareacceleration platforms. We first present the implementation of four highly used big data applications in a heterogeneous CPU FPGAarchitecture. We develop the MapReduce implementation of K-means, K nearest neighbor, support vector machine and naive Bayes ina Hadoop Streaming environment that allows developing mapper functions in a non-Java based language suited for interfacing withFPGA based hardware accelerating environment. We present a full implementation of the HW SW mappers on existing FPGA coreplatform and evaluate how a cluster of CPUs equipped with FPGAs uses the accelerated mapper to enhance the overall performanceof MapReduce. Moreover, we study how various parameters at the application, system and architecture levels affect the performanceand power-efficiency benefits of Hadoop streaming hardware acceleration. This analysis helps to better understand how presence ofHW accelerators for Hadoop MapReduce, changes the choice of CPU, tuning optimization parameters, and scheduling decisions forperformance and energy-efficiency improvement. The results show a promising speedup as well as energy-efficiency gains of upto5.7 and 16 is achieved, respectively, in an end-to-end Hadoop implementation using a semi-automated HLS framework. Resultssuggest that HW SW acceleration yields significantly higher speedup on little cores, reducing the performance gap between little andbig cores after the acceleration. On the other hand, the energy-efficiency benefit of HW SW acceleration is higher on the big cores,which reduces the energy-efficiency gap between little and big cores. Overall, the experimental results show that a low cost embeddedFPGA platform, programmed using a semi-automated HW SW co-design methodology, brings significant performance andenergy-efficiency gains for Hadoop MapReduce computing in cloud-based architectures and significantly reduces the reliance on largenumber of big high-performance cores.Index Terms—FPGA acceleration, hardware software co-design, MapReduce, Hadoop streaming, Big-little coreF1I NTRODUCTIONEMERGING big data analytics applications require a significant amount of server computational power. However, these applications share many inherent characteristicsthat are fundamentally different from traditional desktop,parallel, and scale-out applications [3]. They heavily rely onspecific deep machine learning and data mining algorithms.The characteristics of big data applications necessitates achange in the direction of server-class microarchitecture toimprove their computational efficiency. However, while demand for data center computational resources continues togrow with the growth in the size of data, the semiconductorindustry has reached scaling limits and is no longer ableto reduce power consumption in new chips. Thus, currentserver designs based on commodity homogeneous processors, are no longer efficient in terms of performance/watt toprocess big data applications [4].To address the energy efficiency problem, heterogeneousarchitectures have emerged to allow each application torun on a core that best matches its resource needs thana one size-fits-all processing node. In big data domain,various frameworks have been developed that allow theprocessing of large data sets with parallel and distributed algorithms. MapReduce [5] is an example of such frameworks Department of Electrical and Computer Engineering, George MasonUniversity, Fairfax, VA, 22030.E-mail: see kneshatp, mmalik9, asasan, srafatir hhomayou@gmu.eduThis paper has been presented in part as a poster in CCGRID 2015 [1] and asan invited paper in CODES-ISSS 2016 [2].developed by Google, which achieves high scalability andfault-tolerance for a variety of applications. While hardwareacceleration has been applied to software implementationof widely used applications, MapReduce implementation ofsuch applications requires new techniques, which studiestheir MapReduce implementation and their bottlenecks, andselects the most efficient functions for acceleration [6].The rise of hardware accelerators in data centers, raisesseveral important research questions: what is the potentialfor hardware acceleration in MapReduce, a defacto standardfor big data analytics? How much performance benefits asemi-automated high level synthesis framework which isused for conventional compute-intensive applications bringfor accelerating big data analytics applications? What is therole of processor after acceleration; whether big or littlecore is most suited to run big data application post hardware acceleration? How tuning optimization parameters atsystem, architecture and application levels for performanceand energy-efficiency improvement changes before and after hardware acceleration? and how presence of hardwareaccelerator changes the scheduling decision for performanceand energy-efficiency optimization? This paper answers allabove questions through methodical real-system experiments on state-of-the-art hardware acceleration platforms.To understand the potential performance gain of using asemi-automated standard HW SW co-design methodologyto accelerate analytics applications in MapReduce environment, in this paper we present a MapReduce implementa-2332-7766 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2018.2854787, IEEETransactions on Multi-Scale Computing SystemsSUBMITTED TO IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMStion of big data analytics applications on a 12-node serverMapReduce and evaluate a heterogeneous CPU FPGA architecture, taking into account various communication andcomputation overhead in the system including the communication overhead with FPGA. We offload the hotspot functions in the mapper to the FPGA. We measured power, andexecution time on the server and the FPGA board. To thebest of our knowledge this is the first empirical work thatfocuses on the acceleration of Hadoop streaming for nonJava based map and reduce functions to find architecturalinsights. For evaluation purposes, we are performing thefollowing tasks in this paper: MapReduce parallel implementation of various datamining and machine-learning application in Cthrough Hadoop streaming.Implementation of HW SW co-design of the mapperfunctions on the FPGA core platform.Evaluating the overall speedup, power and energyefficiency of the system considering various hardware communication and software computationoverhead in Hadoop MapReduce environment.Performance and energy-efficiency analysis of thehardware acceleration based on application (size ofinput data), system (number of mappers runningsimultaneously per node and data split size), andarchitecture ( big vs little core) level parameters.Consequently, we make the following major observations: The optimal application, architecture, and systemlevel parameters to maximize the performance andenergy-efficiency is different before and after acceleration.HW SW acceleration yields higher speedup on little Atom cores, therefore significantly reducing theperformance gap between little and big cores afteracceleration.HW SW acceleration yields higher energy-efficiencyimprovement on big Xeon cores, therefore significantly reducing the energy-efficiency gap betweenlittle and big cores.In presence of hardware acceleration, we can reducethe number of mapper/reducer slots (active cores)and yet be as energy-efficient as a case in which,all the available cores are active without significantperformance loss.HW SW acceleration improves scheduling decisionsand provides more opportunities for fine-tuning ofparameters to further enhance performance.The rest of the paper is organized as follows: In Sec.2 we provide a background on MapReduce. In Sec. 3 and4 the system architecture and methodology are described,respectively. Sec. 6 introduces the studied big data applications. Sec. 5 describes the model for the calculation ofthe execution times after the acceleration. Sec. 7 describesthe HW SW co-design of the mapper functions. In Sec. 9we show the results and carry out a sensitivity analysison different configurations. In Sec. 10 and 11, we studyscheduling of various workloads and the scalability of ourframework, respectively. In Sec. 12, we discuss the relatedwork. Finally, Sec. 13 concludes the paper.2Fig. 1. Hadoop MapReduce: Computational Framework Phases [7].Fig. 2. Timing of various Hadoop phases2H ADOOP M AP R EDUCEMapReduce is the programming model developed byGoogle to handle large-scale data analysis. Fig. 1 showsthe various phases in the MapReduce platform. The mapfunctions parcel out the work to different nodes in thedistributed cluster. They process key/value pairs to generate a set of intermediate key/value pairs. The reducefunctions merge all the intermediate values with the sameintermediate key and collate the work to resolve the results.Apache Hadoop is an open-source Java-based framework of MapReduce implementation. It assists the processing of large datasets in a distributed computing environment and stores data in highly fault-tolerant distributed filesystem (HDFS).2.1TimingFig. 2 shows the timing diagram of a MapReduce application with 15 map jobs, one reduce job and four slots.A slot is a map/reduce computation resource unit at anode.The map phase starts with the start of the firstmap task and finishes when the last map task completesits execution. Shuffle starts shortly after the first map, andwill not complete until all the map tasks are finished. A lowvolume shuffle is finished shortly after the last map (e.g. Fig.2), while a high-volume shuffle takes longer to complete.The sort phase finishes after the shuffle. Reduce starts afterall the data is sorted. Upon the completion of all the reducetasks, the whole MapReduce job is finished.In the MapReduce platform, a significant portion of theexecution time is devoted to the map and reduce phase,as they carry out the computation part. In this paper, wetarget the map phase for acceleration, as it accounts for ahigher portion of the execution time across studied machinelearning kernels.3S YSTEM A RCHITECTUREIn a general-purpose CPU several identical cores are connected to each other through their shared-memory distributed interconnect. However, for hardware acceleration,each core is extended with a small FPGA fabric. We studyhow adding on-chip FPGAs to each core would enhance the2332-7766 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2018.2854787, IEEETransactions on Multi-Scale Computing SystemsSUBMITTED TO IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMS3TABLE 1Architectural parametersProcessorCores\ThreadsOperating FrequencyMicro-architectureL1i CacheL1d CacheL2 CacheL3 CacheSystem MemoryTDPIntel Atom C27588\82.4 GHzSilvermont32 KB24 KB4 MB8 GB20 WIntel Xeon E5-24206\121.9 GHzSandy Bridge32 KB32 KB256 KB15MB32 GB95 WIntel Xeon E5-26708\162.6 GHzSandy Bridge32 KB32 KB256 KB20MB365 GB115 Wperformance of the architecture that runs Hadoop MapReduce. Fig. 3 shows the system architecture of the proposedmulti-node platform studied in this paper. The single-nodearchitecture is identical to the DataNode.3.1Single-nodeWhile in a general purpose CPU, mapper/reducer slots aremapped to a single core, in the heterogeneous architecturedepicted in Fig. 3, each mapper/reducer slot is mapped toa core that is integrated with the FPGA. Given the tightintegration between FPGA and CPU, the interconnectioninterface between the two is the main overhead in thisarchitecture. Thus, the mapper/reducer slots are acceleratedwith the FPGA, without any high off-chip data transferoverhead.For implementation purposes, we compare two types ofcore architectures; little core Intel Atom C2758, and big coreitel Xeon E5-2420. These two types of servers represent twoschools of thought in server architecture design: using bigXeon cores, which is a conventional approach to designing ahigh-performance server, and Atom, which uses low-powercores to address the dark silicon challenge facing servers [8].Table 1 shows the details of the studied servers.Moreover, each FPGA in Fig. 3 is a low cost Xilinx Artix-7with 85 KB logic cells and 560 KB block RAM. The integration between the core and the FPGA is compatible with theadvanced micro-controller bus architecture (AMBA).Specifically, we utilize measurements for the AdvancedeXtensible Interface (AXI)-interconnect. AXI is an interfacestandard through which, different components communicate with each other. The data transferred between the coreand the FPGA, is rearranged to create transfer streams. A direct memory access (DMA) engine is used to move streamsin and out of the shared memory between the FPGA and thecore, which provides high-bandwidth direct memory accessbetween the AXI-stream and the IP interfaces implementedon the FPGA.3.2Multi-nodeThe architecture of the multi-node cluster consists of ahomogeneous CPU as the NameNode, which is connectedto several DataNodes with heterogeneous architectures. Thearchitecture of each DataNode is similar to that in Fig. 3.The NameNode is responsible for the job schedulingbetween all the DataNodes. It is configured to distributethe computation workloads among the DataNodes. Thenumber of mapper/reducer slots on each DataNode is basedon its number of cores. The interconnection between theNameNode and DataNodes is established through a multiFig. 3. System architecture for a multi-node cluster.channel Gigabit switch to allow high data transfer rates. Forimplementation purposes, we use a 12-node cluster with E52670 CPUs. Table 1 shows the specifications of the E5-2670CPUs.4M ETHODOLOGYWe develop a MapReduce version of each studied machinelearning application for execution on Apache Hadoop. Itshould be noted that Hadoop applications are mostly implemented in Java; however, the FPGA CPU platforms allowhardware acceleration of C-based applications. While nativeC/C injection into Hadoop is on the way, various utilities have been used to allow Hadoop to run applications,the map and reduce functions of which, are developed inlanguages other than Java. Hadoop pipes [9] and Hadoopstreaming [10] are examples of such utilities. In this paper,we use Hadoop streaming, a utility that comes with the standard Hadoop distribution. It allows running MapReducejobs with any executable or script as the mapper and/or thereducer. The Hadoop streaming utility creates a MapReducejob and submits it to an appropriate cluster.This paper aims to characterize data mining and machine learning applications. Accordingly, the C-based implementation of such applications were developed and executed on Apache Hadoop streaming. Subsequently, twolevels of profiling is carried out.subsectionProfiling of the Application on MapReduce Asmentioned earlier, the MapReduce platform is comprised ofseveral execution phases; however, in this study we focuson the acceleration of the map phase. In order to calculatethe potential speedup on the Hadoop platform after theacceleration of map functions for each application, we needa detailed analysis and profiling for various phases.(i.e.map, reduce, shuffle, etc.) We use the timing information2332-7766 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2018.2854787, IEEETransactions on Multi-Scale Computing SystemsSUBMITTED TO IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMSto calculate the execution time of each phase before theacceleration.subsectionProfiling of the map function To accelerate themap functions through HW SW co-design, we profile themap functions in order to find out the execution time ofdifferent sub-functions and select which functions shouldbe offloaded to the hardware in the FPGA, and which onesstill need to remain in the software on the core (i.e., SWpart).The map function is carried out on data splits. Thedata split size is mostly the size of one HDFS block fordata locality purposes, which varies from 64MB, 128MB tohigher values [11]. For each application, we execute the mapfunction on the data splits and profile it on the two bigand little server architectures. Profiling of map functions onthese two architectures determine the sub-functions bettersuited for FPGA implementation.The execution time of the accelerated map function iscomprised of three parts. First, the SW part of the mapfunction that remains on the core (tsw,Atom and tsw,xeonon Atom and Xeon, respectively), which is calculated basedon the profiling of map functions on Intel Xeon and Atomusing the Perf tool. Second, the HW part of the map functionthat is offloaded to the FPGA (thw ), which is calculated bymeasurements from the FPGA implementations. And third,the data transfer time between the FPGA and the core (ttr ).The calculation of the transfer time requires a platforms thatallows the integration of the FPGA with the core.In the proposed framework the time-intensive subfunction within the map functions are targeted for acceleration. The memory patterns for map functions is highlyregular with low data dependencies. Thus, profiling andmeasurements of execution time using gprof for map functions takes into account the time to bring the data fromthe main memory. In the accelerated map function, theselected sub-function is replaced with an accelerator, itsexecution time is replaced with the processing time of theaccelerator and the time to stream the data between the mapsub-functions on CPU to accelerated map sub-functions onFPGA.In lack thereof a variety of high performance CPU onchip FPGA platforms, to demonstrate how on-chip integration of the CPU and FPGA allows accelerating of subfunctions within Map functions and subsequently accelerates the overall MapReduce applications, rather than usingraw and optimistic values reported for AXI-interconnectiondelay and bandwidth, we use timer functions on Zynq tocalculate the timing of transmission.To estimate the execution time after the acceleration accurately, and for functional verification, we fully implementthe map functions on the Zedboard. ZedBoard (featuringXC7Z020 Zynq) integrates two 667 MHz ARM Cortex-A9with an Artix-7 FPGA with 85 KB logic cells and 560 KBblock RAM. The connections between the core and FPGA isestablished through the AXI interconnect [12]. To calculatethe time required to send the data for each map functionto the accelerator and the processing time of the FPGA, weadd timer IPs on the FPGA. Using the timer, we measurethe data transfer time (ttr ), and the accelerator time thw .The measurements are used as estimation for a framework,in which the transmission link and the FPGA are identicalto those used in the Zynq platform, while the CPU is IntelAtom or Intel Xeon.4Based on the timings calculated from the full implementations on the Artrix-7 FPGA in this platform, the executiontime of the map function after acceleration is calculated astsw,xeon thw ttr and tsw,Atom thw ttr , for Xeon andAtom, respectively, assuming a tight on-chip integration ofthe Intel Atom and Xeon cores to the studied FPGA. Wecompare the execution time of the map function beforeand after the acceleration to yield the speedup of the mapfunctions though HW SW co-design.Finally, based on the information about the executiontime of each phase, and the speedup of the map functions,we perform a comprehensive analysis to find out how acceleration of the map function contributes to the accelerationof the entire application on Hadoop MapReduce.5E STIMATING H ADOOP M AP R EDUCE S PEEDUPIn order to calculate the potential speedup on the Hadoopplatform after the acceleration of map functions, we performthe following two steps: In the first step, the speedup of themap function through HW SW co-design is calculated. Inthe second step, the speedup of the overall MapReduce iscalculated, when the map function is accelerated with therate calculated in the first step. As mentioned earlier, giventhe tight integration of FPGA and CPU, the main overheadis the on-chip data transfer between the core and the FPGA,which is calculated using the timers implemented on theZynq platform.5.1 Modelling Speedup of the Map Function throughHW SW co-designA comparison of various models of computation forhardsware software co-design has been presented by [13].The classical FSM representation or various extension ofit are the most well-known models for describing controlsystems, which consists of a set of states, inputs, outputsand a function which defines the outputs in terms ofinputs and states, and a next-state function. Since theydo not allow the concurrency of states and due to theexponential growth of the number of their states as thesystem complexity rises, they are not the optimal solutionfor modeling HW SW co-design. Dataflow graphs havebeen quite popular in modeling data-dominated systems[14]. In such modeling, computationally intensive systemsand/or considerable transportation of data is convenientlyrepresented by a directed graph where the nodes describecomputations and the arcs represent the order in which thecomputations are performed.In case of acceleration of the map phase in a MapReduceplatform, what needs to be taken into account is the highlyparallel nature of the map functions, which allows higheracceleration by concurrent processing of multiple computations that have no data dependencies. Most efforts formodeling of the hardware software co-design have founddata dependencies to be an important barrier in the extentto which a function is accelerated, however this is not thecase for MapReduce. In the mapper part of most machinelearning applications a small function, i.e., an inner product or a Euclidean distance calculation is the most timeconsuming part of the code, where multiple instances ofa small function can be executed in parallel with no datadependencies. In such cases, a simple queuing network2332-7766 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMSCS.2018.2854787, IEEETransactions on Multi-Scale Computing SystemsSUBMITTED TO IEEE TRANSACTIONS ON MULTI-SCALE COMPUTING SYSTEMScan be deployed to model a map function, with only oneaccelerated sub-function.Queuing system models are useful for analyzing systemswhere inputs arrive sporadically or the processing time fora request may vary [14]. In a queuing model, customers(in this case, the data to be processed by the accelerator)arrive at the queue at some rate; the customer at the headof the queue is immediately taken by the processing node(the accelerator hardware), but the amount of time spent bythe customer in processing must be specified (service time).Typically, both the customer arrival rate and processingtime are modeled as Poisson random variables. In our case,however, since we are using one or multiple copies of thesame accelerator, the service time for all data is fixed andis determined by the maximum frequency of the acceleratoron the FPGA and the number of clock cycles it takes tofinish the processing of each batch of data. Moreover, weassume that the data arrives at the accelerator at a fixedrate, determined by the processing speed of the accelerator.Let Tmap be the execution time of the map function onthe data splits before the acceleration. Accordingly, based onthe discussion in Sec. 4, the speedup of the map function iscalculated as follows.Smap Tmap,tsw thw ttr(1)where tsw is the SW part of the map function thatremains on the core (tsw,Atom and tsw,xeon on Atom andXeon, respectively), ttr thw is HW time and the transfertime derived from the timers in Zynq implementation, andSmap is the speedup of the map function.To calculate thw , we fully optimize the accelerator IPsusing Vivado HLS. Vivado HLS reports the delay of the IPsin terms of the number of cycles. Based on the frequencyof the accelerator in the implementations, the processingtime of the IPs is measured (thw ). To calculate ttr , timerIPs are added on the FPGA. In the software codes, the timeris reset when a transfer is started to the PL. The timer isstopped when the Tlast is high, which happens after thelast packet is processed by the PL. Thus, the timer indicatesthe time required for transfer of data through DMA andthe processing time (thw ttr ). Since thw is measured basedon the latency results from HLS and the frequency, ttr iscalculated by subtracting the measurement form the timerIP and thw .The thw and ttr measurements are used as estimation fora framework, in which the transmission link and the FPGAare identical to those used in the Zynq platform, while thetiming measurements for the execution time on the ARMcore are discarded.5.2Speedup on MapReduceAssuming a platform with M mapper/reducer slots, and n n ninput data splits (n map tasks), each slot runs Mor Mmap tasks. For simplicity and without loss of generality, weassume that n is a product of M . Thus, each mapper slotnruns exactly Mmap jobs. The execution time of the mapphase is calculated as follows.nTmap max1 m MM Xi 1nMT FmnM1T Sm X1i 1 iT Im,(2)5Fig. 4. Overall speedup as a function of .iiwhere, T Smand T Fmare the start and finish times for theim-th slot running the i-th map task, T Imis the time intervalfor the m-th slot between the end of i-th map task and thestart of next map task, and Tmap is the total time for the mapphase. HW SW acceleration will only speedup the first termin (2).Lets assume that is the fraction of time that is accelerated. Then the overall MapReduce speedup is derived from:Speedup where11 (1,1Smap )(3)n max1 m MMPi 1iT fmTT sim,(4)and T is the total execution time.This methodology was applied to the acceleration of themapper functions; however, the same procedure is applicable to the accelerations of other computational phases in theMapReduce, including the reduce.5.3Upper and Lower Bounds of SpeedupIn this section, we analyze the upper and lower bounds forthe speedup on an end-to-end Hadoop platform given afunction speedup of Smap .Fig. 4 shows the overall speedup as a function of ,which is always lower than 1. Different application types,and system and architecture level configurations yield different values for . Based on this figure, as increases,the overall speedup is enhanced. It will approach infinitySmapat Smap1 , which is higher than 1, and out of therange of acceptable . The highest acceleration in realizedwhen all the execution time is devoted to the acceleratedphase, in which case, an acceleration in the range of Smap isobtained.Considering that not all parts of the map and/or task arebest suited for hardware acceleration, and that not all phasesof the MapReduce (shuffle, sort, etc.,) are accelerated, theextent of the achievable accuracy is limited. As a result theacceleration can be limited to 20% as in [15], or upto 1.8 in [16], 2.6 in [17], 3 in [18], and 4.1 as shown in Table3.6 M AP R EDUCE PARALLEL I MPLEMENTATION INH ADOOP S TREAMINGThe machine-learning applications studied in this paper,include various commonly used classification and clustering2332-7766 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. C

Fig. 1. Hadoop MapReduce: Computational Framework Phases [7]. Fig. 2. Timing of various Hadoop phases 2HADOOP MAPREDUCE MapReduce is the programming model developed by Google to handle large-scale data analysis. Fig. 1 shows the various phases in the MapReduce platform. The map functions parcel out the work to different nodes in the distributed .