Hadoop Performance Modeling For Job Estimation And Resource Provisioning

Transcription

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems1Hadoop Performance Modeling for JobEstimation and Resource ProvisioningMukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang and Changjun Jiang Abstract- MapReduce has become a major computing model fordata intensive applications. Hadoop, an open sourceimplementation of MapReduce, has been adopted by anincreasingly growing user community. Cloud computing serviceproviders such as Amazon EC2 Cloud offer the opportunities forHadoop users to lease a certain amount of resources and pay fortheir use. However, a key challenge is that cloud service providersdo not have a resource provisioning mechanism to satisfy userjobs with deadline requirements. Currently, it is solely the user'sresponsibility to estimate the required amount of resources forrunning a job in the cloud. This paper presents a Hadoop jobperformance model that accurately estimates job completion timeand further provisions the required amount of resources for a jobto be completed within a deadline. The proposed model builds onhistorical job execution records and employs Locally WeightedLinear Regression (LWLR) technique to estimate the executiontime of a job. Furthermore, it employs Lagrange Multiplierstechnique for resource provisioning to satisfy jobs with deadlinerequirements. The proposed model is initially evaluated on anin-house Hadoop cluster and subsequently evaluated in theAmazon EC2 Cloud. Experimental results show that the accuracyof the proposed model in job execution estimation is in the rangeof 94.97% and 95.51%, and jobs are completed within therequired deadlines following on the resource provisioning schemeof the proposed model.Index Terms— Cloud computing, Hadoop MapReduce,performance modeling, job estimation, resource provisioningI. INTRODUCTIONMany organizations are continuously collecting massiveamounts of datasets from various sources such as theMukhtaj Khan is with the Department of Electronic and ComputerEngineering, Brunel University, Uxbridge, UB8 3PH, UK. Email:Mukhtaj.Khan@brunel.ac.uk.Yong Jin is with the National Key Lab for Electronic MeasurementTechnology, North University of China, Taiyuan 030051, China. He is aVisiting Professor in the Department of Electronic and Computer Engineering,Brunel University, Uxbridge, UB8 3PH, UK. Email: Yong.Jin@brunel.ac.uk.Maozhen Li is with the Department of Electronic and ComputerEngineering, Brunel University, Uxbridge, UB8 3PH, UK. He is also with theKey Laboratory of Embedded Systems and Service Computing, Ministry ofEducation, Tongji University, Shanghai, 200092, China. Email:Maozhen.Li@brunel.ac.uk.Changjun Jiang and Yang Xiang are with the Department of ComputerScience & Technology, Tongji University, 1239 Siping Road, Shanghai200092, China. Email: {cjjiang, shxiangyang}@tongji.edu.cn.World Wide Web, sensor networks and social networks. Theability to perform scalable and timely analytics on theseunstructured datasets is a high priority task for manyenterprises. It has become difficult for traditional networkstorage and database systems to process these continuouslygrowing datasets. MapReduce [1], originally developed byGoogle, has become a major computing model in support ofdata intensive applications. It is a highly scalable, fault-tolerantand data parallel model that automatically distributes the dataand parallelizes the computation across a cluster of computers[2]. Among its implementations such as Mars[3], Phoenix[4],Dryad[5] and Hadoop [6], Hadoop has received a wide uptakeby the community due to its open source nature [7][8][9][10].One feature of Hadoop MapReduce is its support of publiccloud computing that enables the organizations to utilize cloudservices in a pay-as-you-go manner. This facility is beneficialto small and medium size organizations where the setup of alarge scale and complex private cloud is not feasible due tofinancial constraints. Hence, executing Hadoop MapReduceapplications in a cloud environment for big data analytics hasbecome a realistic option for both the industrial practitionersand academic researchers. For example, Amazon has designedElastic MapReduce (EMR) that enables users to run Hadoopapplications across its Elastic Cloud Computing (EC2) nodes.The EC2 Cloud makes it easier for users to set up and runHadoop applications on a large-scale virtual cluster. To use theEC2 Cloud, users have to configure the required amount ofresources (virtual nodes) for their applications. However, theEC2 Cloud in its current form does not support Hadoop jobswith deadline requirements. It is purely the user's responsibilityto estimate the amount of resources to complete their jobswhich is a highly challenging task. Hence, Hadoopperformance modeling has become a necessity in estimating theright amount of resources for user jobs with deadlinerequirements. It should be pointed out that modeling Hadoopperformance is challenging because Hadoop jobs normallyinvolve multiple processing phases including three core phases(i.e. map phase, shuffle phase and reduce phase). Moreover, thefirst wave of the shuffle phase is normally processed in parallelwith the map phase (i.e. overlapping stage) and the other wavesof the shuffle phase are processed after the map phase iscompleted (i.e. non-overlapping stage).To effectively manage cloud resources, several Hadoopperformance models have been proposed [11][12][13][14].However, these models do not consider the overlapping andnon-overlapping stages of the shuffle phase which leads to aninaccurate estimation of job execution.1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems2 The performance of the improved HP model is initiallyevaluated on an in-house Hadoop cluster and subsequently onAmazon EC2 Cloud. The evaluation results show that theimproved HP model outperforms both the HP model andStarfish in job execution estimation with an accuracy of level inthe range of 94.97% and 95.51%. For resource provisioning, 4job scenarios are considered with a varied number of map slotsand reduce slots. The experimental results show that theReduceReduceTaskTaskMapMapTaskTaskMap PhaseIntermediatedatasetShufflePhaseFinal Output in HDFSMapTaskReduceReduceTaskTaskReduce OutputMapMapTaskTaskReduce OutputMap OutputNormally a Hadoop job execution is divided into a mapphase and a reduce phase. The reduce phase involves datashuffling, data sorting and user-defined reduce functions. Datashuffling and sorting are performed simultaneously. Therefore,the reduce phase can be further divided into a shuffle (or sort)phase and a reduce phase performing user-defined functions.As a result, an overall Hadoop job execution work flow consistsof a map phase, a shuffle phase and a reduce phase as shown inFig.1. Map tasks are executed in map slots at a map phase andreduce tasks run in reduce slots at a reduce phase. Every taskruns in one slot at a time. A slot is allocated with a certainamount of resources in terms of CPU and RAM. A Hadoop jobphase can be completed in a single wave or multiple waves.Tasks in a wave run in parallel on the assigned slots.Map OutputThe improved HP work mathematically models all thethree core phases of a Hadoop job. In contrast, the HPwork does not mathematically model thenon-overlapping shuffle phase in the first wave.The improved HP model employs Locally WeightedLinear Regression (LWLR) technique to estimate theexecution time of a Hadoop job with a varied numberof reduce tasks. In contrast, the HP model employs asimple linear regress technique for job executionestimation which restricts to a constant number ofreduce tasks.Based on job execution estimation, the improved HPmodel employs Langrage Multiplier technique toprovision the amount of resources for a Hadoop job tocomplete within a given deadline.II. MODELING JOB PHASES IN HADOOPMap Output improved HP model is more economical in resourceprovisioning than the HP model.The remainder of paper is organized as follows. Section IImodels job phases in Hadoop. Section III presents the improvedHP model in job execution estimation and Section IV furtherenhances the improved HP model for resource provisioning.Section V first evaluates the performance of the improved HPmodel on an in-house Hadoop cluster and subsequently onAmazon EC2 Cloud. Section VI discusses a number of relatedworks. Finally, Section VII concludes the paper and points outsome future work.Input datasetRecently, a number of sophisticated Hadoop performancemodels are proposed [15][16][17][18]. Starfish [15] collects arunning Hadoop job profile at a fine granularity with detailedinformation for job estimation and optimization. On the top ofStarfish, Elasticiser [16] is proposed for resource provisioningin terms of virtual machines. However, collecting the detailedexecution profile of a Hadoop job incurs a high overhead whichleads to an overestimated job execution time. The HP model[17] considers both the overlapping and non-overlapping stagesand uses simple linear regression for job estimation. This modelalso estimates the amount of resources for jobs with deadlinerequirements. CRESP [18] estimates job execution andsupports resource provisioning in terms of map and reduceslots. However, both the HP model and CRESP ignore theimpact of the number of reduce tasks on job performance. TheHP model is restricted to a constant number of reduce tasks,whereas CRESP only considers a single wave of the reducephase. In CRESP, the number of reduce tasks has to be equal tonumber of reduce slots. It is unrealistic to configure either thesame number of reduce tasks or the single wave of the reducephase for all the jobs. It can be argued that in practice, thenumber of reduce tasks varies depending on the size of the inputdataset, the type of a Hadoop application (e.g. CPU intensive,or disk I/O intensive) and user requirements. Furthermore, forthe reduce phase, using multiple waves generates betterperformance than using a single wave especially when Hadoopprocesses a large dataset on a small amount of resources. Whilea single wave reduces the task setup overhead, multiple wavesimprove the utilization of the disk I/O.Building on the HP model, this paper presents an improvedHP model for Hadoop job execution estimation and resourceprovisioning. The major contributions of this paper are asfollows:ReducePhaseFig.1. Hadoop job execution flow.Herodotou presented a detailed set of mathematical modelson Hadoop performance at a fine granularity [19]. For thepurpose of simplicity, we only consider the three core phases(i.e. map phase, shuffle phase and reduce phase) in modelingthe performance of Hadoop jobs. Table 1 defines the variablesused in Hadoop job performance modeling.A. Modeling Map PhaseIn this phase, a Hadoop job reads an input dataset fromHadoop Distributed File System (HDFS), splits the inputdataset into data chunks based on a specified size and thenpasses the data chunks to a user-define map function. The mapfunction processes the data chunks and produces a map output.The map output is called intermediate data. The average map1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems3output and the total map phase execution time can be computedusing Eq.(1) and Eq.(2) respectively.Table 1. Defined variables in modeling job phases.VariablesExpressionsThe average output data size of a map task.outputDm avgavgTThe map selectivity which is the ratio of a map output to amap input.The total number of map tasks.M selectivityNmavgThe average execution time of a map task.slotNmThe total number of configured map slots.TmtotalTshThe total execution time of a shuffle phase.NThe total number of reduce tasks.r(4)Ttotal sh(Tavgw1 Nw1sh) (Tavg w2Nw2sh)N rslot(5)The average execution duration of a shuffle task.avginputDroutput avg Dr avg RselectivityshThe total number of configured reduce slots.slotNrw1NshThe total number of shuffle tasks that complete in the firstwave.w2shThe total number of shuffle tasks that complete in otherwaves.avgThe average execution time of a shuffle task thatcompletes in the first wave.The average execution time of a shuffle task thatcompletes in other waves.The average output data size of a reduce task.NNrslotrC. Modeling Reduce PhaseIn this phase, a job reads the sorted intermediate data asinput and passes to a user-defined reduce function. The reducefunction processes the intermediate data and produces a finaloutput. In general, the reduce output is written back into theHDFS. The average output of the reduce tasks and the totalexecution time of the reduce phase can be computed usingEq.(6) and Eq.(7) respectively.The average size of a shuffled data.Dsh avg Tsh NOtherwise, the shuffle phase will be completed in multiplewaves and its execution time can be computed using Eq.(5).The average input data size of a map task.inputDm avgtotalshThe total execution time of a map phase.totalTmTIf N r N rslot , then the shuffle phase will be completed in asingle wave. The total execution time of a shuffle phase can becomputed using Eq.(4).Tw1avgTw2outputDr avgThe total execution time of a reduce phase.totalTrThe average input size of a reduce task.inputDr avgThe reduce selectivity which is the ratio of a reduceoutput to a reduce input.The average execution time of a reduce task.RselectivityavgTrinputDmoutput avg Dm avg M selectivityTtotalm avgTm Nmslot(1)totalrT Travg NNrslotr(6)(7)III. AN IMPROVED HP PERFORMANCE MODELAs also mentioned before, Hadoop jobs have three coreexecution phases – map phase, shuffle phase and reduce phase.The map phase and the shuffle phase can have overlapping andnon-overlapping stages. In this section, we present an improvedHP model which takes into account both overlapping stage andnon-overlapping stage of the shuffle phase during the executionof a Hadoop job. We consider single Hadoop jobs withoutlogical dependencies.A. Design RationaleA Hadoop job normally runs with multiple phases in a singlewave or in multiple waves. If a job runs in a single wave then allthe phases will be completed without overlapping stages asshown in Fig.2.(2)NmB. Modeling Shuffle PhaseIn this phase, a Hadoop job fetches the intermediate data,sorts it and copies it to one or more reducers. The shuffle tasksand sort tasks are performed simultaneously, therefore, wegenerally consider them as a shuffle phase. The average size ofshuffled data can be computed using Eq.(3).Dsh avg Dmoutput avg N mNr(3)Fig.2. A Hadoop job running in a single wave (16 map tasks and 16 reducetasks).1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems4However, if a job runs in multiple waves, then the job will beprogressed through both overlapping (parallel) andnon-overlapping (sequential) stages among the phases as showin Fig.3.In the case of multiple waves, the first wave of the shufflephase starts immediately after the first map task completes.Furthermore, the first wave of the shuffle phase continues untilall the map tasks complete and all the intermediate data isshuffled and sorted. Thus, the first wave of the shuffle phase isprogressed in parallel with the other waves of the map phase asshown in Fig.3. After completion of the first wave of the shufflephase, the reduce tasks start running and produce output.Afterwards, these reduce slots will become available to theshuffle tasks running in other waves. It can be observed fromFig.3 that the shuffle phase takes longer to complete in the firstwave than in other waves. In order to estimate the executiontime of a job in multiple waves, we need to estimate two sets ofparameters for the shuffle phase - the average and themaximum durations of the first wave, together with the averageand the maximum durations of the other waves. Moreover,there is no significant difference between the durations of themap tasks running in non-overlapping and overlapping stagesdue to the equal size of data chunks. Therefore, we onlyestimate one set of parameters for the map phase which are theaverage and the maximum durations of the map tasks. Thereduce tasks run in a non-overlapping stage, therefore we onlyestimate one set of parameters for the reduce phase which arethe average and the maximum durations of the reduce tasks.Finally, we aggregate the durations of all the three phases toestimate the overall job execution time.This can be reflected in the mathematical equations of theimproved HP model which are different from the HP model.B. Mathematical ExpressionsIn this section, we present the mathematical expressions ofthe improved HP work in modeling a Hadoop job whichcompletes in multiple waves. Table 2 defines the variables usedin the improved model.Table 2. Defined variables in the improved HP model.VariablesExpressionsThe lower bound duration of the map phase in thelowTm w1first wave (non-overlapping).The upper bound duration of the map phase in theupTm w1first wave (non-overlapping).The number of map tasks that complete in the firstw1Nmwave of the map phase.The number of map tasks that complete in otherw2Nmwaves of the map phase.The maximum execution time of a map task.maxTmlowTsh w1upTsh w1avgTsh w1shuffle phase in the first wave(overlapping and non-overlapping)lowTsh w 2upsh w 2Tavgsh w 2maxTsh w 2shuffle and reduce phasesHP modelFig.3. A Hadoop job running in multiple waves (80 map tasks, 32 reduce tasks).It should be pointed out that Fig.3 also shows the differencesbetween the HP model and the improved model in Hadoop jobmodeling. The HP work mathematically models the whole mapphase which includes the non-overlapping stage of the mapphase and the stage overlapping with the shuffle phase, but itdoes not provide any mathematical equations to model thenon-overlapping stage of the shuffle phase in the first wave.Whereas the improved HP work mathematically models thenon-overlapping map phase in the first wave, and the shufflephase in the first wave which includes both the stageoverlapping with the map phase and the non-overlapping stage.The average execution time of a shuffle task thatcompletes in other waves of the shuffle phase.The maximum execution time of a shuffle task thatcompletes in other waves of the shuffle phase.upThe upper bound duration of the reduce phase.maxThe maximum execution time of a reduce task.rlowThe lower bound execution time of a Hadoop job.upThe upper bound execution time of a Hadoop job.T jobmap phase(non-overlapping and overlapping)The upper bound duration of the shuffle phase inother waves (non-overlapping).The lower bound duration of the reduce phase.TrTThe lower bound duration of the shuffle phase inother waves (non-overlapping)lowTrnon-overlappingshuffle phasein the first waveThe maximum execution time of a shuffle task thatcompletes in the first wave of the shuffle phase.sh w1Tshuffle and reduce phasesmaxTImproved HP modelnon-overlappingmap phasein the first waveThe lower bound duration of the shuffle phase inthe first wave (overlapping with the map phase).The upper bound duration of the shuffle phase inthe first wave (overlapping with the map phase).The average execution time of a shuffle task thatcompletes in the first wave of the shuffle phase.T jobavgT jobThe average execution time of a Hadoop job.In practice, job tasks in different waves may not completeexactly at the same time due to varied overhead in disk I/Ooperations and network communication. Therefore, theimproved HP model estimates the lower bound and the upperbound of the execution time for each phase to cover thebest-case and the worse-case scenarios respectively.We consider a job that runs in both non-overlapping andoverlapping stages. The lower bound and the upper bound ofthe map phase in the first wave which is a non-overlappingstage can be computed using Eq.(8) and Eq.(9) respectively.1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems5T lowTm w1upw1 Nm(8)slotNmmaxTTm w1 avgm mw1Nm(9)slotNmIn the overlapping stage of a running job, the map phaseoverlaps with the shuffle phase. Specifically, the tasks runningin other waves of the map phase run in parallel with the tasksrunning in the first wave of the shuffle phase. As the shufflephase always completes after the map phase which means thatthe shuffle phase takes longer than the map phase, therefore weuse the duration of the shuffle phase in the first wave tocompute the lower bound and the upper bound of theoverlapping stage of the job using Eq.(10) and Eq.(11)respectively.Tsh w1 N shavgTsh w1 lowup(10)slotTsh w1 N shmax(11)slotNrupTsh w 2 Tsh w 2 N shavgw2(12)slotNrTsh w 2 N shmaxw2(13)slotNrThe reduce tasks start after completion of the shuffle tasks.Therefore, the reduce tasks complete in a non-overlappingstage. The lower bound and the upper bound of the reducephase can be computed using Eq.(14) and Eq.(15) respectively.lowTrupTr Tavgr(14)slotmax NrNrTr Nr(15)slotNrAs a result, the lower bound and upper bound of theexecution time of a Hadoop job can be computed by combiningthe execution durations of all the three phases using Eq.(16) andEq.(17) respectively.T job Tm w1 Tsh w1 Tsh w2 Tr(16)T job Tm w1 Tsh w1 Tsh w2 Tr(17)lowuplowuplowup T N mw1slotTsh w1 N shavg avgNrw2 N shmaxTmmaxw1slotNmsh w 2slotNrupT job TavgTm N mw1slotNmw2 N shsh w 2slotNr avgTr Nr(18)slotNrTsh w1 N shmax w1slotNrmaxTr NrslotNr(19)Finally, we take an average of Eq.(18) and Eq.(19) to estimatethe execution time of a Hadoop job using Eq.(20).w1In other waves of the shuffle phase, the tasks run in anon-overlapping stage. Hence, the lower bound and the upperbound of the non-overlapping stage of the shuffle phase can becomputed using Eq.(12) and Eq.(13) respectively.lowTsh w 2lowT job w1NrTsh w1 By substituting the values in Eq.(16) and Eq.(17), we havelowuplowupT job T joblowavgT job up(20)2C. Job Execution EstimationIn the previous section, we have presented the mathematicalexpressions of the improved HP model. The lower bound andthe upper bound of a map phase can be computed using Eq.(8)and Eq.(9) respectively. However, the durations of the shufflephase and the reduce phase have to be estimated based on therunning records of a Hadoop job.When a job processes an increasing size of an input dataset,the number of map tasks is proportionally increased while thenumber of reduce tasks is specified by a user in theconfiguration file. The number of reduce tasks can varydepending on user's configurations. When the number of reducetasks is kept constant, the execution durations of both theshuffle tasks and the reduce tasks are linearly increased with theincreasing size of the input dataset as considered in the HPmodel. This is because the volume of an intermediate datablock equals to the total volume of the generated intermediatedata divided by the number of reduce tasks. As a result, thevolume of an intermediate data block is also linearly increasedwith the increasing size of the input dataset. However, when thenumber of reduce tasks varies, the execution durations of boththe shuffle tasks and the reduce tasks are not linear to theincreasing size of an input dataset.In either the shuffle phase or the reduce phase, we considerthe tasks running in both overlapping and non-overlappingstages. Unlike the HP model, the improved model considers avaried number of reduce tasks. As a result, the durations of boththe shuffle tasks and the reduce tasks are nonlinear to the size ofan input dataset. Therefore, instead of using a simple linearregression as adopted by the HP model, we apply LocallyWeighted Linear Regression (LWLR) [20][21] in the improved1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems6model to estimate the execution durations of both the shuffletasks and the reduce tasks.LWLR is an instance-based nonparametric function, whichassigns a weight to each instance x according to its Euclideandistance from the query instancexq .LWLR assigns a highweight to an instance x which is close to the query instancexqand a low weight to the instances that are far away from thequery instance x q . The weight of an instance can be computedusing a Gaussian function as illustrated in Eq.(21).wk exp( (dis tan ce( xk , xq ) 2 )2h2), (k 1,2,3,., m)(21)where,wk is the weight of the training instance at location k . x k is the training instance at location k . m is the total number of the training instances. h is a smoothing parameter which determines thewidth of the local neighborhood of the query instance.The value of h is crucial to LWLR. Users have the option ofusing a new value of h for each estimation or a single globalvalue of h. However, finding an optimal value for h is achallenging issue itself [22]. In the improved HP model, asingle global value of h is used to minimize the estimated meansquare errors.In the improved HP model, LWLR is used to estimate thedurations of both the shuffle tasks and the reduce tasks. First,avgsh w1 ,we estimate Twhich is the average duration of the shuffle ( X T W X ) 1 ( X T W Y )Here W diag ( wk ) is the diagonal matrix where all thenon-diagonal cells are 0 values. The value of a diagonal cell isincreased when the distance between a training instance and thequery instance is decreased.Finally, the duration of a new shuffle task running in the firstwave of the shuffle phase can be estimated using Eq. (23).Tshavg w1 X q variables which is set to 2 (i.e. the size of an intermediatedataset and the number of reduce tasks). We define a vector Y y1, y 2 ., y m of dependent variables that are used for theaverage durations of the shuffle tasks. For example, y irepresents the average execution time of the shuffle task thatcorresponds to the training instance of xi . We define anothermatrix X q whose rows are query instances. Each queryinstance x q contains both the size of the intermediate datasetd new and the number of reduce tasks rnew of a new job. Wecalculate d new based on the average input data size of a maptask, the total number of map tasks and the map selectivitymetric which isd new Dmavg input N m M selectivity .avgFor the estimation of Tsh w1 , we calculate the weight foreach training instance using Eq. (21) and then compute theparameter using Eq. (22) which is the coefficient of LWLR.(23)avgmaxSimilarly, the durations of Tsh w1 , Tsh w 2 ,avgTshmax w 2 , Trmaxand Tr can be estimated.The estimated values of both the shuffle phase and thereduce phase are used in the improved HP model to estimate theoverall execution time of a Hadoop job when processing a newinput dataset. Fig.4 shows the overall architecture of theimproved HP model, which summarizes the work of theimproved HP model in job execution estimation. The boxes ingray represent the same work presented in the HP model. It isworth noting that the improved HP model works in an offlinemode and estimates the execution time of a job based on the jobprofile.JobJob ePhaseFirstwavetasks running in the first wave of the shuffle phase. To estimateLocallyWeighted LinearRegressionReduceEstimated time of first waveerlapEstimated fprofil romeOtherwaveEstimated time of other waveovTshavg w1 , we define a matrix X m n whose rows contain thetraining dataset x1 , x2 , x3 .,xm and n is the number of feature(22)Estimate time of reduce edTimeOverall Job EstimationFig.4. The architecture of the improved HP model.IV. RESOURCE PROVISIONINGThe improved HP model presented in Section III canestimate the execution time of a Hadoop job based on the jobexecution profile, allocated resources (i.e. map slots and reduceslots), and the size of an input dataset. The improved H

on Hadoop performance at a fine granularity [19]. For the purpose of simplicity, we only consider the three core phases (i.e. map phase, shuffle phase and reduce phase) in modeling the performance of Hadoop jobs. Table 1 defines the variables used in Hadoop job performance modeling. A. Modeling Map Phase