Hadoop Performance Modeling For Job Estimation And Resource Provisioning PDF Free Download

1y ago

17 Views

1 Downloads

1,017.76 KB

15 Pages

Report/dmca

Download PDF

Transcription

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems1Hadoop Performance Modeling for JobEstimation and Resource ProvisioningMukhtaj Khan, Yong Jin, Maozhen Li, Yang Xiang and Changjun Jiang Abstract- MapReduce has become a major computing model fordata intensive applications. Hadoop, an open sourceimplementation of MapReduce, has been adopted by anincreasingly growing user community. Cloud computing serviceproviders such as Amazon EC2 Cloud offer the opportunities forHadoop users to lease a certain amount of resources and pay fortheir use. However, a key challenge is that cloud service providersdo not have a resource provisioning mechanism to satisfy userjobs with deadline requirements. Currently, it is solely the user'sresponsibility to estimate the required amount of resources forrunning a job in the cloud. This paper presents a Hadoop jobperformance model that accurately estimates job completion timeand further provisions the required amount of resources for a jobto be completed within a deadline. The proposed model builds onhistorical job execution records and employs Locally WeightedLinear Regression (LWLR) technique to estimate the executiontime of a job. Furthermore, it employs Lagrange Multiplierstechnique for resource provisioning to satisfy jobs with deadlinerequirements. The proposed model is initially evaluated on anin-house Hadoop cluster and subsequently evaluated in theAmazon EC2 Cloud. Experimental results show that the accuracyof the proposed model in job execution estimation is in the rangeof 94.97% and 95.51%, and jobs are completed within therequired deadlines following on the resource provisioning schemeof the proposed model.Index Terms— Cloud computing, Hadoop MapReduce,performance modeling, job estimation, resource provisioningI. INTRODUCTIONMany organizations are continuously collecting massiveamounts of datasets from various sources such as theMukhtaj Khan is with the Department of Electronic and ComputerEngineering, Brunel University, Uxbridge, UB8 3PH, UK. Email:Mukhtaj.Khan@brunel.ac.uk.Yong Jin is with the National Key Lab for Electronic MeasurementTechnology, North University of China, Taiyuan 030051, China. He is aVisiting Professor in the Department of Electronic and Computer Engineering,Brunel University, Uxbridge, UB8 3PH, UK. Email: Yong.Jin@brunel.ac.uk.Maozhen Li is with the Department of Electronic and ComputerEngineering, Brunel University, Uxbridge, UB8 3PH, UK. He is also with theKey Laboratory of Embedded Systems and Service Computing, Ministry ofEducation, Tongji University, Shanghai, 200092, China. Email:Maozhen.Li@brunel.ac.uk.Changjun Jiang and Yang Xiang are with the Department of ComputerScience & Technology, Tongji University, 1239 Siping Road, Shanghai200092, China. Email: {cjjiang, shxiangyang}@tongji.edu.cn.World Wide Web, sensor networks and social networks. Theability to perform scalable and timely analytics on theseunstructured datasets is a high priority task for manyenterprises. It has become difficult for traditional networkstorage and database systems to process these continuouslygrowing datasets. MapReduce [1], originally developed byGoogle, has become a major computing model in support ofdata intensive applications. It is a highly scalable, fault-tolerantand data parallel model that automatically distributes the dataand parallelizes the computation across a cluster of computers[2]. Among its implementations such as Mars[3], Phoenix[4],Dryad[5] and Hadoop [6], Hadoop has received a wide uptakeby the community due to its open source nature [7][8][9][10].One feature of Hadoop MapReduce is its support of publiccloud computing that enables the organizations to utilize cloudservices in a pay-as-you-go manner. This facility is beneficialto small and medium size organizations where the setup of alarge scale and complex private cloud is not feasible due tofinancial constraints. Hence, executing Hadoop MapReduceapplications in a cloud environment for big data analytics hasbecome a realistic option for both the industrial practitionersand academic researchers. For example, Amazon has designedElastic MapReduce (EMR) that enables users to run Hadoopapplications across its Elastic Cloud Computing (EC2) nodes.The EC2 Cloud makes it easier for users to set up and runHadoop applications on a large-scale virtual cluster. To use theEC2 Cloud, users have to configure the required amount ofresources (virtual nodes) for their applications. However, theEC2 Cloud in its current form does not support Hadoop jobswith deadline requirements. It is purely the user's responsibilityto estimate the amount of resources to complete their jobswhich is a highly challenging task. Hence, Hadoopperformance modeling has become a necessity in estimating theright amount of resources for user jobs with deadlinerequirements. It should be pointed out that modeling Hadoopperformance is challenging because Hadoop jobs normallyinvolve multiple processing phases including three core phases(i.e. map phase, shuffle phase and reduce phase). Moreover, thefirst wave of the shuffle phase is normally processed in parallelwith the map phase (i.e. overlapping stage) and the other wavesof the shuffle phase are processed after the map phase iscompleted (i.e. non-overlapping stage).To effectively manage cloud resources, several Hadoopperformance models have been proposed [11][12][13][14].However, these models do not consider the overlapping andnon-overlapping stages of the shuffle phase which leads to aninaccurate estimation of job execution.1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems2 The performance of the improved HP model is initiallyevaluated on an in-house Hadoop cluster and subsequently onAmazon EC2 Cloud. The evaluation results show that theimproved HP model outperforms both the HP model andStarfish in job execution estimation with an accuracy of level inthe range of 94.97% and 95.51%. For resource provisioning, 4job scenarios are considered with a varied number of map slotsand reduce slots. The experimental results show that theReduceReduceTaskTaskMapMapTaskTaskMap PhaseIntermediatedatasetShufflePhaseFinal Output in HDFSMapTaskReduceReduceTaskTaskReduce OutputMapMapTaskTaskReduce OutputMap OutputNormally a Hadoop job execution is divided into a mapphase and a reduce phase. The reduce phase involves datashuffling, data sorting and user-defined reduce functions. Datashuffling and sorting are performed simultaneously. Therefore,the reduce phase can be further divided into a shuffle (or sort)phase and a reduce phase performing user-defined functions.As a result, an overall Hadoop job execution work flow consistsof a map phase, a shuffle phase and a reduce phase as shown inFig.1. Map tasks are executed in map slots at a map phase andreduce tasks run in reduce slots at a reduce phase. Every taskruns in one slot at a time. A slot is allocated with a certainamount of resources in terms of CPU and RAM. A Hadoop jobphase can be completed in a single wave or multiple waves.Tasks in a wave run in parallel on the assigned slots.Map OutputThe improved HP work mathematically models all thethree core phases of a Hadoop job. In contrast, the HPwork does not mathematically model thenon-overlapping shuffle phase in the first wave.The improved HP model employs Locally WeightedLinear Regression (LWLR) technique to estimate theexecution time of a Hadoop job with a varied numberof reduce tasks. In contrast, the HP model employs asimple linear regress technique for job executionestimation which restricts to a constant number ofreduce tasks.Based on job execution estimation, the improved HPmodel employs Langrage Multiplier technique toprovision the amount of resources for a Hadoop job tocomplete within a given deadline.II. MODELING JOB PHASES IN HADOOPMap Output improved HP model is more economical in resourceprovisioning than the HP model.The remainder of paper is organized as follows. Section IImodels job phases in Hadoop. Section III presents the improvedHP model in job execution estimation and Section IV furtherenhances the improved HP model for resource provisioning.Section V first evaluates the performance of the improved HPmodel on an in-house Hadoop cluster and subsequently onAmazon EC2 Cloud. Section VI discusses a number of relatedworks. Finally, Section VII concludes the paper and points outsome future work.Input datasetRecently, a number of sophisticated Hadoop performancemodels are proposed [15][16][17][18]. Starfish [15] collects arunning Hadoop job profile at a fine granularity with detailedinformation for job estimation and optimization. On the top ofStarfish, Elasticiser [16] is proposed for resource provisioningin terms of virtual machines. However, collecting the detailedexecution profile of a Hadoop job incurs a high overhead whichleads to an overestimated job execution time. The HP model[17] considers both the overlapping and non-overlapping stagesand uses simple linear regression for job estimation. This modelalso estimates the amount of resources for jobs with deadlinerequirements. CRESP [18] estimates job execution andsupports resource provisioning in terms of map and reduceslots. However, both the HP model and CRESP ignore theimpact of the number of reduce tasks on job performance. TheHP model is restricted to a constant number of reduce tasks,whereas CRESP only considers a single wave of the reducephase. In CRESP, the number of reduce tasks has to be equal tonumber of reduce slots. It is unrealistic to configure either thesame number of reduce tasks or the single wave of the reducephase for all the jobs. It can be argued that in practice, thenumber of reduce tasks varies depending on the size of the inputdataset, the type of a Hadoop application (e.g. CPU intensive,or disk I/O intensive) and user requirements. Furthermore, forthe reduce phase, using multiple waves generates betterperformance than using a single wave especially when Hadoopprocesses a large dataset on a small amount of resources. Whilea single wave reduces the task setup overhead, multiple wavesimprove the utilization of the disk I/O.Building on the HP model, this paper presents an improvedHP model for Hadoop job execution estimation and resourceprovisioning. The major contributions of this paper are asfollows:ReducePhaseFig.1. Hadoop job execution flow.Herodotou presented a detailed set of mathematical modelson Hadoop performance at a fine granularity [19]. For thepurpose of simplicity, we only consider the three core phases(i.e. map phase, shuffle phase and reduce phase) in modelingthe performance of Hadoop jobs. Table 1 defines the variablesused in Hadoop job performance modeling.A. Modeling Map PhaseIn this phase, a Hadoop job reads an input dataset fromHadoop Distributed File System (HDFS), splits the inputdataset into data chunks based on a specified size and thenpasses the data chunks to a user-define map function. The mapfunction processes the data chunks and produces a map output.The map output is called intermediate data. The average map1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI10.1109/TPDS.2015.2405552, IEEE Transactions on Parallel and Distributed Systems4However, if a job runs in multiple waves, then the job will beprogressed through both overlapping (parallel) andnon-overlapping (sequential) stages among the phases as showin Fig.3.In the case of multiple waves, the first wave of the shufflephase starts immediately after the first map task completes.Furthermore, the first wave of the shuffle phase continues untilall the map tasks complete and all the intermediate data isshuffled and sorted. Thus, the first wave of the shuffle phase isprogressed in parallel with the other waves of the map phase asshown in Fig.3. After completion of the first wave of the shufflephase, the reduce tasks start running and produce output.Afterwards, these reduce slots will become available to theshuffle tasks running in other waves. It can be observed fromFig.3 that the shuffle phase takes longer to complete in the firstwave than in other waves. In order to estimate the executiontime of a job in multiple waves, we need to estimate two sets ofparameters for the shuffle phase - the average and themaximum durations of the first wave, together with the averageand the maximum durations of the other waves. Moreover,there is no significant difference between the durations of themap tasks running in non-overlapping and overlapping stagesdue to the equal size of data chunks. Therefore, we onlyestimate one set of parameters for the map phase which are theaverage and the maximum durations of the map tasks. Thereduce tasks run in a non-overlapping stage, therefore we onlyestimate one set of parameters for the reduce phase which arethe average and the maximum durations of the reduce tasks.Finally, we aggregate the durations of all the three phases toestimate the overall job execution time.This can be reflected in the mathematical equations of theimproved HP model which are different from the HP model.B. Mathematical ExpressionsIn this section, we present the mathematical expressions ofthe improved HP work in modeling a Hadoop job whichcompletes in multiple waves. Table 2 defines the variables usedin the improved model.Table 2. Defined variables in the improved HP model.VariablesExpressionsThe lower bound duration of the map phase in thelowTm w1first wave (non-overlapping).The upper bound duration of the map phase in theupTm w1first wave (non-overlapping).The number of map tasks that complete in the firstw1Nmwave of the map phase.The number of map tasks that complete in otherw2Nmwaves of the map phase.The maximum execution time of a map task.maxTmlowTsh w1upTsh w1avgTsh w1shuffle phase in the first wave(overlapping and non-overlapping)lowTsh w 2upsh w 2Tavgsh w 2maxTsh w 2shuffle and reduce phasesHP modelFig.3. A Hadoop job running in multiple waves (80 map tasks, 32 reduce tasks).It should be pointed out that Fig.3 also shows the differencesbetween the HP model and the improved model in Hadoop jobmodeling. The HP work mathematically models the whole mapphase which includes the non-overlapping stage of the mapphase and the stage overlapping with the shuffle phase, but itdoes not provide any mathematical equations to model thenon-overlapping stage of the shuffle phase in the first wave.Whereas the improved HP work mathematically models thenon-overlapping map phase in the first wave, and the shufflephase in the first wave which includes both the stageoverlapping with the map phase and the non-overlapping stage.The average execution time of a shuffle task thatcompletes in other waves of the shuffle phase.The maximum execution time of a shuffle task thatcompletes in other waves of the shuffle phase.upThe upper bound duration of the reduce phase.maxThe maximum execution time of a reduce task.rlowThe lower bound execution time of a Hadoop job.upThe upper bound execution time of a Hadoop job.T jobmap phase(non-overlapping and overlapping)The upper bound duration of the shuffle phase inother waves (non-overlapping).The lower bound duration of the reduce phase.TrTThe lower bound duration of the shuffle phase inother waves (non-overlapping)lowTrnon-overlappingshuffle phasein the first waveThe maximum execution time of a shuffle task thatcompletes in the first wave of the shuffle phase.sh w1Tshuffle and reduce phasesmaxTImproved HP modelnon-overlappingmap phasein the first waveThe lower bound duration of the shuffle phase inthe first wave (overlapping with the map phase).The upper bound duration of the shuffle phase inthe first wave (overlapping with the map phase).The average execution time of a shuffle task thatcompletes in the first wave of the shuffle phase.T jobavgT jobThe average execution time of a Hadoop job.In practice, job tasks in different waves may not completeexactly at the same time due to varied overhead in disk I/Ooperations and network communication. Therefore, theimproved HP model estimates the lower bound and the upperbound of the execution time for each phase to cover thebest-case and the worse-case scenarios respectively.We consider a job that runs in both non-overlapping andoverlapping stages. The lower bound and the upper bound ofthe map phase in the first wave which is a non-overlappingstage can be computed using Eq.(8) and Eq.(9) respectively.1045-9219 (c) 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. Seehttp://www.ieee.org/publications standards/publications/rights/index.html for more information.

on Hadoop performance at a fine granularity [19]. For the purpose of simplicity, we only consider the three core phases (i.e. map phase, shuffle phase and reduce phase) in modeling the performance of Hadoop jobs. Table 1 defines the variables used in Hadoop job performance modeling. A. Modeling Map Phase