A Brief Review Of Scheduling Algorithms Of Map Reduce Model Using Hadoop

Transcription

International Journal of Engineering Trends and Technology (IJETT) – Volume-45 Number1 -March 2017A brief review of scheduling algorithms ofMap Reduce model using HadoopAdhishtha Tyagi#1, Sonia Sharma*2#Computer Science Department, JMIT Radaur1Mtech Scholar, 2 Assistant ProfessorAbstract — Scheduling has been an active area ofresearch in computing systems since their inception.Hadoop framework has become very much popularand most widely used in distributed data processing.Hadoop has become a central platform to store bigdata through its Hadoop Distributed File System(HDFS) as well as to run analytics on this stored bigdata using its MapReduce component.The main objective is to study MapReduceframework, MapReduce model, scheduling inhadoop, various scheduling algorithms and variousoptimization techniques in job scheduling.Scheduling algorithms of MapReduce model usinghadoop vary with design and behaviour, and areused for handling many issues like data locality,awareness with resource, energy and time.A. MAPREDUCE AND HADOOPMapReduce is a framework for processing hugeamounts of data in parallel by distributing the jobinto various independent tasks. Hadoop is Apache’sopen source implementation of the MapReduceframework[3].Hadoop Distributed File System (HDFS) andHadoop MapReduce are the key services of Hadoopplatform. A Hadoop system can be described basedon three factors: cluster, workload and user. Each ofthese factors is either homogeneous orheterogeneous which reflects the heterogeneity levelof the Hadoop system[4].B. MAPREDUCE MODELKeywords — Big Data, MapReduce and Hadoopframework, MapReduce model.I. INTRODUCTIONBig Data according to McKinsey refers to"datasets whose size are beyond the ability of typicaldatabase software tools to capture, store, manageand analyze". Big Data has become very popular inInformation Technology, where data is becomingcomplex and growing day by day.According to IBM Big Data consists of threeattributes:1. Variety: It means that the generated data aredifferent in type. It generally refers to structured,semi-structured or unstructured data in large amount.2. Volume: It concerns the large quantities of datathat is generated continuously; they might reachhundreds or thousands of terabytes.3. Velocity: Where processing and analyzing datamust be done in a fast manner to extract value ofdata in appropriate time[1].The advent of Big Data brings about manychallenges to the existing data processingarchitecture. Hence, there arises a need for big dataframeworks to complement the companies’ existinginfrastructure to support the analysis of the new data.There are many Big Data processing frameworkscurrently available in the market such as Dryad,Apache Hadoop, MapReduce , Apache Hadoop Yarn,and Apache Spark to name a few[2].ISSN: 2231-5381The MapReduce programming model consists oftwo functions: Map and Reduce. Input data isdivided into fixed sized blocks and fed into parallelMap tasks. Then Map task process the data chunksand produces the intermediate output in the form ofkey-value pair tuples. These tuples are shuffledacross different reduce nodes based on key values.Each Reduce task performs three steps: copy - the output of map task is copied toreducer nodes, sort - the collection of outputs is sortedbased on their key values and reduce - aggregation is applied to the data.Fig. 1: Phase of execution in Map Taskhttp://www.ijettjournal.orgPage 37

International Journal of Engineering Trends and Technology (IJETT) – Volume-45 Number1 -March 2017data for the task is not present, then a tasktracker waits for some time. If there is anyrequest from local node for task assigning, thenthe scheduler will check the size of the job; ifthe job is too short, then he scheduler will skipthat job and look for any subsequent jobsavailable to run. Dynamic Proportional Scheduling providesmore job sharing and prioritization that resultsin increasing shares of cluster resources andmore differentiation in service levels ofdifferent jobs. This improves response time formulti-user Hadoop environments.Fig. 2: Phase of execution in Reduce TaskThe MapReduce architecture consists of: one master (Jobtracker) and many workers or slaves (Tasktrackers)The JobTracker receives job submitted from user,divides it into map and reduce tasks. Then it assignsthe tasks to TaskTrackers, monitors the progress andfinally after the completion of the entire job,JobTracker informs the status of the job to the user.Each TaskTracker has a fixed number of map andreduce task slots which helps to determine howmany map and reduce tasks it can run at a time[5].C. Scheduling in hadoopHadoop is a multi-user data warehouse which is usedto support a variety of different types of processingjobs, with a pluggable scheduler frameworkproviding greater control. The main objective ofscheduling is to maximize throughput, minimize thecompletion time; overhead and available resourcesmust be balanced by allocating jobs to processors[6]. Hadoop uses a default scheduler i.e. FIFO. Thejobs submitted earlier gets preference over thejobs submitted later. The JobTracker pullsoldest jobs first from the job queue. Thisscheduler is mostly used when execution orderof job is not important. Capacity Scheduler is developed by Yahoo thatallows multiple tenants to securely share a largecluster. It supports hierarchical queues to ensurethat resources are shared among the sub-queuesof an organisation before other queues areallowed to use those resources. Fair Scheduler developed by Facebook, the coreidea behind fair scheduler is to assign resourcesto each job such that an average each job getsequal share of available resources. Delay scheduling has an objective to address theconflict between locality and fairness. Delayscheduling temporarily relaxes fairness toimprove locality by allowing jobs to wait forscheduling on a node with local data. In this, ifISSN: 2231-5381II. LITERATURE SURVEYXiangming Dai and Brahim Bensaou in 2016,proposed a novel task scheduling algorithm forHadoop MapReduce called dynamic prioritymultiqueue scheduler (DPMQS).DPMQS i) increase the data locality of jobs, and, ii)dynamically increase the priority of jobs that areclose to complete their Map phase, to bridge the timegap between the start of the reduce tasks and theexecution of the reduce function for these jobs. Thendiscussion regarding the details of DPMQS and thepractical implementation, thereafter assess itsperformance in a small physical cluster and largescale simulated clusters so as to compare it to theother schedulers in Hadoop. Both real experimentsand simulation results show that DPMQS decreasesthe response time, and demonstrate that DPMQS isnot flexible with the changes in the clustergeometry[7].Divya M. and Annappa B. in 2015, focused onoptimizing job scheduling in Hadoop. WorkloadCharacteristic and Resource Aware (WCRA)Hadoop scheduler classifies the jobs into CPUbound and Disk I/O bound. As concerned to theperformance, nodes in the cluster are classified asCPU busy and Disk I/O busy. The amount ofprimary memory in the node should be more than25% before scheduling the job. The performanceparameters of Map tasks i.e. the time required forparsing the data, map, sort and merge the result, andof Reduce tasksi.e. the time to merge, parse andreduce is used to categorize the job as CPU bound orDisk I/O bound. Tasks are assigned the prioritiesbased upon their minimum Estimated CompletionTime. The jobs are scheduled on a compute node ina way so that the jobs already running will not beaffected. Experimental results have given 30%improvement in performance as compared toHadoop’s FIFO, Fair and Capacity scheduler[4].Qinghua Lu and Shanshan Li in 2015, proposed agenetic algorithm based job scheduling model toimprove the efficiency of BDA. To implement thescheduling model, leverage the estimation module tohttp://www.ijettjournal.orgPage 38

International Journal of Engineering Trends and Technology (IJETT) – Volume-45 Number1 -March 2017predict the performance of clusters when processingjobs and also evaluated the proposed job schedulingmodel in terms of feasibility and performance[8].YongLiang Xu and Wentong Cai in 2015, TheDynamic Task Splitting Scheduler (DTSS) isproposed to mitigate the tradeoffs between fairnessand data locality during job scheduling. DTSSdynamically splits a task and executes the splittedtask immediately, on a non-data-local node, toimprove the fairness. Analysis shows that it ispossible to improve both the fairness and theperformance by adjusting the proportion of thesplitted task. DTSS is shown to improve themakespan of different users in a cluster by 2% to11% as compared to delay scheduling in the caseswhere it is difficult to obtain data-local nodes on acluster. The experiments show that DTSS is not asuitable scheduler under conditions where jobs areable to obtain data-local nodes easily[2].Ke Wang and Iman Sadooghi in 2015, aimed toaddress the YARN scaling issues through adistributed task execution framework, MATRIX,originally designed to schedule the executions ofdata-intensive scientific applications of many-taskson supercomputers. They also run and simulateMATRIX with fine-grained sub-second workloads,giving the efficiency of 86.8% at 64 thousand coresfor the 150ms workload[9].Qutaibah Althebyan and Omar ALQudah in 2014,proposed a new scheduling algorithm that was basedon a multi-threading principle. The proposedalgorithm, divided the cluster into multi blockswhere each one of them is scheduled by a specialthread. Two major factors used to test the algorithmare: the simulation time and the energy consumption.The scheduler is then compared with existingschedulers and the results showed the superiorityand the preference of proposed scheduler over theexisting schedulers[1].Chen He and David Swanson in 2013, created anovel real-time scheduler for MapReduce, which isused to overcome the deficiencies of existingscheduler. It avoids accepting jobs that will lead todeadline misses and improves the cluster utilization.They implement scheduler in Hadoop system andexperimental results show that scheduler providesdeadline guarantees for accepted jobs and achievesgood cluster utilization[3].Li Liu and Yuan Zhou in 2012, formulated thepreemptive scheduling problem under deadlineconstraint, and then proposed its algorithms. Theexperimental results showed that the preemptivescheduling approach is more efficient than the nonpreemptive one for executing jobs under a certaindeadline[6].ISSN: 2231-5381Hong Mao and Shengqiu Hu in 2011, focused onoptimizing the scheduler of MapReduce frameworkin task level. They cared about the hardwareconfiguration and real-time workload of the nodes ina hadoop cluster and aied at shortening time cost ofMapReduce jobs and improving hardware resourceutilization rate. They put forward a load driven taskscheduler that is used to assign tasks toTaskTrackers according to the workload of slavenodes. It is based on a Dynamic Slot Controller(DSC) which is used to adjust Map task Slot (MS)and Reduce task Slot (RS) of TaskTrackers runningon slave nodes adaptively. Load-driven taskscheduler reduces time consumption of MapReducejob by 14% and improves the rate of CPU utilizationof hadoop cluster by 34% when processing 10GBdata[10].Kamal Kc and Kemafor Anyanwu in 2010, extendedreal time cluster scheduling approach in an accountfor the two-phase computation style of MapReduceand developed a criteria for scheduling jobs based onuser specified deadline constraints and discussedimplementation and pre-evaluation of a DeadlineConstraint Scheduler for Hadoop which ensures thatonly jobs whose deadlines can be met are scheduledfor execution[5].III.TABLEOverview Of Various TechniquesFindings In Job Scheduling:AUTHORKamal Kc,KemaforAnyanwuHong DSchedulinghadoop jobstomeetdeadlinesAloaddriven ndFINDINGSExtended realtime clusterschedulingapproach toderiveminimummapandreducetaskcount criteriaforperformingtaskscheduledwith deadlineconstraints.Optimizethe schedulerofMapReduceframework intask level andcare about thehardwareconfigurationPage 39

International Journal of Engineering Trends and Technology (IJETT) – Volume-45 Number1 -March 2017MapReduceLiLiu,YuanZhou,Ming Liu,GuandongXu, XiweiChen,DangpingFan,QianruWangChen He,Ying 142015DevelopingdeadlinebasedMapReduceschedulersby using er forMapReduce,to overcomethedeficienciesofan existingschedulerAnewschedulingalgorithmbased on amultithreadingprincipleDynamicTaskSplittingISSN: 2231-5381and dynamicworkload ofthe nodes in ahadoopcluster.Employed thepreemption tothe deadlineconstraintschedulingmodelinordertoovercome theissues,such as thestarvationcaused by schedulerovercomes thedeficienciesof an existingalgorithm andachieves goodclusterutilization and100%jobsuccess ratio,ensuring therealtimeproperty forall admittedMapReducejobsThe algorithmis based reis divided intoblocks,every block isscheduled byaspecialthread,andthisscheduling isachieved in asynchronoustime.DTSSdynamicallysplits the taskCaiScheduler(DTSS) tomitigate thetradeoffsbetweenfairness anddata localityduring jobschedulingDivya (WCRA)Hadoopschedulerthatclassifies thejobsintoCPU boundand Disk bschedulingmodeltoimprove theefficiency ofBigDataAnalytics(BDA)and scheduleone part oforiginal taskfor processingat a non localdatanodeimmediately.It is possibletoimprovetheperformanceandthefairness of theusers eristicsandResourceAwareHadoop jobscheduler isdesigned thatconsiders thejobcharacteristicsandnodestatusbefore makingthe schedulingdecisions. rmanceof maptasksandreducetasks intheclassificationof workloadimprovesthe esaperformanceestimationmodule, forobtainingoptimizedjobsschedulingPage 40

International Journal of Engineering Trends and Technology (IJETT) – Volume-45 Number1 -March 2017Ke Wang,Ning Liu,ImanSadooghi,Xi Yang,XiaobingZhou,Tonglin Li,MichaelLang,Xian-HeSun, IoanRaicu2015DistributedTaskExecution toovercomehadoopscalinglimitationsISSN: 2231-5381scheme thathavetheoptimizedtime andcostconsumption.The optimizedsolutions canbe used toenableeffectiveschedulingstrategies, andthen in theactualrunning,systemcanmake use ofthechosenschedulingschemetoexecute dataprocessingjobs.Proposed toleveragethe distributeddesignwisdoms ofthe MATRIXtaskexecutionframework toovercome thescalinglimitations ofHadooptowardsextremescales.MATRIXaddressed thescaling issuesof YARN anagementusingkeyvalue stores.XiangmingDai,BrahimBensaou2016A novel taskschedulingalgorithmfor duler(DPMQS)DPMQSfirstlyincreases thedata localityrate of maptasksbylaunchingthem as nearas possible totheir inputdataandsecondlyincreases thepriorityofjobs that arenearto sks spendwaitingforthecompletion ofthemapphase.IV.CONCLUSIONSVarious research papers have been reviewed forthe survey mentioned in the literature. Eachscheduler considers the resources like CPU, Memory,Job Deadlines etc. To overcome the issue of big datastorage and processing the open source frameworkHadoop can be used. Starvation caused by non preemption scheduling is reduced by pre-emption to thedeadline constraint scheduling model. Real TimeMapReduce scheduler achieves good clusterutilization. Dynamic Task Splitting Schedulerimproves the performance and fairness of usersthrough dynamically splitting the tasks. Future workwill consider enhancement in scheduling usingparticular scheduler.REFERENCES[1] Qutaibah Althebyan , Omar ALQudah, Yaser Jararweh,Qussai Yaseen, “Multi-Threading Based Map Reduce TasksScheduling”, 5th International Conference on Informationand Communication Systems (ICICS), 2014.[2] YongLiang Xu, Wentong Cai, “Hadoop Job Schedulingwith Dynamic Task Splitting”, International Conference onCloud Computing Research and Innovation, 2015.[3] Chen He, Ying Lu, David Swanson, “Real-Time Schedulingin Map Reduce Clusters”, IEEE International Conferenceon High Performance Computing and Communications &IEEE International Conference on Embedded andUbiquitous Computing, 2013.[4] Divya M., Annappa B., “Workload Characteristics andResource Aware Hadoop Scheduler”, IEEE 2ndhttp://www.ijettjournal.orgPage 41

International Journal of Engineering Trends and Technology (IJETT) – Volume-45 Number1 -March 2017International Conference on Recent Trends in InformationSystems (ReTIS), 2015.[5] Kamal Kc, Kemafor Anyanwu, “Scheduling Hadoop Jobs toMeet Deadlines”, 2nd IEEE International Conference onCloud Computing Technology and Science, 2010.[6] Li Liu, Yuan Zhou, Ming Liu, Guandong Xu, Xiwei Chen,Dangping Fan, Qianru Wang, “Preemptive Hadoop JobsScheduling under a Deadline”, Eighth InternationalConference on Semantics, Knowledge and Grids, 2012.[7] Xiangming Dai, Brahim Bensaou, “Scheduling for responsetime in Hadoop MapReduce”, IEEE ICC 2016 SAC CloudCommunications and Networking, 2016.[8] Qinghua Lu, Shanshan Li, Weishan Zhang, “GeneticAlgorithms based Job Scheduling for Big Data Analytics”,International Conference on Identification, Information, andKnowledge in the Internet of Things, 2015.[9] Ke Wang, Ning Liu, Iman Sadooghi, Xi Yang, XiaobingZhou, Tonglin Li, Michael Lang, Xian-He Sun, Ioan Raicu,“Overcoming Hadoop Scaling Limitations throughDistributed Task Execution”, IEEE International Conferenceon Cluster Computing, 2015[10]Hong Mao, Shengqiu Hu, Zhenzhong Zhang, Limin Xiao,Li Ruan, “A Load-Driven Task Scheduler with AdaptiveDSC for MapReduce” , IEEE/ACM InternationalConference on Green Computing and Communications,2011.ISSN: 2231-5381http://www.ijettjournal.orgPage 42

Hadoop Distributed File System (HDFS) and Hadoop MapReduce are the key services of Hadoop platform. A Hadoop system can be described based on three factors: cluster, workload and user. Each of these factors is either homogeneous or heterogeneous which reflects the heterogeneity level of the Hadoop system[4]. B. MAPREDUCE MODEL