Job Schedulers For Machine Learning And Data Mining Algorithms . - CORE

Transcription

View metadata, citation and similar papers at core.ac.ukbrought to you byCOREprovided by Servicio de Difusión de la Creación IntelectualVI Jornadas de Cloud Computing & Big Data (JCC&BD 2018)Job Schedulers for Machine Learning and Data Miningalgorithms distributed in Hadoop1 C o rn e jo , F é lix M a r tin ; 2 Z u n in o , A le ja n d r o ; 3 M u r a z z o , M a r ía1,32Departamento de Informática, Universidad Nacional de San Juan;Instituto Superior de Ingeniería de Software de Tandil, Universidad Nacional del centro de la provincia deBuenos Aires1 fmartin.cornejo@gmail.com; 2 azunino@gmail.com; 3 maritemurazzo@gmail.comAlthough the fact that parallel and distributedcomputing appears as one of the most interestingalternatives to store and manipulate Big Data, someof its characteristics prevent its use by part ofcommon users [6]. Aspects such as datadependencies, integrity, load balancing, planning andfault tolerance, although extremely difficulttoaverage programmers, are crucial for Big Datasolutions to be operational. For this reason, severalframeworks have been proposed that abstract theseaspects and provide high-level solutions for users andprogrammers [5,6,7,8,9,10,11]. Some of thoseframeworks they are based on specific programmingparadigms such as Fork-Join, MapReduce and MPI.Recently, the MapReduce paradigm attractedattention due to its applicability in parallelcomputing, being widely used by Google in itsdistributed file system GFS [12]. The great novelty ofthe paradigm and its implementation in a frameworkconsists of in that it conceals to the programmer greatpart of the complexity of parallelization anddistribution. Thus, programmers can focus on thefunctionality of their programs, while the frameworkabstracts the complexity and control the underlyingcomputational infrastructure. One of the mostsuccessful implementations of MapReduce is ApacheHadoop In addition, it provides solutions to store,manipulate and extract Big Data information indifferent ways. Around Hadoop has flourished anecosystem of solutions for specific problems. Someexamples are Pig, which facilitates the analysis oflarge datasets, Spark that speeds up MapReduce usingstructures in RAM and Mahout that aims to scale datamining and machine learning algorithms.While these efforts have had a significant impacton research and industrial environments the diversityof uses of Hadoop has led to underperforming results[13]. This is due, among other factors, to thescheduler, component responsible for assigning theAbstract 1The standard scheduler of Hadoop does not considerthe characteristics of jobs such as computationaldemand, inputs / outputs, dependencies, location ofthe data, etc., which could be a valuable source toallocate resources to jobs in order to optimize theiruse. The objective of this research is to takeadvantage of this information for planning, limitingthe scope to ML / DM algorithms, in order to improvethe execution times with respect to existingschedulers. The aim is to improve Hadoop jobschedulers, seeking to optimize the execution timesof machine learning and data mining algorithms inClusters.Keywords: Big Data, Hadoop, schedulers of Hadoop,ML/DM algorithms, machine learning.1.IntroductionRecently, the valuable knowledge that can beextracted from the analysis of sets of large-scale data,has given rise to the so-called "Big Data". The term"Big Data" [1,2] refers to a collection of large datasets that can not be processed using traditionaldatabase administration tools. This generatednumerous scientific challenges such as how toprovide support for storage, manipulate and thenretrieve information. In this sense, it is necessary tohave capable infrastructures to process large volumesof data, in acceptable times and with limitedcomputational resources. Solutions typically rely onconcepts of parallel and distributed computing, wherethrough Clusters and Computational Grids [3] allowprocessing Big Data at relatively low costs [4,5].62

VI Jornadas de Cloud Computing & Big Data (JCC&BD 2018)computational resources of a Cluster to the works, isone of the aspects that most influence theperformance of Hadoop. Originally, Hadoop includesthree schedulers: FIFO, Fair and Capacity. Thedefault scheduler is FIFO, which queues jobs in orderof arrival and executes them. The Fair scheduler,developed by Facebook, seeks to allocate equitableCluster resources to tasks. The Capacity schedulerwas developed by Yahoo! to ensure equitableallocation for a large number of Cluster users.Although these Hadoop schedulers give flexibilityso that users can optimize the execution of their worksto the Cluster, the resulting performance is often lessthan expected [14] resulting in not only delays inexecution time, but also low use of Cluster resources.Some efforts have proposed ad-hoc schedulers forHadoop that consider different aspects of both tasksand computational resources available.In this paper we analyze and discuss previousefforts on improving Hadoop schedulers and presentour current research to improve Hadoop schedulers.The rest of this paper is organized as follows. InSection 2 we provide background about Hadoop andits built-in schedulers. Then, Section 3 overviews anddiscuss the most representative research on Hadoopscheduling. Section 4 proposes possible current andfuture research towards improving Hadoopscheduling for machine learning and data miningtasks.2.Therefore, here we present a comprehensive study ofthe most representative Hadoop schedulers.Fig 1. Hadoop distributed system architecture2.1. Apache Hadoop EcosystemThe Apache Hadoop is a framework that allows theprocessing of large volumes of data through Clustersystems, using a simple programming model. Inaddition, its design allows the scalability from a fewnodes to thousands of nodes in an agile way,achieving high efficiency in vertical scaling.BackgroundA significant amount of research has been done in thefield of Hadoop Job scheduling; however, there is stilla need for research to overcome some of thechallenges regarding scheduling of jobs in Hadoopclusters. Industries estimate that 20% of the data is instructured form while the remaining 80% of data is insemi structured form. This is a big challenge not onlyfor volume and variety but also for data processing,which can lead to problems for I/O processing and jobscheduling. Fig. 1 shows the architecture of a Hadoopdistributed system.Fig. 2. Hadoop Ecosystem.As we know, this is an era of Big Data wheremachines process a significant amount of data, in therange of terabytes or petabytes, using variousapplications in fields such as science, business, andcommerce. Such applications require a considerableamount of I/O processing and spend most of the timedoing I/O processing, which is a major part of jobscheduling. It is reported that at the Facebook andMicrosoft Bing data centers, I/O processing requires79% of the jobs’ duration and 69% of the resources.Hadoop is a distributed system that uses a MasterSlave architecture. In addition, it provides adistributed file system for storing data called theHadoop Distributed File System (HDFS) and the MapReduce programming paradigm to perform thecalculations.Hadoop is a very diverse ecosystem, which growsday after day. Hadoop has grown into a large familyof solutions for the storage, management, interactionand analysis of large data, integrated into a rich open63

VI Jornadas de Cloud Computing & Big Data (JCC&BD 2018)source ecosystem created by the community of theApache Software Foundation (Fig. 2).2.3 Major challenges for job scheduling in2.2. Job schedulers in HadoopThere is a clear need for efficient schedulingalgorithms for the management of Big data on variousnodes in Hadoop clusters, in particular for supportingthe execution of machine learning and data miningalgorithms. There are various factors that affect theperformance of scheduling policies such as datavolume (storage), format of data sources (datavariety), speed (data velocity), security and privacy,cost, connectivity, and data sharing. To achieve betterutilization of resources and management of Big data,new scheduling policies should be designed. The nextsections describe some of the main challenges thatface job scheduling.Big dataThe aim of job scheduling included in hadoop [15] isto enable faster processing of jobs and to reduce theresponse time as much as possible by using bettertechniques for scheduling depending on the jobs,along with the best utilization of resources. FIFOscheduling is default scheduling mode in Hadoop; thejobs coming first get higher priority than thosecoming later. In some situations, this type ofscheduling has a disadvantage, that is, when longerjobs are scheduled prior to shorter jobs, it leads tostarvation.Fair scheduling shares the resources equallyamong all jobs. Capacity scheduling was introducedby Yahoo. It maximizes the utilization of resourcesand throughput in Clusters. LATE [16] schedulingpolicy was developed to optimize the performance ofjobs and to minimize the job response time bydetecting slow running processes in a Cluster andlaunching equivalence processes as the background.Facebook uses Delay scheduling to achieve betterperformance and lower response time for map tasksby applying changes to MapReduce. In deadlinescheduler, the deadline constraints are specified bythe user before scheduling the jobs in order toincrease system utilization.Resource aware scheduling improves resourceutilization; it uses node, master node, and workednode to complete job scheduling. In matchmakingscheduling [17], each node is marked by the localitymarker, which ensures that every node gets anequitable chance to seize a local task. Through thisscheduling, high data locality and better clusterutilization is achieved.There have already been a few review papers onjob scheduling algorithms for Big data processing.[14] presented a comparative study on job schedulingmethods and discussed their strengths and weakness.[18] presented a review on scheduling algorithms andprovided guidelines for the improvement ofscheduling algorithms in Hadoop MapReduce. [19]presented a survey on Big data management anddiscussed various scheduling algorithms in Hadoop.They also discussed the latest advancements relatedto scheduling algorithms. [20] presented a paper onthe scheduling policies for Hadoop and performed acomparative study on MapReduce optimizationtechniques.2.3.1. Data volume (storage)An important problem of Big data is that it is veryhuge and includes data that is in unstructured formats,which makes it difficult to organize the data foranalysis. Data includes both structured andunstructured data; the storage of unstructured data isnot an easy task.2.3.2. Form at of data sources (data variety)Data comes into Big data from various homogeneousas well as heterogeneous resources, and this causesmany problems due to the heterogeneity of dataresources, data format, and infrastructure2.3.3. Speed (data velocity)Today, everybody expects everything to be doneinstantaneously. The speed is an important issue inBig data. Speed of Big data is restricted by variousproblems such as query/retrieval problem, import/export problem, real time/offline problem, andstatistical analysis problem.2.3.4. Connectivity and data sharingData sharing and connectivity are still issues that needto be considered. At present, a majority of the datapoints are not yet connected. There are various issuesin Big data related to connectivity and data sharingsuch as data standard and interfaces, accesspermissions, and shared protocols.2.3.5. CostCost is also an issue in Big data. The cost upgradation issues in Big data are cost comparisonbetween master and slave nodes, and upgrade ormodification of nodes.64

VI Jornadas de Cloud Computing & Big Data (JCC&BD 2018)3.uses information such as job income rates andaverage execution times to make better decisions.In [25] a quantitative approach is adopted where afirst detailed study of the behavior of severalapplications on Hadoop that run on four differenthardware configurations, to ensure that the dataassociated with the jobs are available locally to acluster in a multi-cluster implementation. On theother hand, the work of [26], observe the changes inthe load of the nodes to assign jobs more intelligently.In [27] the conflict between locality and equity isaddressed, and a simple algorithm called delayprogramming is proposed: when the work that mustbe scheduled according to equity can not start a localtask, it waits a small amount of time, allowing othersjobs start tasks in their place. It was verified that theschedule of delays reaches an almost optimal datalocation in a variety of workloads and can increaseperformance, while preserving fairness. In addition,the simplicity of delay programming makes itapplicable under a wide variety of programmingpolicies in addition to fair exchange.Finally, [28] studied the interactions between jobsand their intermediate results to group multiple jobsinto one and thus reduce the number of queries toaccess shared intermediate data.Custom Hadoop SchedulersSome works have studied the behavior of theschedulers included with Hadoop and proposedimprovements [14,21] proposed an analytical modelfor FIFO and Fair schedulers based on experimentalmeasurements and source code analysis. For a classof Map jobs with heavy-tailed service times, theauthors found problems of starvation with the Fairscheduler due to the way of launching Reduce tasks.To reduce that, the Coupling scheduler was proposedthat couples the advance of Mappers and Reducers, aswell as optimizing the locations for both, achievingmitigation of the problem of starvation and improvingthe data locality.In [18] proposed an improvement for the FIFOscheduler that takes into account the location of thedata and characteristics of the works. In essence, itadopts different policies for jobs linked to CPU orinput / output based on data locality. As a result, it ispossible to reduce the data transfer and the executiontime of the works. Similarly, [19] designed ascheduler based on dynamic priorities whoseobjective is to reduce the latency for jobs of variablelength. The work focuses on intensive data work, soit tries to maintain the data location during execution.Other authors have investigated how to deal withClusters formed by heterogeneous hardware. [20]presented a scheduler that uses the heterogeneity andmix of workloads to optimize jobs that use the samedata sets. Although the scheduler has only beensimulated, it is based on the idea of looking for thebest node of the Cluster to run, based on the idea thata large percentage of MapReduce jobs have the samecharacteristics in terms of CPU, network and diskusage. The scheduler classifies Cluster jobs as linkedto CPU or input / output, similarly the nodes classifiesthem as good for CPU or input / output and thenperforms the assignment. [22] proposed a self adaptive scheduler that takes into account thatdifferent nodes require different time to complete thesame tasks due to their differences such as processing,communication, architectures and memory. Thescheduler uses historical information about eachcluster node to adjust execution parameters anddiscover slow tasks. Thus, you can launch backuptasks in other nodes, also taking into account the datalocality. [23] develops criteria to schedule schedulersbased on restrictions of deadlines specified by theuser and discusses the implementation andpreliminary evaluation of a scheduler of DeadlinesRestrictions for Hadoop that ensures that only jobswhose deadlines can be met are planned forexecution.In [24] on the contrary, the authors focus on theheterogeneity of the tasks, proposing a scheduler that4.Current and Future ResearchThe purpose of this research work is to study theimprovement of time in the exploration and analysisof data sets using an improvement over the schedulersimplemented in Hadoop for machine learning anddata mining jobs. New schedulers algorithms willalso be implemented that will make a betterassignment of jobs to the cluster, obtainingadvantages in terms of better use of resources andexecution times with respect to schedulers such asFIFO schedulers. Machine learning and data miningalgorithms have a distinctive characteristic: theyinvolve executing the same algorithm many timesover different parts of a input dataset.The CPU load, network usage and input / outputwill be analyzed, by executing jobs generated bymachine learning and data mining algorithms whenexecuted on Hadoop with large and publicly availabledatasets. It is intended in a first stage to derive modelsor profiles of resource use of the aforementionedworks considering a space of variability in the datasets and using two Clusters with different hardwarecharacteristics.The methodology will be similar to that of [29,30,31] where once the characteristics of the workshave been outlined, a simulator developed by theworking group will be adapted in order to analyzeplanning alternatives that improve the executiontimes and resource utilization. Similar to [32], where65

VI Jornadas de Cloud Computing & Big Data (JCC&BD 2018)the scheduler uses estimated information onavailability of resources to achieve betterperformance in the execution of tasks in desktopGrids, in this plan estimates will be used on the tasks,starting from the base that the algorithms have wellknown behavior and temporal/spatial complexity.The starting point to obtain job data and thenoutline its characteristics will be existingimplementations of machine learning and data miningalgorithms. A good alternative for this is Mahout2 thatoffers Hadoop implementations of Naive Bayesclassification, k-means clustering, collaborativefiltering and recommendation algorithms based onALS (Alternating Least Squares), among others.Executions of these algorithms using data sets such asthose available in the UC Irvine Machine LearningRepository3 will provide data on the use of resourcesand performance. From these data, jobs will begrouped according to their demands, resultingperformance, input data characteristics, etc.In a first stage, optimized schedulers for theprofiled jobs will be simulated. Once obtained resultsin this stage, the improvements in executions in theCluster available in the UNSJ of San Juan and in theISISTAN of Tandil will be verified. Simulations /tests will be carried out again until satisfactorysolutions are achieved.5.analysis. Engineering Applications of ArtificialIntelligence, 51: 24-36, 2016.[5] A. Corbellini, C. Mateos, D. Godoy, A. Zunino,and S. Schiaffino. An architecture and platformfor developing distributed recommendationalgorithms on large-scale social networks.Journal of Information Science, 41(5):686-704,2015.[6] Cristian Mateos, Alejandro Zunino, and MarceloCampo. Jgrim: An approach for easygridification of applications. Future GenerationComputer Systems, 24(2):99 - 118, 2008a. ence/article/pii/S0167739X07000854.[7] Pieter Hijma, Rob V. van Nieuwpoort, CerielJ.H. Jacobs, and Henri E. Bal. Generatingsynchronization statements in divide-andconquer programs. Parallel Computing, . Extensions for NextGeneration Parallel Programming ModelsReferences[8] Alejandro Corbellini, Daniela Godoy, CristianMateos, Silvia Schiaffino, and Alejandro Zunino.DPM: A novel distributed large-scale socialgraph processing framework for link predictionalgorithms. Future Generation ComputerSystems, 2017a. En prensa.[1] Paul C. Zikopoulos, Chris Eaton, Drik deRoos,ThomasDeutsch,andGeorgeLapis.Understanding Big Data - Analytics forEnterprise Class Hadoop and Streaming Data.2012.[9] M. Arroqui, J. Rodriguez Alvarez, H. Vazquez,C. Machado, Cristian Mateos, and AlejandroZunino. JASAG: A gridification tool foragriculturalsimulationapplications.Concurrency and Computation: Practice andExperience, 27(17):4716-4740, 2015.[2] Alejandro Corbellini, Cristian Mateos, AlejandroZunino, Daniela Godoy, and Silvia N.Schiaffino.Persisting big-data: The nosqllandscape. Inf. Syst., 63:1-23, 2017b. .1016/j.is.2016.07.009.[10] Rob V. van Nieuwpoort, Jason Maassen, RutgerHofman, Thilo Kielmann, and Henri E. Bal. Ibis:An efficient java-based grid programmingenvironment. In Proceedings of the 2002 JointACMISCOPE Conference on Java Grande, JGI’02, pages 18-27, New York, NY, USA, 13.URLhttp://doi.acm.org/10.1145/583810.583813.[3] Daniel S. Katz and Xiaobo Zhou. Leading-edgeresearch in cluster, cloud, and grid computing:Best papers from the ieee/acm {CCGrid} 2015conference. Future Generation ComputerSystems, 72:78 - 80, 2017. ISSN rticle/pii/S0167739X16303557.[11] Craig Chambers, Ashish Raniwala, FrancesPerry, Stephen Adams, Robert R. Henry, RobertBradshaw, and Nathan Weizenbaum. Flumejava:Easy, efficient data-parallel pipelines. SIGPLAN[4] A. Tommasel, A. Corbellini, D. Godoy, and on algorithms: An empirical2 http://m ahout.apache.org3 http://archive.ics.uci.edu/m l/index.php66

VI Jornadas de Cloud Computing & Big Data (JCC&BD 2018)Not., 45(6):363-375, June 2010. ISSN 0362 1340. doi: 10.1145/1809028.1806638. erence on Utility and Cloud Computing,pages161 -167,Nov2012.doi:10.1109/UCC.2012.32.[12] Sanjay Ghemawat, Howard Gobioff, and ShunTak Leung. The google file system. InProceedings of the Nineteenth ACM Symposiumon Operating Systems Principles, SOSP ’03,pages 29-43, New York, NY, USA, 2003. Lhttp://doi.acm.org/10.1145/945445.945450.[20] Sreedhar, C., Kasiviswanath, N., & ChennaReddy, P. (2018). Performance Enhancement ofHadoop for Big Data Using Multilevel QueueMigration (MQM) Technique (pp. 331-342).https://doi.org/10.1007/978-981-10-8237-5 32[21] Jyoti V. Gautam, Harshad kumar B. Prajapati,Vipul K. Dabhi, Sanjay Chaudhary, A survey onjob scheduling algorithms in big data rg/10.1109/ICECCT.2015.7226035[13] Ivanilton Polato, Reginaldo RAc , AlfredoGoldman, and Fabio Kon. A comprehensiveview of hadoop research-a systematic literaturereview. Journal o f Network and ComputerApplications, 46:1 - 25, 2014. ISSN 1084-8045.doi:http ://dx. 1635[22] Jian Tan, Xiaoqiao Meng, and Li Zhang. Delaytails in mapreduce scheduling. In Proceedings o fthe 12th ACM SIGMETRICS/PERFORMANCEJoint International Conference on Measurementand Modelingof ComputerSystems,SIGMETRICS ’12, pages 5-16, New York, NY,USA, 2012. ACM. ISBN ttp://doi.acm.org/ 10.1145/2254756.2254761.[14] D. Yoo and K. M. Sim. A comparative review ofjob scheduling for mapreduce. In 2011 IEEEInternational Conference on Cloud Computingand Intelligence Systems, pages 353-358, Sept2011. doi: 10.1109/CCIS.2011.6045089.[23] Q. Chen, D. Zhang, M. Guo, Q. Deng, and S.Guo. Samr: A self-adaptive ment. In 2010 10th IEEE InternationalConference on Computer and InformationTechnology, pages 2736-2743, June 2010. doi:10.1109/CIT.2010.458.[15] J.V. Gautam, H.B. Prajapati, V.K. Dabhi, S.Chaudhary, A survey on job schedulingalgorithms in big data processing, in: IEEEInternationalConferenceonElectrical,Computer and Communication Technologies(ICECCT15), Coimbatore, 2015, pp. 1-11.[24] Kc, K., & Anyanwu, K. (2010). SchedulingHadoop Jobs to Meet Deadlines. In 2010 IEEESecond International Conference on CloudComputing Technology and Science (pp. 388 [16] M. Zaharia, A. Konwinski, A.D. Joseph, R. Katz,I. Stoica, “Improving MapReduce performancein heterogeneous 46 environments”, in: Proc. 8thUSENIX Symposium on Operating SystemsDesign and Implementation, OSDI 2008, SanDiego,USA, Dec. 2008.[25] Aysan Rasooli and Douglas G. Down. Anadaptive scheduling algorithm for dynamicheterogeneous hadoop systems. In Proceedingsof the 2011 Conference of the Center forAdvanced Studies on Collaborative Research,CASCON ’11, pages 30-44, Riverton, NJ, ?id 2093889.2093893.[17] He, C., Lu, Y., & Swanson, D. (2011).Matchmaking: A New MapReduce SchedulingTechnique. In 2011 IEEE Third InternationalConference on Cloud Computing g/10.1109/CloudCom.2011.16[18] Y. Tao, Q. Zhang, L. Shi, and P. Chen. Jobschedulingoptimizationformulti-usermapreduce clusters. In 2011 Fourth rithms and Programming, pages 213-217,Dec 2011. doi: 10.1109/PAAP.2011.33.[26] Krish, K. R., Anwar, A., & Butt, A. R. (2014).[phi]Sched: A Heterogeneity-Aware HadoopWorkflow Scheduler. In 2014 IEEE 22ndInternational Symposium on Modelling, Analysis6SimulationofComputerandTelecommunication Systems (pp. .40[19] P. Nguyen, T. Simon, M. Halem, D. Chapman,and Q. Le. A hybrid scheduling algorithm fordata intensive workloads in a mapreduceenvironment. In 2012 IEEE Fifth International67

VI Jornadas de Cloud Computing & Big Data (JCC&BD 2018)[27] Hsin-Han You, Chun-Chung Yang, and JiunLong Huang. A load-aware scheduler formapreduce framework in heterogeneous cloudenvironments. In Proceedings of the 2011 ACMSymposium on Applied Computing, SAC ’11,pages 127-132, New York, NY, USA, 2218.1436-5057.doi:10.1007/s00607-012-02455.URL http://dx.doi.org/10.1007/s00607-0120245-5.[31] Matías Hirsch, Juan Manuel Rodriguez, CristianMateos, and Alejandro Zunino. A two-phaseenergy-aware scheduling approach for cpu intensive jobs in mobile grids. J. Grid Comput.,15(1):55-80, 2017. doi: 10.1007/s10723-0169387-6. URL https://doi.org/10.1007/ s10723016-9387-6.[28] Zaharia, M., Borthakur, D., Sen Sarma, J.,Elmeleegy, K., Shenker, S., & Stoica, I. (2010).Delay scheduling. In Proceedings o f the 5thEuropean conference on Computer systems EuroSys ’10 (p. 265). New York, New 1755940[32] Sergio Ariel Salinas, Carlos García Garino, andAlejandro Zunino. PFS: A productivityforecasting system for desktop computers toimprove grid applications performance inenterprise desktop grid. Computing i.sk/ojs/index.php/cai/article/view/878[29] Tomasz Nykiel, Michalis Potamias, ChaitanyaMishra, George Kollios, and Nick Koudas.Mrshare: Sharing across multiple queries inmapreduce. Proc. VLDB Endow., 3(1-2):494505, September 2010. ISSN 2150-8097. 10.14778/1920841.1920906.[30] Juan Manuel Rodriguez, Cristian Mateos, andAlejandro Zunino. Energy-efficient job stealingfor cpuintensive processing in mobile devices.Computing, 96(2):87-117, Feb 2014. ISSN68

algorithms distributed in Hadoop 1 C o rn e jo , F é lix M a rtin ; 2 Z u n in o , A le ja n d ro ; 3 M u ra z z o , M a ría 1,3 Departamento de Informática, Universidad Nacional de San Juan; 2 Instituto Superior de Ingeniería de Software de Tandil, Universidad Nacional del centro de la provincia de Buenos Aires