Enhance Hadoop Performance Using BigData With MapReduce Technology

Transcription

Journal of Information and Computational ScienceISSN: 1548-7741Enhance Hadoop Performance Using BigData withMapReduce TechnologyMs.M.Florence Dayana M.C.A.,M.Phil.,(Ph.D), Head, Department of Computer Applications, BonSecours College For Women -Vilar Bypass,Thanjavur-613006, Tamilnadu, India.N.Kamalasunthari, M.Sc., Computer Science, Department of Computer Science, Bon SecoursCollege For Women -Vilar Bypass,Thanjavur-613006, Tamilnadu, India.T. Prithiviya, M.Sc., Computer Science, Department of Computer Science, Bon Secours College ForWomen -Vilar Bypass,Thanjavur-613006, Tamilnadu, India. Abstract— Cloud computing influences Hadoop structure forprocess BigData in parallel. Hadoop has beyond any doubtconstraints that could be misused to execute the assignmentwith proficiency. These constraints range unit essentiallybecause of learning neighborhood inside the group,employments and assignments planning, and assetdesignations in Hadoop. Conservative asset designationremains a test in Cloud Computing MapReduce stages. wetend to propose Hadoop, that is an expanded Hadoop plan thatdiminishes the calculation value identified with BigDatainvestigation. The anticipated plan conjointly addresses theissue of asset designation in local Hadoop. Hadoop gives asuperior response to "content information", like discoveringpolymer grouping furthermore the theme of a polymerarrangement. Likewise, Hadoop gives AN efficient DataMining approach for Cloud Computing situations. Hadoopplan influences on NameNode's capacity to dole outoccupations to the TaskTrakers (DataNodes) among thebunch. By adding administration alternatives to theNameNode, Hadoop will demonstrating insight guide andrelegate assignments to the DataNodes that contain the soughtlearning while not causation the errand to the whole group.Examination with local Hadoop, Hadoop diminishesequipment time, assortment of peruse operations, and anotherHadoop components.Index Terms — BigData, Cloud Computing, Hadoop, Hadoop,Hadoop Performance, MapReduce, Text Data.I. INTRODUCTIONParallel processing in Cloud Computing has emerged as aninterdisciplinary research area due to the heterogeneousMs.M.Florence Dayana M.C.A.,M.Phil.,(Ph.D), Head, Department ofComputer Applications, Bon Secours College For Women -VilarBypass,Thanjavur-613006, Tamilnadu, India.N.Kamalasunthari, M.Sc., Computer Science, Department of ComputerScience, Bon Secours College For Women -Vilar Bypass,Thanjavur-613006,Tamilnadu, India.T. Prithiviya, M.Sc., Computer Science, Department of Computer Science,Bon Secours College For Women -Vilar Bypass,Thanjavur-613006,Tamilnadu, India.Volume 10 Issue 3 - 2020nature and large size of data. Translating sequential data tomeaningful information requires substantial computationalpower and efficient algorithms to identify the degree ofsimilarities among multiple sequences [1]. Sequential patternmining or data analysis applications such as, DNA sequencealigning and motif finding usually require large and complexamounts of data processing and computational capabilities [2].Efficiently targeting and scheduling of computationalresources is required to solve such complex problems [3].Although, some of the data sets are readable by humans, it canbe very complex to be understood and processed usingtraditional processing techniques [3, 4].Availability of open source and commercial CloudComputing parallel processing platforms has opened newavenues to explore structured semi-structured or unstructureddata [5]. Before we go any further, it is necessary to definecertain definitions that are related to BigData and Hadoop.There are different ways of defining and comparing BigDatawith the traditional data such as data size, content, collectionand processing. Big data has been defined as large data setsthat cannot be processed using traditional processingtechniques, such as Relational Database ManagementSystems, in a tolerable processing time [6]. BigData is either arelational database (Structured), such as stock market data ornon-relational database (Semistructured or Unstructured), suchas social media data or DNA data sets [7]. The 4V’s ofBigData are 1) Volume of the data, which means the data size.Some of companies’ data storage is about Zetabyte. 2)Velocity, which means the speed at which the data isgenerated. 3) Varity of the data, which means the data formsthat different applications deal with such as sequence data,numeric data or binary data. 4) Veracity of the data, whichmeans the uncertainty of the status of the data or how clear thedata is to these applications [8]. Different challenges inBigData have been discussed in previous research [9] and theyare described as technical challenges such as the physicalstorage, that stores the BigData and reduce the redundancy.Also, there are many challenges such as the process ofextracting the information, cleaning data, data integration, data130www.joics.org

Journal of Information and Computational ScienceISSN: 1548-7741aggregation, and data representation. Since BigData has theseissues, it needs such an environment or framework to workthrough these challenges. Hadoop, which works with BigDatasets, is a framework that most organizations use to processBigData in order to overcome data challenges.IOCostWrite NumberOfBloks X HdfsWriteCost (3)From the above equations we clearly see that the total costsof reading and writing from HDFS depends on the number ofblocks, which is the data size. So, by reducing the data size,we can reduce the costs of these processes, which will lead toimproving the Hadoop’s performance. In addition, it is true forevery Hadoop’s process that the number of blocks is related toits costs. For example, the CPU cost of reading isCPUCostRead and is calculated as follows:CPUCostRead NumberOfBlocks X InUncompeCPUCost InputMapPairs X MapCPUCost(4)Fig.1 Overall MapReduce Word Count MapReduce JobNative Hadoop compiler processes MapReduce job bydividing the job into multiple tasks, then distributes these tasksto multiple nodes in the cluster. By studying Hadoopperformance in [18] the authors discussed HadoopMapReduce model to estimate MapReduce job cost by givingsome parameters to the model. Different parameters that jobsneed to have to be executed efficiently. These parameters are:Hadoop Parameters: which is a set of predefinedconfiguration parameters that are in Hadoop setting files?Profile Statistics: which are a set of user-defined properties ofinput data and functions like Map, Reduce, or Combine.Profile Cost Factor: which are I/O, CPU, and Network costjob execution parameters. We will focus on the third categoryof parameters, which is the Profile Cost Factor. In this sectionwe are going to explain the job execution cost in details. Wewill further explain the relationship between the number ofblocks and the cost associated with the reading of the datafrom HDFS.NumberOfBlocks DataSize / BlockSizeWhere InUncompeCPUCost is the compression ratio ofblocks, InputMapPairs is the number of pairs for mappingprocess, and MapCPUCost is the cost of mapping one pair.Readers can find more details about the Hadoop performanceanalyzing model in [18] which is published by DukeUniversity and considered as the most common paper thatdiscussed the Hadoop performance model.II. SYSTEM ANALYSISA. Existing System SummaryThe existing Hadoop MapReduce architecture, multiple jobswith the same data set work completely independent of eachother. We also noticed that searching for the same sequence ofcharacters. Any text format data requires the same amount oftime each time we execute the same job. Also, searching forthe super sequence of a sequence that has already beensearched requires the same amount of time.Disadvantages Every job deals with the same data every time it getsprocessedThe same job executed more than one timeIt reads all the data every time(1)Where DataSize is the size of the input data that we want toupload to HDFS, and BlockSize is the pre-defined size fordata block (by default it is 64MB). There is a compressionratio that is applied to each block to have it less in size beforeit is stored in the HDFS. We will not discuss the compressionratio point here because it is not one of our concerns and it hasbeen discussed clearly in [18]. MapReduce job reads datafrom HDFS where the cost of reading a single data block fromthe HDFS is HdfsReadCost. The cost of reading the wholedata from HDFS is IOCostRead and it is calculated as:III. RELATED WORKSIOCostRead NumberOfBloks X HdfsReadCost (2)Hadoop is considered as a new technology that providesprocessing services for BigData issues in cloud computing,thus, research in this field is considered a hot topic. Manystudies have discussed and developed different ways toimprove the Hadoop MapReduce performance from differentconsiderations or aspects. Many studies have discussedoptimizing Hadoop and MapReduce jobs such as jobscheduling and execution time to improve Hadoopperformance. Whereas, there are many studies that have beendiscussed in relation to data locality in cloud computing.Cost of writing a single data block to HDFS isHdfsWriteCost. The cost of writing any data, such asMapReduce job results or raw data, is IOCostWrite and iscalculated as follows:One of the important features of Hadoop is the process ofjob scheduling [30] [31] and job execution time. Differentstudies have provided some information improvements andhave come up with positive results based on their assumptionsVolume 10 Issue 3 - 2020131www.joics.org

Journal of Information and Computational ScienceISSN: 1548-7741[32] [33]. Others focus on the time of initialization andtermination phases of MapReduce jobs [34].System memory has many issues that could be addressed toimprove the system performance. In Hadoop, Apacheperforms a centralized memory approach which isimplemented to control the cashing and resources [35].Apache Hadoop supports centralized data cashing. However,some studies utilize a distributed cashing approach to improveHadoop performance [36] [37]. There are different approachesthat discuss memory issue. ShmStreaming [38] introduces aShared memory Streaming schema to provide lockless FIFOqueue that connects Hadoop and external programs.The location of input data has been determined in currentHadoop to be located in different nodes in the cluster. Sincethere is a default value for duplication of the data, which is 3times, Hadoop distributes the duplicated data into differentnodes in different network racks. This strategy helps forvarious reasons, one of which is for false tolerant issue to havemore reliability and scalability. However, the default datadistribution location strategy causes some poor performance interms of mapping and reducing tasks. Different studiesproposed solutions to improve Hadoop performance bydeveloping data locality improvements [12] [39]. Others focuson the type of data to improve Hadoop performance [16] [40].In addition, a few studies discuss different issues regarding theimprovement of Hadoop performance [41-45].A. Proposed System SummaryWe propose Hadoop, which is an enhanced Hadooparchitecture that reduces the computation cost associated withBigData analysis. The proposed architecture also addresses theissue of resource allocation in native Hadoop. Hadoopprovides a better solution for text data, such as finding DNAsequence and the motif of a DNA sequence. Hadoop providesan efficient Data Mining approach for Cloud Computingenvironments. Hadoop architecture leverages on NameNode’sability to assign jobs to the TaskTrakers (DataNodes) withinthe cluster. The NameNode, Hadoop can intelligently directand assign tasks to the DataNodes that contain the requireddata without sending the job to the whole cluster.Fig.2 Hadoop MapReduce ArchitectureAdvantages Reduces the computation costReduced number of read operationsReduces CPU timeIV. SYSTEM IMPLEMENTATIONA. Common Job Blocks TableHadoop MapReduce workflow (Hadoop) is the same as theoriginal Hadoop in terms of hardware, network, and nodes.However, the software level has been enhanced. We addedfeatures in NameNode that allow it to save specific data in alook up table which named Common Job Blocks Table CJBT.CJBT stores information about the jobs and the blocksassociated with specific data and features. This enables therelated jobs to get the results from specific blocks withoutchecking the entire cluster. Each time a sequence is alignedusing dynamic programming and conventional alignmentalgorithms, a common feature that is a sequence or subsequence is identified and updated in CJBT. Common featuresin CJBT can be compared and updated each time clientssubmit a new job to Hadoop. Consequently, the size of thistable should be controlled and limited to a specific size to keepthe architecture reliable and efficient. A typical CJBT consistsof three main components. Common Job NameCommon FeatureBlock NameB. MapReduceMapReduce is a promising parallel and scalableprogramming model for data-intensive applications andscientific analysis. A MapReduce program expresses a largedistributed computation as a sequence of parallel operationson datasets of key/value pairs. A MapReduce computation hastwo phases, namely, the Map and Reduce phases. The Mapphase splits the input data into a large number of fragments,which are evenly distributed to Map tasks across the nodes ofa cluster to process. Each Map task takes in a key-value pairand then generates a set of intermediate key-value pairs.C. Common Job NameCommon Job Name represents a shared name of a job thateach MapReduce client must use when submitting a new jobin order to get the benefit of the proposed architecture. Wedefine a library, which contains a list of pre-coded jobs that ismade available to the user Job API. The Jobs APIs provide abrief job description and access to job data. The users select ajob name from the list of jobs already identified for a sharedMapReduce job. This feature helps NameNode to identify andmatch a job to a DataNode(s) containing block(s) in the CJBT.Volume 10 Issue 3 - 2020132www.joics.org

Journal of Information and Computational ScienceISSN: 1548-7741performance of Hadoop [29]. As expected, we also noticedthat the performance of Hadoop MapReduce depends upon thelength of common features and the likelihood of finding thecommon features in the source files and DataNodes.D. Common FeatureCommon Features are defined as the shared data betweenjobs. Hadoop supports caching, enables output to be written inthe CJBT during the reduce step. We use Common Features toidentify the DataNodes or the blocks with shared data entries.JobTracker directs any new jobs with the shared commonfeatures to block names in CJBT. Suppose J1 and J2 aresequence search jobs, J1 uses MapReduce to find the sequencein a DataNode or a block. If J2 contains common feature of J1,it is logical to map the task and allocate the same dataresources of J1.If the common features exist in all source files, then Hadoopwill not improve the performance as the job reads all files thatcontain the common feature. From TABLE I, sequence1 islocated in all chromosomes, which means it is located in alldata blocks. So, Hadoop will read the whole data files again ifthe common feature is sequence1. In this case it gives nobenefits of having Hadoop. However, all other sequences havebetter performance when we use them as common featureusing Hadoop rather than Native Hadoop since they are notpresent in all data files. The above example gives usindications of positive results from the implementation in thenumber of blocks that are read from HDFS. Figure 6 showsone of the results, which is the number of read operations innative Hadoop compared with Hadoop. Number of readoperations is one component of Hadoop MapReduce and it isthe number of times that MapReduce reads blocks fromHDFS. So, based on the data size we can determine thenumber of blocks that should be read by the MapReduce job.As we mentioned before, by reducing the number of readoperations we can improve the performance.Fig.3 Hadoop MapReduce Workflow FlowchartFig.4 Number of read operations in Native Hadoop andHadoop for the same jobsV. RESULTS AND DISCUSSIONFigure 4 shows improvement in Hadoop performance byreducing the number of read operations from HDFS. In nativeHadoop, the number of read operations remains the same inevery job because it reads all data files again during each job.While, in Hadoop there is difference in number of readoperations based on how frequent the sequence exists in theDNA. When we implemented native Hadoop, the number ofUp to this point, there are indications that we receivedpositive results comparing with the native HadoopMapReduce environment. By implementing the proposedsolution, we have less data size to be read by the related jobs.Reducing the number of reads has a direct effect on theVolume 10 Issue 3 - 2020133www.joics.org

Journal of Information and Computational ScienceISSN: 1548-7741read operations was 109. By using Hadoop, the number ofread operations was reduced to be 15, which increases theefficiency by 86.2%. On the other hand, since sequence1exists in every chromosome, the number of read operationsremains the same 109 in Hadoop as native Hadoop. Oneadditional point that we should mention is the length of thesequence. Finding short sequences in length take less timethan finding longer ones. However, the chance of having acommon feature that is very long isminute as we explained inTABLE II. Another Hadoop MapReduce component is CPUprocessing time.blocks. We can say that all operations or factors that arerelated to output from MapReduce remain the same in bothnative Hadoop and Hadoop. That is because our improvementis to reduce the input to MapReduce not its output. So, thenumber of write operations is the same in both native Hadoopand Hadoop, which is 1 since the result is the same and its sizeis very small. Finding the location of the data blocks with thecommon features can result in latency during the readingprocess. However, the benefits of the proposed system aremuch more than the disadvantages. Advantages of theproposed system go beyond the number of read operations andthe performance of the system. The proposed system furtherreduces the data transfer within the network and reduces thecost of execution of the MapReduce job as the number ofactive DataNodes during the action of a job reduces.VI. CONCLUSIONFigure 5 shows the processing time of each feature in DNAdata files, which used for finding the sequence of jobs in bothnative Hadoop and Hadoop. In Hadoop, we can see a hugedifference between the CPU processing-time for Hadoop,which is less than native Hadoop since Hadoop does not readall data blocks from HDFS. For example, CPU processingtime in native Hadoop to process the job search for sequence2is 397 seconds whereas it is 50 seconds in Hadoop. Figure 5shows that Hadoop reduces the CPU processing time by87.4% compared to native Hadoop. However, in sequence1the CPU processing time in native Hadoop is less thanHadoop. Since sequence 1 exists in all chromosomes, Hadoopreduces the efficiency by 3.9%. So, there is an overhead timein Hadoop, which is the process of looking for related jobs inthe lookup table (CJBT) in Hadoop. Although, this mighthappen it rarely occurs based on our study showed above inTable II. This overhead is exists in all jobs because it is theprocessing time of checking the lookup table. However, itcosts very tiny amount of time comparing with the benefit thatcan be gained by using H2Haddop.Fig.5 CPU processing time in Native Hadoop and Hadoopfor the same jobsThere are different factors in native Hadoop we can studyand then compare with Enhanced Hadoop (Hadoop). Figure 8shows the processing results when finding the job sequence insequence2, which is (AAGACGGTGGTAAGG) in DNA dataVolume 10 Issue 3 - 2020In this work we present Enhanced Hadoop framework(Hadoop), which allows a NameNode to identify the blocks inthe cluster where certain information is stored. We discussedthe proposed workflow in Hadoop and compared the expectedperformance of Hadoop to native Hadoop. In Hadoop, we readless data, so we have some Hadoop factors such as number ofread operations, which are reduced by the number ofDataNodes carrying the source data blocks, which is identifiedprior to sending a job to TaskTracker. The maximum numberof data blocks that the TaskTracker will assign to the job isequal to the number of blocks that carries the source datarelated to a specific common job.References[1] P. A. Zandbergen, "Accuracy of iPhone locations: A comparisonof assisted GPS, WiFi and cellular positioning", Transactions in GIS,vol. 13(s1), pp. 5-25, 2009.[2] A. Thiagarajan, L. Ravindranath, H. Balakrishnan, S. Madden,and L. Girod, "Accurate, low-energy trajectory mapping for mobiledevices", In Proceedings of the 8th USENIX Conference onNetworked Systems Design and Implementation (NSDI), 2011.[3] Y. Tsuda, Q. Kong, & T. Maekawa, "Detecting and correctingWiFi positioning errors", In Proceedings of the 2013 ACMInternational Joint Conference on Pervasive and UbiquitousComputing, pp. 777-786, 2013.[4] M.A. Quddus, W.Y. Ochieng, and R.B. Noland, "Current mapmatching algorithms for transport applications: State-of-the art andfuture research directions", Transportation Research Part C:Emerging Technologies, vol. 15(5), pp. 312-328, 2007.[5] S. Brakatsoulas, D. Pfoser, R. Salas, and C. Wenk, "On mapmatching vehicle tracking data", In 31st International Conference onVery Large Data Bases, pp. 853-864, 2005.[6] M. Rahmani, and H. N. Koutsopoulos, "Path inference fromsparse floating car data for urban networks", Transportation ResearchPart C: Emerging Technologies, vol. 30, pp. 41-54, 2013.[7] T. Miwa, D. Kiuchi, T. Yamamoto, and T. Morikawa,"Development of map matching algorithm for low frequency probedata", Transportation Research Part C: Emerging Technologies, vol.22, pp. 132–145, 2012.[8] P. Newson & J. Krumm, “Hidden Markov map matching throughnoise and sparseness", In Proceedings of the 17th ACM134www.joics.org

Journal of Information and Computational ScienceISSN: 1548-7741SIGSPATIAL International Conference on Advances in GeographicInformation Systems, pp. 336-343, 2009.[9] A. Thiagarajan, L. S. Ravindranath, K. LaCurts, S. Toledo, J.Eriksson, S. Madden, and H. Balakrishnan, "Vtrack: Accurate,energy-aware traffic delay estimation using mobile phones", In 7thACM Conference on Embedded Networked Sensor Systems(SenSys), 2009.[10] C.Y. Goh, J. Dauwels, N. Mitrovic, M. T. Asif, A. Oran, and P.Jaillet, "Online map-matching based on Hidden Markov model forreal-time traffic sensing applications", In Proceedings of the 15thIEEE Intelligent Transportation Systems Conference, pp. 776-781,2012.[11] G. R. Jagadeesh and T. Srikanthan, “Robust real-time routeinference from sparse vehicle position data”, In Proceedings of the17th IEEE Intelligent Transportation Systems Conference, pp. 296301, 2014.[12] T. Hunter, P. Abbeel, and A. M. Bayen. "The path inferencefilter: modelbased low-latency map matching of probe vehicle data",IEEE Transactions on Intelligent Transportation Systems, vol. 15(2),pp. 507–529, 2014.[13] J. Yang, and L. Meng, "Feature selection in conditional randomfields for map matching of GPS trajectories", In Progress inLocation-Based Services 2014, Springer International Publishing, pp.121-135, 2015.[14] A. J. Viterbi, "Error bounds for convolutional codes and anasymptotically optimum decoding algorithm", IEEE Transactions onInformation Theory, vol. 13(2), pp. 260-269, 1967.[15] Y. Lou, C. Zhang, Y. Zheng, X. Xie, W. Wang, and Y. Huang,"Mapmatching for low-sampling-rate GPS trajectories", in 17th ACMSIGSPATIAL International Conference on Advances in GeographicInformation Systems, pp. 352–361, 2009.[16] L. Trailovic and L. Y. Pao, "Position error modeling usingGaussian mixture distributions with application to comparison oftracking algorithms", In Proceedings of the American ControlConference, vol. 2, pp. 1272-1277, 2003.[17] J. Krumm, J. Letchner, and E. Horvitz, "Map Matching withTravel Time Constraints", In Society of Automotive Engineers WorldCongress, 2007.[18] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz,and A. Rasin, “Hadoopdb: an architectural hybrid of mapreduce anddbms technologies for analytical workloads,” Proceedings of theVLDB Endowment, vol. 2, no. 1, pp. 922–933, 2009.[19] Y. Xu, P. Kostamaa, and L. Gao, “Integrating hadoop andparallel dbms,” in Proceedings of the 2010 international conferenceon Management of data. ACM, 2010, pp. 969–974.[20] J. Dittrich, J. Quian e-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J.Schad, “Hadoop : Making a yellow elephant run like a cheetah(without it even noticing),” Proceedings of the VLDB Endowment,vol. 3, no. 1-2, pp. 515–529, 2010.[21] D. Logothetis and K. Yocum, “Ad-hoc data processing in thecloud,” Proceedings of the VLDB Endowment, vol. 1, no. 2, pp.1472–1475, 2008.[22] C. Ji, T. Dong, Y. Li, Y. Shen, K. Li, W. Qiu, W. Qu, and M.Guo, “Inverted grid-based knn query processing with mapreduce,” inChinaGrid, 2012 Seventh ChinaGrid Annual Conference on. IEEE,2012, pp. 25–33.[23] S. Das, Y. Sismanis, K. Beyer, R. Gemulla, P. Haas, and J.McPherson, “Ricardo: integrating r and hadoop,” in Proceedings ofthe 2010 international conference on Management of data. ACM,2010, pp. 987–998.[24] A. Stupar, S. Michel, and R. Schenkel, “Rankreduce–processingk-nearest neighbor queries on top of mapreduce,” in Proceedings ofthe 8th Workshop on Large-Scale Distributed Systems forInformation Retrieval, 2010, pp. 13–18.[25] R. Ferreira Cordeiro, C. Traina Junior, A. Machado Traina, J.L opez, U. Kang, and C. Faloutsos, “Clustering very large multi-Volume 10 Issue 3 - 2020dimensional datasets with mapreduce,” in Proceedings of the 17thACM SIGKDD international conference on Knowledge discoveryand data mining. ACM, 2011, pp. 690–698.[26] C. Wang, J. Wang, X. Lin, W. Wang, H. Wang, H. Li, W. Tian,J. Xu, and R. Li, “Mapdupreducer: detecting near duplicates overmassive datasets,” in Proceedings of the 2010 internationalconference on Management of data. ACM, 2010, pp. 1119–1122.[27] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C.Kozyrakis, “Evaluating mapreduce for multi-core and multiprocessorsystems,” in High Performance Computer Architecture, 2007. HPCA2007. IEEE 13th International Symposium on. IEEE, 2007, pp. 13–24.[28] B. He, W. Fang, Q. Luo, N. Govindaraju, and T. Wang, “Mars: amapreduce framework on graphics processors,” in Proceedings of the17th international conference on Parallel architectures andcompilation techniques. ACM, 2008, pp. 260–269.[29] H. Yang, A. Dasdan, R. Hsiao, and D. Parker, “Mapreducemerge: simplified relational data processing on large clusters,”in Proceedings of the 2007 ACM SIGMOD international conferenceon Management of data. ACM, 2007, pp. 1029–1040.[30] D. Jiang, A. Tung, and G. Chen, “Map-Join-Reduce: Towardscalable and efficient data analysis on large clusters,” Knowledge andData Engineering, IEEE Transactions on, vol. 23, no. 9, pp. 1299–1311, 2011.135www.joics.org

Ms.M.Florence Dayana M.C.A.,M.Phil.,(Ph.D), Head, Department of . Hadoop Parameters: which is a set of predefined configuration parameters that are in Hadoop setting files? . implemented to control the cashing and resources [35]. Apache Hadoop supports centralized data cashing. However,