Eliminating Homogeneous Cluster Setup For Efficient Parallel Data .

Transcription

International Journal of Computer Applications (0975 – 8887)Volume 64– No.17, February 2013Eliminating Homogeneous Cluster Setup for EfficientParallel Data ProcessingPiyush SaxenaSatyajit PadhyPraveen KumarM.Tech (CS&E)Amity University,Noida, IndiaM.Tech (CS&E)Amity University,Noida, IndiaAssistant Professor,Amity University,Noida, IndiaABSTRACTThis project proposes to eliminate homogeneous cluster setupin a parallel data processing environment. A homogeneouscluster setup supports static nature of processing which is ahuge disadvantage for optimising the response time towardsclients. Parallel data processing is performed more often intoday’s internet and it is very important for the server todeliver the services to its client in optimal time. In order toavail utmost client satisfaction, the server needs to eliminatehomogeneous cluster setup that is encountered usually inparallel data processing. The homogeneous cluster setup isstatic in nature and dynamic allocation of resources is notpossible in this kind of environment. The project will alsomake sure that the user gets its entire requirement fulfilled inoptimal time. This will improve the overall resourceutilization and, consequently, reduce the processing cost.General TermsMap reduce algorithm, Homogeneous cluster setupKeywordsData mining, Data warehousing, Parallel data processing.1. INTRODUCTIONIn today’s digital generation, a huge amount of data is beenprocessed parallel in the internet. Providing optimal dataprocessing with good response time improvises the output ofparallel data processing. There are many users that try toaccess the same data over the web and it is a challenging taskfor the server to deliver optimal result. The vast amount ofdata they have to deal with every day has made traditionaldatabase solutions prohibitively expensive. Instead, thesecompanies have popularized an architectural paradigm basedon a large number of commodity servers. There are problemslike processing large documents split into several independentsubtasks, distributed among the available nodes, andcomputed in parallel.Parallel data processing is a key feature in accessing andoperating on huge set of data’s. [2] There are several waysavailable to process data parallel which improvises time andresponse. Today’s framework has a huge disadvantage thatcan be termed by a homogeneous cluster setup. Ahomogeneous cluster setup is a cluster of nodes whichconsists of a cluster head node and many sensor nodesconnected to it. The cluster head node is responsible to directa task to the sensor nodes for executing it. In a homogeneouscluster setup, all the sensor nodes avail uniform battery energyand all of them terminate at the same instant of time. Themain disadvantage with a homogeneous cluster setup is that itis static in nature, i.e. once all the sensor nodes are createdand started to execute then no more extra sensor nodes can beadded further in that cluster.It is quite evident that the static nature of a homogeneouscluster setup is a huge disadvantage in parallel dataprocessing. Once the nodes are created and have startedexecuting there cannot be addition of any further nodes ifrequired to be added dynamically. The dynamic factors areabsent basically if there exists a homogeneous cluster setup.In this proposed project, the job manager acts like a clusterhead node and all the task managers act like sensor nodes. Theobjective of this project is to motivate dynamic allocation ofresources which can be achieved more efficiently if weeliminate homogeneous cluster setup.[4]To be more precise, there is a job manager (main server) isallocated with a job it then divides that job into many sub jobsand it allocates to each task manager. Now once this cluster issetup and the parallel data processing begins, there can be nopossible ways by which we can add more task managers oreliminate any executed task managers until all have executed.This is an ambiguous situation when there can be no resourceallocation during the middle of data processing. This creates aproblem for the server (Job Manager) to offer complete resultsto its users. Parallel data processing is more efficient if it canbe executed dynamically and this dynamic environmentimproves the optimum response time. If we eliminate thehomogeneous cluster setup, any number of sensor nodes canbe created at any instant of time.2. BACKGROUNDThere are several amount data being processed in today’sweb. There are several challenges during parallel processingof this huge amount of data. During our research [5], therewere many ambiguities we came across regarding paralleldata processing.The ambiguities were like:1)Delayed response time due to homogeneous clustersetup2)Resources cannot be allocated dynamically once thenumber of task managers are created (static cluster)3)High traffic if data with many peers in action4)Data access used to be slower if the required data isunavailable with the server.All these challenges being very generic, the most importantproblem was if the data to be accessed by the user was ofhuge size than the processing becomes slower. Facing thesechallenges regarding parallel data processing, there was astraightforward approach that was deployed in our proposedproject.1

International Journal of Computer Applications (0975 – 8887)Volume 64– No.17, February 2013The first scenario that [1] was applied in our research is theuse of map reduce algorithm. This algorithm is the mosteffective algorithm that can be used for parallel dataprocessing of large data. This map reduce algorithm stated adivide & conquer structure of working with data. Thisalgorithm made it easier for the host server (job manager) tohandle the job efficiently. It breakdowns the job into manysub jobs and execute them individually with the help of taskmanagers.The second scenario was all about eliminating thehomogeneous cluster setup [19] of network. This will allowthe allocation of resources dynamically to the host server atany instant of time. In order to eliminate homogeneous clustersetup, we need to avoid static nature of the cluster of nodesand allow addition of task managers to the cluster at anyinstant of time. The transformation of homogeneous cluster toheterogeneous cluster network will avoid static nature of thecluster and even reduce the overall hardware cost. Dynamicallocation of resources allows optimizing the parallel dataprocessing in a new manner. This methodology offers a newscope of viewing these given challenges and moulding it tooperate in an efficient manner.The experiment is analyzed by taking into account varioustime slots that allows imagining the whole operation betweenjob manager, task manager & user.to improvise the parallel data processing between largenumber of nodes (users) and servers.The proposed system also demonstrates the discrete allocationof resources that are not available on the host server but areavailable on remote servers, parallel while processing thepresent data. The proposed framework allows a platform forthe server and users in efficient and optimized parallel dataprocessing. It allows the job manager to allocate resources [4]at any instant of time and this improvises the response time.Hence the problem of a homogeneous cluster network iseliminated and thus it is more optimized approach.Once a user has fit his data for processing into the requiredmap and reduce pattern [13], the execution framework takescare of splitting the job into subtasks, distributing andexecuting them. A single Map Reduce job always consists ofa distinct map and reduce pattern. The mapping is done by thejob manager to its entire task manager with individual subtasks and finally all the task managers execute each of theirtasks and reduce it to one single solution and return it back tothe user. The map-reduce algorithm works as a divide &conquer approach and it is very efficient in parallel dataprocessing. The proposed system also offers dynamicallocation of resources to any of the task managers duringexecution. The allocated resources are then available on thehost server always and it can be operated later.Once a user has fit his program into the required map andreduce pattern, the execution framework takes care of splittingthe job into subtasks, distributing and executing them. Asingle Map Reduce job always consists of a distinct map andreduce program. Server - Client computing or networking is adistributed application architecture that partitions tasks orworkloads between service providers (servers) and servicerequesters, called clients.4. FUNCTIONAL REQUIREMENTSOften clients and servers operate over a computer network onseparate hardware. A server machine is a high-performancehost that is running one or more server programs which shareits resources with clients. A client also shares any of itsresources; Clients therefore initiate communication sessionswith servers which await incoming requests.The client/server model is a computing model that acts as adistributed application which partitions tasks or workloadsbetween the providers of a resource or service, called servers,and service requesters, called clients. Often clients [8] andservers communicate over a computer network on separatehardware, but both client and server may reside in the samesystem. A server machine is a high performance host that isrunning one or more server programs which share theirresources with clients. A client does not share any of itsresources, but requests a server's content or service function.Clients therefore initiate communication sessions with serverswhich await incoming requests. Whereas the Servers take therequest from the client and try to fulfill these requests byproviding the resources the clients need.Processing is based on implementation of the theorem uses(network-based) search operations as off the shelf buildingblocks. Thus, the NAP query evaluation methodology isreadily deployable on existing systems, and can be easilyadapted to different network storage schemes. In this case, thequeries are evaluated in a batch. We propose the networkbased anonymization and processing (NAP) framework, thefirst system for K- anonymous query processing in roadnetworks. NAP relies on a global user ordering that satisfiesreciprocity and guarantees K-anonymity. [11] We identify theordering characteristics that affect subsequent processing, andqualitatively compare alternatives. Then, we propose queryevaluation techniques that exploit these characteristics. Inaddition to user privacy, NAP achieves low computationaland communication costs, and quick responses overall. It isreadily deployable, requiring only basic network operations.3. PROJECT SYSTEM OVERVIEWIn recent years a variety of systems has been proposed tofacilitate web warehousing has been developed. Althoughthese systems typically share common goals (e.g. to hideissues of parallelism or fault tolerance), they aim at differentfields of application. [7] Map Reduce algorithm is designed torun data analysis jobs on a large amount of data, i.e. in orderIn software engineering, a functional requirement defines afunction of a software system or its component. A function isdescribed as a set of inputs, the behavior, and outputs.Functional requirements may be calculations, technicaldetails, data manipulation and processing and other specificfunctionality that define what a system is supposed toaccomplish.A job manager is a computer application for controlling,managing and splitting the request of resources/files from theclient. The job manager accepts the job to be executed as arequest from the client and accordingly splits it as largenumber of packets. This large number of packets is allocatedto the task managers for executing it in optimal time.Generally large in number Task Managers are the part of thesoftware that manage the responses given by the Job Managerand try to execute these responses n return the result of theseset of responses to the client. The task manager is responsiblefor executing the individual packets allocated to them by thejob manager.5. OBJECTIVE OF THE PROJECTThe objectives and purpose for this project are to improviseand optimize the scenario of parallel data processing byeliminating homogeneous cluster setup [12]. Millions of data2

International Journal of Computer Applications (0975 – 8887)Volume 64– No.17, February 2013are accessed in the web by the user and it is the utmostresponsibility of the server to provide satisfaction to the user.It is always viable to be on the other side but dealing withsuch huge amount of data everyday makes the situation morecomplicated. Hence there are very few loopholes in thecurrent framework but these are enough to degrade theperformance in parallel data processing.Therefore the main objective and purpose of this project is tooptimize [2] parallel data processing by avoidinghomogeneous cluster setup. In order to avail utmost clientsatisfaction, the host server needs to be upgraded with thelatest technology to fulfil all requirements. The homogeneouscluster setup is static in nature the proposed map reducealgorithm is used in this generic framework that can bedeployed in this scenario. Another important goal of thisproject is to allocate resources or data dynamically to the hostserver (job manager) so that every requirement of resourcescan be fulfilled at any instant of time. The current problem offormation of a homogeneous cluster setup is eliminated sothat any number of sensor nodes or task managers can becreated at any instant of time. The static nature of existingframework of parallel data processing is terminated. Thisallows higher and sharper response time and avoiding delay intransfer. The below figure states that the job manager uponreceiving a job divides it into many packets (files) and thetask mangers execute those packets at an instant of time. Thetime is stated in the figure and the actual response time ishighly optimised with the proposed framework.The job manager should be aligned [17] with all its taskmanagers to avail maximum optimization. The task managersare mapped by the job manager with many jobs and they allsolve it individually which is later reduced to return it back tothe client. This allows the load for the execution to be sharedand the overall execution of huge sized data is more feasiblein less time. Typically data of huge size are the toughestchallenge to be dealt in the web for parallel data processing.This project makes sure that the user gets its entirerequirement fulfilled in optimal time.We discussed pros and cons of Map Reduce and classified itsimprovements. Map Reduce is simple [1] but provides goodscalability and fault-tolerance for massive data processing.The performance evaluation gives a first impression on howthe ability to assign specific jobs to specific task manager of aprocessing job, as well as the possibility to automaticallyallocate/de-allocate virtual machines in the course of a jobexecution, can help to improve the overall resource t.Fig 1: Architecture of proposed system3

International Journal of Computer Applications (0975 – 8887)Volume 64– No.17, February 2013Fig 2: Command line output showing time instances of packet retrievals6. MODULES OF THE PROJECTEDSYSTEMClientThis module deals with the Client or the Customer whoseneeds are to be fulfilled. The client always requests [10] to theserver for executing a particular operation and send a responseback to it accordingly. Nevertheless a client is always volatileabout its operation. In our proposed project, the client selectsthe file that it wants to download. After the file is selected theclient clicks on the download button. It is obvious though thatthe client always tries to request to download a file in thisscenario. After clicking the download button, it waits for theserver to send a response back. The status of the downloadinginterface is shown to the client so as to it can check the statusof the downloading. The client is demonstrated by building asimple interface which consists of simple components. Theclient interface is event driven and the concept of swings injava is largely implemented. The client interface consists of atext area which displays all the files that are availablepresently in the database [6]. The text area is updateddynamically as per the uploading of resources dynamically.The name in the title bar of the interface is named as selectfile which basically states to select a file to download. Thereare two swing buttons included named as download andcancel. Both these buttons are event driven and upon clickingon the button a specified event takes place. The downloadbutton allows downloading a file and the cancel button closesthe client interface.4

International Journal of Computer Applications (0975 – 8887)Volume 64– No.17, February 2013Fig 3: Client InterfaceServerA server is a computer system that is responsible for servicingthe request made by the client. The server is normally locatedremotely [3] and is used to service requests from multipleclients. The server is always responsible for maintainingresources and allocating them as required by the host clients.The server even manages and controls data processingbetween server & client. In today’s modern multiprocessorarchitecture, parallel data processing plays an important role.In order to have efficient parallel data processing where manyclients participate to execute certain tasks, the server needs toexecute data processing faster and efficiently. In our proposedproject the server is an entity that services the request madeby the client. The client request for downloading a file and theserver makes it sure that the file is downloaded and opened atthe end of downloading for the clients. The server interfaceconsists of two important criterions that are job manager andtask manager. The server interface even consists of a menubar that has one menu element names as file. This menuconsists of two menu elements named as resource allocationand exit. The resource allocation option should be selectedonly if the server needs to allocate resources dynamically atthe same time when the file is getting downloaded. The serverinterface even consists of a drop down select menu which hasthe default value of parallel. This states that the [9] datadistribution type is parallel and the data processing will beparallel in nature. The server upon getting the request fromthe client displays certain parameters in its command promptoutput. It is the name of the file that is downloaded, the portnumbers that will be involved during downloading, the totalnumber of packets sent and received etc. The server evenshowed pictorially how the resources are allocateddynamically from the job manager to task manager.Fig 4: Client Download InterfaceJob ManagerThe job manager is an essential component of the server. It’slike the master component of the entire client server layout.The job manager [2] accepts the request that comes from theclient and is responsible for processing it. The job managerfollows divide and conquer approach for executing the jobthat. The job manager has to schedule and control theexecution of the jobs and returning back a valid response tothe client as per its request.In the first scenario upon achieving a request from the clientthe job manager divides the job into many sub jobs or packets.It distributes evenly and randomly all the sub jobs andallocates it to the task managers. After the task managersfinish executing their individual sub jobs, they return theresultant data to the job manager. All the sub jobs or packetsreturned by the task manager to the job manager would not bein sorted order and hence the job manager sorts all the packetsas they were allocated initially. After sorting all the packetsthe job manager tries to merger all the executed packets intoone data so that it can return a single solution to the client.The command prompt terminal output shows the timeinstances at which the job manager sends a packet to the taskmanagers. It even shows the sorting and merging of packetsaccordingly so as to return a single solution to the client.During dynamic resource allocation to the task managers, thejob manager itself uploads the file to the task manager. Thejob manager is responsible for uploading the files to the taskmanager whenever there is a need of allocation of resources.Fig 5: Job ManagerTask ManagerThe task manager is an essential part of the server. It is like abasic block of execution that helps the job manager to executethe sub tasks and return it back to the job manager. The taskmanager responsibility is to execute the individual packetsallocated to them and return it back to the job manager. Whena client requests for downloading a file, the task manager isthe one which is responsible for executing the operation andperforms efficient parallel data processing. Upon the use oftask managers with the job manager, the time for parallel dataprocessing is much more efficient and it even supportsdynamic resource allocation.The task manager interface is a simple representation of theoperation it performs. It shows that whenever a client makes arequest to download the file, the server with the help of taskmanager tries to execute the request and return an optimalsolution back to the client. It even represents the distributedtype data processing which is parallel in nature and states thatthe file uploading is done with the task manager. Afteruploading the necessary file in the back end database, it isready to return the request back to the client. In our proposed5

International Journal of Computer Applications (0975 – 8887)Volume 64– No.17, February 2013project we have chose four task managers and each taskmanager is represented with a unique port number. This portnumber can be initialized by us but moreover it shows theparticipation of each task manager in the parallel dataprocessing. It plays a major part even in allocating resourcesdynamically.users and even at a faster rate. The reliability and feasibilityshould not be hampered during this parallel data execution.Presently the mechanisms used for parallel data executioncreates a homogeneous cluster setup within the network. Thehomogeneous cluster setup states that when there is a paralleldownloading environment [8] under processing between theclient and server, if the client at the same time requests fordownloading a particular file and the server does not have itcurrently in it back end database then it causes a hugeproblem. The file cannot be uploaded until all thedownloading under progress stops its execution.In order to avoid this kind of scenario and to decrease thedelay in response time from the server, we propose aframework that represents efficient parallel data processingwith [10] no homogeneous cluster setup. It makes sure thatwhen a client request for a file that is not present in the server,it can dynamically allocate that resource or file to the clienteven at the same time all the parallel downloading scenario isunder progress. This improves the reliability and responsetime since the client has to no more wait for its response. Thisframework defines a new level in parallel data processing thatis not encountered in today’s world.Fig 6: Task Manager7. CONCLUSION AND FUTURE SCOPEToday’s digital generation executes its key ingredient at aregular basis and that is data [1]. Everyday there are manyprobabilities of parallel data processing. There are manysearch engines like Google or Yahoo which has to process alot of data simultaneously for returning a response [3] to itsThe experimental results demonstrating the comparison ofresponse time between the existing framework and theproposed framework is shown in fig 7. It clearly states that theexisting framework takes more time to respond to a clientrequest comparatively to proposed framework. Due to theobvious problem of homogeneous cluster setup, the existingframework posses delay due to static nature. The dynamicnature of the proposed framework enhances the response time.As the size of the data increases, the response time isdemonstratedforboththeframework.Fig 8: Experimental analysis of existing v/s proposed framework6

International Journal of Computer Applications (0975 – 8887)Volume 64– No.17, February 2013[1] “Parallel Data Processing with Map Reduce: A Survey”by Kyong-Ha Lee and Yoon-Joon Lee, Department ofComputer Science KAIST, December 2011.[12] William Gropp, Ewing Lusk, and Anthony Skjellum.Using MPI: Portable Parallel Programming with theMessage-Passing Interface. MIT Press, Cambridge, MA,1999.[2] Query Optimization for Massively Parallel DataProcessing by Sai Wu , Feng Li, Sharad Mehrotra, BengChin Ooi School of Computing, National University ofSingapore, March 2012[13] Douglas Thain, Todd Tannenbaum, and Miron Livny.Distributed computing in practice: The Condorexperience. Concurrency and Computation: Practice andExperience, 2004.[3] S. Babu. Towards automatic optimization of map reduceprograms. In Proceedings of the 1st ACM symposium onCloud computing, pages 137–142, 2010.[14] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S.Anthony, H. Liu, P. Wychoff, and R. Murthy, “Hive - awarehousing solution over a map-reduce framework,” inVLDB, 2009.8. REFERENCES[4] .html[5] H. chih Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker.Map-Reduce-Merge: Simplified Relational DataProcessing on Large clusters. In SIGMOD ’07:Proceedings of the 2007 ACM SIGMOD internationalconference on Management of data, pages 1029–1040,New York, NY, USA, 2007. ACM.[6] J. Dean and S. Ghemawat. MapReduce: Simplified DataProcessing on Large Clusters. In OSDI’04: Proceedingsof the 6th conference on Symposium on OpeartingSystems Design & Implementation, pages 10–10,Berkeley, CA, USA, 2004. USENIX Association.[7] E. Deelman, G. Singh, M.-H. Su, J. Blythe, Y. Gil, C.Kesselman, G. Mehta, K. Vahi, G. B. Berriman, J. Good,A. Laity, J. C. Jacob, and D. S. Katz. Pegasus: AFramework for Mapping Complex Scientific Workflowsonto Distributed Systems. Sci. Program 13(3):219–237,2005.[8] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan.Interpreting the Data: Parallel Analysis with Sawzall.Sci. Program., 13(4):277–298, 2005.[9] B. Li et al . A Platform for Scalable One-Pass Analyticsusing MapReduce. In Proceedings of the 2011 ACMSIGMOD, 2011.[10] D. Jiang et al. Map-join-reduce: Towards scalable andefficient data analysis on large clusters. IEEETransactions on Knowledge and Data Engineering, 2010.[11] Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau,David E. Culler, Joseph M. Hellerstein, and David A.Patterson. High-performance sorting on networks ofworkstations. In Proceedings of the 1997 ACMSIGMOD International Conference on Management ofData, Tucson, Arizona, May 1997.[15] D. DeWitt and J. Gray, “Parallel database systems: thefuture of high performance database systems,” Commun.ACM, 1992.[16] S. Fushimi, M. Kitsuregawa, and H. Tanaka, “Anoverview of the system software of a parallel relationaldatabase machine grace,” in VLDB ’86: Proceedings ofthe 12th International Conference on Very Large DataBases. San Francisco, CA, USA: Morgan KaufmannPublishers Inc., 1986, pp. 209–219.[17] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan,“Interpreting the data: Parallel analysis with sawzall,”Sci. Program., vol. 13, no. 4, pp. 277–298, 2005.[18] M. Ziane, M. Za ıt, and P. Borla-Salamet, “Parallel queryprocessing with zigzag trees,” The VLDB Journal, vol. 2,no. 3, pp. 277–302, 1993[19] Homogeneous vs Heterogeneous Clustered SensorNetworks: A Comparative Study by Vivek Mhatre,Catherine Rosenberg School of Electrical and ComputerEng., Purdue University, West Lafayette, IN 479071285.AUTHOR’S PROFILEPraveen Kumar Assistant Professor, Amity University,India. Qualified as M.Sc., M.Tech., Ph.D. (Pursuing) withareas of interest in e-goverence / Java programming, DataminingSatyajit Padhy B.Tech, Mtech (CSE) pursuing. Publishedtwo research papers on data processing in data warehousingand services of cloud computing respectively. Areas ofinterest lies in cloud computing, data warehousing and datamining.Piyush Saxena B.Tech, Mtech (CSE) pursuing. Publishedtwo research papers on data processing in data warehousingand services of cloud computing respectively. Areas ofinterest lies in cloud computing, data warehousing and datamining.7

Data mining, Data warehousing, Parallel data processing. 1. INTRODUCTION In today's digital generation, a huge amount of data is been processed parallel in the internet. Providing optimal data processing with good response time improvises the output of parallel data processing. There are many users that try to