A Review Paper On Big Data And Hadoop - IJSRP

Transcription

International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014ISSN 2250-31531A Review Paper on Big Data and HadoopHarshawardhan S. Bhosale1, Prof. Devendra P. Gadekar21Department of Computer Engineering,JSPM’s Imperial College of Engineering & Research, Wagholi, PuneBhosale.harshawardhan186@gmail.com2Department of Computer Engineering,JSPM’s Imperial College of Engineering & Research, Wagholi, Punedevendraagadekar84@gmail.comAbstract: The term ‘Big Data’ describes innovative techniques and technologies to capture, store, distribute, manage andanalyze petabyte- or larger-sized datasets with high-velocity and different structures. Big data can be structured, unstructured orsemi-structured, resulting in incapability of conventional data management methods. Data is generated from various differentsources and can arrive in the system at various rates. In order to process these large amounts of data in an inexpensive and efficientway, parallelism is used. Big Data is a data whose scale, diversity, and complexity require new architecture, techniques,algorithms, and analytics to manage it and extract value and hidden knowledge from it. Hadoop is the core platform for structuringBig Data, and solves the problem of making it useful for analytics purposes. Hadoop is an open source software project thatenables the distributed processing of large data sets across clusters of commodity servers. It is designed to scale up from a singleserver to thousands of machines, with a very high degree of fault tolerance.Keywords -Big Data, Hadoop, Map Reduce, HDFS, Hadoop Components1. IntroductionA. Big Data: DefinitionBig data is a term that refers to data sets or combinationsof data sets whose size (volume), complexity (variability), andrate of growth (velocity) make them difficult to be captured,managed, processed or analyzed by conventional technologiesand tools, such as relational databases and desktop statistics orvisualization packages, within the time necessary to make themuseful. While the size used to determine whether a particulardata set is considered big data is not firmly defined andcontinues to change over time, most analysts and practitionerscurrently refer to data sets from 30-50 terabytes(10 12 or 1000gigabytes per terabyte) to multiple petabytes (1015 or 1000terabytes per petabyte) as big data. Figure No. 1.1 givesLayered Architecture of Big Data System. It can bedecomposed into three layers, including Infrastructure Layer,Computing Layer, and Application Layer from top to bottom.B. 3 Vs of Big DataVolume of data: Volume refers to amount of data. Volume ofdata stored in enterprise repositories have grown frommegabytes and gigabytes to petabytes.Figure 1: Layered Architecture of Big Data SystemC. Problem with Big Data ProcessingVariety of data: Different types of data and sources of data.Data variety exploded from structured and legacy data stored inenterprise repositories to unstructured, semi structured, audio,video, XML etc.Velocity of data:Velocity refers to the speed of dataprocessing. For time-sensitive processes such as catching fraud,big data must be used as it streams into your enterprise in orderto maximize its value.i. Heterogeneity and IncompletenessWhen humans consume information, a great deal ofheterogeneity is comfortably tolerated. In fact, the nuance andrichness of natural language can provide valuable depth.However, machine analysis algorithms expect homogeneousdata, and cannot understand nuance. In consequence, data mustbe carefully structured as a first step in (or prior to) dataanalysis. Computer systems work most efficiently if they canstore multiple items that are all identical in size and structure.Efficient representation, access, and analysis of semi-structuredwww.ijsrp.org

International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014ISSN 2250-31532data require further work.ii. ScaleOf course, the first thing anyone thinks of with Big Data isits size. After all, the word “big” is there in the very name.Managing large and rapidly increasing volumes of data hasbeen a challenging issue for many decades. In the past, thischallenge was mitigated by processors getting faster, followingMoore’s law, to provide us with the resources needed to copewith increasing volumes of data. But, there is a fundamentalshift underway now: data volume is scaling faster than computeresources, and CPU speeds are static.iii. TimelinessThe flip side of size is speed. The larger the data set to beprocessed, the longer it will take to analyze. The design of asystem that effectively deals with size is likely also to result ina system that can process a given size of data set faster.However, it is not just this speed that is usually meant whenone speaks of Velocity in the context of Big Data. Rather, thereis an acquisition rate challengeiv. PrivacyThe privacy of data is another huge concern, and one thatincreases in the context of Big Data. For electronic healthrecords, there are strict laws governing what can and cannot bedone. For other data, regulations, particularly in the US, areless forceful. However, there is great public fear regarding theinappropriate use of personal data, particularly through linkingof data from multiple sources. Managing privacy is effectivelyboth a technical and a sociological problem, which must beaddressed jointly from both perspectives to realize the promiseof big data.v. Human CollaborationIn spite of the tremendous advances made in computationalanalysis, there remain many patterns that humans can easilydetect but computer algorithms have a hard time finding.Ideally, analytics for Big Data will not be all computationalrather it will be designed explicitly to have a human in theloop. The new sub-field of visual analytics is attempting to dothis, at least with respect to the modeling and analysis phase inthe pipeline. In today’s complex world, it often takes multipleexperts from different domains to really understand what isgoing on. A Big Data analysis system must support input frommultiple human experts, and shared exploration of results.These multiple experts may be separated in space and timewhen it is too expensive to assemble an entire team together inone room. The data system has to accept this distributed expertinput, and support their collaboration.2. Hadoop: Solution for Big Data ProcessingHadoop is a Programming framework used to support theprocessing of large data sets in a distributed computingenvironment. Hadoop was developed by Google’s MapReducethat is a software framework where an application break downinto various parts. The Current Appache Hadoop ecosystemconsists of the Hadoop Kernel, MapReduce, HDFS andnumbers of various components like Apache Hive, Base andZookeeper. HDFS and MapReduce are explained in followingpoints.Figure 2: Hadoop ArchitectureA. HDFS ArchitectureHadoop includes a fault‐tolerant storage system called theHadoop Distributed File System, or HDFS. HDFS is able tostore huge amounts of information, scale up incrementally andsurvive the failure of significant parts of the storageinfrastructure without losing data. Hadoop creates clusters ofmachines and coordinates work among them. Clusters can bebuilt with inexpensive computers. If one fails, Hadoopcontinues to operate the cluster without losing data orinterrupting work, by shifting work to the remaining machinesin the cluster. HDFS manages storage on the cluster bybreaking incoming files into pieces, called “blocks,” andstoring each of the blocks redundantly across the pool ofservers. In the common case, HDFS stores three completecopies of each file by copying each piece to three differentservers.Figure 3: HDFS ArchitectureB. MapReduce ArchitectureThe processing pillar in the Hadoop ecosystem is theMapReduce framework. The framework allows thespecification of an operation to be applied to a huge data set,divide the problem and data, and run it in parallel. From ananalyst’s point of view, this can occur on multiple dimensions.For example, a very large dataset can be reduced into a smallersubset where analytics can be applied. In a traditional datawarehousing scenario, thiswww.ijsrp.org

International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014ISSN 2250-31533Jimmy Lin et.al. used Hadoop which is currently the large–scale data analysis “ hammer” of choice, but there existsclasses of algorithms that aren’t “ nails” in the sense that theyare not particularly amenable to the MapReduce programmingmodel . He focuses on the simple solution to find alternativenon-iterative algorithms that solves the same problem. Thestandard MapReduce is well known and described in manyplaces .Each iteration of the pagerank corresponds to theMapReduce job. The author suggested iterative graph, gradientdescent & EM iteration which is typically implemented asHadoopjob with driven set up iteration &Check forconvergences. The author suggests that if all you have is ahammer, throw away everything that’s not a nail [3].Figure 4: MapReduce Architecturemight entail applying an ETL operation on the data to producesomething usable by the analyst. In Hadoop, these kinds ofoperations are written as MapReduce jobs in Java. There are anumber of higher level languages like Hive and Pig that makewriting these programs easier. The outputs of these jobs can bewritten back to either HDFS or placed in a traditional datawarehouse. There are two functions in MapReduce as follows:map – the function takes key/value pairs as input and generatesan intermediate set of key/value pairsreduce – the function which merges all the intermediate valuesassociated with the same intermediate key3. Literature ReviewS. Vikram Phaneendra & E. Madhusudhan Reddy et.al.Illustrated that in olden days the data was less and easilyhandled by RDBMS but recently it is difficult to handle hugedata through RDBMS tools, which is preferred as “big data”. Inthis they told that big data differs from other data in 5dimensions such as volume, velocity, variety, value andcomplexity. They illustrated the hadoop architecture consistingof name node, data node, edge node, HDFS to handle big datasystems. Hadoop architecture handle large data sets, scalablealgorithm does log management application of big data can befound out in financial, retail industry, health-care, mobility,insurance. The authors also focused on the challenges that needto be faced by enterprises when handling big data: - dataprivacy, search analysis, etc [1].Kiran kumara Reddi & Dnvsl Indira et.al. Enhanced uswith the knowledge that Big Data is combination of structured, semi-structured ,unstructured homogenous and heterogeneousdata .The author suggested to use nice model to handle transferof huge amount of data over the network .Under this model,these transfers are relegated to low demand periods where thereis ample ,idle bandwidth available . This bandwidth can thenbe repurposed for big data transmission without impactingother users in system. The Nice model uses a store –andforward approach by utilizing staging servers. The model isable to accommodate differences in time zones and variationsin bandwidth. They suggested that new algorithms are requiredto transfer big data and to solve issues like security,compression, routing algorithms [2].Wei Fan & Albert Bifet et.al. Introduced Big Data Miningas the capability of extracting Useful information from theselarge datasets or streams of data that due to its Volume,variability and velocity it was not possible before to do it. Theauthor also started that there are certain controversy about BigData. There certain tools for processes. Big Data as suchhadoop, strom, apache S4. Specific tools for big graph miningwere PEGASUS & Graph. There are certain Challenges thatneed to death with as such compression, visualization etc.[4].Albert Bifet et.al. Stated that streaming data analysis in realtime is becoming the fastest and most efficient way to obtainuseful knowledge, allowing organizations to react quickly whenproblem appear or detect to improve performance. Hugeamount of data is created everyday termed as “ big data”. Thetools used for mining big data are apache hadoop, apache big,cascading, scribe, storm, apache hbase, apache mahout, MOA,R, etc. Thus, he instructed that our ability to handle manyexabytes of data mainly dependent on existence of rich varietydataset, technique, software framework [5].Bernice Purcell et.al. Started that Big Data is comprised oflarge data sets that can’t be handle by traditional systems. Bigdata includes structured data, semi-structured and unstructureddata. The data storage technique used for big data includesmultiple clustered network attached storage (NAS) and objectbased storage. The Hadoop architecture is used to processunstructured and semi-structured using map reduce to locate allrelevant data then select only the data directly answering thequery. The advent of Big Data has posed opportunities as wellchallenges to business [6].Sameer Agarwal et.al. Presents a BlinkDB, a approximatequery engine for running interactive SQL queries on largevolume of data which is massively parallel. BlinkDB uses twokey ideas: (1) an adaptive optimization framework that buildsand maintains a set of multi-dimensional stratified samplesfrom original data over time, and (2) A dynamic sampleselection strategy that selects an appropriately sized samplebased on a query’s accuracy or response time requirements [7].Yingyi Bu et.al. Used a new technique called as HaLoopwhich is modified version of Hadoop MapReduce Framework,as Map Reduce lacks built-in-support for iterative programsHaLoop allows iterative applications to be assembled fromexisting Hadoop programs without modification, andsignificantly improves their efficiency by providing interwww.ijsrp.org

International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014ISSN 2250-3153iteration caching mechanisms and a loop-aware scheduler toexploit these caches. He presents the design, implementation,and evaluation of HaLoop, a novel parallel and distributedsystem that supports large-scale iterative data analysisapplications. HaLoop is built on top of Hadoop and extends itwith a new programming model and several importantoptimizations that include (1) a loop-aware task scheduler, (2)loop-invariant data caching, and (3) caching for efficient fixpoint verification [8].Shadi Ibrahim et.al. Project says presence of partitioningskew1 causes a huge amount of data transfer during the shufflephase and leads to significant unfairness on the reduce inputamong different data nodes In this paper, author develop anovel algorithm named LEEN for locality aware and fairnessaware key partitioning in MapReduce. LEEN embraces anasynchronous map and reduce scheme. Author has integratedLEEN into Hadoop. His experiments demonstrate that LEENcan efficiently achieve higher locality and reduce the amount ofshuffled data. More importantly, LEEN guarantees fairdistribution of the reduce inputs. As a result, LEEN achieves aperformance improvement of up to 45% on differentworkloads. To tackle all this he presents a present a techniquefor Handling Partitioning Skew in MapReduce using LEEN[9].Kenn Slagter et.al. Proposes an improved partitioningalgorithm that improves load balancing and memoryconsumption. This is done via an improved sampling algorithmand partitioner. To evaluate the proposed algorithm, itsperformance was compared against a state of the artpartitioning mechanism employed by Tera Sort as theperformance of MapReduce strongly depends on how evenly itdistributes this workload. This can be a challenge, especially inthe advent of data skew. In MapReduce, workload distributiondepends on the algorithm that partitions the data. One way toavoid problems inherent from data skew is to use datasampling. How evenly the partitioner distributes the datadepends on how large and representative the sample is and onhow well the samples are analyzed by the partitioningmechanism. He uses an improved partitioning mechanism foroptimizing massive data analysis using MapReduce for evenlydistribution of workload [10].Ahmed Eldawy et.al.presents the first full-fledgedMapReduce framework with native support for spatial data thatis spatial data Spatial Hadoop pushes its spatial constructs in alllayers of Hadoop, namely, language, storage, MapReduce andoperations layers. In the language layer, a simple high levellanguage is provided to simplify spatial data analysis for nontechnical users. In the storage layer, a two-layered spatial indexstructure is provided where the global index partitions dataacross nodes while the local index organizes data in each node.This structure is used to build a grid index, an R-tree or an R tree. Spatial-Hadoop is a comprehensive extension to Hadoopthat pushes spatial data inside the core functionality of Hadoop.Spatial Hadoop runs existing Hadoop programs as is, yet, itachieves order(s) of magnitude better performance thanHadoop when dealing with spatial data. SpatialHadoopemploys a simple spatial high level language, a two-levelspatial index structure, basic spatial components built inside theMapReduce layer, and three basic spatial operations: range4queries, k-NN queries, and spatial join. Author presents anefficient MapReduce framework for Spatial Data [11].Jeffrey Dean et.al. Implementation of MapReduce runs on alarge cluster of commodity machines and is highly scalable: atypical MapReduce computation processes many terabytes ofdata on thousands of machines. Programmers and the systemeasy to use: hundreds of MapReduce programs have beenimplemented and upwards of one thousand MapReduce jobsare executed on Google's clusters every day. Programs writtenin this functional style are automatically parallelized andexecuted on a large cluster of commodity machines. The runtime system takes care of the details of partitioning the inputdata, scheduling the program's execution across a set ofmachines, handling machine failures, and managing therequired inter-machine Communication. This allowsprogrammers without any experience with parallel anddistributed systems to easily utilize the resources of a largedistributed system. Author proposes Simplified DataProcessing on Large Clusters [12].Chris Jermaine et.al. Proposes a Online Aggregation forLarge-Scale Computing. Given the potential for OLA to benewly relevant, and given the current interest on very largescale, data-oriented computing, in this paper we consider theproblem of providing OLA in a shared-nothing environment.While we concentrate on implementing OLA on top of aMapReduce engine, many of author’s most basic projectcontributions are not specific to MapReduce, and should applybroadly. Consider how online aggregation can be built into aMapReduce system for large-scale data processing. Given theMapReduce paradigm’s close relationship with cloudcomputing (in that one might expect a large fraction ofMapReduce jobs to be run in the cloud), online aggregation is avery attractive technology. Since large-scale cloudcomputations are typically pay-as-you-go, a user can monitorthe accuracy obtained in an online fashion, and then savemoney by killing the computation early once sufficientaccuracy has been obtained [13].Tyson Condie et.al. propose a modified MapReducearchitecture in which intermediate data is pipelined betweenoperators, while preserving the programming interfaces andfault tolerance models of other MapReduce frameworks. Tovalidate this design, author developed the Hadoop OnlinePrototype (HOP), a pipelining version of Hadoop. Pipeliningprovides several important advantages to a MapReduceframework, but also raises new design challenges. To simplifyfault tolerance, the output of each MapReduce task and job ismaterialized to disk before it is consumed. In thisdemonstration, we describe a modified MapReducearchitecture that allows data to be pipelined between operators.This extends the MapReduce programming model beyondbatch processing, and can reduce completion times andimprove system utilization for batch jobs as well. Wedemonstrate a modified version of the Hadoop MapReduceframework that supports online aggregation, which allows usersto see “early returns” from a job as it is being computed. OurHadoop Online Prototype (HOP) also supports continuousqueries, which enable MapReduce programs to be written forapplications such as event monitoring and stream processing[14].www.ijsrp.org

International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014ISSN 2250-3153Jonathan Paul Olmsted et.al. Derive the necessary results toapply variation Bayesian inference to the ideal point model.This deterministic, approximate solution is shown to producecomparable results to those from standard estimation strategies.However, unlike these other estimation approaches, solving forthe (approximate) posterior distribution is rapid and easilyscales to ‘big data’. Inferences from the variation Bayesianapproach to ideal point estimation are shown to be equivalentto standard approaches on modestly-sized roll call matricesfrom recent sessions of the US Congress. Then, the ability ofvariation inference to scale to big data is demonstrated andcontrasted with the performance of standard approaches.[15]Jonathan Stuart Ward et.al. did a survey of Big datadefinition, Anecdotally big data is predominantly associatedwith two ideas: data storage and data analysis. Despite thesudden Interest in big data, these concepts are far from new andhave long lineages. This, therefore, raises the question as tohow big data is notably different from conventional dataprocessing techniques. For rudimentary insight as to the answerto this question one need look no further than the term big data.\Big" implies significance, complexity and challenge.Unfortunately the term\big" also invites quantification andtherein lies the difficulty in furnishing a definition. The lack ofa consistent definition introduces ambiguity and hampersdiscourse relating to big data. This short paper attempts tocollate the various definitions which have gained some degreeof traction and to furnish a clear and concise definition of anotherwise ambiguous term [16].Albert Bifet et.al. Discuss the current and future trends ofmining evolving data streams, and the challenges that the fieldwill have to overcome during the next years. Data stream realtime analytics are needed to manage the data currentlygenerated, at an ever increasing rate, from such applications as:sensor networks, measurements in network monitoring andtraffic management, log records or click-streams in webexploring, manufacturing processes, call detail records, email,blogging, twitter posts and others. In fact, all data generatedcan be considered as streaming data or as a snapshot ofstreaming data, since it is obtained from an interval of time.Streaming data analysis in real time is becoming the fastest andmost efficient way to obtain useful knowledge from what ishappening now, allowing organizations to react quickly whenproblems appear or to detect new trends helping to improvetheir performance. Evolving data streams are contributing tothe growth of data created over the last few years. We arecreating the same quantity of data every two days, as wecreated from the dawn of time up until 2003. Evolving datastreams methods are becoming a low-cost, green methodologyfor real time online prediction and analysis [17].Mrigank Mridul, Akashdeep Khajuria, Snehasish Dutta,Kumar N. et.al did the analysis of big data he stated that Datais generated through many sources like business processes,transactions, social networking sites, web servers, etc. and5remains in structured as well as unstructured form . Today'sbusiness applications are having enterprise features like largescale, data-intensive, web-oriented and accessed from diversedevices including mobile devices. Processing or analyzing thehuge amount of data or extracting meaningful information is achallenging task. The term “Big data” is used for large data setswhose size is beyond the ability of commonly used softwaretools to capture, manage, and process the data within atolerable elapsed time. Big data sizes are a constantly movingtarget currently ranging from a few dozen terabytes to manypeta bytes of data in a single data set. Difficulties includecapture, storage, search, sharing, analytics and visualizing.Typical examples of big data found in current scenario includesweb logs, RFID generated data, sensor networks, satellite andgeo-spatial data, social data from social networks, Internet textand documents, Internet search indexing, call detail records,astronomy, atmospheric science, genomics, biogeochemical,biological, and other complex and/or interdisciplinary scientificproject, military. Surveillance, medical records, photographyarchives, video archives, and large-scale ecommerce [18].Kyong-Ha Lee Hyunsik Choi et.al. Proposes a prominentparallel data processing tool MapReduce survey intends toassist the database and open source communities inunderstanding various technical aspects of the MapReduceframework. In this survey, we characterize the MapReduceframework and discuss its inherent pros and cons. We thenintroduce its optimization strategies reported in the recentliterature. author also discuss the open issues and challengesraised on parallel data analysis with MapReduce [19].Chen He Ying Lu David Swanson et.al develops a newMapReduce scheduling technique to enhance map task’s datalocality. He has integrated this technique into Hadoop defaultFIFO scheduler and Hadoop fair scheduler. To evaluate histechnique, he compares not only MapReduce schedulingalgorithms with and without his technique but also with anexisting data locality enhancement technique (i.e., the delayalgorithm developed by Facebook). Experimental results showthat his technique often leads to the highest data locality rateand the lowest response time for map tasks. Furthermore,unlike the delay algorithm, it does not require an intricateparameter tuning process [20].4. Other Components of HadoopThe Table 1, Comparison among Components of Hadoop,gives details of different Hadoop Components which havebeen used now days. HBase, Hive, MongoDB, Redis,Cassandra and Drizzle are the different components.Comparison among these components is done on the basis ofConcurrency, Durability, Replication Method, DatabaseModel and Consistency Concepts used in the components.Table 1: Comparison among Components of .ijsrp.org

International Journal of Scientific and Research Publications, Volume 4, Issue 10, October 2014ISSN 2250-31536DescriptionWidecolumn storebasedonApacheHadoop andon conceptsof Big TableDataWarehouseSoftware forQuerying ne of themost popularDocumentStoresIn-memoryDatabase withconfigurableoptionsperformancevs. persistencyWide-columnstore based onideasofBigTable andDynamoDBMySQL forkwithapluggablemicro-kerneland with tionlanguageJavaJavaC CJavaC StoreKey – ValueStoreWide licationfactorSelectedReplicationfactorMaster – SlaveReplicationSelectedReplicationfactorMaster –MasterReplication,Master – SlaveReplicationMaster – SlaveReplication[3]5. ConclusionWe have entered an era of Big Data. The paper describesthe concept of Big Data along with 3 Vs, Volume, Velocityand variety of Big Data. The paper also focuses on Big Dataprocessing problems. These technical challenges must beaddressed for efficient and fast processing of Big Data. Thechallenges include not just the obvious issues of scale, butalso heterogeneity, lack of structure, error-handling, privacy,timeliness, provenance, and visualization, at all stages of theanalysis pipeline from data acquisition to resultinterpretation. These technical challenges are common acrossa large variety of application domains, and therefore not costeffective to address in the context of one domain alone. Thepaper describes Hadoop which is an open source softwareused for processing of Big Data.[4][5][6][7][8]REFERENCES[1][2]S.Vikram Phaneendra & E.Madhusudhan Reddy“Big Data- solutions for RDBMS problems- Asurvey” In 12th IEEE/IFIP Network Operations &Management Symposium (NOMS 2010) (Osaka,Japan, Apr 19{23 2013).Kiran kumara Reddi & Dnvsl Indira “DifferentTechnique to Transfer Big Data : survey” IEEETransactions on 52(8) (Aug.2013) 2348 { 2355}[9][10]Jimmy Lin “MapReduce Is Good Enough?” Thecontrol project. IEEE Computer 32 (2013).Umasri.M.L, Shyamalagowri.D ,Suresh Kumar.S“Mining Big Data:- Current status and forecastto the future” Volume 4, Issue 1, January 2014ISSN: 2277 128XAlbert Bifet “Mining Big Data In Real Time”Informatica 37 (2013) 15–20 DEC 2012Bernice Purcell “The emergence of “big data”technology and analytics” Journal of TechnologyResearch 2013.Sameer Agarwal†, Barzan MozafariX, AurojitPanda†, Henry Milner†, Samuel MaddenX, IonStoica “BlinkDB: Queries with Bounded Errorsand Bounded Response Times on Very LargeData” Copyright 2013ì ACM 978-1-4503-19942/13/04Yingyi Bu Bill Howe Magdalena BalazinskaMichael D. Ernst “The HaLoop Approach toLarge-Scale Iterative Data Analysis” VLDB 2010paper “HaLoop: Efficient Iterative Data Processingon Large Clusters.Shadi Ibrahim Hai Jin Lu Lu “HandlingPartitioning Skew in MapReduce using LEEN”ACM 51 (2008) 107–113Kenn Slagter · Ching-Hsien Hsu “An improvedpartitioning mechanism for optimizing massivedata analysis using MapReduce” Published online:11 April 2013www.ijsrp.org

Internationa

Keywords-Big Data, Hadoop, Map Reduce, HDFS, Hadoop Components 1. Introduction A. Big Data: Definition Big data is a term that refers to data sets or combinations of data sets whose size (volume), complexity (variability), and rate of growth (velocity) make them difficult to be captured, managed, processed or analyzed by conventional technologies