Big Data Analytics: Hadoop-Map Reduce & NoSQL Databases - IJCSIT

Transcription

Abinav Pothuganti / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1) , 2015, 522-527Big Data Analytics: Hadoop-Map Reduce &NoSQL DatabasesAbinav PothugantiComputer Science and Engineering, CBIT,Hyderabad, Telangana, IndiaAbstract— Today, we are surrounded by data like oxygen. Theexponential growth of data first presented challenges tocutting- edge businesses such as Google, Yahoo, Amazon,Microsoft, Facebook, Twitter etc. Data volumes to beprocessed by cloud applications are growing much faster thancomputing power. This growth demands new strategiesforprocessingand analysing information. Such largevolume of un-structured (or semi structured) and structureddata that gets created from various applications, emails, weblogs, social media is known as “Big Data”. This kind of dataexceeds the processing capacity of conventional databasesystems. In this paper we will provide the basic knowledgeabout Big Data, which is majorly being generatedbecause of cloud computing and also explain in detail aboutthe two widely used Big Data Analytics techniques i.e. HadoopMapReduce and NoSQL Database.Keywords— Big Data, Big Data Analytics, Hadoop, NoSQLIntroductionI. INTRODUCTIONCloud computing has been driven fundamentally by theneed to process an exploding quantity of data in terms ofexabytes as we are approaching the Zetta Byte Era.One critical trend shines through the cloud is Big Data.Indeed, it's the core driver in cloud computing and willdefine the future of IT. When a company needed to storeand access more data they had one of two choices. Oneoption would be to buy a bigger machine with more CPU,RAM, disk space, etc. This is known as scaling vertically.Of course, there is a limit to how big of a machine you canactually buy and this does not work when you start talkingabout internet scale. The other option would be to scalehorizontally. This usually meant contacting some databasevendor to buy a bigger solution. These solutions do notcome cheap and therefore required a significant investment.Today, the source of data generated not only by the usersand applications but also “machine- generated,” and suchdata is exponentially leading the change in the Big Dataspace.Big Data processing is performed through aprogramming paradigm known as MapReduce. Typically,implementation of the MapReduce paradigm requiresnetworked attached storage and parallel processing. Thecomputing needs of MapReduce programming are oftenbeyond what small and medium sized business are able tocommit.Cloud computing is on-demand network access toComputing resources, provided by an outside entity.Common deployment models for cloud computing includeplatform as a service (PaaS), software as a service (SaaS),www.ijcsit.cominfrastructure as a service (IaaS), and hardware as a service(HaaS).Platform as a Service (PaaS) is the use of cloudcomputing to provide platforms for the development anduse of custom applications. Software as a service (SaaS)provides businesses with applications that are stored andrun on virtual servers – in the cloud. In the IaaS model, aclient business will pay on a per-use basis for use ofequipment to support computing operations includingstorage, hardware, servers, and networking equipment.HaaS is a cloud service based upon the model of timesharing on minicomputers and mainframes.The three types of cloud computing are the public cloud,the private cloud, and the hybrid cloud. A public cloud isthe pay- as-you-go services. A private cloud is internal datacenter of a business not available to the general public butbased on cloud structure. The hybrid cloud is a combinationof the public cloud and private cloud.Three major reasons for small to medium sizedbusinesses to use cloud computing for big datatechnology implementation are hardware cost reduction,processing cost reduction, and ability to test the value of bigdata.II. BIG DATABig data is a collection of data sets so large andcomplex which is also exceeds the processing capacity ofconventional database systems. The data is too big, movestoo fast, or doesn’t fit the structures of our current databasearchitectures. Big Data is typically large volume of unstructured (or semi structured) and structured data that getscreated from various organizedandunorganizedapplications, activities and channels such as emails,tweeter, web logs, Facebook, etc. The main difficulties withBig Data include capture, storage, search, sharing, analysis,and visualization. The core of Big Data is Hadoop which isa platform for distributing computing problems across anumber of servers. It is first developed and released as opensource by Yahoo!, it implements the MapReduce approachpioneered by Google in compiling its search indexes.Hadoop’s MapReduce involves distributing a datasetamong multiple servers and operating on the data: the“map” stage. The partial results are then recombined: the“reduce” stage. To store data, Hadoop utilizes its owndistributed file system, HDFS, which makes data availableto multiple computing nodes. Big data explosion, a resultnot only of increasing Internet usage by people around theworld, but also the connection of billions of devices to theInternet. Eight years ago, for example, there were onlyaround 5 exabytes of data online. Just two years ago, that522

Abinav Pothuganti / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1) , 2015, 522-527amount of data passed over the Internet over the courseof a single month. And recent estimates put monthlyInternet data flow at around 21 exabytes of data. Thisexplosion of data - in both its size and form - causes amultitude of challenges for both people and machines.III. HADOOP MAP REDUCEHadoop is a batch processing system for a cluster ofnodes that provides the underpinnings of most Big Dataanalytic activities because it bundle two sets offunctionality most needed to deal with large unstructureddatasets namely, Distributed file system and MapReduceprocessing. It is a project from the Apache SoftwareFoundation written in Java to support data intensivedistributed applications. Hadoop enables applications towork with thousands of nodes and petabytes of data. Theinspiration comes from Google’s MapReduce and GoogleFile System papers. Hadoop’s biggest contributor has beenthe search giant Yahoo, where Hadoop is extensively usedacross the business platform.A. High Level Architecture of HadoopFig 1. High level Architecture of HadoopPig: It is a dataflow processing (scripting) languageApache Pig is a platform for analysing large data sets thatconsists of a high-level language for expressing dataanalysis programs. The main characteristic of Pig programsis that their structure can be substantially parallelizedenabling them to handle very large data sets, simple syntaxand advanced built-in functionality provide an abstractionthat makes development of Hadoop jobs quicker and easierto write than traditional Java MapReduce jobs.Hive: Hive is a data warehouse infrastructure built ontop of Hadoop. Hive provides tools to enableeasy data summarization, ad-hoc querying and analysis oflarge datasets stored in Hadoop files. It provides amechanism to put structure on this data and it alsoprovides a simple query language called Hive QL, basedon SQL, enabling users familiar with SQL to query thisdata.HCatalog: It is a storage management layer for Hadoopthat enables users with different data processing tools.HCatalog’s table abstraction presents users with a relationalview of data in the Hadoop distributed file system (HDFS)and ensures that users need not worry about where or inwhat format their data is stored.MapReduce: Hadoop MapReduce is a programmingmodel and software framework for writing applications thatrapidly process vast amounts of data in parallel on largewww.ijcsit.comclusters of computer nodes. MapReduce uses the HDFS toaccess file segments and to store reduced results.HBase: HBase is a distributed, column-oriented database.HBase uses HDFS for its underlying storage. It maps HDFSdata into a database like structure and provides Java APIaccess to this DB. It supports batch style computationsusing MapReduce and point queries (random reads). HBaseis used in Hadoop when random, realtime read/write accessis needed. Its goal is the hosting of very large tables runningon top of clusters of commodity hardware.HDFS: Hadoop Distributed File System (HDFS) is theprimary storage system used by Hadoop applications.HDFS is, as its name implies, a distributed file system thatprovides high throughput access to application data creatingmultiple replicas of data blocks and distributing them oncompute nodes throughout a cluster to enable reliable andrapid computations.Core: The Hadoop core consist of a set of componentsand interfaces which provides access to the distributed filesystems and general I/O (Serialization, Java RPC, Persistentdata structures). The core components also provide “RackAwareness”, an optimization which takes into account thegeographic clustering of servers, minimizing networktraffic between servers in different geographic clusters.B. Architecture of HadoopHadoop is a Map/Reduce framework that works onHDFS or on HBase. The main idea is to decompose a jobinto several and identical tasks that can be executed closerto the data (on the DataNode). In addition, each task isparallelized: the Map phase. Then all these intermediateresults are merged into one result: the Reduce phase. InHadoop, The JobTracker (a java process) is responsible formonitoring the job, managing the Map/Reduce phase,managing the retries in case of errors. The TaskTrackers(Java process) are running on the different DataNodes. EachTaskTracker executes the tasks of the job on the locallystored data.The core of the Hadoop Cluster Architecture is givenbelow:HDFS (Hadoop Distributed File System): HDFS is thebasic file storage, capable of storing a large number of largefiles.MapReduce: MapReduce is the programming modelby which data is analyzed using the processing resourceswithin the cluster.Each node in a Hadoop cluster is either a master or aslave. Slave nodes are always both a Data Node and a TaskTracker. While it is possible for the same node to be both aName Node and a JobTrackerName Node: Manages file system metadata and accesscontrol. There is exactly one Name Node in each cluster.Secondary Name Node:Downloads periodiccheckpoints from the name Node for fault-tolerance. Thereis exactly one Secondary Name Node in each cluster.Job Tracker: Hands out tasks to the slave nodes. Thereis exactly one Job Tracker in each cluster.Data Node: Holds file system data. Each data nodemanages its own locally-attached storage and stores a copy523

Abinav Pothuganti / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1) , 2015, 522-527of some or all blocks in the file system. There are one ormore Data Nodes in each cluster.Task Tracker: Slaves that carry out map and reducetasks. There are one or more Task Trackers in each cluster.C. Hadoop Distributed File System (HDFS)An HDFS cluster has two types of node operating in amaster-worker pattern: a NameNode (the master) and anumber of DataNodes (workers). The namenode managesthe filesystem namespace. It maintains the filesystem treeand the metadata for all the files and directories in the tree.The namenode also knows the datanodes on which all theblocks for a given file are located. Datanodes are theworkhorses of the filesystem. They store and retrieveblocks when they are told to (by clients or the namenode),and they report back to the namenode periodically with listsof blocks that they are storing. Name Node decides aboutreplication of data blocks. In a typical HDFS, block size is64MB and replication factor is 3 (second copy on the localrack and third on the remote rack). The Figure 4 shownarchitecture distributed file system HDFS. HadoopMapReduce applications use storage in a manner that isdifferent from general-purpose computing. To read anHDFS file, client applications simply use a standard Javafile input stream, as if the file was in the nativefilesystem. Behind the scenes, however, this stream ismanipulated to retrieve data from HDFS instead. First, theName Node is contacted to request access permission. Ifgranted, the Name Node will translate the HDFS filenameinto a list of the HDFS block IDs comprising that file and alist of Data Nodes that store each block, and return the liststo the client. Next, the client opens a connection to the“closest” Data Node (based on Hadoop rack-awareness, butoptimally the same node) and requests a specific block ID.That HDFS block is returned over the same connection, andthe data delivered to the application. To write data to HDFS,client applications see the HDFS file as a standard outputstream. Internally, however, stream data is first fragmentedinto HDFS-sized blocks (64MB) and then smaller packets(64kB) by the client thread. Each packet is enqueued into aFIFO that can hold up to 5MB of data, thus decoupling theapplication thread from storage system latency duringnormal operation. A second thread is responsible fordequeuing packets from the FIFO, coordinating with theName Node to assign HDFS block IDs and destinations,and transmitting blocks to the Data Nodes (either local orremote) forstorage. A third thread managesacknowledgements from the Data Nodes that data has beencommitted to disk.Fig 2. Hadoop Distributed Cluster File System Architecturewww.ijcsit.comD. Map Reduce Architecture & ImplementationMapReduce is a data processing or parallel programmingmodel introduced by Google. In this model, a user specifiesthe computation by two functions, Map and Reduce. In themapping phase, MapReduce takes the input data and feedseach data element to the mapper. In the reducing phase, thereducer processes all the outputs from the mapper andarrives at a final result. In simple terms, the mapper ismeant to filter and transform the input into something thatthe reducer can aggregate over. The underlying MapReducelibrary automatically parallelizes the computation, andhandles complicated issues like data distribution, loadbalancing and fault tolerance. Massive input, spread acrossmany machines, need to parallelize. Moves the data, andprovides scheduling, fault tolerance. The originalMapReduce implementation by Google, as well as its opensource counterpart, Hadoop, is aimed for parallelizingcomputing in large clusters of commodity machines. MapReduce has gained a great popularity as it gracefully andautomatically achieves fault tolerance. It automaticallyhandles the gathering of results across the multiple nodesand returns a single result or set.MapReduce model advantage is the easy scaling of dataprocessing over multiple computing nodes.Fig 3. High Level view of MapReduce Programming ModelFault tolerance: MapReduce is designed to be faulttolerant because failures are common phenomena in largescale distributed computing and it includes worker failureand master failure.Worker failure: The master pings every mapper andreducer periodically. If no response is received for a certainamount of time, the machine is marked as failed. Theongoing task and any tasks completed by this mapper willbe re-assigned to another mapper and executed from thevery beginning. Completed reduce tasks do not need to bere-executed because their output is stored in the global filesystem.Master failure: Since the master is a single machine, theprobability of master failure is very small. MapReduce willre- start the entire job if the master fails. There are currentlythree popular implementations of the MapReduceprogramming model namely Google MapReduce, ApacheHadoop, Stanford Phoenix.524

Abinav Pothuganti / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1) , 2015, 522-527E. Execution Process in MapReduce Programming ModelIn MapReduce programming model and a MapReducejob consists of a map function, a reduce function, and Whena function called the below steps of actions take place: MapReduce will first divide the data into Npartitions with size varies from 16MB to 64MB Then it will start many programs on a cluster ofdifferent machines. One of program is the masterprogram; the others are workers, which canexecute their work assigned by master. Master candistribute a map task or a reduce task to an idleworker. If a worker is assigned a Map task, it will parse theinput data partition and output the key/value pairs,then pass the pair to a user defined Map function.The map function will buffer the temporarykey/value pairs in memory. The pairs willperiodically be written to local disk and partitionedinto P pieces. After that, the local machine willinform the master of the location of these pairs. If a worker is assigned a Reduce task and isinformed about the location of these pairs, theReducer will read the entire buffer by using remoteprocedure calls. After that, it will sort thetemporary data based on the key. Then, the reducer will deal with all of the records.For each key and according set of values, thereducer passes key/value pairs to a user definedReduce function. The output is the final output ofthis partition. After all of the mappers and reducers have finishedtheir work, the master will return the result tousers' programs. The output is stored in Findividual files.problem can be easily solved by launching a singleMapReduce job as given in the below: Input data Input data are partitioned into smaller chunks ofdata For each chunk of input data, a “map task” runswhich applies the map function resulting output ofeach map task is a collection of key-value pairs. The output of all map tasks is shuffled for eachdistinct key in the map output; a collection iscreated containing all corresponding values fromthe map output. For each key-collection resulting from the shufflephase, a “reduce task” runs which applies thereduce function to the collection of values. The resulting output is a single key-value pair. The collection of all key-value pairs resulting fromthe reduce step is the output of the MapReduce job.Fig 5. A MapReduce Programming Model ExampleIV. NOSQL DATABASESNoSQLsystemsaredistributed, non-relationaldatabases designed for large-scale data storage and formassively- parallel data processing across a large number ofcommodity servers. They also use non-SQL languages andmechanisms to interact with data (though some new featureAPIs that convert SQL queries to the system’s native querylanguage or tool). NoSQL database systems arose alongsidemajor Internet companies, such as Google, Amazon, andFacebook; which had challenges in dealing with hugequantities of data with conventional RDBMS solutionscould not cope.Fig 4. Architecture of MapReduceF. A MapReduce Programming Model ExampleIn essence MapReduce is just a way to take a big taskand split it into discrete task that can be done in parallel. Asimple problem that is often used to explain howMapReduce works in practice consists in counting theoccurrences of single words within a text. This kind ofwww.ijcsit.comA. Evolution of NoSQL DatabasesOf the many different data-models, the relational modelhas beendominatingsincethe80s,withimplementations like Oracle databases, MySQL andMicrosoft SQL Servers-also knownasRelationalDatabaseManagementSystem (RDBMS). Lately,however, in an increasing number of cases the use of525

Abinav Pothuganti / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1) , 2015, 522-527relational databases leads to problems both because ofdeficits and problems in the modelling of data andconstraints of horizontal scalability over several servers andbig amounts of data. There are two trends that bringingthese problems to the attention of the internationalsoftware community:1.The exponential growth of the volume of datagenerated by users, systems and sensors, further acceleratedby the concentration of large part of this volume on bigdistributed systems like Amazon, Google and other cloudservices.2.The increasing interdependency and complexity ofdata accelerated by the Internet, Web2.0, social networksand open and standardized access to data sources from alarge number of different systems.High Availability: all clients can always find at leastone copy of the requested data, even if some of themachines in a cluster are down.Partition-tolerance: the total system keeps itscharacteristic even when being deployed on differentservers, transparent to the client.The CAP-Theorem postulates that only two of the threedifferent aspects of scaling out are can be achieved fully atthe same time.Fig 7. Characteristics of NoSQL DatabasesFig 6. Big Data Transactions with Interactions and ObservationsOrganizations that collect large amounts of unstructureddata are increasingly turning to non-relational databases,now frequently called NoSQL databases. NoSQL databasesfocus on analytical processing of large scale datasets,offering increased scalability over commodity hardware.Computational and storage requirements of applicationssuch as for Big Data Analytics, Business Intelligence andsocial networking over peta-byte datasets have pushed SQLlike centralized databases to their limits. This led to thedevelopment of horizontally scalable, distributed nonrelational data stores, called No-SQL databases.B. Characteristics of NoSQL DatabasesIn order to guarantee the integrity of data, most of theclassical database systems are based on transactions. Thisensures consistency of data in all situations of datamanagement. These transactional characteristics are alsoknown as ACID (Atomicity, Consistency, Isolation, andDurability).However, scaling out of ACID-compliantsystems has shown to be a problem. Conflicts are arisingbetween the different aspects of high availability indistributed systems that are not fully solvable - known asthe CAP- theorem.Strong Consistency: all clients see the same version ofthe data, even on updates to the dataset - e. g., by means ofthe two-phase commit protocol (XA transactions), andACID.www.ijcsit.comC. Classification of NoSQL DatabasesWe classify NoSQL Databases in four basic categories,each suited to different kinds of tasks: Key-Value stores Document databases (or stores) Wide-Column (or Column-Family) stores Graph databases.Key-Value stores:Typically, these DMS store items as alpha-numericidentifiers (keys) and associated values in simple,standalone tables (referred to as “hash tables”). The valuesmay be simple text strings or more complex lists and sets.Data searches can usually only be performed against keys,not values, and are limited to exact matches.Fig 8. Key/Value Store NoSQL DatabaseDocument Databases:Inspired by Lotus Notes, document databases were, astheir name implies, designed to manage and storedocuments. These documents are encoded in a standard data526

Abinav Pothuganti / (IJCSIT) International Journal of Computer Science and Information Technologies, Vol. 6 (1) , 2015, 522-527exchange format such as XML, JSON (Javascript OptionNotation) or BSON (Binary JSON). Unlike the simple keyvalue stores described above, the value column in documentdatabases contains semi- structured data-specificallyattribute name/value pairs. A single column can househundreds of such attributes, and the number and type ofattributes recorded can vary from row to row.Fig 9. Document Store NoSQL DatabaseWide-Column (or Column-Family) Stores:Like document databases, Wide-Column (or ColumnFamily) stores (hereafter WC/CF) employ a distributed,column-oriented data structure that accommodates multipleattributes per key. While some WC/CF stores have a KeyValue DNA (e.g., the Dynamo-inspired Cassandra), mostare patterned after Google’s Bigtable, the petabyte-scaleinternal distributed data storage system Googledeveloped for its search index and other collections likeGoogle Earth and Google Finance. These generallyreplicate not just Google’s Bigtable data storage structure,but Google’s distributed file system (GFS) and MapReduceparallel processing framework as well, as is the case withHadoop, which comprises the Hadoop File System (HDFS,based on GFS) Hbase (a Bigtable style storage system) MapReduce.Graph Databases:Graph databases replace relational tables with structuredrelational graphs of interconnected key-value pairings. Theyare similar to object-oriented databases as the graphs arerepresented as an object-oriented network of nodes(conceptual objects), node relationships (“edges”) andproperties (object attributes expressed as key-value pairs).They are the only of the four NoSQL types discussed herethat concern themselves with relations, and their focus onvisual representation of information makes them morehuman- friendly than other NoSQL DMS.V. CONCLUSIONCloud technology progress & increased use of theInternet are creating very large new datasets with increasingvalue to businesses and processing power to analyse themaffordable. Big Data is still in its early infancy but it isalready having a profound effect on technology companiesand the way we do business. The size of these datasetssuggests that exploitation may well require a new categoryof data storage and analysis systems with differentarchitectures.Both Hadoop-MapReduce programming paradigmand NoSQL Databases have a substantial base in the BigData community and they are still expanding. HDFS,the Hadoop Distributed File System, is a distributed filesystem designed to hold very large amounts of data(terabytes or even petabytes), and provide high-throughputaccess to these information.Ease-of-use of the MapReduce method has made theHadoop Map Reduce technique more popular than anyother Data Analytics techniques (methods).REFERENCES[1]R. Taylor. An overview of theHadoop/MapReduce/HBaseframework and its current applications in bioinformatics BMCbioinformatics,11(Suppl 12):S1, 2010.[2] J. Dean and S. Ghemawat. Mapreduce: simplified data processing onlarge clusters. Commun. ACM, 51(1):107–113, 2008.[3] ://hadoop.apache.org/common/docs/current/hdfs[4] R. Lammel. Google‟s mapreduce programming model – revisited.Science of Computer Programming, 70(1):1 – 30, 2008.[5] A Workload Model for MapReduce by Thomas A. de Ruiter, Master‟sThesis in Computer Science, Parallel and Distributed Systems GroupFaculty of Electrical Engineering, Mathematics, and ComputerScience. Delft University of Technology, 2nd June 2012.[6] Dhruba Borthaku, The Hadoop Distributed File System: doop.apache.org/common/.[7] Y. Hung-Chih, D. Ali, H. Ruey-Lung, and D.S. Parker, “Map-ReduceMerg e: Simplified Relation al Data Processing On Large Clusters”,in Proceedings of the 2007 ACM Sigmod International Conferenceon Management of Data Beijing”, China: acm, 2007.[8] Rabi Prasad Padhy, “Big Data Processing with Hadoop-MapReducein Cloud Systems” in International Journal of Cloud Computingand Services Science[9] A B M Moniruzzaman and Syed Akhter Hossain, “NoSQLDatabase: New Era of Databases for Big data AnalyticsClassification, Characteristics and Comparison”, in InternationalJournal of Database Theory and ApplicationFig 10. Graph NoSQL Databasewww.ijcsit.com527

File System papers. Hadoop's biggest contributor has been the search giant Yahoo, where Hadoop is extensively used across the business platform. A. High Level Architecture of Hadoop Fig 1. High level Architecture of Hadoop Pig: It is a dataflow processing (scripting) language Apache Pig is a platform for analysing large data sets that