Making A Case For Hadoop In Your Organization

Transcription

A MetaScale WhitepaperMaking a Case for Hadoopin Your OrganizationAuthor: Andrew McNalisWith so much noise around Hadoop and big data, it can be difficultto find real answers on what the everyday uses are that makeHadoop a viable component of the modern enterprise dataarchitecture. By understanding the uses, benefits and risks ofHadoop in the enterprise, business and information technologyprofessionals can develop a plan for gaining acceptance ofimplementing it in their organization.

Making a Case for Hadoop in Your Organization WHITE PAPERIntroductionHadoop and big data have been all over the press lately. With all the hype,Hadoop does look like an attractive low cost solution that your company couldleverage.So, you’d like to get approval to implement Hadoop within your company ororganization, but you need to sell it to skeptical upper management.This whitepaper will provide tools to do just that by covering uses of Hadoop inthe organization, benefits to the organization by using Hadoop, and the risks andrisk responses. Additionally, a companion piece will outline points that you canuse in a plan for selling Hadoop within your organization.Topics to be covered:Uses for Hadoop in the EnterpriseHow Hadoop Lowers Costs and Reduces Processing TimesReliability of Hadoop in the EnterpriseBenefits to Implementing HadoopImpedances, Barriers and Road Blocks to Implementing Hadoop in theEnterpriseRisk Management of Hadoop in the EnterpriseThe Plan (see companion piece)Through candid discussion of these topics, anyone who needs to obtainexecutive, management or funding approval should be able to make acompelling business case to implement Hadoop in the enterprise.Both business and information technology professionals will gain valuable insightinto what Hadoop can do in an enterprise and how to leverage that informationto get acceptance.Uses for Hadoop in the EnterpriseLike most IT professionals, you probably receive some kind of promotional emailabout big data on a daily basis. You might find yourself asking: So, what’s it reallyall about? How can big data and Hadoop be used in the enterprise, specifically inmy organization?2

Making a Case for Hadoop in Your Organization WHITE PAPERLet’s start with a couple definitions.Big Data: In information technology, big data is a collection of data sets so largeand complex that it becomes difficult to process using on-hand databasemanagement tools or traditional data processing applications (source:Wikipedia).Hadoop: Apache Hadoop is an open-source software framework that supportsdata-intensive distributed applications, licensed under the Apache v2 license. Itsupports the running of applications on large clusters of commodity hardware.The Hadoop framework transparently provides both reliability and data motionto applications (source: Wikipedia).Unstructured vs. Structured DataMany of the published use cases for Hadoop involve analysis or processing ofunstructured data. In most cases, traditional relational database managementsystems (RDBMS) are not effective tools to process unstructured data, especiallylarge volumes of it. Processing large datasets with unstructured data is an areathat Hadoop excels at. However, that does not mean it can’t process structureddata. As you’ll see from topics later in this paper, many of the characteristics ofHadoop enable it to process structured data also.Processing Big DataThe traditional use case for Hadoop is storing and processing large, complexdatasets. This was the genesis for Hadoop, and this functionality has beenproven many times.Enterprise Data HubMany organizations have targeted data warehouse platforms for their enterprisedata hub. In the past, these platforms were the best choice available. However,the traditional data warehouse platforms come with an expensive initial pricetag along with ongoing maintenance and subscription costs. Plus, depending onIT budgets, the size of the data warehouse might be constrained due to costs,forcing transformation of the data, elimination of some data elements, and/orpurging old data to save space. With Hadoop, these measures can be avoided,enabling an organization to keep all raw operation data potentially forever.3

Making a Case for Hadoop in Your Organization WHITE PAPERHadoop as an Enterprise Data HubData Archive (where the data is accessible)Many organizations have retention periods for data driven by compliance andlegal requirement. In some cases, someone in the organization just thinks datamight be valuable in the future so they take measures to keep it.But, where is the data usually kept? In Excel spreadsheets, Access, SQL Serverand MySQL databases, on corporate shared drives, or individual’s machines harddrives? Each of these types of media has their own problems, but generally,accessibility by the greater organization is a problem. Additionally, spaceconstraints drive behavior and actions that might not be the best course ofaction for the data.4

Making a Case for Hadoop in Your Organization WHITE PAPERIn many cases, with regards to application specific data, it is kept on physicalmagnetic tape. Problems with tape include accessibility, location of the tapes,time it takes to retrieve the data, and reliability of the tape media. Using Hadoopas a low cost storage solution enables the organization to keep data forcompliance and other purposes, and make that data available at any time.Analytics and Deep AnalyticsHadoop can be used to analyze data using complex criteria. Deep Analytics refersto processing large quantities of data, data in the 100s of Terabyte to Petabyterange, for this analysis. Tools available include MapReduce (written in JAVA orRuby), Pig, Hive, Hue, and a variety of commercial-off-the-shelf BI tools. The Ropen source statistical package can also be installed to leverage the parallelcomputing power of Hadoop.Operation ProcessingOperational Processing is a bit of a departure from the type of work that isusually targeted for the Hadoop environment. Where Hadoop got its startprocessing large amounts of data from web sites, it is also good at - and excels at- doing traditional IT batch processing. Examples follow.ETL ProcessesExtract, Transform and Load (ETL). Many companies spend large amounts ofdollars on computing equipment and software to facilitate their ETL needs. Oncethe technical infrastructure is in place, the organization will need to developscripts that read the inputs, perform the transformations and create the outputs.If organizations are not using formal ETL tools, than they’ve written theprocesses using programming languages.Pain points with using ETL environments include:capacity issues (too much to process and not enough computingresources);processing congestion and data latency;up front and ongoing expense for this single use environment;complex business logic coded into the proprietary language of the ETLtool;locking the organization into this expensive technology solution for theforeseeable future;5

Making a Case for Hadoop in Your Organization WHITE PAPERdata lineage (tracking back transformations of transformed data) Hadoop addresses these problems inherently.Ways Hadoop can ease ETL pain points include:ability to keep growing your cluster because of Hadoop’s scalability andlow cost hardware;mitigated processing congestion and data latency issues by leveraging theparallelization of Hadoop;reduced costs with commodity hardware, open source software and easeof scaling the environment;code in easy to use PIG language.Transformation of ETL to ELTx (T-T-T) with HadoopMainframe and Distributed Batch ProcessingHadoop excels at batch processing. By its very nature, Hadoop processes filessequentially, just like all batch processing systems. Whenever data is read inHadoop, it is opening a file and reading the entire file sequentially. The Hadoop6

Making a Case for Hadoop in Your Organization WHITE PAPERframework handles the parallelization part “under the covers” and abstracts thedevelopers and users from the parallelization layer.One of the major benefits to doing batch processing on Hadoop is speed.Because of the parallelization, the batch processes are 10 times faster than on asingle threaded server or the mainframe.High Disk and CPU Consumption Applications and Grid ComputingBecause of the multiple machines and parallelization within Hadoop, and thelocal disk on each of the data nodes, Hadoop is an excellent environment tosolve big data and heavy central processing unit (CPU) consumption type ofproblems. Once again, because of the nature of the Hadoop ecosystem, codersare insulated from needing to worry about the specifics of parallel processingand can instead focus on functionality.Lower Costs and Reduced Processing TimesFor new projects within your organization, Hadoop is a less expensivealternative than traditional computing equipment and software. Targetingtraditional platforms (like Mainframe, robust servers, purchased software,purchased Distributed Relational Databases) is costly. In addition to initial upfront costs to purchase the products, most come with ongoing maintenance andsubscription costs. With open source Hadoop software running on strippeddown commodity hardware, expenses will be kept to a minimum.Hadoop is also being used for converting legacy applications to a lower cost,open source parallel computing model. There have been many successesconverting COBOL and IBM BAL to Pig and Java based UDFs (user definedfunctions). The UDFs are used for the logic that is too complex for Pig.Converting mainframe and distributed batch processes to Hadoop reduces MIPS(million instructions per second), which leads to smaller mainframe footprint,reduces mainframe costs and reduces mainframe licensing costs (which areoften tied to MIPS).Another area that Hadoop is excelling at is the ETL space. Using Hadoop as anenterprise data hub and keeping data at its most atomic level direct from itssource system means that your base data is always in Hadoop. This base data7

Making a Case for Hadoop in Your Organization WHITE PAPERcan be used for any of the transformations, and makes the data lineage veryclear (as opposed to transforming data that has been transformed before ormultiple times). Transformations can be coded in Pig and be run when needed.Because of the parallel processing power of Hadoop, these transformation jobsrun very quickly, much faster than in a traditional ETL platform.Data warehouse analytics that involve large quantities of data, and do notrequire high speed SQL access and sub-second response time, also runextremely well on Hadoop. This strategy frees up computing resources (andmaybe disk space) on the more expensive data warehouse platform for thekinds of applications that can easily run on a Hadoop platform.Massively Parallel ProcessingParallelization is a key benefit of Hadoop. Because of the parallel horse power ina Hadoop cluster, your application has many computers working on theproblem simultaneously. Because of this parallelization, processing times aredramatically reduced compared to sequentially processed, single threadedenvironments. Processes coming from the mainframe, ETL and distributedenvironments are usually single threaded.Reliability of HadoopHadoop is a highly available fault tolerant environment. Because of having manymachines as data nodes, local disk and three replicas of the data blocks, thereare many options for where to process the data.In a Hadoop cluster, the data nodes are designed to fail (one power supply andone NIC card as opposed to two of each). The Hadoop framework is set up so adata node can fail and another picks up for it. When a data node fails andanother picks up for it, the end user or batch process does not even know thenode failed, and processing continues on another replica of the data in thecluster.There are some single points of failure in the Hadoop ecosystem. Failure pointsinclude the various master nodes, such as name node, secondary name node, jobtracker, Oozie and Hbase Master. If one of these machines fail, the users ofHadoop overall, or that component will experience a disruption of server.8

Making a Case for Hadoop in Your Organization WHITE PAPERBecause of this, the master machines are designed with redundant equipment tomake them more fault tolerant (dual power supplies and dual NICs).Benefits to Implementing Hadoop in the EnterpriseThere are many direct benefits to using Hadoop Cost Effectiveness. The first benefit is generally lower costs than traditionalcomputing solutions. Open source software and low cost commodity hardwarehelp drive the cost of a Hadoop installation down.Faster Processing. Parallel processing on multiple data nodes drives processingtimes down versus traditional solutions.Ease of Scalability. With Hadoop, you can add more servers into the clusterwithout an outage. Just add them in, update cluster configuration files, andHadoop starts using them immediately. There is a rebalance option forredistributing the data. The rebalance process should be run on a periodic basisas part of your operational procedures.Fault Tolerance. Build in fault tolerance, data nodes designed to fail, three datareplicas and beefy single point of failure servers for master servers all contributeto making a Hadoop environment fault tolerant, reliable and always up.Single Version of Truth. Combine operational processing on the same platform asyour enterprise data hub. With the flexibility of Hadoop, your organization hasthe opportunity to create your single version of the truth data on Hadoop as anEnterprise Data Hub. Once data is in the Enterprise Data Hub, it can beconsumed by others, including downstream operational processes. If you moveoperation processes to Hadoop, data brought down to Hadoop for theoperations processes can be used to build and populate the Enterprise Data Hub.There are a few indirect benefits to implementing Hadoop also Having a sizable Hadoop cluster

Hadoop: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license. It supports the running of applications on large clusters of commodity hardware. The Hadoop framework transparently provides both reliability and data motion to applications (source: Wikipedia).