Technical Guide: Unleashing The Power Of Hadoop With .

Transcription

Technical Guide: Unleashing thePower of Hadoop with InformaticaA Data Integration Platform Approach to TurnBig Data into Big Opportunity with HadoopW H I T E PA P E R

This document contains Confidential, Proprietary and Trade Secret Information (“Confidential Information”) ofInformatica Corporation and may not be copied, distributed, duplicated, or otherwise reproduced in any mannerwithout the prior written consent of Informatica.While every attempt has been made to ensure that the information in this document is accurate and complete, sometypographical errors or technical inaccuracies may exist. Informatica does not accept responsibility for any kind ofloss resulting from the use of information contained in this document. The information contained in this document issubject to change without notice.The incorporation of the product attributes discussed in these materials into any release or upgrade of anyInformatica software product—as well as the timing of any such release or upgrade—is at the sole discretion ofInformatica.Protected by one or more of the following U.S. Patents: 6,032,158; 5,794,246; 6,014,670; 6,339,775; 6,044,374;6,208,990; 6,208,990; 6,850,947; 6,895,471; or by the following pending U.S. Patents: 09/644,280;10/966,046; 10/727,700.This edition published September 2011

White PaperTable of ContentsExecutive Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Hadoop Coming of Age . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3The “Force” of Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Challenges with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5Coexistence and Interoperability Between Hadoop and an EDW . . . . . . . . . . . . . . . . . . . . 6Integrated Information Architecture with Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7Hadoop Components and the Role of Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . 9Hadoop Use Cases and Architecture Today . . . . . . . . . . . . . . . . . . . . . . . . 10Top Use Cases for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10Hadoop Deployment Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11High-Level Data Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12Considerations for Data Integration in Hadoop . . . . . . . . . . . . . . . . . . . . . 13Using a Data Integration Platform for Hadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14Six Guiding Principles for Platform-Based Data Integration in Hadoop . . . . . . . . . . . . . . 15Hadoop and Data Integration in Action . . . . . . . . . . . . . . . . . . . . . . . . . . 18Getting Started with Hadoop Projects and Data Integration . . . . . . . . . . . 20Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21Technical Guide: Unleashing the Power of Hadoop with Informatica1

Executive SummaryThis technical guide outlines how to take advantage of Hadoop to extend your enterpriseinformation architecture in the era of Big Data. Hadoop deployments are maturing as part of anintegrated environment to store, process, and analyze large-scale data that is complex in sourcesand formats. Organizations are now considering Hadoop as part of an overall loosely coupledarchitecture to handle all types of data cost-effectively.First, we describe the scope, challenges, and opportunities of evaluating, designing, and deployingHadoop and its subprojects in the context of building and evolving your enterprise informationarchitecture. We explore and clarify the role of data integration for these projects. Second, weillustrate typical use cases of Hadoop projects and introduce a high-level reference architectureto support business requirements involving semi- and unstructured data sources over petabytesof data. Third, we provide guidance on enterprise data integration for typical use cases, includingbehavioral modeling, fraud analytics, Web log analysis, and network monitoring. Finally, wesummarize the benefits of an integrated environment with Hadoop, including how to get started.This technical guide addresses the growing market demand for understanding why the ability toleverage Hadoop is critical for an organization seeking to extend its information managementpractice —including what an integrated environment with Hadoop and the existing infrastructurewill look like and what considerations must be given to supporting Hadoop from the dataintegration perspective.2

White PaperHadoop Coming of Age“Entrepreneurial CIOs shouldThe “Force” of HadoopInterest in Hadoop is intensifying rapidly. Organizations are keen on understanding how and whento take advantage of Hadoop to monetize the value of Big Data, improve competitive advantage,and transition toward the ideal of the data-driven enterprise. For Big Data projects, Hadoop offerstwo important services: It can store any kind of data from any source, inexpensively and at verylarge scale, and it can do very sophisticated analysis of that data easily and quickly. With Hadoop,organizations are discovering and putting into practice new data analysis and mining techniquesthat were previously impractical for performance, cost, and technological reasons.Hadoop, the Big Data processing platform, is gaining attention and sponsorship from IT executives.Stephen Prentice of Gartner advised, “Entrepreneurial CIOs should aggressively embrace theconcepts of Big Data, because it represents a perfect example of the type of technology-enabledstrategic business opportunity that plays to their strengths and could deliver significant newrevenue or unique competitive differentiation to the business.”1Unlike traditional relational platforms, Hadoop is able to store any kind of data in native dataformat and perform a wide variety of analyses and transformations on that data. As a result,for areas in which traditional IT infrastructures are not meeting the demands of Big Data,organizations are considering or using Hadoop as an extension to their environments to tackle thevolume, velocity, and variety of Big Data.The market of Hadoop-based offerings is evolving rapidly. Trying to understand these diverseHadoop offerings can be daunting when evaluating your next data platform. As Hadoop expert TomWhite writes in his book Hadoop: The Definitive Guide, “The good news is that Big Data is here. Thebad news is that we are struggling to store and analyze it.”2 Hadoop decisively addresses some ofthese pains. Organizations are considering or deploying Hadoop because of its unique strengths:aggressively embrace the conceptsof Big Data, because it representsa perfect example of the type oftechnology-enabled strategicbusiness opportunity that plays totheir strengths and could deliversignificant new revenue or uniquecompetitive differentiation to thebusiness.”Stephen PrenticeVP and Gartner Fellow, Gartner Complex data analytics. Not all data fits into the structured rows and columns of a traditionaldatabase. Hadoop is ideal for diverse and complex data such as videos, images, text, logsfrom applications, networks and Web, real-time feeds, and call detail records (CDRs), as wellas information from sensors and devices, especially when you want to store data in its nativeformat. Many organizations perform sentiment analysis or fraud analysis, combining transactiondata with textual and other Big Data from a variety of applications. Extensive data explorationbeyond data samples is part of the analytics requirements Storage of large amounts of data. Hadoop stores data without a need to change it into adifferent format, as the majority of traditional data warehousing processes requires. Value ofdata is not lost in the translation process. When you need to accommodate new data sourcesand formats but don’t want to lock into a single format, Hadoop is often a good framework thatallows flexibility for a data analyst to choose how and when to perform data analysis. Manyorganizations use Hadoop as a temporary data store or staging area before deciding what todo with data—an opportunity to keep data that used to be discarded for future mining or datapreparation tasks1Stephen Prentice, Gartner, “CEO Advisory: Big Data Equals Big Opportunity,” March 2011.2Tom White, Hadoop: The Definitive Guide, 2009, O’Reilly.Technical Guide: Unleashing the Power of Hadoop with Informatica3

Scaling through distributed processing. With scalability from terabytes to petabytes, Hadoopoffers a distributed framework capable of processing massive volumes of diverse data asorganizational demand changes. Organizations that run Hadoop—from very large clusters tosmaller, terabyte-scale systems—characterize Hadoop’s ability to scale both up and down asa definitive advantage in performance and cost-efficiency. With its MapReduce framework,Hadoop can abstract the complexity of running distributed, shared-nothing data processingfunctions across multiple nodes in the cluster, making it easier to gain benefits of scaling. Cost advantage. Hadoop is open source software that runs on commodity hardware. You canadd or retire nodes based on project demand. This combination can reduce the per-terabytecost for storage and data processing. The ability to store and process data cost-effectivelyin Hadoop is enabling organizations to harness more data or even all data, without the needfor summarization, for projects that did not previously make business sense or were noteconomically feasible. Power of open source community. Hadoop and its subprojects are supported by a global andgrowing network of developers and some of the world’s largest Web-centric companies, suchas Yahoo! and Facebook. Organizations that choose Hadoop benefit from the open sourcecharacteristics of sharing best practices, implementing enhancements and fixes in software anddocumentation, and evolving the overall community.It is noted here that many traditional data-related vendors are now providing solutions to helpcombine data from Hadoop and the rest of infrastructures and this interoperability with Hadoop isalso starting to add to the list of benefits above.Hadoop usage has certainly passed the initial hype period. In a benchmark research reportof large-scale data users, Ventana Research discovered that 54 percent of respondents havedeployed or are considering Hadoop.3 The Ventana Research benchmark also revealed thatHadoop users realized substantially greater cost savings, ability to create new products andservices, analytic speed, and resource utilization compared to non-Hadoop users.According to James Kobielus of Forrester, “What’s clear is that Hadoop has already proven itsinitial footprint in the enterprise data warehousing (EDW) arena: as a petabyte-scalable stagingcloud for unstructured content and embedded execution of advanced analytics.”43David Menninger, Ventana Research, “Hadoop and Information Management Benchmark Research Project,” August 2011.44James Kobielus, Forrester, “Hadoop: Future of Enterprise Data Warehousing? Are You Kidding?” June 2011.

White PaperChallenges with HadoopHadoop is an evolving data processing platform and often market confusion exists amongprospective user organizations. Based on our research and input from Informatica customers, thefollowing lists summarize the challenges in Hadoop deployment. Difficulty finding resources and keeping them productive. Data scientists and developers adeptat the types of tasks and projects selected for Hadoop are often difficult to find. Hadoopprojects create additional siloes of data and data-related activities that can be duplicative andhard to manage. It’s also difficult to repurpose assets such as data quality rules and mappingsused outside Hadoop for Hadoop projects. Pre- and post-processing of data in Hadoop. It is also becoming clearer that the data tasksperformed in Hadoop also need to be further integrated with the rest of IT systems. Scriptingthese tasks can often cause problems when organizations want to move data in and out ofHadoop with reliability and efficiency. Challenges in effectively tackling the diversity of data and deriving meanings Hadoop excels atstoring a diversity of data, but the ability to derive meanings and make sense of it across allrelevant data types can be a major challenge. Lack of transparency and auditability over development tasks. Hadoop lacks metadatamanagement and data auditability, and presents difficulties with standardization and reuse. Limited data quality and governance. While some data in Hadoop is kept for storage orexperimental tasks that do not require high level of data quality, many are using Hadoop forend-user reporting and analytics and it is hard to trust the underlying data in such scenarios. Maintainability of data integration tasks in Hadoop. When organizations perform datatransformation tasks in Hadoop, they typically use a scripting approach that limits the ability toseparate transformation logic from a physical execution plan. This often leads to maintainabilityissues due to the emerging nature of various Hadoop sub-components that these scriptingtasks may be dependent upon. Other technical challenges. Hadoop has challenges in managing mixed workloads accordingto user service-level agreements (SLAs), and it’s not capable of readily providing “right time”response to complex queries or real-time data integration support for front-line workforcesusing both historical and real-time data.Technical Guide: Un

25.05.2011 · Hadoop can abstract the complexity of running distributed, shared-nothing data processing functions across multiple nodes in the cluster, making it easier to gain benefits of scaling. Cost advantage . Hadoop is open source software that runs on commodity hardware. You can add or retire nodes based on project demand. This combination can .