Data Virtualization For Dummies

Transcription

These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

DataVirtualizationDenodo Special Editionby Lawrence C. MillerThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

Data Virtualization For Dummies , Denodo Special EditionPublished by: John Wiley & Sons, Ltd., The Atrium, Southern Gate Chichester, West Sussex,www.wiley.com 2019 by John Wiley & Sons, Ltd., Chichester, West SussexRegistered OfficeJohn Wiley & Sons, Ltd., The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, United KingdomAll rights reserved. No part of this publication may be reproduced, stored in a retrieval system ortransmitted in any form or by any means, electronic, mechanical, photocopying, recording, scanning orotherwise, except as permitted by the UK Copyright, Designs and Patents Act 1988, without the priorwritten permission of the Publisher. For information about how to apply for permission to reuse thecopyright material in this book, please see our website http://www.wiley.com/go/permissions.Trademarks: Wiley, For Dummies, the Dummies Man logo, The Dummies Way, Dummies.com, MakingEverything Easier, and related trade dress are trademarks or registered trademarks of John Wiley & Sons,Inc. and/or its affiliates in the United States and other countries, and may not be used without writtenpermission. Denodo and the Denodo logo are trademarks or registered trademarks of Denodo Technologies.All other trademarks are the property of their respective owners. John Wiley & Sons, Ltd., is not associatedwith any product or vendor mentioned in this book.LIMIT OF LIABILITY/DISCLAIMER OF WARRANTY: WHILE THE PUBLISHER AND AUTHOR HAVE USED THEIRBEST EFFORTS IN PREPARING THIS BOOK, THEY MAKE NO REPRESENTATIONS OR WARRANTIES WITHRESPECT TO THE ACCURACY OR COMPLETENESS OF THE CONTENTS OF THIS BOOK AND SPECIFICALLYDISCLAIM ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. ITIS SOLD ON THE UNDERSTANDING THAT THE PUBLISHER IS NOT ENGAGED IN RENDERING PROFESSIONALSERVICES AND NEITHER THE PUBLISHER NOR THE AUTHOR SHALL BE LIABLE FOR DAMAGES ARISINGHEREFROM. IF PROFESSIONAL ADVICE OR OTHER EXPERT ASSISTANCE IS REQUIRED, THE SERVICES OFA COMPETENT PROFESSIONAL SHOULD BE SOUGHT.For general information on our other products and services, or how to create a custom For Dummies bookfor your business or organization, please contact our Business Development Department in the U.S. at877-409-4177, contact info@dummies.biz, or visit www.wiley.com/go/custompub. For information aboutlicensing the For Dummies brand for products or services, contact BrandedRights&Licenses@Wiley.com.ISBN 978-1-119-55849-1 (pbk); ISBN 978-1-119-55851-4 (ebk)Printed in Great Britain10 9 8 7 6 5 4 3 2 1Publisher’s AcknowledgmentsWe’re proud of this book and of the people who worked on it. For details onhow to create a custom For Dummies book for your business or organization,contact info@dummies.biz or visit www.wiley.com/go/custompub. For detailson licensing the For Dummies brand for products or services, contactBrandedRights&Licenses@Wiley.com.Some of the people who helped bring this book to market include thefollowing:Project Editor: Martin V. MinnerAssociate Publisher: Katie MohrEditorial Manager: Rev MengleBusiness DevelopmentRepresentative: Frazer HossackProduction Editor:Tamilmani VaradharajDenodo Review Team: Paul Moxon,Pablo Alvarez, Ravi Shankar,Lakshmi Randall, Becky Smith,Amy FlippantThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

IntroductionOrganizations are challenged by ever-growing data volumes,as well as data types that are increasingly diverse. Withthe advent of big data and the proliferation of multipleinformation channels, organizations must store, discover, access,and share massive volumes of traditional and new data sources.At the same time, more business opportunities can be realized onlyif large and diverse sources can be integrated in near-real time, ifnot in real time. In today’s complex data landscape, it is no longer feasible to replicate data from myriad sources into a centralrepository because of the associated costs and delays in accessingthe data. Cloud storage architectures have helped, but they stillestablish independent data silos that cannot be seamlessly integrated with other systems, such as traditional data warehouses.Data virtualization is a modern approach to data integration.It transcends the limitations of traditional techniques by delivering a simplified, unified, and integrated view of trusted business data in real time or near-real time, as needed by consumingapplications, processes, analytics, or business users.About This BookData Virtualization For Dummies, Denodo Special Edition, consistsof seven chapters that explore»» The challenges of data silos, data overload, and regulatorycompliance (Chapter 1)»» What data virtualization is and how it helps businesses(Chapter 2)»» Data virtualization use cases (Chapter 3)»» How data virtualization enables big data solutions(Chapter 4)»» Data virtualization in the cloud (Chapter 5)»» How to get started with data virtualization (Chapter 6)»» Key things to know about data virtualization (Chapter 7)Introduction1These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

Foolish AssumptionsIt’s been said that most assumptions have outlived their uselessness, but I assume a few things nonetheless.Mainly, I assume that you are someone who uses or managesdata in your organization, such as:»» A data warehouse manager, data engineer, or databaseadministrator responsible for making data available to thebusiness in an agile, cost-effective, and secure manner»» A data analyst or data scientist needing fast and reliableaccess to large and diverse data sets»» A business user who regularly needs to access data to makeinformed and timely decisions with all the best available dataIcons Used in This BookThroughout this book, I occasionally use special icons to callattention to important information. Here’s what to expect:This icon points out information you should commit to yournonvolatile memory, your gray matter, or your noggin — alongwith anniversaries and birthdays.You won’t find a map of the human genome here, but if you seekto attain the seventh level of NERD-vana, perk up! This iconexplains the jargon beneath the jargon.Tips are appreciated, never expected — and I sure hope you’llappreciate these tips. This icon points out useful nuggets ofinformation.These alerts point out the stuff your mother warned you about(well, probably not), but they do offer practical advice to help youavoid potentially costly or frustrating mistakes.2Data Virtualization For Dummies, Denodo Special EditionThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

IN THIS CHAPTER»» Eliminating siloed data in the enterprise»» Dealing with different data sourcesand types»» Understanding regulatory compliancerequirements»» Learning the basics of data virtualizationChapter1Data, Data EverywhereIn this chapter, you learn about modern data challenges including data silos, disparate data sources and types, and regulatorycompliance. You also learn what data virtualization is — andwhat it isn’t.Unlocking Data SilosData silos — data sources that can’t easily be shared across systems and applications — have plagued the IT and business landscape for many years. These silos exist within organizations for avariety of reasons, such as:»» Older, legacy systems have trouble communicating withmore modern systems.»» On-premises systems have difficulty communicating withcloud-based systems.»» Multiple disparate storage systems have been deployed overthe years as existing systems near storage capacity orperformance degrades.»» Some systems work only with specific applications.CHAPTER 1 Data, Data Ever y where3These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

»» Some systems are configured to be accessed only byspecified individuals or groups.»» Companies acquire other companies with systems that areconfigured differently.Data silos make it challenging for business users to access andanalyze all of the available data within an organization. Data siloscan lead to inaccurate results or conclusions and delayed decisionmaking with incomplete or imperfect data. The lack of a single“source of truth” also creates doubt in the veracity of the data.Managing the Data SwampManaging the deluge of digital data is challenging for everybusiness today. In addition to the sheer volume of data, businessesmust manage multiple different data types — including structured, unstructured, and semi-structured data — from multipledata sources. These different data types must often be extractedfrom their sources, transformed into a different format, andloaded into the consuming application (a process known asExtract, Transform, and Load, or ETL) before they can be usedby the business. ETL processes (discussed in Chapter 2) are oftenscripted or manual processes that require IT assistance, run inscheduled batches, and are inflexible, which introduces furthercomplexity and delay.Navigating the Compliance LandscapeNew legislation and regulations mandating data protectionrequirements are a constant and costly challenge for organizations in practically every industry. Regulations such as theU.S. Health Insurance Portability and Accountability Act (HIPAA),the U.S. Gramm-Leach-Bliley Act (GLBA), and Canada’s PersonalInformation Protection and Electronic Documents Act (PIPEDA)establish data privacy, protection, and retention requirements forcertain businesses and industries.More recently, the European Union’s (EU) General Data Protection Regulation (GDPR) went into effect on May 25, 2018. All4Data Virtualization For Dummies, Denodo Special EditionThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

businesses that serve EU citizens, regardless of where the business is located, are required to comply. The GDPR details howcompanies must protect personal information. Companies thatfail to comply with the GDPR will not only be subject to heftyfines, but may also face lawsuits and additional audits. To complywith the GDPR, companies will have to demonstrate that personaldata is»» Processed lawfully and fairly, and in a transparent way.»» Collected for specific, explicit, and legitimate purposes.»» Limited to what is necessary for processing.»» Kept accurate and up to date.»» Stored so that the subject is identified only when necessary.»» Processed in a secure manner so it does not fall into thewrong hands or become lost, damaged, or destroyed.»» Protected “by design.” All new systems must be developedwith privacy in mind.Companies need a bird’s-eye view into all of their data, as wellas a way to establish security controls over the entire infrastructure from a single point. Data virtualization provides this capability, enabling companies to quickly and easily comply withdata protection mandates without investing in new hardware orre-building existing systems from the ground up.What Is Data Virtualization?Data virtualization delivers a simplified, unified, and integratedview of trusted business data in real time or near-real time asneeded by the consuming applications, processes, analytics, orbusiness users. Data virtualization integrates data from disparatesources, locations, and formats, without replicating the data, tocreate a single, virtual data layer that delivers unified data services to support multiple applications and users (see Figure 1-1).The result is faster access to all data, less replication and cost, andmore agility to change.CHAPTER 1 Data, Data Ever y where5These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

FIGURE 1-1: Data virtualization integrates data from disparate sources,locations, and formats to support multiple applications and users.Whereas most data integration solutions move a copy of the datato a new, consolidated source, data virtualization offers a completely different approach. Rather than moving the data, datavirtualization provides a view of the integrated data, leaving thesource data exactly where it is. Companies do not have to pay thecosts of moving and housing the data, and yet they gain the benefits of data integration.Data virtualization performs many of the same transformationand quality functions as traditional data integration — such asETL, data replication, data federation, Enterprise Service Bus(ESB) and others — but leverages modern technology to deliverreal-time data integration at a lower cost, with more speed andagility. It can replace traditional data integration and reduce theneed for replicated data marts and data warehouses in manycases.Data virtualization is also an abstraction layer and a data services layer. In this sense, it is highly complementary to usebetween original and derived data sources, ETL, ESB and othermiddleware, applications, and devices, whether on-premises orcloud-based.6Data Virtualization For Dummies, Denodo Special EditionThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

Data virtualization delivers the following key capabilities:»» Logical abstraction and decoupling: Disparate datasources, middleware, and consuming applications that useor expect specific platforms and interfaces, formats, schema,security protocols, query paradigms, and other idiosyncrasies can now interact easily through data virtualization.»» Data federation on steroids: Data federation is a subset ofdata virtualization, but enhanced with more intelligentreal-time query optimization, caching, in-memory, andhybrid strategies that are automatically (or manually)chosen, based on source constraints, application need, andnetwork awareness.»» Semantic integration of structured and unstructureddata: Data virtualization is one of the few technologies thatbridge the semantic understanding of unstructured and webdata with the schema-based understanding of structureddata to enable integration and data quality improvements.»» Agile data services provisioning: Data virtualizationpromotes the application programming interface (API)economy. Any primary, derived, integrated, or virtual datasource can be made accessible in a different format orprotocol than the original, with controlled access in a matterof minutes.»» Unified data governance and fine-grained security withcomplete auditability: Data virtualization enables finegrained control over sensitive customer information storedacross multiple systems by establishing a single, unifiedaccess layer across on-premises and off-premises systems.All data is made discoverable and can be easily integratedthrough a single virtual layer that exposes redundancy andquality issues faster. Data virtualization imposes data modelgovernance and security from source to output dataservices, and consistency in integration and data qualityrules. When data consumers need to access a source, theydo so through the data virtualization layer, which containsthe metadata for accessing each source, and returns asecure, virtualized view of the data to the consumer in realtime. These views are traceable and auditable, and will bedelivered only to authorized consumers.CHAPTER 1 Data, Data Ever y where7These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

»» Eliminates unnecessary data movement: With a datavirtualization layer in place, no data replication is requiredfor reporting purposes, and no Extract, Transform, andLoad (ETL) scripts must be rewritten. A data virtualizationlayer operates with a company’s existing infrastructure,configured exactly as it is. The data virtualization layermerely abstracts the access functions so that usersperceive the data as existing in a single virtual repository.However, if for performance reasons data must bepersisted, data virtualization tools also offer simple ways topersist a data set by simply enabling some settings in themodel. Replication becomes just another option, not anecessity.»» Complete data lineage and agile business rules: At anypoint in time, companies can understand and report on thefull lineage of any sensitive data set, including its originalsource, all views, and all modifications. In addition, throughthe data virtualization layer, companies can establishsophisticated rules for automating compliance, such asmasking data on the fly, so it cannot be viewed by users wholack the requisite credentials. Such rules can be appliedquickly and effectively across diverse systems because theyare applied in the data virtualization layer.»» Secures data-at-rest and data-in-motion: The data virtualiza-tion layer can perform role-based authentication at any level,such as guest, employee, or corporate; apply data-specificpermissions including row- and column-level masking; anddefine schema-wide permissions and policy-based security. Thevirtualization layer secures data in transit via Secure SocketsLayer/Transport Layer Security (SSL/TLS) protocols and authenticates users via industry-proven protocols such as LightweightDirectory Access Protocol (LDAP), pass-through with Kerberos,Windows Single Sign-On (SSO), Open Authorization (OAuth),Simple and Protected GSS-API Negotiation Mechanism(SPNEGO) authentication, OAuth and SAML authentication, andJava Database Connectivity/Open Database Connectivity (JDBC/ODBC) Security.8Data Virtualization For Dummies, Denodo Special EditionThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

»» Facilitates privacy by design: Data virtualization is alsoparticularly well suited to helping companies comply with theGDPR’s “protected by design” requirement. By definition, adata virtualization layer does not require a source to be of aprescribed type, or to be accessed in a certain way. Newsources can easily be added to the infrastructure byconnecting them to the data virtualization layer, where theyare immediately subject to the same security controls andauditability as any other source on the system, irrespectiveof the data source technology.Data virtualization delivers abstracted and integrated informationin real time from disparate sources to multiple applications andusers. It is also easy to build, consume, and maintain. To buildvirtual data services, the user follows three simple steps (seeFigure 1-2):»» Connect and virtualize any source. Quickly accessdisparate structured and unstructured data sources usingincluded connectors. Introspect their metadata and exposeas normalized source views in the data virtualization layer.»» Combine and integrate into business data views.Combine, integrate, transform, and cleanse source viewsinto canonical model-driven business views of data in agraphical user interface (GUI) or through documentedscripting.»» Connect and secure data services. Any of the virtual dataviews can be secured and published as SQL views or dozensof other data services formats.FIGURE 1-2: Building virtual data services.CHAPTER 1 Data, Data Ever y where9These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

WHAT DATA VIRTUALIZATIONIS NOTSome vendors use buzzwords for marketing other products to capitalize on the popularity of data virtualization. To help dispel any confusion, remember that data virtualization is not:10 Data visualization: It sounds similar, but visualization refers to thedisplay of data to end-users graphically as charts, graphs, maps,reports, and so on. Data virtualization is middleware that providesdata services to other data visualization tools and applications.Although it has some data visualization for users and developers,that is not the main use. A replicated data store: Data virtualization does not normallypersist or replicate data from source systems to itself. It onlystores metadata for the virtual views and integration logic. If caching is enabled, it stores some data temporarily in a cache or inmemory database. Virtual data can be persisted if desired bysimply invoking it as a source using ETL. Thus, data virtualization isa powerful, but lightweight and agile, solution. A logical data warehouse: A logical data warehouse is an architectural concept and not a platform. Data virtualization is anessential technology used in creating a logical data warehouse bycombining multiple data sources, data warehouses, and big datastores like Hadoop. Data federation: Data virtualization is a superset of capabilitiesthat includes advanced data federation. Virtualized data storage: Some companies and products use thesame term data virtualization to describe virtualized database software or storage hardware virtualization solutions. They do notprovide real-time data integration and data services across disparate structured and unstructured data sources. Virtualization: When the term virtualization is used alone, it typically refers to hardware virtualization — servers, storage disks,networks, and so on.Data Virtualization For Dummies, Denodo Special EditionThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

IN THIS CHAPTER»» Assessing the pros and cons of ETL, ESB,and data virtualization»» Using traditional data integrationtechniques and data virtualizationtogether»» Enabling business agility with datavirtualization»» Empowering business users withself-service access to real-time dataChapter2Introducing DataVirtualizationIn this chapter, you look at traditional data integration techniques such as Extract, Transform, and Load (ETL) processesand Enterprise Service Bus (ESB) architectures, as well as howdata virtualization complements these techniques, enablesbusiness agility, and makes self-service a reality for businessusers.Going Beyond TraditionalData IntegrationThe problem with data silos (discussed in Chapter 1) is that no onecan easily query all of the available data. Instead, each data silomust be queried separately, and the results then must be manually consolidated. This process is costly, time-consuming, andCHAPTER 2 Introducing Data Virtualization11These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

inefficient. To bring the data together, companies typically useone or more of the following data integration strategies:»» Extract, Transform, and Load (ETL) processes, which copythe data from the different silos and move it to a centrallocation, such as a data warehouse.»» Enterprise Service Buses (ESBs), which establish a communication system for applications, enabling them to shareinformation.»» Data virtualization, which creates real-time, integratedviews of the data in data silos and makes them available toapplications, analysts, and business users.Extract, Transform, and Load (ETL)processesExtract, Transform, and Load (ETL) processes were the first dataintegration strategies, introduced as early as the 1970s.The basic ETL process follows these steps:1.2.3.The data is extracted from the source.The extracted copy of the data is transformed into the formatand structure required by its final destination.The transformed copy of the data is loaded into its finaldestination, such as an operational data store, a data mart, ora data warehouse.Some processes do the transformation in the final step and aretherefore called “ELT” processes, but the basic concept is thesame.Pros and cons of ETL processes include»» Pros: ETL processes are efficient and effective at moving datain bulk. 12The technology is well understood and supported byestablished vendors.Data Virtualization For Dummies, Denodo Special EditionThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

ETL tools have features that sufficiently support bulk/batch data movement. Most organizations have in-house ETL capabilities.»» Cons: Moving data is not always the best approach because thisresults in a new repository that must maintained. Large organizations can have thousands of ETL processesrunning each night, synchronized by scripts that aredifficult to modify. Typically, ETL processes are not collaborative; theend-users must wait until the data is ready. ETL processes cannot handle today’s data volumes andcomplex data types.Enterprise Service Bus (ESB)ESBs, introduced in 2002, use a message bus to exchange information between applications. The message bus essentially acts asa translator between the applications, enabling the applications tocommunicate via the bus. An ESB decouples systems and allowsthem to communicate without depending on, or even knowingabout, other systems on the bus (see Figure 2-1).FIGURE 2-1: An ESB decouples systems and applications and enablescommunication between them.CHAPTER 2 Introducing Data Virtualization13These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

ESBs form the underpinnings of service-oriented architecture(SOA), in which applications can easily share services across anorganization. ESBs were born out of the need to move away frompoint-to-point integration which, like ETL scripts, are hard tomaintain over time.Pros and cons of ESBs include»» Pros: Applications are decoupled. They can be used to orchestrate business logic usingmessage flows. ESB technology is mature and is supported by establishedvendors. ESBs can address operational scenarios by usingmessages to trigger events.»» Cons: ESBs cannot integrate application data to deliver onanalytical use cases. Queries are static and can only be scheduled; ESBs do noteasily support ad hoc queries. Database queries are restricted to one source at a time.Joins and other multiple-source functions are performedin memory, which drains resources. ESBs are suitable only for operational use cases thatinvolve small result sets.Data virtualizationData virtualization creates integrated views of data drawn fromdisparate sources, locations, and formats, without replicating thedata, and delivers these views, in real time, to multiple applications and users. Data virtualization can draw from a wide varietyof structured, semi-structured, and unstructured sources, andcan deliver to a wide variety of consumers.Because no replication is involved, the data virtualization layercontains no source data; it contains only the metadata requiredto access each of the applicable sources, as well as any globalinstructions that the organization may want to implement, suchas security or governance controls.14Data Virtualization For Dummies, Denodo Special EditionThese materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

Users and applications query the data virtualization layer which,in turn, gets the data from various sources (see Figure 2-2). Thedata virtualization layer abstracts users and applications from thecomplexities of access. The data virtualization layer appears as asingle, unified repository to all consumers.FIGURE 2-2: Data virtualization connects disparate data sources, combinesrelated data into views, and publishes the data to applications for all dataconsumers.Fundamental attributes that define the capabilities of a true datavirtualization platform include the following:»» Universal data access to any source or data type: Dataengines automatically connect, navigate, and extract datafrom any internal or external source and all data typesincluding structured, unstructured, and web.»» Unified virtual data layer: Powerful transformations andrelationships are built using an integrated modeling andexecution environment to normalize, transform, improvequality, and relate data across heterogeneous source typesusing common metadata and semantics. An extendedrelational data model allows disparate data types to berepresented natively in the virtual layer, thereby minimizingeffort and maximizing performance.CHAPTER 2 Introducing Data Virtualization15These materials are 2019 John Wiley & Sons, Ltd. Any dissemination, distribution, or unauthorized use is strictly prohibited.

»» Universal data publishing: Combined information ispublished as reusable data services in multiple formats suchas SQL query, Simple Object Access Protocol (SOAP),Representational State Transfer (REST) and Open DataProtocol (OData) web services, messaging, mobile feeds,keyword-based search, and so on. Hybrid delivery modes(such as virtual real time, cache, batch, and message-based)to consuming applications are also supported.»» Agile high performance: Advanced real-time dynamicoptimization is supplemented by intelligent caching andscheduled batches for flexible mixed workloads. Read/writeaccess with enterprise-class reliability and scalability — evenfor web and unstructured sources — is supported.»» Unified data governance: An enterprise-wide single entrypoint for data and metadata management, security, audit,logging, and monitoring is enabled through built-in tools andinstrumentation, as well as integration to external datamanagement tools.»» Agile development of pervasive, self-service dataservices: Complexity is hidden from application developersand business users. Consuming applications and datasourc

ness data in real time or near-real time, as needed by consuming applications, processes, analytics, or business users. About This Book Data Virtualization For Dummies, Denodo Special Edition, consists of seven chapters that explore » The challenges of data silos, data