Data Governance For The Data Lake

Transcription

Data Governance for the Data LakeImproving Agility, Flexibility, and ValueDonna BurbankGlobal Data Strategy Ltd.Nov 16th, 2016

Donna Burbankcompany that specialises in the alignmentof business drivers with data-centrictechnology. In past roles, she has served inkey brand strategy and productmanagement roles at CA Technologies andEmbarcadero Technologies for several ofthe leading data management products inthe market.Donna is a recognised industry expert ininformation management with over 20years of experience in data strategy,information management, data modeling,metadata management, and enterprisearchitecture. Her background is multifaceted across consulting, productdevelopment, product management,brand strategy, marketing, and businessleadership.She is currently the Managing Director atGlobal Data Strategy, Ltd., an internationalinformation management consultingAs an active contributor to the datamanagement community, she is a longtime DAMA International member and isthe President of the DAMA RockyMountain chapter. She was also on thereview committee for the ObjectManagement Group’s InformationManagement Metamodel (IMM) and amember of the OMG’s FinalizationTaskforce for the Business ProcessModeling Notation (BPMN).Americas, Europe, Asia, and Africa andspeaks regularly at industryconferences. She has co-authored twobooks: Data Modeling for theBusiness and Data Modeling Made Simplewith ERwin Data Modeler and is a regularcontributor to industry publications suchas DATAVERSITY, EM360, & TDAN. She canbe reached atdonna.burbank@globaldatastrategy.comDonna is based in Boulder, Colorado, USA.She has worked with dozens of Fortune500 companies worldwide in theFollow on Twitter @donnaburbankGlobal Data Strategy, Ltd. 20162

AgendaWhat we’ll cover today Data Lakes & Big Data Big Data – A Technical & Cultural Paradigm Shift Big Data in the Larger Information Management Landscape Data Governance for the Data Lake To Govern or Not to Govern: Identifying which data assets it makes sense to control (and what to leavealone) Rollout & Value: Delivering “quick wins” to the organization Rules of Engagement: Identifying a practical framework & operating model for the Data Lakeenvironment Stakeholder Engagement: Working with various roles within the organization in a way that makessense for each, from business users, to data architects, to data scientists, and more Summary & QuestionsGlobal Data Strategy, Ltd. 20163

Big Data –A Technical &Cultural Paradigm Shift4

Traditional Relational Technologies and “Big Data”:a Paradigm ShiftTraditional Top-Down, HierarchicalDesign, then Implement“Passive”, Push technology“Manageable” volumes of information“Stable” rate of changeBusiness IntelligenceDesignGlobal Data Strategy, Ltd. 2016ImplementBig Data Distributed, DemocraticDiscover and AnalyzeCollaborative, InteractiveMassive volumes of informationRapid and Exponential rate of growthStatistical AnalysisDiscoverAnalyze

“Traditional” way of Looking at the World: Hierarchies Carolus Linnaeus in 1735 established a hierarchy/taxonomy for organizing and identifyingbiological lobal Data Strategy, Ltd. 2016

“New” Way of Looking at the World - EmergenceIn philosophy, systems theory, science, and art, emergence isthe way complex systems and patterns arise out of amultiplicity of relatively simple interactions.- WikipediaI love my newLevis jeans.Is Levi comingto my party?Sale #LEVIS20% at Macys.LOL. TTYL.Leving soon.Global Data Strategy, Ltd. 2016

Data Warehouse vs. Data LakeA Data Warehouse is a storage repository that holds currentand historical data used for creating analytical reports. Datastructures & requirements are pre-defined, and data isorganized & stored according to these definitions.Data WarehouseGlobal Data Strategy, Ltd. 2016A Data Lake is a storage repository that holds a vastamount of raw data in its native format, includingstructured, semi-structured, and unstructured data.The data structure & requirements are not defined untilthe data is needed.Data Lake8

What is Big Data? Big Data is often characterised by the “3 Vs”: Volume: Is there a high volume of data? (e.g. terabytes per day) Velocity: Is data generated or changed at a rapid pace? (e.g. per second, sub-second) Variety: Is data stored across multiple formats? (e.g. machine data, OSS data, log files) The ability to understand and manage these sources and integrate them into thelarger Business Intelligence ecosystem can provide the ability to gain valuableinsights from data. Social Media Sentiment Analysis – e.g. What are customers saying about our products?Web Browsing Analytics – Customer usage patternsInternet of Things (IoT) Analysis – e.g. Sensor data, Machine log dataCustomer Support – e.g. Call log analysis This ability leads to the “4th V” of Big Data – Value. Value: Valuable insights gained from the ability to analyze anddiscover new patterns and trends from high-volume and/orcross-platform systems. Volume Velocity VarietyValueGlobal Data Strategy, Ltd. 2016

The Business Case is SimilarBig DataI love my newLevis jeans.I want to returnthese Levis – theydon’t look like thead.Tell me whatcustomers aresaying about ourproduct.Traditional DatabasesWhich customerdatabase do youwant me to pull thisfrom? We have aDBAAnd, by the way, the databasesall store customer informationin a different format.“CUST NM” on DB2,“cust last nm” on Oracle, etc.It’s a mess.Is Levi comingto my party?Sale #LEVIS20% at Macys.LOL. TTYL.Leving soon.DataScientistI’ll need to input the raw datafrom thousands of sources, andwrite a program to parse andanalyze the relevantinformation.SAPDataArchitect10Global Data Strategy, Ltd. 2016

The 5th “V” - Veracity Only through proper Governance, Data Quality Management, Metadata Management, etc., canorganizations achieve the 5th “V” – Veracity. Veracity: Trust in the accuracy, quality and content of the organizations’ information assets. i.e. The hard work doesn’t go away with Big DataData ScienceData LakesRaw data used in Self-Service Analytics and BI environments isoften so poor that many data scientists and BI professionalsspend an estimated 50 – 90% of their time cleaning andreformatting data to make it fit for purpose.(4Source: DataCenterJournal.comSource: Radiant AdvisorsDigitization & Data QualityData ScienceCorrecting poor data quality is a Data Scientist’s least favoritetask, consuming on average 80% of their working daySource: Forbes 2016Global Data Strategy, Ltd. 2016The absence of commonly understood and shared metadataand data definitions is cited as one of the main impedimentsto the success of Data Lakes.71% of interviewees expect digitization to grow theirbusiness. But 70% say the biggest barrier is finding the rightdata; 62% cite inconsistent dataSource: Stibo Systems

Combining DW & Big Data Can Provide Valuable Information There are numerous ways to gain value from data Relational Database and Data Warehouse systems are one key source of value Customer information Product information Big Data can offer new insights from data From new data sources (e.g. social media, IoT) By correlating multiple new and existing data sources (e.g. network patterns & customer data) Integrating DW and Big Data can provide valuable new insights. Examples include: Customer Experience Optimization Churn Management Products & Services InnovationGlobal Data Strategy, Ltd. 2016DataWarehouseNewInsights12

Big Data is Part of a Larger Enterprise LandscapeA Successful Data Strategy Requires Many Inter-related Disciplines“Top-Down” alignment withbusiness prioritiesManaging the people, process,policies & culture around dataLeveraging & managing data forstrategic advantageCoordinating & integratingdisparate data sources“Bottom-Up” management &inventory of data sourcesGlobal Data Strategy, Ltd. 201613

Data Governance forthe Data Lake14

Applying a Structured Data Governance FrameworkData Issues &ChallengesBusiness Goals &ObjectivesVision & StrategyOrganization &PeopleProcess &WorkflowsData Management &MeasuresTools & TechnologyGlobal Data Strategy, Ltd. 2016Culture &Communication

DATA GOVERNANCEWhat my friends think I doWhat my mom thinks I doWhat society thinks doDriving theSuccess ofthe BusinessWhat my coworkers think I doGlobal Data Strategy, Ltd. 2016What I think I doWhat I actually do16

How can we Transform our Business through Data?Business OptimizationBusiness TransformationBecoming a Data-Driven CompanyBecoming a Data Company Making the Business More Efficient Changing the Business Model via Data – databecomes the product Better Marketing Campaigns Monetization of Information: examples across Higher quality customer data, 360 viewmultiple industries including:of customer, competitive info, etc. Telecom: location information, usage & Better Productssearch data, etc. Data-Driven product development, Retail: Click-stream data, purchasingCustomer usage monitoring, etc.patterns Better Customer Support Social Media: social & family Linking customer data with support logs,connections, purchasing trends &network outages, etc.recommendations, etc. Lower Costs Energy: Sensor data, consumer usage More efficient supply chainpatterns, smart metering, etc. Reduced redundancies & manual effortHow do we doHow do we dowhat we dosomethingData Lakes can supportbetter?different?both of these paradigms.Global Data Strategy, Ltd. 201617

Mapping Business Drivers to Data Management CapabilitiesBusiness-Driven PrioritizationStakeholder ChallengesBusiness DriversExternal DriversDigital Self ServiceOnline Community &Social MediaIncreasing RegulationPressuresCustomer Demand forInstant ProvisionInternal Drivers12360 View of Customer Needed Aligning data from many sources Geographic distribution across regionsBrand ReputationRevenue GrowthCommunity BuildingCost ReductionData Quality Bad customer info causing Brand damage Completeness & Accuracy NeededCost of Data Management Manual entry increases costs Data Quality rework Software License duplication6No Audit Trails No lineage of changes Fines had been levied in past for lack ofcompliance7New Data Sources Exploiting Unstructured Data Access to External & Social DataGlobal Data Strategy, Ltd. 2016Data Governance1 2 3 4 5 6Master DataManagement1 2 3 7Data WarehousingIntegrating Data Siloed systems Time-to-Solution Historical data5360 View of Customer1 71 2 3 734Targeted MarketingLack of Business Alignment Data spend not aligned to Business Plans Business users not involved with dataStrategyBusiness Intelligence1 2 6Big Data Analytics2 3 7Data Quality3 4 5Data Architecture& Modeling 12 3 4Data Asset Planning &Inventory3DataIntegrationMetadataMgt 1Shows “HeatMap” of Priorities5 62 3 5 7182 3 4 5 6 7

Identify What Data Needs to Be GovernedAnd What to Leave AloneIdentify KeyBusiness DriverHow?What?Why?Launch of New Product – Marketing Campaignrequires better customer informationExploratory Analytics &DiscoveryLightly governedSocial MediaSentiment AnalysisFilter Data ElementsAligned with BusinessDriverStructured Warehouse forFinancial ReportingHighly governedFocus GovernanceEfforts on Key VendorGlobal Data Strategy, Ltd. 201619

Defining an Actionable RoadmapMaximize the Benefit to the Organization Develop a detailed roadmap that is both actionable and realistic Show quick-wins, while building to a longer-term goal Include both Data Lake exploration & Data Warehouse reporting Focus on projects that benefit multiple stakeholders You can’t manage & govern everything – pick your priorities.InitiativesH1 '16H2 '16H1 '17H2 '17Strategy DevelopmentSocial Media SentimentAnalysisBusiness GlossaryPopulation & PublicationData WarehousePopulationCustomerProductLocationCall Log AnalysisOpen Data PublicationIoT IntegrationOngoingCommunication & CollaborationIntegratedCustomer ViewGlobal Data Strategy, Ltd. 2016MarketingCustomer SupportSalesExecutive Team20

Integrating the Data Lake & Traditional Data Sources The Data Lake has a different architecture & purpose than traditional data sources such as datawarehouses. But the two environments can co-exist to share relevant information. Data Governance is different for each environment.Reporting & AnalyticsAdvancedAnalyticsStandard BIReportsSelf-Service BIData Governance & CollaborationData Analysis & Discovery – Data LakeSandboxLightly ModeledDataDataExplorationEnterprise Systems of RecordMaster &Reference DataData WarehouseOperational DataData MartsSecurity & PrivacyLightly governedGlobal Data Strategy, Ltd. 2016Highly governed21

Roles & CultureBusiness ExecutiveDBAs AnalyticalStructuredProject & Task focusedCautious – identifies risks“Just let me code!”Data Scientist Looks for opportunitiesLikes to exploreSeen as “modern”Seen as “hip” & “sexy”Global Data Strategy, Ltd. 2016Data Architects AnalyticalStructured“Big Picture” focusedCan be considered “old school”“Let me tell you about my data model!” Results-OrientedOptimistic – Identifies opportunities“Big Picture” focused“I’m busy.”“What’s the business opportunity?”Big Data Vendors It’s magic! It’s easy! No modelingneeded!

Organizational Siloes Too often, there are organizational & cultural silos that limit the sharing between the DataLake and Data WarehouseData Lake & DataScientist Exploratory projects Quick wins Little documentation &governanceGlobal Data Strategy, Ltd. 2016Data Warehouse & DataArchitects Enterprise reportingLong-term projectsData StandardsMetadata & Governance23

Breaking Down Organizational Siloes Good Communication & Governance help break down siloes and encourage information sharing.Data Lake & DataScientist Exploratory projects Quick wins Little documentationGlobal Data Strategy, Ltd. 2016Data Warehouse & DataArchitects Enterprise reporting Long term project Data standards & documentation24

New Operating Model:Interactions Between New & Existing RolesExisting RolesNew RolesAlignmentPrivacyAnalystData StewardGlobal Data Strategy, Ltd. 2016Data ArchitectData ScientistETL DeveloperHadoopAdministrator

Sample Data Governance Operating ModelExecutive Level Executive SponsorExecutive Support & DirectionBudget & resource approvalStrategic LevelPrioritizationEscalationCommunicationData Governance Steering CommitteeFinanceProduct DevelopmentMarketingHuman ResourcesITCustomer ServiceDistribution & ChannelsBusiness Reporting &AnalyticsPredictive Modeling &AnalyticsIM ArchitectureData Governance Working Group Data Governance LeadFunctional Data Area Leads (Data Stewards)Business and ITITData Architects, DataScientists, etc.Business OperationsInformation Management & ITData Stewards & SMEs fromFinance, Marketing, CustomerService, etc.Data ArchitectureMetadata ManagementData ProvisioningGlobal Data Strategy, Ltd. 2016Strategic directionPrioritizationBoth Business & ITIssue escalationTactical LevelData Governance Working GroupBusinessSMEs,Data Stewards, etc. Builds & manages policies,procedures & standards Data Definition Works with Stewards & SMEs toenforce at a tactical levelExecution Executes data managementactivities (data publication,integration, etc.) Both Business & IT26

Data Governance Processes & WorkflowsCustomize for the environment Data Governance Processes & Workflows are different for Data Lakes & Data Warehouses Data Lake & Big Data Exploration Light governance “Tell me what you’re working on” “Post some sample code” Data Warehousing Heavily governed Structured data models, metadata lineage, etc. Some things remain the same Data Stewardship Who is the expert for Product data? Who wrote this code? Data Definitions, Standard Metrics & Business Glossary What’s the definition for “Total Earned Revenue”? Is a customer considered active if their payment is over 30 days overdue?Global Data Strategy, Ltd. 201627

Data Management & MeasuresSuit the Method to the Environment Metadata Management & Governance is different with a Data Lake vs. a Data Warehouse Data Lake Metadata is not non-existent! Exploration & discovery doesn’t mean lack of any documentation Consider other exploratory and rapidly changing environments – e.g. Open Source Development, OpenData, etc. Data Warehouse More Traditional metadata management applies Data Lineage Data Models Business Metadata is a constant What does this term mean? (business glossary) Who is the owner or steward of the data? Who can I go to to ask a question?Global Data Strategy, Ltd. 201628

Data Warehousing Metadata & LineageRobust Documentation & Lineage Data warehouses are typically governed by a robust and well-documented data lineage.Logical Data ModelDimensionalData ModelPhysical Data ModelPhysical Data ModelCUSTOMERDatabase TableCUSTOMERETL ToolBusiness GlossaryCUSTOMERBI ToolETL ToolDatabase TableDatabase TableCUSTDatabase TableTBL C1Sales ReportDatabase TableGlobal Data Strategy, Ltd. 201629

Big Data Platform MetadataWeaker Metadata & Lineage Big Data platforms (e.g. Hadoop-based) are typically based on system of files (HDFS) As a result, the detailed structure that is found in a relational database platform does not exist Metadata still exists for these platforms. Technical Metadata Tree structure of HDFS directories Directory and file attributes (ownership, permissions, quotas, replicationfactor, etc.) Metadata about logical data sets (e.g. format, statistics, etc.) Data ingest & transformation lineage Business Metadata Description of file Tags There are components that allow you to add structure within theHadoop ecosystem (e.g. Hive)Global Data Strategy, Ltd. 201630

The Industry is Advancing There is an Apacheincubator project toaddress DataGovernance &Metadata frameworkfor Hadoop.Global Data Strategy, Ltd. 201631

Data Lake Big Data Model - “Schema on Read” With the Big Data and NoSQL paradigm, “Schema-on-Read” means you do not need to know how you willuse your data when you are storing it. You do need to know how you will use your data when you are usingit and model accordingly. i.e. it’s not magic. For example, you may first place the data on HDFS in files, then apply a tablestructure in Hive. Apache Hive provides a mechanism to project structure onto thedata in Hadoop and to query that data using a SQL-like languagecalled HiveQL (HQL).HiveExplorationHDFSGlobal Data Strategy, Ltd. 2016Table StructuresCreate table AnalysisAnalyze & understand the data. Build a data structure to suiteyour needs.File systemhdfs dfs -put /local/path/userdump /hdfs/path/data/users32

Data Modeling in the Big Data EcosystemData SourcesJSON / XMLHDFS File SystemStructured DataHQLHive HBaseHadoop FrameworkGlobal Data Strategy, Ltd. 2016Semi-structured DataJSONUnstructured DataXML JSONMapReduce / Analytics

GitHub MetadataOpen Source DevelopmentWhat is the purposeof the code? Data Lake explorationtypically is code-drivenwith little formal datastructure. In the Open Sourcedevelopment,environment, metadatastill exists. Just enough informationfor another developer tobe able to re-use thecode. Similar documentationcan be provided for DataLake exploration &associated data sciencemodels & code.Global Data Strategy, Ltd. 2016Who published it?What are the datastructures?What are helpfulcomments?34

Open Data MetadataPublicly-available data With Open Data, metadata provides the context that makes information usable & credible. Data Lakes can use a similar method.When was it createdor updated?Feedback loopWhen was it Published?Who published it?What is theintended usage?How often is itrefreshed?What are the security orusage restrictions?DataWhat keywords categorize thisdata?Global Data Strategy, Ltd. 201635

Business Definitions are CriticalPutting information i

A Data Lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure & requirements are not defined until the data is needed. A Data Warehouse is a storage repository that holds current and historical