Lecture @Dhbw: Data Warehouse Part L: Data Catalog

Transcription

Ein Unternehmen der Daimler AGLECTURE @DHBW: DATA WAREHOUSEPART L: DATA CATALOGANDREAS BUCKENHOFER, DAIMLER TSS

ABOUT MEAndreas enior DB hemen/datenbank/in-memory/Since 2009 at Daimler TSSDepartment: Big DataBusiness Unit: Analyticshttp://wwwlehre.dhbw-stuttgart.de/ buckenhofer/https://www.xing.com/profile/Andreas Buckenhofer2

ANDREAS BUCKENHOFER, DAIMLER TSS GMBH“Forming good abstractions and avoiding complexityis an essential part of a successful data architecture”Data has always been my main focus during my long-time occupation in the area ofdata integration. I work for Daimler TSS as Database Professional and Data Architectwith over 20 years of experience in Data Warehouse projects. I am working withHadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn newthings, experiment, and program every day.I share my knowledge in internal presentations or as a speaker at internationalconferences. I'm regularly giving a full lecture on Data Warehousing and a seminar onmodern data architectures at Baden-Wuerttemberg Cooperative State UniversityDHBW. I also gained international experience through a two-year project in GreaterLondon and several business trips to Asia.I’m responsible for In-Memory DB Computing at the independent German OracleUser Group (DOAG) and was honored by Oracle as ACE Associate. I hold currentcertifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big DataArchitect“, „Oracle Database 12c Administrator Certified Professional“, “IBMInfoSphere Change Data Capture Technical Professional”, etc.Daimler TSSContact/ConnectxingDOAGDHBWData Warehouse / DHBW3

NOT JUST AVERAGE: OUTSTANDING.As a 100% Daimler subsidiary, we give100 percent, always and never less.We love IT and pull out all the stops toaid Daimler's development with ourexpertise on its journey into the future.Our objective: We make Daimler themost innovative and digital mobilitycompany.Daimler TSSData Warehouse / DHBW4

INTERNAL IT PARTNER FOR DAIMLER Holistic solutions according to the Daimler guidelines IT strategy Security Architecture Developing and securing know-how TSS is a partner who can be trusted with sensitive dataAs subsidiary: maximum added value for Daimler Market closeness Independence Flexibility (short decision making process,ability to react quickly)Daimler TSSData Warehouse / DHBW5

LOCATIONSDaimler TSS Germany7 locations1000 employees*Ulm (Headquarters)Daimler TSS ChinaHub Beijing10 employeesStuttgartBerlinKarlsruhe* as of August 2017Daimler TSSDaimler TSS IndiaHub Bangalore22 employeesDaimler TSS MalaysiaHub Kuala Lumpur42 employeesData Warehouse / DHBW6

WHAT YOU WILL LEARN TODAY After the end of this lecture you will be able to Explain metadata Technical Business operational Explain data catalogs Understand use cases for metadataDaimler TSSData Warehouse / DHBW7

MAKING IT EASIER TO DISCOVER DATASETSAVAILABLE SINCE 05-SEP-2018Source: Google announcement easier-discover-datasets/Daimler TSSData Warehouse / DHBW8

MAKING IT EASIER TO DISCOVER DATASETSAVAILABLE SINCE 05-SEP-2018Daimler TSSData Warehouse / DHBW9

FIND THE RIGHT DATA With data science and analytics on the rise and under way to beingdemocratized, the importance of being able to find the right data toinvestigate hypotheses and derive insights is paramountSource: h-for-datasets-first-research-then-the-world Google Dataset search helps to find external data Schema.org defines open metadata format; dataset itself may not be open/free Search engines can interpret the format Ranking of data Help users discover where the data is and user can access it directly from the sourceWhat about internal data?

WHAT IS METADATA?Dataaboutother dataDaimler TSSData Warehouse / DHBW11

TYPES OF METADATA (1)Business Metadata Definition of business vocabulary and relationships Definition of the value range Linkage to physical representationDaimler TSSData Warehouse / DHBW12

TYPES OF METADATA (2)Report and ETL metadata Report definitions Data sources Column definitions ComputationsLogical and physical metadata of data model Table structure Definition of columns Relationships between tables and columnsDimension hierarchy

BENEFITS OF METADATA MANAGEMENT Data Lineage and dependenciesGenerating and controllingDWH processesImprove SW developmentqualityIncrease comprehensibility ofKPIsSource: Detlef Apel: Datenqualität erfolgreich steuern, dpunkt 2015, chapter 14Daimler TSSData Warehouse / DHBW14

TECHNICAL METADATA MANAGEMENTVERY OFTEN NOT SUCCESSFULWho enrichestechnical metadatawithbusiness relevantinformation?Metadata RepositoryOLTP-1OLTP-2Data oservice-1DWH

OVER 75% Of time is spent for Say they least enjoyDATA PREPARATIONData ConsumersDaimler TSSData Warehouse / DHBW16

DOES METADATA MANAGEMENT PROVIDE ANSWERS TOSUCH QUESTIONS ACROSS THE WHOLE WORKFLOW?What table containsproduction dates?FindHow is thiscolumncalculated?UnderstandSearch for dataWhat tables areimportant?Who knows aboutthe data?Daimler TSSIs the datareliable?TrustHow to get accessto the data?Is FIN unique?AccessWriteWork with dataWhat is the differencebetween production dateHow to join theand prod dt?tables?Data Warehouse / DHBW17

DATA CATALOG A HOT TOPIC New Data Catalog vendors are entering the marketEstablished vendors rebrand and enrich their existing toolsDaimler TSSData Warehouse / DHBW18

EVALUATION CRITERIATechnical MetadataData accessBusiness Metadata incl. GlossaryLineageTagging (Linkage)APICollective ctureSecurityComponentsSource connectorsPrerequisitesData profilingLicencingDaimler TSSData Warehouse / DHBW19

ALATION ARCHITECTURENot just an RDBMS for structuredmetadata, but also storage engines for textdataDaimler TSSData Warehouse / DHBW20

CATALOGING SOURCE SYSTEMSMANY FORMATS MANY CONNECTORS RDBMS (Oracle, Db2, SQL Server, Teradata, ) Hadoop (HDFS, Hive, ; on-premises, Cloud) NoSQL DBs Files (Excel, csv, ) Powerdesigner, Erwin, and other data modeling tools

METADATA IMPORT USED TO BE SIMPLE WITH RDBMSWhere is the dataand where is themetadata in thislogfile?Data Lake:decentralized controlof the data

DATA LAKE / HADOOP Easy approach: Access Hive Metastore and import metadata Prerequisite: all data/files in HDFS require Hive access But unrealistic prerequisite Many logs are just dumped into the file system Interpreting ALL files by catalog SW unrealistic, too. Huge computing power Huge number of variations (Cloud, on-premises, SW versions) lacks support of vendors forcatalog SW Sources should deliver metadata

CATALOGING @GOOGLEHeavy usage ofAutomationandMachine LearningSource: https://ai.google/research/pubs/pub45390Daimler TSSData Warehouse / DHBW24

CATALOGING AT NETFLIX, TWITTER, LINKEDIN, ETC.CompanyLinkNetflix acatTwitterhttps://blog.twitter.com/engineering/en of-analytics-data-at-twitter.htmlLinkedIn tps://github.com/linkedin/WhereHows/wikiGoogle r-for-apache-atlas-use-cases/

CATALOGING @UBERSource: https://eng.uber.com/databook/Daimler TSSData Warehouse / DHBW26

CATALOGING @TWITTERSource: https://blog.twitter.com/engineering/en of-analytics-data-at-twitter.htmlDaimler TSSData Warehouse / DHBW27

CATALOGING @LINKEDIN (OPEN SOURCE)Source: https://github.com/LinkedIn/Wherehows/wikiDaimler TSSData Warehouse / DHBW28

CATALOGS ARE EVERYWHERE GOOGLE, AMAZONINVENTORYDaimler TSSUSER EXPERIENCEData Warehouse / DHBW29

INVENTORY VS USER EXPERIENCESuppliers provide inventory A catalog should list everything that is actually availableConsumers require user experience A catalog should provide data usage statistics, ratings, datasamples, statistical profiles, lineage, lists of users andstewards, and tips on how the data should be interpreted

AUTOMATION, CROWD KNOWLEDGE, AND EXPERTS limitation of permissions to a trusted group A trusted group documents few datasets very well But most of the metadata is not documented Failure of many past approaches Automation, crowd knowledge and experts required Automation to get a broad coverage and use existing information like query logs Crowd to increase broad coverage Experts to confirm or reject „guesses“- Combination of coverage and accuracy

DATA CATALOG – AMAZON FOR GovernanceData AccessDaimler TSSData Warehouse / DHBW32

CATALOG SEARCHDaimler TSSData Warehouse / DHBW33

SCHEMA AND ITS TABLESDaimler TSSData Warehouse / DHBW34

TABLE AND ITS COLUMNS WITH SAMPLE DATADaimler TSSData Warehouse / DHBW35

COLUMNS AND RELATIONSHIPSDaimler TSSData Warehouse / DHBW36

LEGAL TAGSGDPR AND OTHER REGULATIONSAssociate legal tags Articles 16-21 Identify data Right to erasure Right to be forgottenDaimler TSSData Warehouse / DHBW37

Daimler TSSSource 3Source 1Source 2Source 3DatacatalogLocal data catalogs (reality) Legal requirements Feasibility Tool support very weak Source 2DatacatalogCentral data catalog Integrated views Mammoth task No redundancySource 1Data catalogCENTRAL VS LOCAL DATA CATALOGSData Warehouse / DHBW38

BIMA-STUDIE 2018 (BARC SOPRA STERIA CONSULTING)DIGITIZATION HOT SPOTSData quality and meta data managementDomain knowledgeData culture

IS THE DATA CATALOG A “METADATA MANAGEMENTRELOADED”?Name it as you like, but there are some critical developments Automation, Collective intelligence and expert knowledge Enable crowd sourcing and get help from other usersHelp to understand quality of data and usage of datasets Rating of information Web application for search / collaboration and API to access metadataGovernance and legal framework for e.g. GDPR scenarios Capture metadata for security and end-user data consumption Identify the owner of the dataset and get access to source data

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon 49 731 505-06 / Fax 49 731 505-65 99tss@daimler.com / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSSitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Chairperson), Steffen Bäuerle

InfoSphere hange Data apture Technical Professional", etc. DOAG DHBW xing Contact/Connect. As a 100% Daimler subsidiary, we give . Business Metadata incl. Glossary Tagging (Linkage) Collective Intelligence(Collaboration) Search Security Source connectors Data profiling Data access Lineage API Versioning Architecture