Transcription
Ein Unternehmen der Daimler AGLECTURE @DHBW: DATA WAREHOUSEPART L: DATA CATALOGANDREAS BUCKENHOFER, DAIMLER TSS
ABOUT MEAndreas enior DB hemen/datenbank/in-memory/Since 2009 at Daimler TSSDepartment: Big DataBusiness Unit: Analyticshttp://wwwlehre.dhbw-stuttgart.de/ buckenhofer/https://www.xing.com/profile/Andreas Buckenhofer2
ANDREAS BUCKENHOFER, DAIMLER TSS GMBH“Forming good abstractions and avoiding complexityis an essential part of a successful data architecture”Data has always been my main focus during my long-time occupation in the area ofdata integration. I work for Daimler TSS as Database Professional and Data Architectwith over 20 years of experience in Data Warehouse projects. I am working withHadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn newthings, experiment, and program every day.I share my knowledge in internal presentations or as a speaker at internationalconferences. I'm regularly giving a full lecture on Data Warehousing and a seminar onmodern data architectures at Baden-Wuerttemberg Cooperative State UniversityDHBW. I also gained international experience through a two-year project in GreaterLondon and several business trips to Asia.I’m responsible for In-Memory DB Computing at the independent German OracleUser Group (DOAG) and was honored by Oracle as ACE Associate. I hold currentcertifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big DataArchitect“, „Oracle Database 12c Administrator Certified Professional“, “IBMInfoSphere Change Data Capture Technical Professional”, etc.Daimler TSSContact/ConnectxingDOAGDHBWData Warehouse / DHBW3
NOT JUST AVERAGE: OUTSTANDING.As a 100% Daimler subsidiary, we give100 percent, always and never less.We love IT and pull out all the stops toaid Daimler's development with ourexpertise on its journey into the future.Our objective: We make Daimler themost innovative and digital mobilitycompany.Daimler TSSData Warehouse / DHBW4
INTERNAL IT PARTNER FOR DAIMLER Holistic solutions according to the Daimler guidelines IT strategy Security Architecture Developing and securing know-how TSS is a partner who can be trusted with sensitive dataAs subsidiary: maximum added value for Daimler Market closeness Independence Flexibility (short decision making process,ability to react quickly)Daimler TSSData Warehouse / DHBW5
LOCATIONSDaimler TSS Germany7 locations1000 employees*Ulm (Headquarters)Daimler TSS ChinaHub Beijing10 employeesStuttgartBerlinKarlsruhe* as of August 2017Daimler TSSDaimler TSS IndiaHub Bangalore22 employeesDaimler TSS MalaysiaHub Kuala Lumpur42 employeesData Warehouse / DHBW6
WHAT YOU WILL LEARN TODAY After the end of this lecture you will be able to Explain metadata Technical Business operational Explain data catalogs Understand use cases for metadataDaimler TSSData Warehouse / DHBW7
MAKING IT EASIER TO DISCOVER DATASETSAVAILABLE SINCE 05-SEP-2018Source: Google announcement easier-discover-datasets/Daimler TSSData Warehouse / DHBW8
MAKING IT EASIER TO DISCOVER DATASETSAVAILABLE SINCE 05-SEP-2018Daimler TSSData Warehouse / DHBW9
FIND THE RIGHT DATA With data science and analytics on the rise and under way to beingdemocratized, the importance of being able to find the right data toinvestigate hypotheses and derive insights is paramountSource: h-for-datasets-first-research-then-the-world Google Dataset search helps to find external data Schema.org defines open metadata format; dataset itself may not be open/free Search engines can interpret the format Ranking of data Help users discover where the data is and user can access it directly from the sourceWhat about internal data?
WHAT IS METADATA?Dataaboutother dataDaimler TSSData Warehouse / DHBW11
TYPES OF METADATA (1)Business Metadata Definition of business vocabulary and relationships Definition of the value range Linkage to physical representationDaimler TSSData Warehouse / DHBW12
TYPES OF METADATA (2)Report and ETL metadata Report definitions Data sources Column definitions ComputationsLogical and physical metadata of data model Table structure Definition of columns Relationships between tables and columnsDimension hierarchy
BENEFITS OF METADATA MANAGEMENT Data Lineage and dependenciesGenerating and controllingDWH processesImprove SW developmentqualityIncrease comprehensibility ofKPIsSource: Detlef Apel: Datenqualität erfolgreich steuern, dpunkt 2015, chapter 14Daimler TSSData Warehouse / DHBW14
TECHNICAL METADATA MANAGEMENTVERY OFTEN NOT SUCCESSFULWho enrichestechnical metadatawithbusiness relevantinformation?Metadata RepositoryOLTP-1OLTP-2Data oservice-1DWH
OVER 75% Of time is spent for Say they least enjoyDATA PREPARATIONData ConsumersDaimler TSSData Warehouse / DHBW16
DOES METADATA MANAGEMENT PROVIDE ANSWERS TOSUCH QUESTIONS ACROSS THE WHOLE WORKFLOW?What table containsproduction dates?FindHow is thiscolumncalculated?UnderstandSearch for dataWhat tables areimportant?Who knows aboutthe data?Daimler TSSIs the datareliable?TrustHow to get accessto the data?Is FIN unique?AccessWriteWork with dataWhat is the differencebetween production dateHow to join theand prod dt?tables?Data Warehouse / DHBW17
DATA CATALOG A HOT TOPIC New Data Catalog vendors are entering the marketEstablished vendors rebrand and enrich their existing toolsDaimler TSSData Warehouse / DHBW18
EVALUATION CRITERIATechnical MetadataData accessBusiness Metadata incl. GlossaryLineageTagging (Linkage)APICollective ctureSecurityComponentsSource connectorsPrerequisitesData profilingLicencingDaimler TSSData Warehouse / DHBW19
ALATION ARCHITECTURENot just an RDBMS for structuredmetadata, but also storage engines for textdataDaimler TSSData Warehouse / DHBW20
CATALOGING SOURCE SYSTEMSMANY FORMATS MANY CONNECTORS RDBMS (Oracle, Db2, SQL Server, Teradata, ) Hadoop (HDFS, Hive, ; on-premises, Cloud) NoSQL DBs Files (Excel, csv, ) Powerdesigner, Erwin, and other data modeling tools
METADATA IMPORT USED TO BE SIMPLE WITH RDBMSWhere is the dataand where is themetadata in thislogfile?Data Lake:decentralized controlof the data
DATA LAKE / HADOOP Easy approach: Access Hive Metastore and import metadata Prerequisite: all data/files in HDFS require Hive access But unrealistic prerequisite Many logs are just dumped into the file system Interpreting ALL files by catalog SW unrealistic, too. Huge computing power Huge number of variations (Cloud, on-premises, SW versions) lacks support of vendors forcatalog SW Sources should deliver metadata
CATALOGING @GOOGLEHeavy usage ofAutomationandMachine LearningSource: https://ai.google/research/pubs/pub45390Daimler TSSData Warehouse / DHBW24
CATALOGING AT NETFLIX, TWITTER, LINKEDIN, ETC.CompanyLinkNetflix acatTwitterhttps://blog.twitter.com/engineering/en of-analytics-data-at-twitter.htmlLinkedIn tps://github.com/linkedin/WhereHows/wikiGoogle r-for-apache-atlas-use-cases/
CATALOGING @UBERSource: https://eng.uber.com/databook/Daimler TSSData Warehouse / DHBW26
CATALOGING @TWITTERSource: https://blog.twitter.com/engineering/en of-analytics-data-at-twitter.htmlDaimler TSSData Warehouse / DHBW27
CATALOGING @LINKEDIN (OPEN SOURCE)Source: https://github.com/LinkedIn/Wherehows/wikiDaimler TSSData Warehouse / DHBW28
CATALOGS ARE EVERYWHERE GOOGLE, AMAZONINVENTORYDaimler TSSUSER EXPERIENCEData Warehouse / DHBW29
INVENTORY VS USER EXPERIENCESuppliers provide inventory A catalog should list everything that is actually availableConsumers require user experience A catalog should provide data usage statistics, ratings, datasamples, statistical profiles, lineage, lists of users andstewards, and tips on how the data should be interpreted
AUTOMATION, CROWD KNOWLEDGE, AND EXPERTS limitation of permissions to a trusted group A trusted group documents few datasets very well But most of the metadata is not documented Failure of many past approaches Automation, crowd knowledge and experts required Automation to get a broad coverage and use existing information like query logs Crowd to increase broad coverage Experts to confirm or reject „guesses“- Combination of coverage and accuracy
DATA CATALOG – AMAZON FOR GovernanceData AccessDaimler TSSData Warehouse / DHBW32
CATALOG SEARCHDaimler TSSData Warehouse / DHBW33
SCHEMA AND ITS TABLESDaimler TSSData Warehouse / DHBW34
TABLE AND ITS COLUMNS WITH SAMPLE DATADaimler TSSData Warehouse / DHBW35
COLUMNS AND RELATIONSHIPSDaimler TSSData Warehouse / DHBW36
LEGAL TAGSGDPR AND OTHER REGULATIONSAssociate legal tags Articles 16-21 Identify data Right to erasure Right to be forgottenDaimler TSSData Warehouse / DHBW37
Daimler TSSSource 3Source 1Source 2Source 3DatacatalogLocal data catalogs (reality) Legal requirements Feasibility Tool support very weak Source 2DatacatalogCentral data catalog Integrated views Mammoth task No redundancySource 1Data catalogCENTRAL VS LOCAL DATA CATALOGSData Warehouse / DHBW38
BIMA-STUDIE 2018 (BARC SOPRA STERIA CONSULTING)DIGITIZATION HOT SPOTSData quality and meta data managementDomain knowledgeData culture
IS THE DATA CATALOG A “METADATA MANAGEMENTRELOADED”?Name it as you like, but there are some critical developments Automation, Collective intelligence and expert knowledge Enable crowd sourcing and get help from other usersHelp to understand quality of data and usage of datasets Rating of information Web application for search / collaboration and API to access metadataGovernance and legal framework for e.g. GDPR scenarios Capture metadata for security and end-user data consumption Identify the owner of the dataset and get access to source data
Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon 49 731 505-06 / Fax 49 731 505-65 99tss@daimler.com / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSSitz und Registergericht: Ulm / HRB-Nr.: 3844 / Geschäftsführung: Martin Haselbach (Chairperson), Steffen Bäuerle
InfoSphere hange Data apture Technical Professional", etc. DOAG DHBW xing Contact/Connect. As a 100% Daimler subsidiary, we give . Business Metadata incl. Glossary Tagging (Linkage) Collective Intelligence(Collaboration) Search Security Source connectors Data profiling Data access Lineage API Versioning Architecture