MDM For The Modern Data Architecture

Transcription

MDM for the Modern Data ArchitectureSeptember 2014

Purpose of MDMCreate correct and consistent dataacross the enterprise that fosterstrust in information andacceleration of growth.2 RedPoint Global Inc. 2014Confidential

Why it matters“ Without data you’re justanother person with anopinion.”W. Edwards Deming3 RedPoint Global Inc. 2014Confidential

Vicious Cycle of Unmanaged DataData Issues1 Masterremain unaddressed orunresolvedconflicts4 Datareinforce siloed2 Garbagein/garbage outUnmanaged Dataoperationscreates processconfusionof process trust slows3 Lackbusiness momentum4 RedPoint Global Inc. 2014Confidential

A Data Architecture Under PressureUnstructured documents, emailsServer sPackagedApplicationsSentiment, web dataHierarchical data2.8 ZB in 2013RDBMSData SystemEDWMPPRepositoriesOLTP, ERP, CRM85% from new data types15x Machine Data by 2020Transactional data40 ZB by 2020Master dataSource: IDCExisting SourcesSources(CRM, ERP, Clickstream,Logs)Sensor, machine dataGeolocation Hortonworks Inc. 2014Clickstream5 RedPoint Global Inc. 2014Confidential

Broad Spectrum of Benefits Across IndustriesFinancialServices New account riskscreens Fraud prevention Trading risk Maximize depositspread Insurance underwriting Accelerate loanprocessingRetailTelecomManufacturing 360 view of thecustomer Analyze brandsentiment Localized, personalizedpromotions Website optimization Optimal store layout Call detail records(CDRs) Infrastructureinvestment Next product to buy(NPTB) Real-time bandwidthallocation New productdevelopment Supplier consolidation Supply chain andlogistics Assembly line qualityassurance Proactive maintenance Crowdsourced qualityassuranceHealthcare Genomic data formedical trials Monitor patient vitals Reduce re-admittancerates Store medical researchdata Recruit cohorts forpharmaceutical trials6 RedPoint Global Inc. 2014Utilities, Oil& GasPublic Sector Smart meter streamanalysis Slow oil well declinecurves Optimize lease bidding Compliance reporting Proactive equipmentrepair Seismic imageprocessing Analyze publicsentiment Protect critical networks Prevent fraud andwaste Crowdsource reportingfor repairs toinfrastructure Fulfill open recordsrequestsConfidential

Gartner’s Nexus of Forces Making Things Worse7 RedPoint Global Inc. 2014Confidential

Business Benefits of MDMToday IT data mgmt. pros focus on:Business leaders really care about:Eliminating duplicate/orphaned dataIncreasing revenueStandardizing and centralizing data/metadataDecreasing costsMeeting operational SLAsIncreasing operational efficienciesData enrichmentReducing riskData integration and synchronizationImproving customer experiencesUse business-value driven KPIs to evangelize MDM benefits8Reduction in direct marketingpostage costsReduction in average handle timein call centerIncrease in customer self-service fororder management, technical supportand customer serviceIncrease in campaign response ratesReduction in customer privacycompliance risk exposureDelivering a consistent crosschannel customer experience RedPoint Global Inc. 2014Confidential

How About MDM on a Data Lake?9Benefits of a Hadoop Data LakeChallenges to Data Lake Approach Data is ingested in its raw state regardless offormat, structure or lack of structure Raw data can be used and reused fordiffering purposes across the enterprise Beyond inexpensive storage, Hadoop is anextremely power and scalable andsegmentable computational platform Master Data can be fed across the enterpriseand deep analytics on clean data isimmediately enabled Severe shortage of Map Reduce skilledresources Inconsistent skills lead to inconsistentresults of code based solutions Nascent technologies require multiplepoint solutions Technologies are not enterprise grade Some functionality may not be possiblewithin these frameworks RedPoint Global Inc. 2014Confidential

Key Functions for Master Data ManagementETL & ELT Profiling, reads/writes,transformations Single project for all jobsMaster KeyManagement Create keys Track changes Maintain matchesover time10 RedPoint Global Inc. 2014Data QualityIntegration & Matching Cleanse data Parsing, correction Geo-spatial analysis Grouping Fuzzy matchWeb ServicesIntegrationProcess Automation& Operations Consume and publish HTTP/HTTPS protocols XML/JSON/SOAPformatsConfidential Job scheduling, monitoring,notifications Central point of control Meta Data Management

Data Lake is the Center of Your MDM StrategyIngestion of all data available fromany source, format, cadence,structure or non-structureELT and data transformation,refinement, cleansing, completion,validation and standardizationGeospatial processing andgeocodingData profiling, lineage and metadatamanagementIdentity resolution and persistentkeying and entity profilemanagement11 RedPoint Global Inc. 2014Confidential

Data Lake Architecture for MDMData SourcesClickstreamCRMOnline ia Call dbackCompeteFieldFeedbackManuf.FieldFeedback12 RedPoint Global Inc. 2014Confidential

How Can That Possibly Work?More MapReduce!13 RedPoint Global Inc. 2014YARN!Confidential

Overview What is Hadoop/Hadoop 2.0Hadoop 1.0 All operations based on Map Reduce Intrinsic inconsistency of code basedsolutions Highly skilled and expensive resourcesneeded 3rd party applications constrained by theneed to generate code14 RedPoint Global Inc. 2014Hadoop 2.0 Introduction of the YARN:“a general-purpose, distributed, applicationmanagement framework that supersedes theclassic Apache Hadoop MapReduce frameworkfor processing data in Hadoop clusters.” Mature applications can now operatedirectly on Hadoop Reduce skill requirements andincreased consistencyConfidential

RedPoint Data Management on HadoopParallel Section15DataI/OKey /SplitAnalysisNRAYPartitionDataserverExecutionAM /Tasks RedPoint Global Inc. 2014ConfidentialcudeRpaMPartitioningAM / Tasks

Reference Hadoop ArchitectureMonitoring and Management ticalTools and AppsAMBARIDBsINTERACTIVEDATA REFINEMENTFilFilesFilesesHIVEPIGHIVE Server2MAPREDUCESTRUCTUREJMSQueue’sREST- Sensor Logs- Clickstream- Flat Files- Unstructured- Sentiment- Customer- TALOG(metadata services)1 nHDFSRDBMSLOADSQOOP/HiveWeb HDFSRedPoint Functional Footprint16 RedPoint Global Inc. 2014Data SourcesConfidentialEDW

Benchmarks – Project GutenbergSample MapReduce (small subset of the entireScodewhichtotalsnearlywithout150 lines): the UDF:amplePigscriptpublic static class MapClassSETIntWritable pig.maxCombinedSplitSize67108864extends Mapper WordOffset, Text, Text,{private final static String delimiters SET pig.splitCombinationtrue"',./ ?;:\"[]{}- ()&*% # !@ \\ «»¡ ¶·¿";A LOAD'/testdata/pg/*/*/*';private final static IntWritable onenew IntWritable(1);private Text word new Text(); B FOREACH A GENERATE FLATTEN(TOKENIZE((chararray) 0))public void map(WordOffset key, Text value, Context context)C FOREACH B GENERATE UPPER(word) AS word;throws IOException, InterruptedException {String line value.toString(); D GROUP C BY word;StringTokenizer itr new StringTokenizer(line,E FOREACH delimiters);D GENERATE COUNT(C) AS occurrences, group;while (itr.hasMoreTokens()) {word.set(itr.nextToken()); F ORDER E BY occurrences DESC;context.write(word, one);STORE F INTO '/user/cleonardi/pg/pig-count';}}}Map Reduce17PigRedPoint 150 Lines of MR Code 50 Lines of Script Code0 Lines of Code6 hours of development3 hours of development15 min. of development6 minutes runtime15 minutes runtime3 minutes runtimeExtensive optimizationneededUser Defined Functionsrequired prior to runningscriptNo tuning or optimizationrequired RedPoint Global Inc. 2014Confidential

Data Lake Architecture for MDMData SourcesCRMClickstreamERPOnline ChatBillingSensorDataSubscriberSocialMediaProduct Call dbackCompeteFieldFeedbackManuf.FieldFeedback18 RedPoint Global Inc. 2014Confidential

Meta Data Management. 11 RedPoint Global Inc. 2014 Confidential Data Lake is the Center of Your MDM Strategy Ingestion of all data available from any source, format, cadence, structure or non-structure ELT and data transformation, refinement, cleansing, completion,