Transcription
From Cost Center to Profit Center – Data Management BestApproachesJuly 13, 2016
2 RedPoint Global Inc. 2016Confidential
Overview of RedPoint GlobalLaunched in 2006Founded and staffed by industry veteransHeadquarters: Wellesley, MassachusettsOffices in US, UK, Australia, PhilippinesGlobal customer baseServes most major industries3 RedPoint Global Inc. 2016Confidential
RedPoint Data Management Ranks High in Gartner CriticalCapabilities ReportProduct or Service Scores forOperational/Transactional Data QualityProduct or Service Scores forData Integration4.214 RedPoint Global Inc. 20164.41Confidential
Big Data Can Become Big Information5 RedPoint Global Inc. 2016Confidential
What Needs to ofdataOrganizeLinkInformation6 RedPoint Global Inc. executionConfidential
Attributes of InformationRELEVANTInformationmust linformationisoften ll.ACCURATEThisoneisobvious.Ina curacyof aclearcostbenefit. formationbutthisisalsowhatrivestheuseifsuccessful7 RedPoint Global Inc. 2016Confidential
Current State of Data DUCTIONMODEDenormalizing filingNormalizingvalue8 RedPoint Global Inc. 2016Confidential
When Data Prep is ODETimespenttuningalgorithm:80%9 RedPoint Global Inc. 2016Confidential
The Elephant in the RoomSkillsGap SevereshortageofMRorSparkskilledresources Veryexpensiveresourcesandhardtoretain Inconsistentskillsleadtoinconsistentresults Underutilizesexistingresources 10 RedPoint Global Inc. 2016Maturity&GovernanceDataIntoInformation AnascenttechnologyecosystemaroundHadoop tionality Newapplicationsarenotenterpriseclass Legacyapplicationshavebuiltshorttermcapabilities ormation pectives endeduseofthedataConfidential
Key Data Mastering Functionality Needed for Fast Data PrepETL&ELTDataQuality Profiling,reads/writes,transformations SingleprojectforalljobsWebServicesIntegration Consumeandpublish HTTP/HTTPSprotocols XML/JSON/SOAPformats11 Cleansedata Parsing,correction Geo- ‐spatialanalysisProcessAutomation&Operations Jobscheduling,monitoring,notifications Centralpointofcontrol MetaDataManagement RedPoint Global Inc. 2016Integration&Matching Grouping FuzzymatchHadoopIntegration PureYARNintegrationintoHadoop NocodingdataqualityConfidentialMasterKeyManagement Createkeys Trackchanges MaintainmatchesovertimeJavaSDKLayer JavaSDKforrapiddevelopment Publicprojectincubatorforprojectsharing
Benchmarks – Project entirecodewhichtotalsnearly150lines):public static class MapClassextends Mapper WordOffset, Text, Text, IntWritable {private final static String delimiters "',./ ?;:\"[]{}- ()&*% # !@ \\ «»¡ ¶·¿";private final static IntWritable one new IntWritable(1);private Text word new Text();public void map(WordOffset key, Text value, Context context)throws IOException, InterruptedException {String line value.toString();StringTokenizer itr new StringTokenizer(line, delimiters);while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}}12PigSamplePigscriptwithouttheUDF:SET pig.maxCombinedSplitSize 67108864SET pig.splitCombination trueA LOAD '/testdata/pg/*/*/*';B FOREACH A GENERATE FLATTEN(TOKENIZE((chararray) 0)) AS word;C FOREACH B GENERATE UPPER(word) AS word;D GROUP C BY word;E FOREACH D GENERATE COUNT(C) AS occurrences, group;F ORDER E BY occurrences DESC;STORE F INTO '/user/cleonardi/pg/pig-count'; 150 Lines of MR code 50 Lines of script code0 Lines of code6 hours of development3 hours of development15 minutes of development6 minutes runtime15 minutes runtime3 minutes runtimeNeeds extensiveoptimizationUser-defined functions neededbefore running scriptNo tuning oroptimization required RedPoint Global Inc. 2016Confidential
Intel’s POV on Data Quality Outside rDataQualityandMDMCostlyprocessintimeandmoney13 RedPoint Global Inc. gConfidential
RedPoint’s Marketing Data LakeDataIngestionDataLake14Specialized AnalyticDatabases & CachesProduction RDBMSDatabasesPersistent Entity Resolution, Linkage and KeyingYARN1 Matching, M DM In clusterProcessnativedocumentortabulardata RedPoint Global Inc. 2016nPurposeBuiltDataStructuresConfidential
For Additional can: ualityreport ViewCustomerCasestudies RequestaFreeTrial15 RedPoint Global Inc. ntial
Jul 06, 2016 · 11 RedPoint Global Inc. 2016 Confidential Key Data Mastering Functionality Needed for Fast Data Prep ETL*&*ELT Data*Quality Master*Key*Management Web*Services*Integration Integration*&*Matching Process*Automation* &Operations Profiling,*reads/writes,* transformations Single*project*for*all*jobs Cleanse*data Parsing,*correction