Modernizing The Data Warehouse - CODUG

Transcription

Modernizing the Data WarehouseThe Marriage of Big Data and Relational TechnologiesDirk deRoosdderoos@ca.ibm.com@Dirk deRoosWorld-Wide Technical Sales, Big DataJune 19, 2014 2014 IBM Corporation

The Evolution of Analytics 1960s: Navigational DBMS– IMS (hierarchical) 1970s-1980s: Relational DBMS– SQL– System R, System Z, DB2 1990s: Data Warehouse– Dimensional model, ETL, MDM Today: Big Data/NoSQLTed Codd22 2013 IBMCorporation 2014IBM Corporation

Pressures on Traditional Relational StoresBudgetary constraintsTechnical change/Different forms of data33Regulatory pressures(SLAs, Archive, Governance) 2013 IBMCorporation 2014IBM Corporation

The NoSQL Revolution Different requirements require different tools––––Document storesKey/value storesGoogle BigTable implementationsGraph databases Values (there are exceptions)– Huge data volumes – easy scale-out– Semi-structured data– Extreme performance4 2013 IBMCorporation 2014IBM Corporation

Database GenresA High-level ViewKey/ValueColumnarData SizeDocumentGraphRelationalData Complexity55 2014 IBM Corporation

Traditional Warehousing vs. NoSQLACID vs. BASE Atomicity Consistency Isolation Durability6 Basically Available Soft state Eventually consistent 2013 IBMCorporation 2014IBM Corporation

Hadoop – Architecture Master / Slave architectureNameNodeFile1abcd Master: NameNode– Manages the file system namespaceand metadata FsImage EditLog– Regulates access by files by clients Slave: DataNode–––––Many DataNodes per clusterManages storage attached to the nodesPeriodically reports status to NameNodeData is stored across multiple nodesNodes and components will fail, so forreliability data is replicated acrossmultiple nodesabdbacadccbdDataNodes7 2014 IBM Corporation

Hadoop Distributed File System HDFS is designed to support very large files Each file is split into blocks– Hadoop default: 64MB– BigInsights default: 128MB Blocks reside on different physical DataNode Behind the scenes, 1 HDFS block is supported by multiple operatingsystem blocks64 MBHDFS blocksOS blocks If a file or a chunk of the file is smaller than the block size, onlyneeded space is used. E.g.: a 210MB file is split as follows:64 MB864 MB64 MB18 MB 2014 IBM Corporation

MapReduce Explained Hadoop computation model– Data stored in a distributed file system spanning many inexpensive computers– Bring function to the data– Distribute application to the compute resources where the data is stored Scalable to thousands of nodes and petabytes of dataHadoop Data Nodespublic static class TokenizerMapperextends Mapper Object,Text,Text,IntWritable {private final static IntWritableone new IntWritable(1);private Text word new Text();public void map(Object key, Text val, ContextStringTokenizer itr new StringTokenizer(val.toString());while (itr.hasMoreTokens()) {word.set(itr.nextToken());context.write(word, one);}}1. Map Phase(break job into small parts)}public static class IntSumReducerextends Reducer Text,IntWritable,Text,IntWritaprivate IntWritable result new IntWritable();public void reduce(Text key,Iterable IntWritable val, Context context){int sum 0;for (IntWritable v : val) {sum v.get();Distribute maptasks to cluster. . .2. Shuffle(transfer interim outputfor final processing)3. Reduce Phase(boil all output down toa single result set)MapReduce ApplicationShuffleResult Set9Return a single result set 2014 IBM Corporation

Next Generation Hadoop Beyond MapReduce General purpose storage and processing framework10 2014 IBM Corporation

Complementary AnalyticsTraditional ApproachNew ApproachStructured, analytical, logicalCreative, holistic thought, n DataSocial DataInternal App ame DataLinearLinearMonthly sales reportsProfitability analysisOLTP System DataCustomer surveysERP data1111TraditionalSourcesWeb xploratoryExploratoryIterative Text Data: emailsIterativeBrand sentimentProduct strategySensor data: imagesMaximum asset utilizationNewSourcesRFID 2013 IBMCorporation 2014IBM Corporation

Traditional Data Mining and Exploratory Analysis1212 2013 IBMCorporation 2014IBM Corporation

Data Governance Maturity Disciplines 13Organizational awarenessStewardshipPolicyValue creationData risk managementSecurity/Privacy/Compliance Data architectureData qualityBusiness glossary/metadataInformation lifecycle managementAudit and reporting 2013 IBMCorporation 2014IBM Corporation

Data Governance Maturity DisciplinesNoSQL Challenges 14Organizational awarenessStewardshipPolicyValue creationData risk managementSecurity/Privacy/Compliance Data architectureData qualityBusiness glossary/metadataInformation lifecycle managementAudit and reporting 2013 IBMCorporation 2014IBM Corporation

Traditional AnalyticsData typesActionable insightOperationalsystemsStaging areaPredictive analyticsand modelingEnterpriseWarehouseData MartsTransaction andapplication dataReporting andanalysisArchiveInformation Movement and Transformation15 2014 IBM Corporation

IBM Big Data Architecture VisionAll Data SourcesBig Data EcosystemAnalytic ApplicationsReal-timeAnalytic n andOperationalInformation StreamProcessing DataIntegration Master Data Video/AudioNetwork/SensorEntity AnalyticsPredictiveLanding Area,Analytics Zoneand Archive Exploration,IntegratedWarehouse, andMart Zones DecisionManagementDiscoveryDeep ReflectionOperationalPredictiveRaw DataStructured DataText AnalyticsData MiningEntity AnalyticsMachineLearningBI and PredictiveAnalyticsAnalyticApplicationsInformation Governance, Security andBusiness Continuity16 2014 IBM Corporation

Analytics for Data-in-MotionReal time delivery Scale-out architecturefor massive linear scalabilityICUMonitoringAlgorithmicTrading Sophisticated analyticswith pre-built toolkits & accelerators Comprehensive development toolsto build applications with minimallearningCyberSecurityMillions ofevents ernment /LawenforcementTelco nal / Non-traditionaldata sourcesVideo, audio, networks, social media, etc17 2014 IBM Corporation

BigInsights: IBM’s Hadoop Distribution BigInsights Pure OpenSource Code Analysis–––––Native SQL interfaceNative R interfaceText analysis toolkitSocial analysis toolkitSpreadsheet style analysis GUI Development lifecycle– Cluster aware Eclipse plug-ins– App Store for Hadoop Data Exploration– Indexing and faceted search– Search-based applications18 Opt-in EnterpriseClass ExtensionsIBM SupportInfrastructure Management– Enterprise file system Advanced replication Multi-temp storage POSIX controls– Grid management Mature resource manager Multi-tenant workload support Baked-in security– LDAP– Role-based authorization– Perimeter security with reverse-proxy 2014 IBM Corporation

Big SQLMaster NodesBig SQLCoordinator nodeCatalogLocal fs (catalog tables)Hadoop DataNode(s)Map ReduceDB Runtime n 1Map ReduceDB Runtime n 2syncMap ReduceDB Runtime n nsyncsyncCluster networkHDFSUser Datatemp(s)HDFSHost 1User Datatemp(s)HDFSUser DataHost 2temp(s)Host n Architecture––––IBM Optimizer IBM Compiler IBM Runtime Ported to HadoopNodes integrated in Hadoop cluster, direct access to Hadoop dataQueries Hadoop data – no proprietary data formatMapReduce run-time also available for query execution Benefits––––19Extensive SQL support (ANSI, IBM, Oracle, Teradata)Performance: Maturity – 30 years of engineeringFederated joins between relational systems and HadoopSecurity: Row and column access control 2014 IBM Corporation

Deep Statistical Analysis: Big R Fit-for-purpose architecture for deep statistical analysis– Problems involving small data sets (10GB): R– Problems involving partitioned data sets (e.g. 32 x 10GB): BigR– Problems involving large data sets: (TB range): BigR using SystemML R integration in BigInsights– R code can be deployed againstdata stored in BigInsights– Big R: partitioning larger data setsand executing R code against them– Seamless access to data in BigInsights– Enterprise friendly license (no GPL)R ClientsData Sources SystemML– Some data sets cannot belogically partitioned: too big for R– Engine designed for massive scaleon Hadoop– Numerically accurate results– Provide an R interface for SystemML20SystemMLIBM ConfidentialEmbedded RParallelExecution 2014 IBM Corporation

Big MatchFind and Integrate Master Data in Big Data Sources How It Works– Probabilistic matching on big data platform (BigInsights-Hadoop)– Matching at a higher volume– Matching of a wider variety of data setsMDM BigInsights Client Value– Find master data within big data sources– Get an answer faster – enable real-timematching at big data volumesBig Match Engine Building Big Data Confidence– Provides more context by detectingmaster entities faster21 2014 IBM Corporation

Unique Data Matching Capabilities for HadoopProbabilistic matching engine and pre-built algorithms integrated into BigInsights for linking alldata related to a customer natively within HadoopInternal / plyChainSupportTicketingC. Johnson125 Main Street512-554-1234C. JohnsonMain Street512-554-1234Christine. Johnson123 Main StreetCall lengthSemi-structured notesSatisfactionOrderMgmt.C. JohnsonChris Johnston123 Main Street 123 Main Street512-545-1234512-554-1234Shipping:456 Pine AveIncreased Value of Customer only if Big Matchmatches allthese recordsExternal / �� Clothes,Camping Gear22Christine JohnsonMarried1 child4/15/743rd Party@ChristyJohnson65Christy65Mail Order responderSpecialty ApparelPartner Sales dataChristy65Circle / Network dataVIP: GoldCustomer Sat: 80%Influence Score: 8/10 2014 IBM Corporation

Match and Search Differentiators – Fuzzy Matching Comprehensive library of fuzzy matching techniques Scored against probabilistic weights based on value frequencies in your ammed vs.MahmoudAndrew AndyGeorge Jorge1st FirstAIG AmericanInternationalGroupRoad RdVan de Velde VandeveldeEdit DistanceTransliterationDate SimilarityProximity867-5309 8765309Toyota トヨダ01/01/1973 01/02/1973Geocodes andgreat-circledistanceTypographicalErrorsNoise WordsMisalignmentInitiate Inc. InitiateKim Jung-il Kim il JungJohn Smith vs.John Snith23 2014 IBM Corporation

Logical Data Warehouse – Schema AreasExploration ZoneData Sandbox AreasVisualization,Data Mining& ExplorationDetailedSystem ofRecordData (Y)LandingZone24Data DeltaDetailedSystem ofRecord (Years)(More Refined Data)(Years)(ELT)(Modeled,Years)(Mixture: Raw& Modeled)Data ScientistsAnalytical orPredictiveModelsData Prediction(Years)Self-Provisioning Data(Raw,Years)Data ct Data UsersUser Reports&DashboardsUser Guided& AdvancedAnalyticsIntegrated Warehouse & Marts ZoneDeep Data Zone 2014 IBM Corporation

THINK25 2013 IBMCorporation 2014IBM Corporation

BigInsights Enterprise Edition ComponentsIBM InfoSphere BigInsightsData IngestToolsApplication Support and Development ToolingVisualization & DiscoveryBigSheetsGovernanceCatalogData ExplorerEclipseApp infrastructureBig SQLJaqlDashboard ieJDBCTeradataNetezzaAdvanced Analytics EnginesText Processing Engine andExtractor Library (AQL HIL)Big R / SystemMLRDB2StreamsCluster Optimization and ManagementIntegrated InstallerAdmin ConsoleEnhanced MonitoringZooKeeperAvroDerbyData ClickSplittable tive MapReducePlatform SymphonySecurityPrivate firewallData StoreFile SystemNutchHBaseHDFSLDAP or KerberosGPFS-FPOSqoopPAMOpen Source26FlumeIBM 2014 IBM Corporation

Find and Integrate Master Data in Big Data Sources How It Works - Probabilistic matching on big data platform (BigInsights-Hadoop) - Matching at a higher volume - Matching of a wider variety of data sets Client Value - Find master data within big data sources - Get an answer faster - enable real-time matching at big data volumes