The Role Of Hadoop In BI And Data . - .101com

Transcription

The Role of Hadoopin BI and Data WarehousingColin WhiteBI ResearchSept. 16, 2014

Sponsor2

SpeakersColin WhitePresident,BI ResearchChad MeleyVP of Product andServices Marketing,Teradata3

The Role of Hadoopin BI and Data WarehousingColin WhitePresident, BI ResearchTDWI-Teradata Web SeminarSeptember 2014

Webinar ObjectivesReview Hadoop Trends and DirectionsLook at the Role of Hadoop in BI and Data WarehousingDiscuss Approaches for Integrating Hadoop into the ExistingBI/DW EnvironmentCopyright BI Research, 20145

Hadoop Origins: Apache“A framework for runningapplications on a largehardware cluster built ofcommodity hardware.”wiki.apache.org/hadoop/Source: MicrosoftFocus was on programmatic and batch-oriented applications that processedlarge amounts of multi-structured data (the original “big data”)Systems were deployed by assembling Apache components or using Hadoopdistributions from companies such as Cloudera, Hortonworks and MapRCopyright BI Research, 20146

Hadoop Today: Cloudera & Hortonworks ExamplesSource: ClouderaSource: ClouderaSource: HortonworksCopyright BI Research, 20147

Hadoop Today: MapR ExampleCopyright BI Research, 20148

Hadoop Today: The Component WarsComponent(and rksImpala (Cloudera) XXHue (Cloudera) XX Sentry (Cloudera) X XFlume (Cloudera) X Parquet (Cloudera/Twitter) X XSqoop (Cloudera) Ambari (Hortonworks)XXXX Knox (Hortonworks)XXXX Tez (Hortonworks)XXXX Drill (MapR)X XXXSource: ClouderaCopyright BI Research, 20149

Hadoop Today: Enterprise Integration ExampleSource: HortonworksCopyright BI Research, 201410

Hadoop Today: SummaryHadoop ecosystem is growing rapidly– has moved beyond batch MRprocessing to support a wide range ofdifferent application use casesClassic software and hardwarevendors have joined the race tosupport the use of Hadoop for useon-premises and in the cloudMany of these classic vendorsuse distributions from “open source”suppliersMany leading-edge and traditional businesses have Hadoopprojects in evaluation mode and also some in production –most of these projects are focused on specific LOB solutionsCopyright BI Research, 201411

Hadoop Today: Key QuestionsWhy use Hadoop? Replace existing enterprise systemsEnhance existing enterprise systemsWhat are the use cases for Hadoop?What are the TCO considerations for Hadoop?How mature is the Hadoop ecosystem?What are the skill requirements for Hadoop?Which Hadoop solution should we use?How do we integrate Hadoop with existingsystems?Copyright BI Research, 201412

Driving Forces Behind Big Data and HadoopDRIVERSNew ntoptionsTECHNOLOGIESCopyright BI Research, 201413

New Business Insights: Customer MarketingSituational 1-to-1 Marketing – reach individualcustomers with the right messages and offers Micro-segmentationAnalyze all channels: web, stores, call centers,purchases, buying patternsAnalyze other information for influential factors:geography, weatherCustomer experience management – make allexperiences beneficial to customer/businessCustomer perception management – analyzetrends in social channels and respondappropriatelyIn all cases analysts need to be able to movefrom analyzing past events to predicting futureoutcomesCopyright BI Research, 201414

New Business Insights: Fraud DetectionCopyright BI Research, 201415

New Business Insights: The Internet of ThingsFurther reading: GE Document - Industrial Internet: Pushing the Boundaries of Minds and MachinesCopyright BI Research, 201416

New Technologies: eXtended Data WarehouseAnalytic tools & applicationsInvestigative computingplatformTraditional EDWenvironmentData integrationplatformOperational systemsDatarefineryOther internal & externalstructured & multi-structured dataReal-time streaming dataRT analysis platformRT BI servicesOperational real-timeenvironmentCopyright BI Research, 201417

Two Important New ComponentsInvestigative Computing PlatformUsed for exploring data anddeveloping new analyses andanalytic modelso Output used by an enterprise DW,real-time analysis engine, or standalone LOB applicationo May employ RDBMS or HadoopoEDW data& analysesmodels& rulesapplicationsAnalytic tools & applicationsInvestigative computingplatformData RefineryIngests raw detailed data in batchand/or real-time into a manageddata storeo Distills the data into usefulinformation and distributes resultsto other systemso Primary use of Hadoop todayoEDW dataOperational dataData refineryOther internal &external data,RT streaming dataCopyright BI Research, 201418

The Role of Investigative ComputingEnables data scientists and analysts to blend new typesof data with existing information to discover ways ofimproving business processesAllows data scientists and analysts to experiment withdifferent types of data and analytics before committingto a particular solutionMay employ an analytic sandbox, analytic platform or a data refineryResults may include data schemas, analyses, analytic models, businessrules, decision workflows, dashboards, LOB applications, etc.Represents a shift in the way organizations build analytic solutions:oIncreases flexibility and provides faster time to value because data does nothave to be modeled or integrated into an EDW before it can be analyzedoExtends traditional business decision making with solutions that increase theuse and business value of analytics throughout the enterpriseCopyright BI Research, 201419

Teradata Example: Identify/Retain “At Risk” UsersMulti-Structured RawDataCall Center VoiceRecordsCustomerFeedbackCall DataHadoopRawSentimentDataTraditional Data FlowData SourcesPOSCapture, Retain andRefine LayerWeb SaleMobileSale&CustETL ToolsItemMasterAster doeswebsessionization,path and basicsentimentanalysiswith multistructureddataAster pre-builtoperators:sessionization,n-path, many tomany basket andaffinity,collaborativefiltering forrecommendationsAsterDiscoveryPlatformAnalytic ores andtransformssocial,images, andcall recordsDimensional DataSOCIALFEEDSWEB ANDMOBILECLICKSTREAMAnalysis dataIntegrated DWSource: TeradataCopyright BI Research, 201420

Hadoop Today: Key Questions Revisited - 1Why use Hadoop? Replace existing enterprise systems Enhance existing enterprise systems What are the use cases for Hadoop? Data refinery (including archiving)Investigative computing platform for analyzing large volumes ofraw data (especially multi-structured data) for specific LOBsolutionsWhat are the TCO considerations for Hadoop? Need to consider more than just hardware and software costsOther factors include training, development, administration andsupport costs, and floor space and utility requirementsCopyright BI Research, 201421

Hadoop Today: Key Questions Revisited - 2How mature is the Hadoop ecosystem? Still immature (especially in the areas of governance and systemsmanagement), but improving rapidlyWhat are the skill requirements for Hadoop? Despite increasing SQL support, Hadoop still requires highlytechnical skills in areas such as large-scale Linux and JavaWhich Hadoop solution should we use? Hadoop is not a single product but a set of different componentsthat satisfy a variety of requirementsChoice is between traditional and “open source” vendorsHow do we integrate Hadoop with existing systems? A key issue – build an eXtended data warehouse infrastructureCopyright BI Research, 201422

Bottom Line: A Lot Has Changed in a Year!Cloudera: “Enterprise Data Hub Complements the Ecosystem”Data Applications4Relational and NoSQL Database3CustomApplicationsEnterprise Data Hub21Data SourcesMapR: “Optimized Data Architecture”Hortonworks: “HDP is Deeply Integrated in the Data Center”Copyright BI Research, 201423

The Role of Hadoopin BI and Data WarehousingChad MeleyVP of Product and Services Marketingchad.meley@teradata.com

Key TrendsEconomics have changed increasing the amount of data you can captureTools have changed, expanding the types of analysesFramework has evolved so you can use the right tool for the right job259/15/2014Teradata Confidential

TERADATA UNIFIED DATA ARCHITECTURESystem Conceptual eCustomersPartnersSCMINTEGRATED DATA WAREHOUSECRMImagesDATAPLATFORMTERADATA DATABASEAudioand VideoMachineLogsDataMiningHADOOPINTEGRATED athand StatsDataScientistsTextLanguagesWeb andSocialEngineersASTER DATABASEUSERSSOURCESANALYTICTOOLS & APPS

NoSchema Advantages in HadoopLoad data, and figure it out later Raw data format provides complete flexibilityNon-traditional data types easily supported (graph, text, weblog, etc.)NoETL approach provides agilityLate-binding gives more power to the data scientist289/15/2014Teradata Confidential

NoSQL Advantages Flexibility in choice of programming languagesLeverage existing programming skill setsNot constrained to SQL set processing modelMore natural framework for manipulating non-traditional data typesEfficiency for parallelization of complex processing (e.g., imageprocessing, text parsing, etc.)Example: Pattern Matching AnalysisDiscover patterns in rows of sequential data.MapReduce Approach{user, page, time}WeblogsSmartMetersClick 1Click 2Click 3 Single-pass of data Time series analysis Gap recognitionClick 4{device, value, time}SalesTransactionsReading 1Reading 2Reading 3Reading 4Traditional SQL Approach{user, product, time}Stock TickDataPurchase 1 Purchase 2Purchase 3Purchase 4{stock, price, time}Tick 1Tick 2Tick 3Tick 4{user, number, time}Call Data Records299/15/20149/15/2014Call 1Call 229Call 3Call 4Call 4Teradata Confidential Full Table Scans Self-Joins for sequencing Limited operators for ordered data

Unlock New Insights with Late BindingEarly BindingData Providers Evaluate DataLate BindingData Providers Collect Data(Ingest)(Quality, Structure, Source(s), Meaning) Define Data Structure(Data Model, Data Type, Rules)Data Consumers Evaluate Data(Quality, Structure, Source(s), Meaning) Collect Data(Ingest) Define Data Structure Apply Structure &Author QuestionsApply Structure(Transform to defined structure)Data Consumers Author Questions(Data Model, Data Type, Rules)(Transform to defined structure andtranslate questions into a single script)(Translate questions into scripts)Ideal for Reused & Known Data Consistent Results The Masses309/15/2014Ideal for Unfamiliar & Unknown Data Infrequent usage Unstable source schemaTeradata Confidential

MPP RDBMS and HADOOP – Right Tool for the Right JobMPP RDBMS is the right tool for the job whenHADOOP is the right tool for the job whenthere are increases in number of:decreases in Analyses, Integrated DataSources & Reuse of Data, plus increases in:Analyses (Concurrency, Throughput,SLAs, ANSI SQL - ease and maturity)Integrated Data Sources (High IO,Access Complexity for joins, groupings,seeks)Reuse of Data (schema-on-write,business rule changes, governance)And with needs for .Fine Grain SecurityData Quality and IntegrityHigh AvailabilityFast Response TimesAnd with needs for .Extreme Data Ingest RatesOpen Source DevelopmentMPP RDBMS is cost advantaged when theabove is trueDevelopment CostsMaintenance CostsUsage Costs319/15/2014Data Variety (no schema, evolvingschema, sparse data)High Intensity, Batch Computation(High CPU)Logic Complexity (Procedural LanguageProcessing in Parallel)Teradata ConfidentialHADOOP is cost advantaged when theabove is trueAcquisition CostsDevelopment CostsUsage Costs

Teradata QueryGrid Business BASEData -downto msPush-downto OtherDatabasePush-downto NoSQLDatabasesRun SAS,Perl, Ruby,Python, RFuture329/15/2014Teradata Confidential

Architectural PrinciplesTechnology continues to evolve, but the principles remain the sameKey PrinciplesTeradata EDWData is More Valuable whenIntegrated Full ParallelismData Modeling IPExperienceJSON Data TypeOrchestration between EDW andHadoop QueryGrid Push Down Processing UnityAtomic Data Yields More Insightsthan Summary Data Scale out ArchitectureIntelligent MemoryAJIs & PPIsHybrid ColumnarEDW 1:1 Interactions in HadoopResults Increase as AnalyticalCapabilities Mature fromReporting Analyzing Predicting Operationalizing Parallel Set ProcessingIn-Database AnalyticsTemporalGeospacialHigh Availability / Dual ActiveTactical Queries Amazing Things Happen when Datais Democratized Throughout theEnterprise ANSI SQLWorkload ManagementOptimizerData LabsHigh ConcurrencyEDW Aster SQL Map Reduce339/15/2014Teradata ConfidentialTeradata UDA Parallel Set Processing ParallelProcedural ProgrammingStreaming with AnalyticsHBase, MongoDBValue Add Hadoop engineeringfor Reliability

Questionsand Answers35

Contact InformationIf you have further questions or comments:Colin White, BI Researchinfo@bi-research.comChad Meley, Teradatachad.meley@teradata.com36

May employ an analytic sandbox, analytic platform or a data refinery Results may include data schemas, analyses, analytic models, business . MapR: “Optimized Data Architecture” The Role of Hadoop in BI and Data Warehousing