OCITA Spring Event

Transcription

OCITA Spring EventMike KingEnterprise Technologist, Big DataWright Patterson AFB; May 19, 2016

Acronym Key - Part 1 VLDB – Very Large Database CDH – Cloudera Distribution for Hadoop PK – Primary Key EDH – Enterprise Data Hub AK – Alternate Key EDW – Enterprise Data Warehouse COTS – Commercial Off-the-Shelf ETL – Extract, Transform & Load KV – Key value ELK – Elastic Search, Logstash & Kibanna JSON – Java Script Object Notation XML – eXtensible Markup Language BSON – Binary Structured Object Notation SQL – Structured Query Language iOT – internet of things CRM – Customer Relationship Management JDBC – Java DataBase Connectivity TPC – Transaction Performance Council

Acronym Key - Part 2 SOA – Service Oriented Architecture BDE – Big Data Extensions (Vmware) API – Application Programming Interface FTE – Full Time Equivalent CSV – Comma Separated Values SIEM – Security Information EventManagement RDBMS – Relational DataBase ManagementSystem MQ – Message Queuing MPP – Massively Parallel Platform ERP – Enterprise Resource Planning ML – Machine Learning HA – High Availability CoE – Center of Excellence DBA – DataBase Administrator HTTP – HyperText Transfer Protocol DWFT – DataWarehouse Fast Track HDFS – Hadoop Distributed File System *aaS – anything as-a Service

Big Data4Dell - Internal Use - Confidential

ConfidentialDell - Internal Use - Confidential

Trends Affecting Big DataTechnology Virtualization: App, CM,Mgt, Client ToolsConsumption pattern Cloud Automation Integration Tools Data Science Skills demand *aaSData tsunami iOT Data, data & more data MobileConfidentialDell - Internal Use - Confidential Analytics for all, & all – Varying needs– Three types– I, a, p, s, DB– You Name It The profession– Needs– Roles– How to fill Training

Big Data is really complex data, with needsthat extend beyond the existing tool chainRelational data(Database)FacebookApplication dataMS Excel andMS AccessPDF, Word and textfilesTwitterLinkedInSensor dataPhotosVideosDifferent data types Large volumes Varying speedsConfidentialDell - Internal Use - Confidential

ConfidentialDell - Internal Use - Confidential

Customer Success StoriesConfidentialDell - Internal Use - Confidential

Customer lifetime value of Big DataUK – online services Jan 2013: 200 nodesAlways ran Hadoop – saw megagrowth from Jan 2013: 200 nodes Primary use case: Web 2.0 as core Growth: Jan 2015 800 nodesG500 SI-Telco Jan 2013 150 nodes Primary use case: Top SecretGovernment Work Growth: Jan 2014 150 nodesUS-based Telco Feb 2013 – 42 node POCs Primary use case: Log Files,BDaaS, & Churn Analysis Growth: March 2015- 2200 nodesFinancial Services Nov 2013: 12 nodes POC Primary use case: Log Files, FraudAnalysis, 360 Customer View Growth: March 2015 220 nodesConfidentialDell - Internal Use - Confidential

Why Dell?11Dell - Internal Use - Confidential

Dell Differentiators12Room for textDell - Internal Use - ConfidentialServices

Dell, A Very Differed ProviderWhy Dell is different Modular– Plug ‘N play Happy to fill in the gaps Complete when we need to be– Servers, storage, networking, software & services. Products enhanced to work with Big Data Solutions– Engineered– CustomConfidentialDell - Internal Use - Confidential

Dell & Hadoop – Performance Matters #1 TPCx-HS Hadoop Price/Performance in the industry at scale factors of 1TB, 3TB, 10TB,and 30TB #1 TPCx-HS Hadoop Performance in the industry at scale factor of 10TBSF10 The Dell Cloudera Reference Architecture for Hadoop provides the #1 TPCx-HS HadoopPrice/Performance in the industry at scale factors of 1TB, 3TB 10TB, and 30TB PowerEdge R730XD provides the #1 TPCx-HS Hadoop Performance in the industry at scalefactor of 10TB Up to 64% better TPCx-HS Price/Performance compared to Cisco at scale factor of 10TB Up to 13% better price/performance compared to Huaweii at scale factor of 1TB,14Dell - Internal Use - Confidential

Support & Services DSC (for free)– Briefing– Architectural design session– POC Prof Services (for fee)– Jumpstart– Select Use Cases– Custom engagementsConfidentialDell - Internal Use - Confidential

BI, Analytics &Big DataCapabilities Dell - Internal Use - Confidential16Dell Blueprints

Services Offer MatrixService OfferTeamFormatServiceEst Cost /Duration1Hadoop H/W InstallEDT SKU for Quickstart Custom SOW forRA Rack & Stack Label / Cable Priced by size 4k days2Cloudera DeploymentEDT SKU for Quickstart Custom SOW forRA O/S Install Foundation services install Configuration 9k days3Cloudera BasicJumpstartGICS Repeatable SOW/SKU coming Q1 ALL Cloudera (QS& RA) 18k FF2 weeks onsite1 FTE4Cloudera Health CheckGICS Repeatable SOW/SKU coming Q1 ALL Cloudera (QS& RA) Time-boxed cluster certification Up to 2 clusters, 100 nodes Cloudera best practice 15k FF1 week onsite1 FTE5Hadoop Active ArchiveProof of ConceptGICS Repeatable SOW/SKU coming Q1 ALL Cloudera (QS& RA) Real world PoC using native tools (ie: Hive, sqoop,flume, etc.) to demonstrate effective use case ofActive Archive Design, Development and non-prod deployment 50k FF5 weeks1 FTE on/off site6Hadoop ETL/DWOffload Proof ofConceptGICS Repeatable SOW/SKU coming Q1 ALL Cloudera (QS& RA) Real world PoC using native tools (ie: Hive, sqoop,flume, etc.) to demonstrate effective use case ofActive Archive Design, Development and non-prod deployment 50k FF5 weeks1 FTE on/off site7Custom WorkloadGICSCustom SoW Custom workload specific to Cloudera/Hadoop Any deviation in scope from SKU offersCustom Quote17Dell - Internal Use - ConfidentialTrainingAs-is / To-BeHands on labsRoadmap Deliverable

Use Case Taxonomy18Dell - Internal Use - Confidential

Use Cases by aceuticalManufacturingAnticipating customerneedsReducing risk anddetecting fraudImproving patient careand reducing costEnsuring regulatorycompliance andvalidationContinuous tyand shelf sregressionPriceoptimizationSOXcomplianceQualityof careManufacturingImprovedoperations19Dell - Internal Use - Confidential

FSIHealthcareManufacturingOil & GasRetailFraud prevention in credits and Quality of care optimization Proactive qualitypaymentsassuranceHorizontal drillingenablement andoptimizationEnablement of a 360degree customer viewRisk modeling in investmentsbankingClinical quality and costanalysisAnalysis of demand fornew products andservicesSeismic dataprocessingGeneration ofpersonalized offersCross-selling and upselling inretail bankingGenome processing andDNA sequencingProduct research guided Predicting whereby machine-generated best to drill nextdataEnablement of first inbasket analysisInsurance policy personalization Population healthmanagementDetection of supplychain issuesWhich leases do Isell?Merchandising andsupply chain analysisMortgage lending portfoliovaluationIdentification of crosssell and upsellopportunitiesWhich sectionsshould I acquire?Isolation of productsand mixes indicative oflarger basketsDetection of fraud andsuspicious transactionsConfidentialDell - Internal Use - Confidential

ITCommonETL offload / accelerationCustomer profileFinanceBankingRisk modelingRisk arbitrageActive archiveLog aggregationCustomer 360Churn analysisagile data explore / selfservice BI / data lakeQuality improvementSIEMSecurity - intrusiondetection, othersCross-sell, upsellHouseholding / matchingReportingFraud analyticsLoyalty analysisProfitability analyticsCanabilization analysis Your cool UC CurrencyhedgingInsurance policy Interesting stuff done by Your evenpersonalizationcompetitors cooler UC ConfidentialDell - Internal Use - Confidential My Bank Mortgage lending portfoliovaluation

Skeleton Process Define goals & objectives Brainstorm use cases Assess– Complete– Data Cull Assemble overall tech arch– RA Gaps– Skills– Process POC– One UC at a time Rank›Learn›Adjust Solution architecture››ImproveFeedback– How– What tools– Next UC– RepeatConfidentialDell - Internal Use - Confidential

Use Cases23Dell - Internal Use - Confidential

Use Cases Archive– Active– As needed– Platform retirement ETL––––COTS replacementPerformance, ParallelismEnhancementFunctionality›– Offload– Performant– License redux Data Warehousing–––– Log LK, flume Messaging, Streaming– Kafka, spark, flink, storm Integration– Structured, Multi-structured, Variable– RDBMS, nosql, files– Public, private & hybrid cloud

Hadoop & Big DataSolutions25Dell - Internal Use - Confidential

About Hadoop SOA– Omnipresent– Rest APIs Logs– By product, program or not at all– CDH Ent – integrated for many One-offs– Doable– Minimize Plethora of choices Mostly Open Source Languages–––– Growth Evolving Contenders and pretenders CustomizableConfidentialDell - Internal Use - ConfidentialJavaPythonRScala

About Hadoop Continued Store tons of data MPP SQL– All is now feasible Scale– Horizontal Mix disparate sources Ingest– Bulk– Small batches– Real-time ML Predictive analytics Architecture– Enterprise– Technical– Solution Structure– Strongly type– Semi– MultiConfidentialDell - Internal Use - Confidential

COTS Replacements with Hadoop ETL Data Archiving– Informatica– Abinitio– Data Stage– Strong ERP eton SoftechSolix Messaging– Tibco EMS– IBM MQ– MSMQConfidentialDell - Internal Use - ConfidentialApplimationIBM Optimo SIEM––––Informatica ILM

Data Sources Types– Public, private, purchased Sources & Sinks– Flume to HDFS– Flume to Kafka to HDFS– HTTP to Hbase Channels––––29JDBCMemoryFileCustomDell - Internal Use - Confidential Sources– Databases– Apps––ERPCRM–Other purchased–Custom– Files– Messages

Ingestion File transfer HDFS client Sqoop Flume Kafka Custom Shareplex Connector for Hadoop Boomi30Dell - Internal Use - Confidential

Skills, Training & Languages Skills––––––InventoryNeedsGapsBuy, rent, growCoEMentor ialsFor freeFor small fee Drivers licenseCheat sheets Languages– Not just one– Which one(s)?JavaR›Python›Scala– Shape usage– Justify choicesConfidentialDell - Internal Use - Confidential››

BlueprintBig Data and Analytics Blueprint Portfoliofor Big Data& AnalyticsSERVICESStatistica Data Analytics SuiteDell SoftwareSuiteDell Boomi Integration ToolsReferenceArchitecturesDell Toad Data ManagementDell SharePlex Replication Connector for HadoopRA Implementations:Engage yourBig Data Overlay SalesTeamDell Cloudera Apache Hadoop Solution on R730XDStart and up to 15 Nodes, Scales to 445 nodes, Scales 45 nodesConsultingSQL DWFTStart with 730/PS6210S to 17TB, Scales on 730xd to 21TB,Scales on 730/PS6210S to 26 TB, Scales on 730/SC4020 to 55TBDeploymentDell Cloudera Syncsort Data Warehouse Optimization for ETL Offload RACustomSolution Architecture(June 19, 2015)Training:BundledProSupport PlusMicrosoft APS AppliancePDW: 3 nodes, Scales PDW Hadoop to 6 nodes, Scales PDW Hadoop 9 – 54 nodesEngineeredSolutionsDell QuickStart 5.5 for Cloudera Hadoop5 nodesSAP HANA ApplianceSingle Server configurations scale from 128GB – 1.5 TB RAM;Scale Out cluster configurations scale from 2-16TB RAM (up to 24TB w/R930 – due September, 2015)32Dell - Internal Use - Confidential

Dell Hadoop Solution Offerings SummaryDell QuickStart 5.6 for Cloudera Includes all hardware/software/services Cloudera Enterprise Support 5 Nodes & NW: Full PoC for 150K PoC easily upgraded to ProductionDell Cloudera 5.6 Solution Proven & tested Reference ArchitectureDell Cloudera Syncsort Data WarehouseOptimization for ETL Offload Foundational design with customizable components Enables organizations to lower data transformation costs Robust, Enterprise-ready solution Builds operational efficiencies for laying a strong, cost-effective,secure, scalable and robust solution formanaging data Massive, modular scalability Builds foundation to mature into advanceddata analytics33Dell - Internal Use - Confidential

Dell QuickStart for Cloudera HadoopEasy starting point for a complete Big Data solutionDell QuickStart for Cloudera Hadoop delivers a full Hadoop cluster to start youon the pathway to taking control of Big Data Brings a full Hadoop proof of concept into organizations to allow them begin todevelop expertise Delivers Hadoop capabilities for a low-entry price Incorporates full support from the experts as you take the first steps with Hadoop Teaches how to implement data collection, data management and data analytics toenable sophisticated strategies to build value for business Includes professional services to help you get started Ideal for pre-production use casesGet started today with Dell QuickStart for Cloudera Hadoop for a fullysupported Hadoop solution with hardware, software, training and servicesConfidentialDell - Internal Use - ConfidentialKey Benefits Easy:Dell QuickStart for ClouderaHadoop includes all hardware,software, training and services Affordable:Build a full Hadoop environmentfor under 110K Flexible:Easily upgrades to a fullproduction cluster

Dell Cloudera Apache Hadoop 5.5 Solution, accelerated by IntelProven Hadoop Distribution for the EnterpriseKey differentiation & innovations A robust end-to-end Hadoop solution A solution built on experience, partnership, and innovation and tested and validated Reference ArchitecturesValue proposition A secure end-to-end data management solution To collect, mine, manage and analyze data Gain valuable business insights for unique competitiveadvantagesDell Cloudera Hadoop Solution for Big DataTarget market All organizations from small, to medium and largeenterprises – across all verticalsBetter Together Dell Cloudera Intel for industry-leading, secure,infrastructure-optimized Hadoop solutions Streamlined to search, process, manage, and analyze alldataImportant updates in Cloudera 5.6 on 13GRunning on the PowerEdge R730xdUpdates to Cloudera SearchThe release of Impala 2.0 that integrates Apache Spark into the platform and drives better batch processingConfidentialwith Spark 2.1 as the processing engineDell - Internal Use - Confidential

Blueprintfor Big Data& AnalyticsDell Cloudera Syncsort Data Warehouse Optimization for ETLOffload Reference ArchitectureThe first and only reference architecture for ETL offload with HadoopScalable ETL with the flexibility of a Reference Architecture Scale Out hardware architecture – PowerEdge R730, R730xd, andhigh performance Dell S-Series Networking.Tight integration between Dell, Cloudera and Syncsort provides easeof deployment and maintenance with no performance impact orhurdles down the road.Close the Skills Gap by eliminating the need to develop expertise onMapReduce, Pig, Hive, and Sqoop.Fast Track Projects with automated conversion of legacy SQL scriptsinto efficient ETL processes in Hadoop without any coding.Comprehensive and collaborative service and support for the entiresolution through it’s complete lifecycle.The Dell Difference Faster time to value through an optimized solution jointly designed bythree market leaders.Detailed Reference Architecture DocumentationDeployment guidelines detail best practices based on extensiveexperience with production deployments Cloudera EnterpriseDell - Internal Use - Confidential 36 DMX-hLink to Dell Cloudera Syncsort DWO – ETL Offload RADell Blueprints

NoSQL37Dell - Internal Use - Confidential

NoSQL Database Types Four types–Columnar––Document––DBAsHow do you access them?––38By typeWithin typeWho will manage them?– Neo4j, TitanHow many do you need?–– Riak, RedisGraph– MongoDB, CouchbaseKV––Hbase, CassandraSQL, nosqlSequentialDell - Internal Use - Confidential

Nosql background, issues and considerations History– Google Big Table, Amazon Dynamo What does schema-less mean?––––On readStill structuredEmbeddedCan vary between records Languages & formats used– Java, Python– JSON, BSON, XML, CSV39Dell - Internal Use - Confidential

NoSQL background, issues and considerationscontinued Eric Brewer’s CAP theorem– What does NoSQL really mean?–– Can’t do all three.Distributed, shared-nothing aggregate oriented database“Not only SQL” versus “No”What are the factors for the various choices?–––––Best fitUse case(s)KVHA, Multi-siteNetwork– Sharding–40Kevin BaconPartitioningDell - Internal Use - Confidential

RDBMS versus NoSQLRDBMSsLarge user populationsStructuredStatic schemaStrong typingAccess by PK, AK, indexesComplex structuresFeature richMulti-purpose, shared by appsOLTPNoSQL DBsSmall user populationsMulti-structured, Semi-structuredSchema evolutionWeak typingMostly random access by PKSimple structuresBare bones functionalitySingle purpose/use case, not shared by appsNot transactionalComplex queriesSmall to medium sized dbs3 way joinsSimple queriesVLDB, XL DB Sizefew or no joinsHorizontal scalabilityProprietary, differed access verbs/methodsCustom applications–ACIDChallenging, costly scalabilitySQLCOTS packagesDatamarts41Dell - Internal Use - Confidential–BASE

Nosql Commonalities Mostly open source Weak typing Multi-structured Horizontal scale No standardization VLDB Single purpose, per database42Dell - Internal Use - Confidential

Nosql Differences Access APIs Formats supported Security Features Persistence Management Programmability Administration ?Schemas VLDB Performance & tuning Resource consumption Language bindings43Dell - Internal Use - Confidential

How are nosql databases typically used? As an adjunct to Hadoop As a partial replacement for some RDBMS workloads To scale linearly As a data store for semi-structured and multi-structured data44Dell - Internal Use - Confidential

EnterpriseArchitecture45Dell - Internal Use - Confidential

EA - TOGAF 46Dell - Internal Use - essment Current StateFuture State TransitionGapsChallengesIssues

Fixtures & Architecture Definition Examples– Oracle DB– Oracle EBS– ELA Architecture– Modular– Solution– Reference47–Guidelines––Engineered SolutionsBlueprintsDell - Internal Use - Confidential

Solution Architecture48Dell - Internal Use - Confidential

Systems of tepAnalyticalExecutionEnd-PointBusiness Intelligence, PoSBusinessReportingData MartsPatientRecordsEmailSystems of arch, opHDFSHbaseFlumeAnalytics, DiscoveryConfidentialDell - Internal Use - ConfidentialConfidentialHDFSHbaseSearch Queries(Research,Marketing)Natural LanguageSearchAd-HocReportsQuery ptimizeAdvancedAnalytics

Customer Churn Analysis1.SOURCEINTEGRATE, AGGREGATE, & TRANSFORM2. ANALYZE3. ACTCloud DataCalendar EventsStock MarketDataMarketingCampaignsDell BoomiIntegrate and correlateToadIntelligenceCentralDell StatisticaAggregate argeted e-mailsOffer RedemptionCustomizedProductOfferingsCalendar EventsStock MarketDataSales CampaignsToad Data PointIntegrate and cleanseTransactionalPatternsSocial izationPoint-of-SaleCouponsDell Statistica Big DataCrawl and saveSOURCESSERVICESDell - Internal Use - Confidential50MANAGEMENTSECURITYDESIGN/DEPLOYDell Blueprints

regateanaly

Training As-is / To-Be Hands on labs Roadmap Deliverable . 18k FF . . – Informatica – Abinitio – Data Stage SIEM – Arcsight – Logility – Splunk – LogRythm Data Archiving – Strong ERP focus › Informatica ILM . o