Big Data Platforms, Tools, And Research At IBM

Transcription

IBM ResearchBig Data Platforms, Tools, and Research at IBMEd PednaultCTO, Scalable AnalyticsBusiness Analytics and Mathematical Sciences, IBM Research 2011 IBM Corporation

Please Note:IBM’s statements regarding its plans, directions, and intent are subject to change orwithdrawal without notice at IBM’s sole discretion.Information regarding potential future products is intended to outline our general productdirection and it should not be relied on in making a purchasing decision.The information mentioned regarding potential future products is not a commitment,promise, or legal obligation to deliver any material, code or functionality. Informationabout potential future products may not be incorporated into any contract. Thedevelopment, release, and timing of any future features or functionality described for ourproducts remains at our sole discretion.Performance is based on measurements and projections using standard IBM benchmarks in acontrolled environment. The actual throughput or performance that any user willexperience will vary depending upon many factors, including considerations such as theamount of multiprogramming in the user's job stream, the I/O configuration, the storageconfiguration, and the workload processed. Therefore, no assurance can be given that anindividual user will achieve results similar to those stated here.

IBM Big Data Strategy: Move theAnalytics Closer to the DataNew analytic applications drive therequirements for a big data platform Integrate and manage the fullvariety, velocity and volume of data Apply advanced analytics toinformation in its native formAnalytic ApplicationsBI /Exploration / Functional Industry Predictive ContentBI /Reporting VisualizationAppAppAnalytics AnalyticsReportingIBM Big Data PlatformVisualization& Discovery Visualize all available data for adhoc analysis Development environment forbuilding new analytic use Workload optimization andscheduling Security and Governance3Information Integration & Governance 2012 IBM Corporation

Big Data Platform - Hadoop SystemInfoSphere BigInsights Augments open source Hadoopwith enterprise capabilities– Enterprise-class storage– Security– Performance Optimization– Enterprise integration– Development tooling– Analytic Accelerators– Application and industry accelerators– VisualizationHadoopSystem

Enterprise-Class Storage and Security IBM GPFS-SNC (Shared-Nothing Cluster) parallel file systemcan replace HDFS to provide Enterprise-ready storage– Better performance– Better availability No single point of failure– Better management Full POSIX compliance, supports multiple storage technologies– Better security Kernel-level file system that can exploits OS-level security Security provided by reducing the surface area and securingaccess to administrative interfaces and key Hadoop services– LDAP authentication and reverse-proxy support restricts access toauthorized users Clients outside the cluster must use REST HTTP access– Defines 4 roles not available in Hadoop for finer grained security: System Admin, Data Admin, Application Admin, and User Installer automatically lets you map these roles to LDAP groups and users– GPFS-SNC means the cluster is aware of the underlying OS securityservices without added complexity

Workload OptimizationOptimized performance for big data analytic workloadsAdaptive MapReduceTaskHadoop System Scheduler Algorithm to optimize execution time ofmultiple small and large jobs Identifies small and large jobs fromprior experience Performance gains of 30% reduceoverhead of task startup Sequences work to reduce overheadMapAdaptive MapReduce(break task into small parts)(optimization —order small units of work)(many results to asingle result set)

Big Data Platform - Stream Computing Built to analyze data inmotion Multiple concurrent input streams Massive scalabilityStreamComputing Process and analyze avariety of data Structured, unstructured content,video, audio Advanced analytic operators

InfoSphere Streams exploits massive pipelineparallelism for extreme throughput with low latencytupledirectory: directory: directory: directory:”/img" ”/img"”/opt" ”/img"filename: filename: filename: filename:“farm” “bird” “java” 1024data:height:640width:480data:

Video analyticsExample – Contour DetectionOriginal PictureContour Detection Use B&W threshold pictures tocompute derivatives of pixels Used as a first step for other moresophisticated processingWhy InfoSphere Streams? Scalability Composable with other analytic streams Very low overhead from Streams –pass 200-300 fps per core – onceanalysis added processing overheadis high but can get 30fps on 8 cores

Big Data Platform - Data Warehousing Workload optimized systems– Deep analytics appliance– Configurable operational analytics appliance– Data warehousing software Capabilities Massive parallel processing engine High performance OLAP Mixed operational and analytic workloadsDataWarehouse

Deep Analytics Appliance - Revolutionizing AnalyticsPurpose-built analytics applianceSpeed:10-100x faster thantraditional systemsSimplicity:Minimal administrationand tuningScalability:Peta-scale user datacapacitySmart:High-performanceadvanced analyticsDedicated HighPerformanceDisk StorageBlades WithCustom FPGAAccelerators

Netezza is architected for high performance onBusiness Intelligence (OLAP) workloads Designed to processes data it at maximum disk transfer rates Queries compiled into C and FPGAs to minimize overheadR ticsPartnerADE or Blades PartnerADE or IDENetworkFabricClients

Discovering Patterns in Big Data using InDatabase Analytic Model BuildingIBM Netezzadata warehouse applianceModelLARGEDATA SETAnalyticsModelAnalyticWorkbenchBuilding ModelBuilding HostModelLARGEDATA SETAnalyticsModelLARGEDATA SETAnalyticsS-Blades Disk Enclosures

Big Data Platform - Information Integrationand Governance Integrate any type of data tothe big data platform– Structured– Unstructured– Streaming Governance and trust forbig data– Secure sensitive data– Lineage and metadata of new bigdata sources– Lifecycle management to controldata growth– Master data to establish singleversion of the truthInformation Integration & Governance

Leverage purpose-built connectors formultiple data sourcesConnect any type of data through optimized connectorsand information integration capabilitiesStructuredUnstructuredStreaming Massive volume of structured data movement 2.38 TB / Hour load to data warehouse High-volume load to Hadoop file system Ingest unstructured data into Hadoop file system Integrate streaming data sourcesBig DataPlatform

InfoSphere DataStage for structured dataRequirementsIntegrate, transform and deliverdata on demand across multiplesources and targets includingdatabases and enterpriseapplicationsDataStage Integrate and transformmultiple, complex, anddisparate sources ofinformation Demand for data isdiverse – DW, MDM,Analytics, Applications,and real timeBenefits Transform and aggregateany volume of information Deliver data in batch or realtime through visuallydesigned logic Hundreds of built-intransformation functions Metadata-drivenproductivity, enablingcollaboration16Hutchinson 3G (3) in UK Up to 50% reduction intime to create ETL jobs.

The Orchestrate engine originally developed by TorrentSystems with funding from NIST provides parallel processingDataStage process definitionData sourceDataflows can bearbitrary DAGsTargetClean 1ImportAnalyzeMergeClean 2Centralized Error Handlingand Event LoggingConfiguration FileDeployment and ExecutionParallel access to targetsParallel pipeliningClean 1ImportMergeClean 2Parallel access to sourcesAnalyzeInter-node communicationsParallelization of operationsInstances of operators run in OS-level processes interconnected by shared memory/sockets

We connect to EVERYTHINGRDBMSGeneral AccessStandards & Real TimeLegacyDB2 (on Z, I, P or X series)OracleInformix (IDS and XPS)MySQLNetezzaProgressRedBrickSQL ServerSybase (ASE & IQ)TeradataHP NeoViewUniverseUniDataGreenplumPostresSQLAnd more .Sequential FileComplex Flat FileFile / Data SetsNamed PipeFTPExternal Command CallParallel/wrapped 3rd party appsWebSphere MQJava Messaging Services (JMS)JavaDistributed TransactionsXML & XSL-TWeb Services (SOAP)Enterprise Java Beans /DBBold / Italics indicatesAdditional charge item Enterprise ApplicationsCDC / ReplicationJDE/PeopleSoftEnterpriseOneOracle eBusiness SuitePeopleSoft EnterpriseSASSAP R/3 & BISAP XISiebelSalesforce.comHyperion EssbaseAnd more DB2 (on Z, I, P, X series)OracleSQL ServerSybaseInformixIMSVSAMADABASIDMS3rd party adapters:Allbase/SQLC-ISAMD-ISAMDS MumpsEnscribeFOCUSImageSQLInfomanKSAMM204MS AnalysisNomadNonStopSQLRMSS2000And many more .

InfoSphere Metadata Workbench See all the metadata repositorycontent with InfoSphere MetadataWorkbench It is a key enabler to regulatorycompliance and the IBM DataGovernance Maturity Model It provides one of the mostimportant view to business andtechnical people: Data Lineage Understand the impact of achange with Impact Analysis Cross-tool reporting on:– Data movement– Data lineage– Business meaning– Impact of changes– Dependencies– Data lineage for BusinessIntelligence Reports19Web-based exploration of Information Assetsgenerated and used by InfoSphere InformationServer components

Data Lineage Traceability across business and IT domains Show relationships between business terms, data modelentities, and technical and report fields Allows business term relationships to be understood Shows stewardship relationships on business terms Lineage for DataStage Jobs is always displayed initially at asummary “Job” level20

Data Lineage Extender Support governance requirements for business provenance Extended visibility to enterprise data integration flows outside of InfoSphereInformation Server Comprehensive understanding of data lineage for trusted information Popular business use cases– Non-IBM ETL tools and applications– Mainframe COmmon Business-Oriented Language (COBOL) programs– External scripts, Java programs, or web services– Stored procedures– Custom transformationsExtendedData Source21ExtendedMappingInfoSphere DataStageJob Design

Lineage tracking with BigInsights Extension Points easy to define for BigInsights sources and targets InfoSphere Metadata Workbench can show lineage/impact of attributes and jobsfrom end-to-end. For this scenario, the current Roadmap includes Better characterization of the metadata of BigInsights data sets and jobs Import of the metadata into Information Server Complete metadata analysis features

Big Data Platform - User Interfaces Business Users Visualization of a large volume and widevariety of data Developers Similarity in tooling and languages Mature open source tools withenterprise capabilities Integration among environments Administrators Consoles to aid in systems managementVisualization& DiscoveryApplicationDevelopmentSystemsManagement

Visualization - Spreadsheet-style user interface Ad-hoc analytics for LOB user Analyze a variety of data unstructured and structured Browser-based Spreadsheet metaphor forexploring/ visualizing dataGatherExtractExploreIterateCrawl – gather statisticallyAdapter–gather dynamicallyDocument-level infoCleanse, normalizeAnalyze, annotate, filterVisualize resultsIterate through any priorstep

Object-oriented APIsfor implementing dataparallel algorithmsclass MyAlgorithm rgeTasks()endIteration()}MapperWith a Netezza-based control ObjectObjectResultstheparallelfilesystemwould bereplaced with database tables, Can use eitherSerializedK-MeansExample:objectdefaultis Javaeachk sthe and gatingreturnpartitionstruecentermappersif thenotandnew-centerconvergedupdate objectsthe wouldremainthe acrossaccumulatorsHadoop-based and Netezzabased implementationsMapperHDFS / GPFSclass MyAlgorithm {initializeTask()initializeTask()class MyAlgorithm teration()endIteration()}class MyAlgorithm rgeTasks()endIteration()}class MyAlgorithm rgeTasks()endIteration()}Mapperclass MyAlgorithm rgeTasks()endIteration()}Mapper and both themappers class MyAlgorithm rgeTasks()endIteration()}Reducer and reducerswould be replacedwith UDAPs (UserDefined AnalyticProcesses), class MyAlgorithm rgeTasks()endIteration()}Reducer

Objects can be connected into workflows with theirdeployment optimized using semantic properties D 5*(B’*A A*C)B– Transpose BasicOnePassTask Can execute in Mapper or Reducer– Add (matrix add) BasicOnePassKeyedTask Executes in Reducer and can bepiggybackedCTranspose– MM (matrix multiply) BasicOnePassMergeTask Has Map and Reduce componentsAB’MAPMMMMB’*AA*CREDUCE– Multiply (scalar multiply) BasicOnePassTask Can execute in Mapper or Reducer Entire computation can beexecuted in one map-reduce jobdue to differentiation of BasicTasksAddMultiply

SystemML compiles an R-like language intoMapReduce jobs and database jobsExample OperationsX*Y cell-wise multiplication: zij xij yijA B * (C / D)X/Y cell-wise division: zij xij/yijBinary hopMultiplyBBinary hopDivideBinary lopMultiplyCDR1Group lopBinary lopBM1DivideMR JobGroup lopCLanguageInput DML parsedinto statement blockswith typed variablesHOP ComponentEach high-level operatoroperates on matrices,vectors and scalarsDLOP ComponentEach low-level operatoroperates on key-value pairsand scalarsRuntimeMultiple low-leveloperators combined in aMapReduce job

Approximately thirty data-parallel algorithms have beenimplemented to date using these and related APIs Simple Statistics– CrossTab– Descriptive Statistics Clustering––––K-Means ClusteringKernel K-MeansFuzzy K-MeansIclust Dimensionality Reduction––––Principal Components AnalysisKernel PCANon-Negative Matrix FactorizationDoubly-sparse NMF Graph Algorithms–––––Connected Graph AnalysisPage RankHubs and AuthoritiesLink DiffusionSocial Network Analysis(Leadership) Regression Modeling––––––Linear RegressionRegularized Linear ModelsLogistic RegressionTransform RegressionConjugate Gradient SolverConjugate Gradient Lanczos Solver Support Vector Machines– Support Vector Machines– Ensemble SVM Trees and Rules–––––Adaptive Decision TreesRandom Decision TreesFrequent Item Sets - AprioriFrequent Item Sets - FP-GrowthSequence Mining Miscellaneous– k-Nearest Neighbors– Outlier Detection

MARIO incorporates AI planning technology to enable ease of useGoal-based Automated Composition (Inquiry Compilation) In response to a simpleprocessing request MARIO automatically assembles analyticsinto a variety of real-time situationalapplications29

MARIO incorporates AI planning technology to enable ease of useGoal-based Automated Composition (Inquiry Compilation) In response to a simpleprocessing request MARIO automatically assembles analyticsinto a variety of real-time situationalapplications Deploys application components acrossmultiple platforms, establishes interplatform dataflow connections Initiates continuous processing of flowingdata Manages crossplatform operation

Big Data Platform - Accelerators Analytic accelerators– Analytics, operators, rule sets Industry and HorizontalApplication Accelerators– Analytics– Models– Visualization / user interfaces– AdaptersAccelerators

Analytic Accelerators Designed for Variety

Accelerators Improve Time to ValueTelecommunicationsCDR streaming analyticsDeep Network AnalyticsOver 100 sampleapplicationsRetail CustomerIntelligenceCustomer Behavior andLifetime Value AnalysisFinanceSocial Media AnalyticsStreaming options tradingInsurance and bankingDW modelsSentiment Analytics, Intentto purchasePublic transportationData miningReal-time monitoring androuting optimizationStreaming statisticalanalysisUser Defined Toolkits Standard ToolkitsIndustry Data ModelsBanking, Insurance, Telco,Healthcare, Retail

Big Data Platform - Analytic ApplicationsBig Data Platform is designed foranalytic application development andintegrationBI/Reporting – Cognos BI, AttivioPredictive Analytics – SPSS, G2, SASExploration/Visualization – BigSheets, DatameerInstrumentation Analytics – Brocade, IBM GBSContent Analytics – IBM Content AnalyticsFunctional Applications – Algorithmics, CognosConsumer Insights, Clickfox, i2, IBM GBSIndustry Applications – TerraEchos, Cisco, IBMGBSAnalytic ApplicationsBI /Exploration / Functional Industry Predictive ContentReporting VisualizationAppAppAnalytics Analytics

IBM Big Data Platform Systems Management Application Development Visualization & Discovery Accelerators Information Integration & Governance Hadoop System Stream Computing Data Warehouse New analytic applications drive the requirements for a big data platform Integrate and manage the f