IBM’s InfoSphere BigInsights: Smart Analytics For Big Data

Transcription

IBM’s InfoSphere BigInsights:Smart Analytics for Big DataClaus Samuelsencsa@dk.ibm.comNovember 7, 2011 2011 IBM Corporation

IBM DisclaimerInformation regarding potential future products is intended to outlineour general product direction and it should not be relied on in makinga purchasing decision. The information mentioned regarding potentialfuture products is not a commitment, promise, or legal obligation todeliver any material, code or functionality. Information about potentialfuture products may not be incorporated into any contract. Thedevelopment, release, and timing of any future features orfunctionality described for our products remains at our sole discretion.2 2011 IBM Corporation

Agenda The “Big Data” challenge: smarter analytics for asmarter planet IBM’s approach– The big picture– Details on BigInsights– How BigInsights fits in your software stack (with datawarehouses, DBMSs, streams, etc.) How IBM can help you get off to a quick start3 2011 IBM Corporation

The “Big Data” ChallengeNovember 7, 2011 2011 IBM Corporation

Information is at the Centerof a New Wave of Opportunity 44xas much Data and ContentOver Coming Decade2009800,000 petabytes55202035 zettabytes80%Of world’s datais unstructured And OrganizationsNeed Deeper Insights1 in 3Business leaders frequentlymake decisions based oninformation they don’t trust, ordon’t have1 in 2Business leaders say they don’thave access to the informationthey need to do their jobs83%of CIOs cited “Businessintelligence and analytics” aspart of their visionary plansto enhance competitiveness60%of CEOs need to do a better jobcapturing and understandinginformation rapidly in order tomake swift business decisions 2011 IBM Corporation

Example: The Perception Gap Surrounding Social Media . . . . IBM 2010 CEO Study: 88 percent of CEOs said “getting closer to customers” was top priority over next 5 yearsand viewed social media as a core part of that strategy However, a March 2011 IBM study identified that companies fail to understand what customers want from socialadvertising and outreachSocial media and social networkingwill increase customer advocacy?Source: “Capitalizing oncomplexity, Insights from theGlobal Chief Executive OfficeStudy,” IBM Institute forBusiness Value, 201066“What Customers Want”First in a two-part seriesIBM Institute for Business ValuePublished March 2011 2011 IBM Corporation

Big Data Presents Big OpportunitiesExtract insight from a high volume, variety and velocity of datain a timely and cost-effective mannerVariety: Manage and benefit fromdiverse data types and datastructuresVelocity: Analyze streaming data andlarge volumes of persistentdataVolume: Scale from terabytes tozettabytes77 2011 IBM Corporation

What we hear from customers . . . . Lots of potentially valuable data isdormant or discarded due tosize/performance considerations Large volume of unstructured or semistructured data is not worth integratingfully (e.g. Tweets, logs, . . .) Not clear what should be analyzed(exploratory, iterative) Information distributed across multiplesystems and/or Internet Some information has a short usefullifespan Volumes can be extremely high Analysis needed in the context ofexisting information (not stand alone)8 2011 IBM Corporation

Merging the Traditional and Big Data ApproachesTraditional ApproachBig Data ApproachStructured & Repeatable AnalysisIterative & Exploratory AnalysisBusiness UsersDetermine whatquestion to askDelivers a platform toenable creativediscoveryITBusinessStructures thedata to answerthat questionExplores whatquestions could beaskedMonthly sales reportsProfitability analysisCustomer surveys9ITBrand sentimentProduct strategyMaximum asset utilization 2011 IBM Corporation

Big Data Scenarios Span Many IndustriesMulti-channel customersentiment and experience aanalysisDetect life-threateningconditions at hospitals intime to intervenePredict weather patterns to planoptimal wind turbine usage, andoptimize capital expenditure onasset placementMake risk decisions based onreal-time transactional dataIdentify criminals and threatsfrom disparate video, audio,and data feeds10 2011 IBM Corporation

Information ManagementVestas (European Energy Company)Business Challenge Analyze large volumes of public and private weather data foralternative energy business Existing high-performance computing hardware, limited staffProject objectives Leverage large volume (2 PB) of weather data to optimizeplacement of turbines. Reduce modeling time from weeks to hours. Optimize ongoing operations.Solution Components: IBM InfoSphereBigInsights EnterpriseEdition:- Scalability (data volumes)- Jaql (query support andextensibility)- IBM-provided file system(support existing hardware &apps)- Strong runtime performanceThe benefits Reliability, security, scalability, and integration needs fulfilled Standard enterprise software support IBM xSeries hardware Single-vendor solution for software, hardware, storage, nsights: videos/interviews11 2011 IBM Corporation

Information ManagementGlobal Technology FirmBusiness challenge Analyze & correlate log records across to improve service Detect & predict failure patterns; initiate automated or manualpreventive actionsProject objectives Process variety of logs generated by multiple systems, devices indistinct formats (XML, text, ) Accommodate large data volumes growing at 1 TB /day Parse logs, identify/extract entities of interest, index as needed,cluster data by sessions, detect & visualize patterns through GUI Report on Top X, Bottom X patterns; support exploratory queriesSolution Components: IBM InfoSphereBigInsights EnterpriseEdition including: Spreadsheet datadiscovery andvisualization Text analytics runtimeand tooling Flexible query support Scalability IBM InfoSphere StreamsThe benefits IBM analytics and tooling simplify development and speed time-tovalue. “You have done in 2 weeks what I have been trying to generalize forthe past 6 months.” -- Customer project leader12 2011 IBM Corporation

Information ManagementGlobal Media FirmBusiness challenge Identify unauthorized content streaming (piracy) Quantify annual revenue loss, analyze trends Monitor social media sites (e.g., Twitter, Facebook) to identifydissemination of pirated content. Time sensitive!Project objectives: Analyze high variety of data. Volumes unclear. Start with social media data for 1 year. Use text analytics to Qualify & classify info of interest (complex, custom set of rules) Search for URLs with live streaming of target data, sentiment, . Future potential for video analysisSolution Components: IBM InfoSphereBigInsights EnterpriseEdition including: Text analytics runtimeand tooling Custom textannotators Flexible query support ScalabilityThe benefits Improved understanding of business exposures through advancedanalytics Improved decision-making process Scalable, flexible infrastructure for handling future analytic needs13 2011 IBM Corporation

Customer EngagementsUse patternsCommon requirements Customer sentiment analysis (crosssell, up-sell, campaign management) Integrated retail and web customerbehavior modeling Predictive modeling (credit card fraud) System log analytics (reduceoperational risk) Extract business insight from large volumes ofraw data (often outside operational systems) Integrate with other existing software Ready for enterprise useConsumerInsightText, Blog, WeblogClick streamsLog & transactionsBiological SequencesOperational system & streams data sources1414Multi-channelsalesNext GenFraud ModelsNew BusinessDevelopmentText AnalyticsStatistical ModelBuilding 2011 IBM Corporation

IBM’s approachNovember 7, 2011 2011 IBM Corporation

Big Data: an integral part of an enterprise data platform Manage Big Data from the instant it enters the enterprise High fidelity – no changes to original format Available for new uses, analyses, and lData StoreBig DataApplicationsWarehouseBig Data PlatformIBM Big Data SolutionsClient and Partner SolutionsBig Data User EnvironmentDevelopersEnd UsersAdmin.Traditional data sources(ERP, CRM, databases,etc.)Big Data Enterprise sSource data (Web, sensors, logs, media, etc. ) 2011 IBM Corporation

IBM’s Platform Addresses Key Requirements1. Platform for V3 – Variety, Velocity, Volume Variety - manage data & content “As Is” Handle any velocity - low-latency streams and large volume batch Volume - huge volumes of at-rest or streaming dataBig Data Platform2. Analytics for V3 Analyze Sources in their native format - text, data, rich content Analyze all of the data - not just a subset Dynamic analytics - automatic adjustments and actions3. Ease of Use for Developers and Users Developer UIs, common languages & automatic optimization End-user UIs & visualization4. Enterprise Class Failure tolerance, Security and Privacy Scale Economically5. Extensive Integration Capabilities17 Integrate wide variety of sources Leverage enterprise integration technologies 2011 IBM Corporation

Platform VisionIBM Big Data SolutionsClient and Partner SolutionsRules / BPMiLog & LombardiDataWarehouseBig Data AcceleratorsText SphereWarehouseGeospatialTimes SeriesApplicationsWarehouseAppliancesAcousticIBM & non-IBMMathematicalMaster DataMgmtBlue PrintsInfoSphere MDMBig Data Enterprise EnginesDatabaseINTEGRATIONDB2 & non-IBMContentAnalyticsInfoSphere StreamsInfoSphere BigInsightsECMProductivity Tools & OptimizationWorkloadManagement ionManagerActivityMonitorIdentity &Access MgmtDataProtectionInformation ServerBusinessAnalyticsCognos & SPSSMarketingUnicaData GrowthManagementInfoSphere Optim18 2011 IBM Corporation

BigInsights Summary BigInsights analytical platform for persistent “Big Data”– Based on open source & IBM technologies– Managed like a start-up . . . . Emphasis on deep customer engagements,product plan flexibility Distinguishing characteristics– Built-in analytics . . . . Enhances business knowledge– Enterprise software integration . . . . Complements and extends existingcapabilities– Production-ready platform . . . . Speeds time-to-value; simplifiesdevelopment and maintenance IBM advantage– Combination of software, hardware, services and advanced research19 2011 IBM Corporation

InfoSphere BigInsightsPlatform for volume, variety,velocity -- V3 Enhanced HadoopfoundationEnterprise EditionLicensedUsability Web console Integrated install Spreadsheet-style tool Ready-made “apps”Enterprise Class Storage, security, clustermanagementIntegration Connectivity to DB2,Netezza20Enterprise classAnalytics for V3 Text analytics & toolingApacheHadoopBusiness process accelerators (“Apps”)Text analyticsSpreadsheet-style analysis toolRDBMS, warehouse connectivityIntegrated Web-based consoleBasic EditionFlexible job schedulerPerformance enhancementsFree downloadEclipse-based toolingIntegrated installLDAP authenticationOnline InfoCenter.BigData Univ.Breadth of capabilities 2011 IBM Corporation

BigInsights ContentFunctionVersionIntegrated p (including common utilities, HDFS, MapReduce framework)0.20.2IncIncJaql (programming / query language)0.5.2IncIncPig (programming / query language)0.7IncIncFlume (data collection/aggregation)0.9.1IncIncHive (data summarization/querying)0.5IncIncLucene (text search)*3.1.0IncIncZookeeper (process coordination)3.2.2IncIncAvro (data serialization)*1.5.1IncIncHBase (real time read/write)0.20.6IncIncOozie (workflow/ job orchestration)2.2.2IncIncOnline documentationIncIncCapability to integrate with JDBC sources through general-purposeJaql module*IncIncCapability to integrate with DB2, InfoSphere Warehouse (DB2 UDFsamples to submit jobs, and read results from BigInsights)IncInc*New or upgraded in 1.2 2011 IBM Corporation

BigInsights Content ty to integrate with R (Jaql module to invoke R statisticalcapabilities from BigInsights)n/aIncCapability to integrate with Netezza, DB2 LUW with DPF from Jaqln/aIncLDAP Authenticationn/aIncIntegrated Web Console*n/aIncIntegrated workflow capabilitiesn/aIncIntegrated flexible schedulern/aIncn/aIncText analytics capabilityn/aIncEclipse support for text analytic development, Jaql, Hive, Java*n/aIncSpreadsheet-like analytical tool (BigSheets)*n/aIncPlatform performance enhancements (Adaptive MapReduce,efficient processing of compressed files)*22BasicEditionIBM Optim Development Studio V2.2.1.0*New or upgradedn/aInc 2011IBM Corporation

Announcing BigInsights V1.3Enhanced Web Console: Administration tools– View cluster health– Manage cluster access– Manage/install cluster instances. Tools for big data – Web tools to:– Run big data applications– View progress– Graph results– Integrate with BigSheets– Manage and schedule workflows, jobs, tasks,and filesGreater Efficiency: Adaptive MapReduce – Improveperformance for small jobs (without alteringhow jobs are created) Compression – Decrease disk space &storage infrastructure requirements.23Better Manageability: Development tools for:– Text analytics– Java map reduce development– Cluster file browsing– Job submission– Jaql and Hive development– Developing and publishing applications to theweb console Web Secure online REST access to clusterto automatically leverage applications andaccess data Web applications for:– Securely importing and exporting data withrelational databases– Importing and export files to the cluster– Importing data from web crawlers and socialmedia. 2011 IBM Corporation

BigInsights: Value Beyond Open Source Technical differentiators– Built-in analytics Text processing engine, annotators, Eclipse tooling Interface to project R (statistical platform)–––––Enterprise software integration (DBMS, warehouse)Simplified programming / query interface (Jaql)Integrated installation of supported open source and IBM componentsWeb-based management consolePlatform enrichment: additional security, job scheduling options,performance features, . . .– Standard IBM licensing agreement and world-class support– More to come in future releases! Business benefits––––24Quicker time-to-value due to IBM technology and supportReduced operational riskEnhanced business knowledge with flexible analytical platformLeverages and complements existing software assets 2011 IBM Corporation

BigInsights and the data warehouseBig aditionalanalytictoolsData warehouseBigInsights25 2011 IBM Corporation

BigInsights and the data warehouseTraditionalanalytictoolsBig DataanalyticapplicationsBigInsightsData Warehouse26 Query-ready archive for “cold” warehouse data 2011 IBM Corporation

Growing Ecosystem of SolutionsIBM BigInsights SolutionsPartner SolutionsCognos Consumer InsightsSocial media analytics solution that usesBigInsights. Available now.IBM Content AnalyticsUnlock valuable business insight fromunstructured data. Proof of technologycompleted. Production offering due soon. . . with more to comeIBM Big Data User EnvironmentsIBM Big Data Platform27 2011 IBM Corporation

A Closer Look at BigInsights . . . .28 2011 IBM Corporation

About the BigInsights Platform Flexible, enterprise-class support for processing large volumesof data– Based on Google’s MapReduce technology– Inspired by Apache Hadoop; compatible with its ecosystem anddistribution– Well-suited to batch-oriented, read-intensive applications– Supports wide variety of data Enables applications to work with thousands of nodes andpetabytes of data in a highly parallel, cost effective manner– CPU disks “node”– Nodes can be combined into clusters– New nodes can be added as needed without changing Data formats How data is loaded How jobs are written29 2011 IBM Corporation

The MapReduce Programming Model "Map" step:– Input split into pieces– Worker nodes process individual pieces in parallel (underglobal control of the Job Tracker node)– Each worker node stores its result in its local file systemwhere a reducer is able to access it "Reduce" step:– Data is aggregated (‘reduced” from the map steps) byworker nodes (under control of the Job Tracker)– Multiple reduce tasks can parallelize the aggregation3030 2011 IBM Corporation

Logical MapReduce Example: Word Countmap(String key, String value):// key: document name// value: document contentsfor each word w in value:EmitIntermediate(w, "1");reduce(String key, Iterator values):// key: a word// values: a list of countsint result 0;for each v in values:result ParseInt(v);Emit(AsString(result));31Content of Input DocumentsHello World Bye WorldHello IBMMap 1 emits: Hello, 1 World, 1 Bye, 1 World, 1 Map 2 emits: Hello, 1 IBM, 1 Reduce (final output): Bye, 1 IBM, 1 Hello, 2 World, 2 2011 IBM Corporation

MapReduce ProcessingInput DocumentsHello World ByeWorldHello IBMMap 1 emits: Hello, 1 World, 1 Bye, 1 World, 1 Map 2 emits: Hello, 1 IBM, 1 Reduce (final output): 32Bye, 1 IBM, 1 Hello, 2 World, 2 2011 IBM Corporation

So What Does This Result In? Easy To Scale Fault Tolerant and Self-Healing Data Agnostic Extremely Flexible33 2011 IBM Corporation

Web-based Installation, Management Consoles Integrated installation– Seamless process for single nodeand cluster environments– Post-install validation of IBM andopen source components Integrated management console––––––3434System health managementAdd / drop nodesStart / stop servicesRun / monitor jobs (applications)Explore / modify file system. 2011 IBM Corporation

BigInsights and Text Analytics Distill structured info from unstructured data Sentiment analysis Consumer behavior Illegal or suspicious activities . Pre-built library of text annotators for commonbusiness entities Rich language and tooling to build customannotators Support for Western languages (English,Dutch/Flemish, French, German, Italian,Portuguese, or Spanish) plus select Asianlanguages (Japanese, son""PhoneNumber""StateOrProvince""URL""ZipCode" 2011 IBM Corporation

BigInsights Text Analytics Development36 2011 IBM Corporation

Example Analysis : Extraction from Twitter messagesExtract intent, interests, life events and micro segmentationattributesIhadanMonetizable IntentI had t's)! ionName, Birth ovingtotomiamimiamiinin33months.months.i gbsaYRhttp://4sq.com/gbsaYRWhile accounting for less relevantmessagesSubtle Spam,AdvertisingSarcasm,Wishful Thinking3737I topten!!!ten!!!BuythemonitunesBuy them on llPhones,Phones,WindowsMobileWindows Mobile@purplepleather@purplepleather

IBM & non-IBM InfoSphere MDM DB2 & non-IBM Cognos & SPSS Unica ECM Data Growth Management InfoSphere Optim Rules / BPM iLog & Lombardi Data Warehouse InfoSphere Warehouse IBM Big Data Solutions Client and Partner Solutions Big Data Enterprise Engines Big Data Accelerators Text Image/Vi