New Trends In Data Integration, Analytics, And BI - Missouri S&T

Transcription

New Trends in Data Integration,Analytics, and BIMukesh MohaniaIBM India ResearchAcknowledgement: Thanks to all members of IM Research group at NewDelhi for their contribution.

Agenda Information Integration – Definition and architectures Existing solutions Context Oriented Information Integration SCORE approach EROCS approach Integrating Audio Streams with Structured Data Data Analytics and BI Applications Improving Semantic SearchPreventing Customer AttritionPreventing Information Leakage from Text DocumentsImproving Cross/Up-SaleSocial Network Analysis for Telecom BI Conclusions2

MarketMarket Insights:Insights: InformationInformation ManagementManagement ChallengesChallengesOnly 1/3rd of CFOs believe that theinformation is easy to use, tailored,cost effective or integrated.60% of CEOs: Need to do a better jobcapturing and understanding informationrapidly in order to make swift businessdecisions.85% ofCustomersinformation isunstructured.Trx.30-50% of applicationEmployeesdesign time is spent oncopy management.PartnersProductsOrgs.Financials42% oftransactions arestill paper-based.e-MailsDatabasesReportsThe average billion dollar company:48 disparate financial systems2.7 ERP systemsWebContent30% of people’s time:searching forrelevant information.MediaDocuments79% of companies have40% of IT budgetsmay be spent onintegration.more than two repositoriesand 25% have more than 15Sources: IBM & Industry Studies, Customer Interviews, Forrester3

Valuable Business Information is Buried Under LargeAmounts of Unstructured DocumentsStructured data:Unstructured data ContentQueryContributes to 20% of businessthat is conducted - Gartner GroupContenResu tltUnstructured data: Doubles every three months Gartner GroupStructured dataDatabaseQueryManagementasetabDa esultRHow do Icorrelate?Enterprises are realizing the need to bridge this separation and are demandingINTEGRATED RETRIEVAL, MANAGEMENT AND ANALYSIS of both thestructured and unstructured content.Existing systems do not enable automatic association of the two disparatesources.4

Improved BI with Email Analysis CRM analysis tends to tell you “what” happened not “why” it happened Customer attrition “ATM usage down by 30% in last quarter” Likely to leave New product sales Sales 20% less than target Marketing campaigns target selection People who have purchased similar products in the past Email analysis may tell you “why” “very high service charges for loan processing” customer attrition “very unhappy with the product quality ” product sales down “my wife had called your call center yesterday.” cross sellopportunity for family products. Better target selection for marketing campaigns “unable to change my password while traveling ” customer satisfaction5

Know your Customers Better To realize full potential of customer we have to answer certain questions abouthim/her Increase share of wallet: What does a person need? What are hisproduct affinities? What is his opinion? What hinders him from doing morebusiness with us? Cross-sell/Up-sell : What products sell best? (Cognos) Will he buy it?When is the right time to sell so that his likelihood of buying is high? Product Extension: What features does he like? How can I improve hisproduct experience? Reduce Churn: Who is likely to churn? (SPSS) Why is he churning? Reduce cost to serve: What is his problem? How can I solve it efficiently? Structured Data Analysis and Surveys are employed to answer these questions Structured data analysis can answer only some of these questions Unstructured data can answer more questions which cannot be answeredby structured data.6

Existing Solutions7

II Architecture: A Data Warehousing Approach8

II Architecture: Virtualization Layer Approach9

Structured and Unstructured Information Integration:A Brief Background on Existing SolutionsExisting solutions can be classified in terms of the query paradigm used: Keyword Query Based Solutions (DB2 ESE, DbXplorer/BANKS) Relational data exposed to search engine as virtual textdocuments Query both structured and unstructured information usingkeywords SQL Query Based Solutions (SQL LIKE predicate, DB2 NetSearchExtender) Text data exposed to relational engine as virtual tables with textcolumns Query both structured and unstructured information using SQL Provide SQL primitives to search text in table columns using a set ofkeywords10

Keyword Query Based Solution: DB2 ESEKeyword rExtender11

Keyword Query Based Solution: rd QueryC1C2C3AX1AX2AY3BX4AX5BY6BX7BX8Keyword QuerySearchSearchEngineEngine12

Keyword Query Based Solutions: Summary Advantage: Simplicity! Disadvantages Less expressive (as compared to SQL) How to ask for the information related to the five best performing stocks inthe past week? Need to specify a set of keywords that succinctly encodes theinformation need Not always easy13

SQL Query Based Solution:Standard SQL LIKE PredicateSELECT stocks.price, docs.textFROM stocks, docsWHERE (stocks.name ‘IBM’AND docs.text LIKE ‘% IBM %’)OR (stocks.name ‘ORCL’AND docs.text LIKE ‘% ORCL %’)C1C2C3AX1AX2AY3BX4AX5BY6BX7BX8DB2DB2UDBUDB/ 4

SQL Query Based Solution:Net Search ExtenderSELECT stocks.price, docs.textFROM stocks, docsWHERE (stocks.name ‘IBM’AND CONTAINS(docs.text, “IBM”))OR (stocks.name ‘ORCL’AND CONTAINS(docs.text, BUDB/ rIntegratorCONTAINS( )NetNetSearchSearchExtenderExtender15

SQL Query Based Solutions: Summary Advantages: More expressive – can specify more involved and sophisticated queries Disadvantages: The unstructured data is still queried using keywords Need to specify a set of keywords that succinctly encodes theinformation need Not always easy The SQL query and the embedded keyword query encode the sameinformation need Redundant effort Association of documents with tuples (local context), not with the entireresult (global context) Same documents get attached to “IBM” when “IBM” is queried with “ORCL”as when “IBM” is queried with “DELL”16

SCORE Approach –Associating text Documents with StructuredQuery Results17

Problem Statement: Enhance structureddata retrieval by associating additionaldocuments relevant to the user context withthe query result.Structured data relations, schema-based(XML) documentsUnstructured data schema-less (free-flow)documents, web-pages18

SCORE Overview“Get the 3 companies withmax price variation”SELECT name, max(price) -min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS ONLYSELECT name, max(price) - min(price)FROM stocksGROUP BY nameORDER BY 2FETCH FIRST 3 ROWS X7BX8“IBM” “ORCL” “MSFT”“Database” eEnterpriseSearchSearchExtenderExtenderCIKM 2005 – Best Paper19

IBM India Research LabOverall ArchitectureUser Interface/ApplicationQuery Result Relevant DocumentsQuery DirectivesMetadata sQuery Result ContextQuerySCOREQuery HandlerContext HandlerMetadataModifiedQueryModifiedQuery ResultContext DirectivesMetadataDB2 IICMCMStructured Data SourceEnterprise SearchCMCMUnstruct Data SourceMetadata20 2004 IBM Corporation

EROCS Approach –Associating Relevant Structured Data withText Documents21

Linkage Discovery (EROCS): Efficiently Linking Diverse Data Exploit partial information contained in a document to automatically identify andlink relevant structured dataMain Idea View the structured data as a set of pre-defined “entities” Identify the entities from this set that best match the document, and also findembeddings of the identified entities in the documentValue Proposition Metadata Extraction (linkingdocuments with structureddata)Enhance semantic Search byexploiting this MetadataEnable BI across Structuredand Unstructured DataProviding more metadata andricher text search in CMUIMA AnnotatorVLDB 2006, SIGMOD 2007, PODS 2007, ICDE 2008 (Demo)22

ExampleC1C1Find the transaction thatbest matches the contextAAABABBBC2C3C1 X C2 1AAABABBXYXXYXXBXXYXXYXX2345678C3123C2C3A C1A AX C2X X1 C32 ACTIONI am Name John Smith /Name . bought a Company Sony /Company product DVD player /product .from Company JK Electronics /Company .Additional “sidebar”information availableas a result of rd (AMEX)Promo# 1236NOREFNDCustIdNameLoyaltyClubAddrA756K9John WSmithPlatinumIBMChicago, IL23

Linkage Discovery ArchitectureAnnotation RulesOAEMIMLTextTextEntity DefinitionEntity ViewUIMA AnnotatorCRMLinkage Discovery(Text-Entity 3BX4AX5BY6BX7BX8Data MiningTextTextBI ToolsAttrition AnalysisCust. Segmentation24

Issues: Connecting People, Products, etc in a Database To Textdocument Learn the key differences between finding names of peopleand products versus connecting those names to databaseID's Understand the three major difficulties encountered inlinking: Ambiguity of reference, unknown aliases, and ambiguity of type Explore state-of-the-art approaches to resolving theproblem which span machine learning to rule basedsystems25

CallAssist: Integrating Real-time Audio with DatabasesCallAssist A novel system for linking audio conversations with relevant structured data automatically in real-time,Suggests informative queries to narrow down the contextFLAIVR basedRouterAudioSuggest Entities &Informative queriesValue PropositionCost Reduction: the end-to-end time taken to support acustomer will be reduced. Reduced call escalation rates Lower average time for call resolution Reduction in agent training costs-- Agents need not be have in-depthunderstanding of business processes-- System will prompt agent with informativequeries that can narrow contextPotential Sales: By suggesting opportunities for upselling/cross-selling products and services to customer.ContextRouteControllerAheadASR ial ApplicationsAutomate Q&A for Call CenterAutomated Solution ProviderDynamic Learning/Self-help ProgramIntegrated Trouble Ticket cum Diagnosis GeneratorExtend Customer Data Integration tools to support real-time audioDatabasesTemplatesVLDB 2007 (Demo), SIGMOD 200826

27

Data Analytics and BI Applications

Example Scenario: Semantic SearchEnables Semantic Search onText External DBImproves precision/recall More types (e.g. transaction) Attributes from structured data(e.g. customer profile)Search Application Facilitates enhanced searchComplaint, platinum member, Los AngelesIndexLinkage DiscoveryIdentify RelationshipsNamed-entity extractionCategorizationFind Words & RootsTextIdentify LanguageUIMA Compliant Collection Processing EngineRDBMSEnhancedMetadataInformation about customerbeing a “Platinum” memberin “Los Angeles” addedusing Linkage DiscoveryThird PartyAnnotators External DB29

Semantic Search ExampleI have an account in your bank in TX (# 0214-452). I am currently facing problems inaccessing my net-banking account. Whenever I try to login, I get a message "account locked".I cannot go to the branch to reset my passwords as I am currently traveling and outside theUS. Can you please reset my password to my old one?Mail # 1I am indeed privilege to get your response to quickly. However, I have not got any service outof your net banking on the given dates. I was following up the wrongful debits to my account onaccount of the car loan which you have refunded now. That being the case, there is nojustification for you to charge me the extra 40.Mail # 2Find Emails from Privilege Customers complaining about net-bankingRegular (non-linked) SearchI am indeed privilege to get yourresponse to quickly. However, Ihave not got any service out ofyour net banking on the givendates. I was following up thewrongful debits to my account onaccount of the car loan whichyou have refunded now. Thatbeing the case, there is nojustification for you to charge methe extra 40.LD SearchI have an account in your bank in TX (#0214-452).I am currently facingproblems in accessing my net-bankingaccount. Whenever I try to login, I get amessage "account locked". I cannot goto the branch to reset my passwords asI am currently traveling and outside theUS. Can you please reset my passwordto my old AX2BY6AY3BX7BX4BX8AX5BY6BX7BX8CRM30

Example: Complaint Routingxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx xxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx xx xxxxxxxxxxxxxxxxxxx x xxxx x ABABBBC2C3C1 X C2 1AAABABBBXYXXYXXXXYXXYXX2345678C31234C2C3A C1A AX C2X X1 C32 xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx RouterRouterIdentified transaction with relevant context(Payment details, promotion utilized,customer profile, products)Customer care85678Transactions dataLinkage Discovery can also help proactively identifythe minimal additional information needed from thecustomer to identify the relevant transactionPODS 2006 (Entity Completion)31

Improved BI spanning both on Content and DataShow me the top 4 pain points of my most privileged customers from North region whohave reduced their balance by more than 50% in the last quarter.Information fromEmailsInformation fromCRM Data45 Traditional BI systems cannot answer this type of hybridquery requiring manual analysis40353025 OmniFind Analytics Edition with Linkage Discovery can.20151050NetBankingForexAQBLoanICDE 2008 (Demo)32

Discovering Customer Needs to MaximizeCustomer ValueVoiceASRTranscribed Noisy CallsDataStage, EROCSQualityStageStructuredDBAnnotated and linked TextRecords and structured dataLinguistic AnalysisClaims . not processedTaking too much timeRefund not givenMining Views/FunctionsData CleaningData linkingAnalysisUnStructuredDBSIGKDD 2008, ICDM 2008 Upsell and Cross-Sell First call resolution Reducing turn around time Identifying pain points Churn prediction Relating business metrics tocustomer perceptions Improving service levelsOAEBusiness Value:feedback toimprove CustomerValue33

Customer Relationship Management: Churn Prediction Attrition Prevention Use SCORE to deduce common features of the set of customers who havecancelled their credit card Prioritize customer retention campaign for remaining customers exhibiting thesecharacteristics“List all customers cancelled theircredit cards in last 3 XXY45123BB64BA XX75BB XY86BX7BX8XOther customershaving the samecontextC3“CardType: Gold”,“Category: High Interest”“State: CA”,“No of Complaints: 2”,“Sentiments: Unhappy”,“Band: 5”,“CardType: Silver”,“Category: Late Payment”“State: CA”,“No of Complaints: 2”,“Sentiments: Unhappy”,“Band: itory34

IBM Research India Research LabDocument Sanitization: Preventing Information LeakageSanitization involves removing sensitiveinformation from a document.Problem Statement Given a document D and a parameter K,delete a minimum number of terms so thatthe remaining document T is K-safe. K-safety: A set of terms T is K-safe, iffor any entity e, at least K other entitiescontain T I C(e) in their contextCIKM 200835

IBM Research India Research LabProblem Scenario 1: Document Sanitization Based on AccessControl Policies Dynamically sanitize a document for a specified user, based on his/her accessprivileges defined on a structured database. Sensitive information hidden from the user in the database should also beremoved from the documentSanitized DocumentInput DocumentDocumentDocumentSanitizerSanitizerJohn Ralphson ofIRL is suffering fromcancer C1AAC2C1AC1C2C3AX1AC1 XC2 2C3BAA YA XX 3X 41BA XB YY 5X 63B4BA XB XX 7Y n Ralphson ofIRL is suffering from*********cancerC3C2XXYXXYXX12345678C312345678 Access ControlPoliciesDoctor36

IBM Research India Research LabProblem Scenario 2: Document Sanitization forSecuring Entities Database contains a set of entities Each entity e has context C(e) : a set of terms associated with e. Sanitization Hide information from a document so that the entitymentioned in the document cannot be identified. Identification: Happens by searching the database and hn Ralphson ofIRL is suffering fromcancer C1AAC2C1ABAAC1C2C3AX1AC1 XC2 2C3BAA YA XX 3X 41B2BA XB YY 5X 63B4BA XB XX 7Y 85BX7BXBABBBA6DB8ABABXXYXXYXXJohn --------- of------ is suffering fromcancer C3C2XXYXXYXX12345678C312345678 Entities Contexts37

SNAzzy (Social Network Analysis forTelecom Business Intelligence)TKDE 2008, WWW 200738

Technology Overview Goal Augment the traditional analysis generally utilized by Telcos with Social Network Analyses for improvedCRM and Business Intelligence Methodology Analysis of call and SMS patterns to create a graph where the vertices in the graph represent phone #s,individuals, geographical areas, or communities and the edges in the graph represent their relationships(call duration, friend, acquaintance, ) Focus Global Structure Analysis: Analyzing Call Graphs to help you better understand the underlyingbehavior of users, in a global context and accordingly plan the marketing services Customer Social Analysis: Analyzing the graph to identify the customers of high social influence whoshould be retained Churn Analysis: To predict churn and perform targeted marketing campaigns on potential churnersand to identify potential acquisition targets from competitors Psychographic Analysis: Analyze the calling patterns to guess the subscriber profile to enableeffective customer segmentation leading to targeted campaign management. Community Discovery: Analyze the graph to detect communities for improved group targeting andretention39

Case Study Customer: A major Telecom Operator of the world Data: 1 month CDRs (both SMS & Calls) of multiple regionsHuge amount of data Graph of 7 million nodes and 35 million edges for 1 region The CDR graph showed various insights A heavy tailed distribution – very few people know a lot of people, and most peopleknow a few peopleSMS is an important medium of communication among certain customer segmentsSMS is a more social phenomenon than Voice CallsBoth SMS graph and Call graph has a very large strongly connected components Can reach a majority of people by traversing links Identified various types of interesting communities (see next slide) Cliques (everyone knows everyone else)Clique connectors (people connecting multiple cliques)Competitor’s customers as part of cliques signifying that they are external membersof a strong communityStars Analysis shows that people called by churners are more likely to churn In consultation with the Telecom Operator BI and Marketing team to gain more insights on theidentified communities and devise campaigns GOAL: Integrate SNA with Operator BI System40

Some Identified CommunitiesKey customers?Connecting social groupsHow do we retain this customer ?(Why is he calling so many competitors ?)Potential conversiontargets fromcompetitors?Platinum CustomerRegular CustomerCDMA CompetitorGSM Competitor41

Conclusions Information Integration has become widely popular – However a lotneeds to be done As per Gartner: 80% of data in enterprise is unstructured. Datawhich is not integrated with the structured data in the enterprise! Huge need for new ways of doing information integration Context Oriented Information Integration SCORE: Automatically finds relevant unstructured data for a SQL Query EROCS: Finds links between structured and unstructured data Social Network Analysis for Telecom BI Improves churn prediction by analyzing the social behavior of callers,that is, who is calling whom and their calling patterns. Need to consider SNA fact particularly in churn analysis and BI42

Thank You43

SELECT stocks.price, docs.text FROM stocks, docs WHERE (stocks.name 'IBM' AND CONTAINS(docs.text, "IBM")) OR (stocks.name 'ORCL' AND CONTAINS(docs.text, "ORCL")) SQL Query Based Solution: Net Search Extender DB2 UDB / WebSphere Information Integrator DB2 UDB / WebSphere Information Integrator Net Search Extender Net Search .