Accomplishments & Future Directions - IEEE

Transcription

Web Mining : Accomplishments& Future DirectionsJaideep SrivastavaUniversity of du/faculty/srivasta.htmlMr. Prasanna Desikan’s help in preparing these slides is acknowledged Jaideep Srivastava1

OverviewWeb Structure MiningIntroduction to data miningDefinitionInteresting Web StructuresOverview of Hyperlink AnalysisMethodologyKey ConceptsData mining processData Mining techniquesClassificationClusteringTopic AnalysisConcept HierarchyContent RelevancePageRankHubs and AuthoritiesWeb CommunitiesInformation ScentWeb miningConclusionsWeb mining definitionWeb mining taxonomyWeb Usage MiningWeb Content MiningDefinitionPre-processing of contentCommon Mining techniquesDefinitionPreprocessing of usage dataClassificationClusteringTopic AnalysisConcept HierarchyContent RelevanceApplications of Content Mining Jaideep SrivastavaSession IdentificationCGI DataCachingDynamic PagesRobot Detection and FilteringTransaction IdentificationIdentify Unique UsersIdentify Unique Usertransaction2

OverviewRelated ConceptsWeb Usage Mining (contd.)Web VisualizationTopic DistillationWeb Page CategorizationSemantic Web MiningDistributed Web MiningPath and Usage Pattern DiscoveryPattern AnalysisApplicationsConclusionsWeb mining applicationsAmazon.comGoogleDouble ClickAOLeBayMyYahooCiteSeeri-MODEv-TAG Web Mining ServerWeb services & Web miningDefinitionsWhat they provideService Oriented ArchitectureSOAPWSDLUDDIHow WM can help WSWeb Services Optimization Jaideep Srivastava3

OverviewResearch DirectionsProcess MiningTemporal Evolution of the WebWeb Services OptimizationFraud at E-tailerFraud at online AuctioneerOther threatsWeb Mining and PrivacyPublic Attitude towards PrivacyWhy this attitudeDoes understanding implicationshelpWhat needs to be doneConclusions Jaideep Srivastava4

Introduction to Data Mining Jaideep Srivastava5

Why Mine Data?Computerization and automated data gathering hasresulted in extemely large data repositories.E.g., Walmart: 2000 stores, 20 M transactions/dayRaw DataPatternsKnowledgeScalability issues and desire for more automationmakes more traditional techniques less effective:Statistical MethodsRelational Query SystemsOLAP Jaideep Srivastava6

The Data Mining (KDD) Processinterpretationdata nPatternsDATATarget DataPreprocessedDataTransformedData Jaideep Srivastava7

Data Mining TechniquesClassificationClusteringAssociation RulesSequential PatternsRegressionDeviation Detection Jaideep SrivastavaPrimarytechniques8

Classification: DefinitionGiven a collection of records (training set )Each record contains a set of attributes, one of the attributesis the class.Find a model for class attribute as a functionof the values of other attributes.Goal: previously unseen records should beassigned a class as accurately as possible.A test set is used to determine the accuracy of the model.Usually, the given data set is divided into training and testsets, with training set used to build the model and test setused to validate it. Jaideep Srivastava9

Classification ExamplealaluscciiouororniggnttetessaaoalccccTid Refund MaritalStatusTaxableIncome CheatRefund MaritalStatusTaxableIncome ed120KNoYesDivorced 90K?5NoDivorced esDivorced KYesTrainingSet Jaideep SrivastavaLearnClassifierTestSetModel10

Classification TechniquesDecision Tree based MethodsRule-based MethodsMemory based reasoningNeural NetworksGenetic AlgorithmsNaïve Bayes and Bayesian Belief NetworksSupport Vector Machines Jaideep Srivastava11

What is Cluster Analysis?Finding groups of objects such that the objects in agroup will be similar (or related) to one another anddifferent from (or unrelated to) the objects in othergroups.Based on information found in the data that describes the objectsand their relationships.Also known as unsupervised classification.Many applicationsUnderstanding: group related documents for browsing or to findgenes and proteins that have similar functionality.Summarization: Reduce the size of large data sets.Web Documents are divided into groups based on asimilarity metric.Most common similarity metric is the dot product betweentwo document vectors. Jaideep Srivastava12

What is not Cluster Analysis?Supervised classification.Have class label information.Simple segmentation.Dividing students into different registration groupsalphabetically, by last name.Results of a query.Groupings are a result of an external specification.Graph partitioningSome mutual relevance and synergy, but areas are notidentical. Jaideep Srivastava13

Notion of a Cluster is AmbiguousInitial points.Six ClustersTwo ClustersFour Clusters Jaideep Srivastava14

Types of ClusteringsA clustering is a set of clusters.One important distinction is betweenhierarchical and partitional sets of clusters.Partitional ClusteringA division data objects into non-overlapping subsets (clusters)such that each data object is in exactly one subset.Hierarchical clusteringA set of nested clusters organized as a hierarchical tree. Jaideep Srivastava15

Partitional Clustering664455221133Original PointsA Partitional Clustering Jaideep Srivastava16

Hierarchical Clustering(agglomerative clustering)560.243420.15521310.10.050Traditional Hierarchical Clustering132546Traditional Dendrogram Jaideep Srivastava17

Other Distinctions Between Sets ofClustersExclusive versus non-exclusiveIn non-exclusive clusterings, points may belong to multipleclusters.Can represent multiple classes or ‘border’ pointsFuzzy versus non-fuzzyIn fuzzy clusterings, a point belongs to every cluster withsome weight between 0 and 1.Weights must sum to 1.Probabilistic clustering has similar characteristics.Partial versus complete.In some cases, we only want to cluster some of the data. Jaideep Srivastava18

Mining AssociationsGiven a set of records, find rules that will predict theoccurrence of an item based on the occurrences ofother items in the recordMarket-Basket transactionsExample:TIDItems1Bread, Milk2345Bread, Diaper, Beer, EggsMilk, Diaper, Beer, CokeBread, Milk, Diaper, BeerBread, Milk, Diaper, CokeTID Bread Milk Diaper Beer Eggs Coke11100002101110301110141111005111001 Jaideep Srivastava19

Definition of Association Rules ,cTIDItems1Bread, Milk2345Bread, Diaper, Beer, EggsMilk, Diaper, Beer, CokeBread, Milk, Diaper, BeerBread, Milk, Diaper, CokeGoal:Discover all rules havingsupport minsup andconfidence minconfthresholds.Association Rule:Support: s X yσ ( X y) T ( s P(X, y))σ ( X y)(c P ( y X))Confidence: c σ (X)Example:s c {Milk , Diaper} Beerσ ( Milk , Diaper, Beer ) T 2 0 .45σ (Milk, Diaper, Beer) 2 0.67σ (Milk, Diaper)3 Jaideep Srivastava20

Association Rule MiningExample of Rules:TIDItems1Bread, Milk2345Bread, Diaper, Beer, EggsMilk, Diaper, Beer, CokeBread, Milk, Diaper, BeerBread, Milk, Diaper, Coke{Milk,Diaper} {Beer} (s 0.4, c 0.67){Milk,Beer} {Diaper} (s 0.4, c 1.0){Diaper,Beer} {Milk} (s 0.4, c 0.67){Beer} {Milk,Diaper} (s 0.4, c 0.67){Diaper} {Milk,Beer} (s 0.4, c 0.5){Milk} {Diaper,Beer} (s 0.4, c 0.5)Observations: All the rules above correspond to thesame itemset: {Milk, Diaper, Beer} Rules obtained from the sameitemset have identical support butcan have different confidence Jaideep Srivastava21

Association Rule MiningTwo-step approach:1.2.Generate all frequent itemsets (sets of items whosesupport minsup)Generate high confidence association rules fromeach frequent itemsetEach rule is a binary partitioning of a frequent itemsetFrequent itemset generation is the moreexpensive operation Jaideep Srivastava22

Sequential Pattern DiscoveryGiven a set of objects, with each object associated withits own timeline of events, find rules that predict strongdependencies among different events.(A B) (C)(D E)Examples:In point-of-sale transaction sequences(Intro to visual C)(C -Primer) (Perl for dummies)(TCL TK)In Telecommunication alarm logs:(Inverter Problem Excessive Line Current) (Rectifier Alarm)(Fire Alarm) Jaideep Srivastava23

RegressionPredict a value of a given continuous valued variablebased on the values of other variables based onlinear or non-linear model of dependency.Greatly studied in statistics and neural network fieldsExamples:Predicting sales amount of a new product based onadvertising expenses.Time Series prediction of stock market indices Jaideep Srivastava24

Deviation DetectionDiscovering most significant changes in data frompreviously measured or normative data.Usually categorized separately from other datamining tasksDeviations are often infrequent.Modifications of classification, clustering and timeseries analysis can be used as means to achievethe goal.Outlier Detection in Statistics Jaideep Srivastava25

Web Mining Jaideep Srivastava26

Web MiningWeb is a collection of inter-related files on oneor more Web servers.Web mining isthe application of data mining techniques to extract knowledgefrom Web dataWeb data isWeb content – text, image, records, etc.Web structure – hyperlinks, tags, etc.Web usage – http logs, app server logs, etc. Jaideep Srivastava27

Web Mining – HistoryTerm first used in [E1996], defined in a ‘task oriented’ mannerAlternate ‘data oriented’ definition given in [CMS1997]1st panel discussion at ICTAI 1997 [SM1997]Continuing forumWebKDD workshops with ACM SIGKDD, 1999, 2000, 2001,2002, ; 60 – 90 attendeesSIAM Web analytics workshop 2001, 2002, Special issues of DMKD journal, SIGKDD ExplorationsPapers in various data mining conferences & journalsSurveys [MBNL 1999, BL 1999, KB2000] Jaideep Srivastava28

Web Mining Taxonomy Jaideep Srivastava29

Pre-processing Web DataWeb ContentExtract “snippets” from a Web document thatrepresents the Web DocumentWeb StructureIdentifying interesting graph patterns or preprocessing the whole web graph to come up withmetrics such as PageRankWeb UsageUser identification, session creation, robot detectionand filtering, and extracting usage path patterns Jaideep Srivastava30

Web Content Mining Jaideep Srivastava31

DefinitionWeb Content Mining is the process of extractinguseful information from the contents of Webdocuments.Content data corresponds to the collection of facts a Webpage was designed to convey to the users. It may consistof text, images, audio, video, or structured records suchas lists and tables.Research activities in this field also involve usingtechniques from other disciplines such asInformation Retrieval (IR) and natural languageprocessing (NLP). Jaideep Srivastava32

Pre-processing ContentContent PreparationExtract text from HTML.Perform Stemming.Remove Stop Words.Calculate Collection Wide Word Frequencies (DF).Calculate per Document Term Frequencies (TF).Vector CreationCommon Information Retrieval Technique.Each document (HTML page) is represented by a sparse vector of termweights.TFIDF weighting is most common.Typically, additional weight is given to terms appearing as keywords or intitles. Jaideep Srivastava33

Common Mining TechniquesThe more basic and popular data miningtechniques include:ClassificationClusteringAssociationsThe other significant ideas:Topic Identification, tracking and drift analysisConcept hierarchy creationRelevance of content. Jaideep Srivastava34

Document Classification“Supervised” techniqueCategories are defined and documents areassigned to one or more existing categoriesThe “definition” of a category is usually in theform of a term vector that is produced during a“training” phaseTraining is performed through the use ofdocuments that have already been classified(often by hand) as belonging to a category Jaideep Srivastava35

Document Clustering“Unsupervised” techniqueDocuments are divided into groups based on asimilarity metricNo pre-defined notion of what the groups shouldbeMost common similarity metric is the dot productbetween two document vectors Jaideep Srivastava36

Topic Identification and TrackingCombination of Clustering and ClassificationAs new documents are added to a collectionAn attempt is made to assign each document to anexisting topic (category)The collection is also checked for the emergenceof new topicsThe drift in the topic(s) are also identified Jaideep Srivastava37

Concept Hierarchy CreationCreation of concept hierarchies is importantto understand the category and subcategories a document belongs toKey FactorsOrganization of categories; e.g. Flat, Tree, orNetworkMaximum number of categories per document.Category Dimensions; e.g. Subject, Location,Time, Alphabetical, Numerical Jaideep Srivastava38

Relevance of ContentRelevance can be measured with respect to any ofthe following criteriaDocumentQuery basedUser BasedRole/Task Based Jaideep Srivastava39

Document RelevanceMeasure of how useful a given document is in agiven situationCommonly seen in the context of queries results are ordered by some measure ofrelevanceIn general, a query is not necessary to assign arelevance score to a document Jaideep Srivastava40

Query Based RelevanceMost commonWell established in Information RetrievalSimilarity between query keywords anddocument is calculatedCan be enhanced through additional informationsuch as popularity (Google) or term positions(AltaVista) Jaideep Srivastava41

User Based RelevanceOften associated with personalizationProfile for a particular user is createdSimilarity between a profile and document iscalculatedNo query is necessary Jaideep Srivastava42

Role/Task Based RelevanceSimilar to User Based RelevanceProfile is based on a particular role or task,instead of an individualInput to profile can come from multiple users Jaideep Srivastava43

Web Content Mining ApplicationsIdentify the topics represented by a Web DocumentsCategorize Web DocumentsFind Web Pages across different servers that are similarApplications related to relevanceQueries – Enhance standard Query Relevance with User,Role, and/or Task Based RelevanceRecommendations – List of top “n” relevant documents ina collection or portion of a collection.Filters – Show/Hide documents based on relevance score Jaideep Srivastava44

Web Structure Mining Jaideep Srivastava45

What is Web Structure Mining?The structure of a typical Web graphconsists of Web pages as nodes,and hyperlinks as edges connectingbetween two related pagesHyperlinkWebDocumentWeb Graph StructureWeb Structure Mining can be is the process ofdiscovering structure information from the WebThis type of mining can be performed either at the(intra-page) document level or at the (inter-page)hyperlink levelThe research at the hyperlink level is also calledHyperlink Analysis Jaideep Srivastava46

Motivation to study Hyperlink StructureHyperlinks serve two main purposes.Pure Navigation.Point to pages with authority* on the same topic ofthe page containing the link.This can be used to retrieve useful informationfrom the web.* - a set of ideas or statements supporting a topic Jaideep Srivastava47

Web Structure Terminology(1)Web-graph: A directed graph that represents theWeb.Node: Each Web page is a node of the Web-graph.Link: Each hyperlink on the Web is a directed edgeof the Web-graph.In-degree: The in-degree of a node, p, is thenumber of distinct links that point to p.Out-degree: The out-degree of a node, p, is thenumber of distinct links originating at p that point toother nodes. Jaideep Srivastava48

Web Structure Terminology(2)Directed Path: A sequence of links, starting from pthat can be followed to reach q.Shortest Path: Of all the paths between nodes pand q, which has the shortest length, i.e. number oflinks on it.Diameter: The maximum of all the shortest pathsbetween a pair of nodes p and q, for all pairs ofnodes p and q in the Web-graph. Jaideep Srivastava49

Interesting Web Structures[ERC 2000]Mutual ReinforcementEndorsementSocial ChoiceCo-CitationTransitive Endorsement Jaideep Srivastava50

The Bow-Tie Model of the Web[BKM 2000] Jaideep Srivastava51

Hyperlink Analysis Techniques [DSKT2002]Knowledge Models: The underlying representations thatforms the basis to carry out the application specific taskAnalysis Scope and Properties: The scope of analysisspecifies if the task is relevant to a single node or set of nodesor the entire graph. The properties are the characteristics ofsingle node or the set of nodes or the entire webMeasures and Algorithms: The measures are the standardsfor the properties such as quality, relevance or distancebetween the nodes. Algorithms are designed to for efficientcomputation of these measuresThese three areas form the fundamental blocks for buildingvarious Applications based on hyperlink analysis Jaideep Srivastava52

Hyperlink Analysis Techniques Jaideep Srivastava53

Google’s PageRank [BP1998]dP1P21OutDeg ( P1)1OutDeg ( P 2)1OutDeg ( P3)P3PNKey ideaRank of a web pagedepends on the rank ofthe web pages pointingto it PR( P1)PR( P 2)PR( P3) PR( P ) d N (1 d ) OutDeg ( P1) OutDeg ( P 2) OutDeg ( P3) Jaideep Srivastava54

The PageRank Algorithm [BP1998]Set PR [r1, r2, .rN], where r-i is some initial rank ofpage I, and N the number of Web pages in thegraph;d 0.15; D [1/N .1/N]T;A is the adjacency matrix as described above;doPRi 1 AT*PRi ;PRi 1 (1-d)* PRi 1 d*D;δ PRi 1 - PRi 1while δ ε, where ε is a small number indicating theconvergence thresholdreturn PR. Jaideep Srivastava55

Hubs and Authorities [K1998]Key ideasHubs and authorities are‘fans’ and ‘centers’ in abipartite core of a web graphA good hub page is one thatpoints to many good authoritypagesA good authority page is onethat is pointed to by manygood hub pagesHubsAuthoritiesBipartite Core Jaideep Srivastava56

HITS Algorithm [K1998]Let a is the vector of authority scores and h be the vector of hubscoresa [1,1, .1], h [1,1, .1] ;doa ATh;h Aa;Normalize a and h;while a and h do not converge(reach a convergence threshold)a* a;h* h;return a*, h*The vectors a* and h*represent the authority and hub weights Jaideep Srivastava57

Information Scent [CPCP2001]Key ideaa user at a given page “foraging” for information would follow alink which “smells” of that informationthe probability of following a link depends on how strong the“scent” is on that linkDistal Scent(content from pageat other end of link)Proximal Cues(snippets, graphics)P1Scent Jaideep SrivastavaP258

Web Communities [FLG2000]DefinitionWeb communities can bedescribed as a collection ofweb pages such that eachmember node has morehyperlinks ( in either direction)within the community thanoutside the community.Approach Maximal-flow model Graph substructureidentificationWeb Communities Jaideep Srivastava59

Max Flow- Min Cut AlgorithmDetermine minimal cutDetermine the Communityof this node (Source)CentralPage LikeYahoo (Sink)CommunityCommunity Jaideep Srivastava60

ConclusionsWeb Structure is a useful source for extractinginformation such asQuality of Web Page- The authority of a page on a topic- Ranking of web pagesInteresting Web Structures-Graph patterns like Co-citation, Social choice,Complete bipartite graphs, etc.Web Page Classification- Classifying web pages according to various topics Jaideep Srivastava61

Conclusions (Cont )Which pages to crawl- Deciding which web pages to add to the collection ofweb pagesFinding Related Pages- Given one relevant page, find all related pagesDetection of duplicated pages- Detection of neared-mirror sites to eliminateduplication Jaideep Srivastava62

Web Usage Mining Jaideep Srivastava63

What is Web Usage Mining?A Web is a collection of inter-related files on one or moreWeb serversWeb Usage MiningDiscovery of meaningful patterns from data generated byclient-server transactions on one or more Web localitiesTypical Sources of Dataautomatically generated data stored in server accesslogs, referrer logs, agent logs, and client-side cookiesuser profilesmeta data: page attributes, content attributes, usage data Jaideep Srivastava64

The Web Usage Mining Process Jaideep Srivastava65

Preprocessing ArchitecturePathCompletionServer Session FileData CleaningUser/SessionIdentificationPage ViewIdentificationEpisodeIdentificationRaw UsageDataUsage StatisticsSite Structureand ContentEpisode File Jaideep Srivastava66

ECLF Log File FormatIP Address128.101.35.92rfc931 authuser Date and time of request--[09/Mar/2002:00:03:18 -0600]requeststatus"GET / harum/ HTTP/1.0"IP address: IP address of theremote hostRfc931: the remote login name ofthe userAuthuser: the username as whichthe user has authenticatedhimselfDate: date and time of the requestRequest:the request line exactlyas it came from the clientStatus: the HTTP response codereturned to the r agentMozilla/4.7 [en] (X11; I; SunOS 5.8 sun4u)Bytes: The number of bytestransferredReferer: The url the client was onbefore requesting your urlUser agent: The software theclient claims to be using Jaideep Srivastava67

Issues in Usage DataSession IdentificationCGI DataCachingDynamic PagesRobot Detection and FilteringTransaction IdentificationIdentify Unique UsersIdentify Unique User transaction Jaideep Srivastava68

Session Identification Problems“AOL Effect”: Single IP Address/ Multiple UsersISP Proxy ServersPublic Access Machines“WebTV Effect”: Multiple IP Addresses/ SingleSessionRotating IP for load balancingPrivacy tools Jaideep Srivastava69

Session Identification SolutionsCookies - small piece of code that is saved onthe client machineUser Login – Require user to use login IDwith passwordEmbedded SessionID.IP Agent.Client-side tracking Jaideep Srivastava70

CGI DataCommon Gateway Interface (CGI): Methodused to pass variables and user entered datato Content ServerSet of name/value pairs that are attached toend of a URI Jaideep Srivastava71

Example URIBase URI/cgi-bin/templates?BV EngineID falfiffkdgfbemmcfnnckcgl.0&BVOperation Dyn RawSmartLink&BV SessionID 2131083763.936854172&BV ServiceName MyStore&form%25destination mysite/logo.tmplCGI Data Jaideep Srivastava72

CGI Data ProblemsHidden Values: POST requests have a “hidden”option that removes the name/value pairs fromthe URIContent Servers can maintain “state” in the formof session variables. The relevant data fordetermining what page was accessed may notbe in the current CGI pairs Jaideep Srivastava73

CGI Data SolutionsPull data directly from the HTTP traffic instead of theServer logAdvantages: Generic, works for any Web server/Contentserver configurationDisadvantages: No access to secure data. No access tointernal Content server variablesHave Content server create an “access log”Advantages: All relevant information is always availableClean log of page views instead of file accesses iscreated. No sessionID “first access” problemsDisadvantages: Content server performance may bedegraded. Not automatic like Server logs Jaideep Srivastava74

Caching ProblemsClients and Proxy Servers save local copies ofpages that have been accessedUses of the “back” and “forward” buttons on abrowser may access local copy instead ofrequesting a new one from the server Jaideep Srivastava75

Server Log Incompleteness due ccess pattern:Record in server log::page2.htmlmissed!index, page1, index, page2index, page1, page2 Jaideep Srivastava76

Wrong Access Timings Recordedat ServerServerTime capturedby server logClientge 1Request pat1t2Sent paget01t3ge 2Request pat4}Actual viewing timet5 Jaideep Srivastava77

Missed Page Views at Server Viewing time for cached pagesClientt1-3CacheServert2-0Page 2 viewing timet2-3}t1-0}Page 1 viewing timet3-0t3-3t2-1 t2-2t1-1 t1-2t3-1t3-2Viewing time calculated from server log Jaideep Srivastava78

Caching SolutionsDynamic content greatly reduces the number ofcached page accessesAdvantages: Fewer “missed” page viewsDisadvantages: Increased Server traffic“Negative” expiration dates for pages forcebrowsers to request a new version Jaideep Srivastava79

Robot Detection and Filtering[TK2002]Web robots are software programs that automaticallytraverse the hyperlink structure of world wide web inorder to locate and retrieve informationMotivation for distinguishing web robot visits from otherusersUnauthorized gathering of business information at ecommerce web sitesConsumption of considerable network bandwidthDifficulty in performing click-stream analysiseffectively on web data Jaideep Srivastava80

Transaction IdentificationMain Questions:how to identify unique usershow to identify/define a user transactionProblems:user ids are often suppressed due to security concernsindividual IP addresses are sometimes hidden behind proxy serversclient-side & proxy caching makes server log data less reliableStandard Solutions/Practices:user registrationnot full-proofclient-side cookiescache bustingincreases network traffic}} Jaideep Srivastava81

Heuristics for TransactionIdentificationIdentifying User Sessionsuse IP, agent, and OS fields as key attributesuse client-side cookies & unique user ids, if availableuse session time-outsuse synchronized referrer log entries and time stamps to expanduser paths belonging to a sessionpath completion to infer cached referencesEX: expanding a session A B C by an access pair(B D) results in: A B C B Dto disambiguate paths, sessions are expanded based on pageattributes (size, type), reference length, and no. of back referencesrequired to complete the path Jaideep Srivastava82

Inferring User Transactions fromSessionsStudies show that reference lengthsfollow an exponential distribution.Page types: navigational, content, mixed.Page types correlate with reference lengths.Can automatically classify pages asnavigational or content using % ofnavigational pages (based on sitetopology) and a normal estimate of Chisquared distribution.A transaction is an intra-session path endingin a content page.Histogram ofpage referencelengths (secs)navigationalpages Jaideep Srivastavacontentpages83

Associations in Web TransactionsAssociation Rules:discovers affinities among sets of items across transactionsα, σX Ywhere X, Y are sets of items, α confidence, σ supportExamples:60% of clients who accessed /products/, also accessed/products/software/webminer.htm.30% of clients who accessed /special-offer.html, placedan online order in /products/software/.(Actual Example from IBM official Olympics Site){Badminton, Diving} {Table Tennis} (α 69.7%, σ 0.35%) Jaideep Srivastava84

Other Patterns from WebTransactionsSequential Patterns:30% of clients who visited /products/software/, had done a search inYahoo using the keyword “software” before their visit60% of clients who placed an online order for WEBMINER, placed anotheronline order for software within 15 daysClustering and Classificationclients who often access /products/software/webminer.html tendto be from educational institutions.clients who placed an online order for software tend to be students in the20-25 age group and live in the United States.75% of clients who download software from /products/software/demos/ visitbetween 7:00 and 11:00 pm on weekends. Jaideep Srivastava85

Path and Usage Pattern DiscoveryTypes of Path/Usage InformationMost Frequent paths traversed by usersEntry and Exit PointsDistribution of user session durations / User AttritionExamples:60% of clients who accessed /home/products/file1.html,followed the path /home /home/whatsnew /home/products /home/products/file1.html(Olympics Web site) 30% of clients who accessed sport specificpages started from the Sneakpeek page.65% of clients left the site after 4 or less references. Jaideep Srivastava86

Pattern AnalysisPattern Analysis Tools/TechniquesKnowledge Query MechanismOLAP / Visualization ToolsIntelligent Agents / Expert SystemsWEBMINER: SQL-like Knowledge Query MechanismSELECT association-rules (A*B*C*)FROM “rules.out”WHERE time 970101 AND domain “edu” ANDsupport 0.01 AND confidence .85 Jaideep Srivastava87

Implications of Web Usage Mining forE-commerceElectronic Commercedetermine lifetime value of clientsdesign cross marketing strategies across productsevaluate promotional campaignstarget electronic ads and coupons at user groupsbased on their access patternspredict user behavior based on previously learnedrules and users’ profilepresent dynamic information to users based ontheir interests and profiles Jaideep Srivastava88

Implications for Other ApplicationsEffective and Efficient Web Presencedetermine the best way to structure the Web siteidentify “weak links” for elimination or enhancementA “site-specific” web design agentPre-fetch files that are most likely to be accessedIntra-Organizational Applicationsenhance workgroup management & communicationevaluate Intranet effectiveness and identify structuralneeds & requirements Jaideep Srivastava89

What’s Round-the-Corner for WUMData Integration and meta-level schemasEnhanced knowledge query mechanism, user interface,and visualization modules (OLAP)Intelligent agent to extract the most “interesting” rulesfrom among the discovered rulesBetter models of user behavior (e.g. InformationForaging)Rule-based expert system to provide “suggestions”based on discovered rules Jaideep Srivastava90

Related Concepts Jaideep Srivastava91

Interestingness Measure [PT1998,C2000]A measurement of patterns that are subjectively differentfrom what is expected and above a certain supportthresholdIn the World Wide Web, there are two sources ofinformationWeb Structure: Reflects the aut

Jaideep Srivastava 1 Web Mining : Accomplishments & Future Directions Jaideep Srivastava University of Minnesota USA srivasta@cs.umn.edu http://www.cs.umn.edu .