Web Usage Mining: Discovery And Applications Of Usage Patterns From Web .

Transcription

Web Usage Mining: Discovery and Applications of UsagePatterns from Web DataJaideep Srivastava * t , Robert Cooley:l: , Mukund Deshpande, Pang-Ning TanDepartment of Computer Science and EngineeringUniversity of Minnesota200 Union St SEMinneapolis, MN 55455{srivast a,cooley,deshpaqd,pt an} cs .umn.eduABSTRACTWeb usage mining is the application of data mining techniques to discover usage patterns from Web data, in order tounderstand and better serve the needs of Web-based applications. Web usage mining consists of three phases, namelypreprocessing, pattern discovery, and pattern analysis. Thispaper describes each of these phases in detail. Given its application potential, Web usage mining has seen a rapid increase in interest, from both the research and practice communities. This paper provides a detailed taxonomy of thework in this area, including research efforts as well as commercial offerings. An up-to-date survey of the existing workis also provided. Finally, a brief overview of the WebSIFTsystem as an example of a prototypical Web usage miningsystem is given.Keywords: data mining, world wide web, web usage haining.1.INTRODUCTIONThe ease and speed with which business transactions canbe carried out over the Web has been a key driving forcein the rapid growth of electronic commerce. Specifically, ecommerce activity that involves the end user is undergoinga significant revolution. The ability to track users' browsingbehavior down to individual mouse clicks has brought thevendor and end customer closer than ever before. It is nowpossible for a vendor to personalize his product message forindividual customers at a massive scale, a phenomenon thatis being referred to as mass customization.The scenario described above is one of many possible applications of Web Usage mining, which is the process of apply-ing data mining techniques to the discovery of usage patterns]rom Web data, targeted towards various applications Datamining efforts associated with the Web, called Web mining,vey. An early taxonomy of Web mining is provided in [29],which also describes the architecture of the WebMiner system [42], one of the first systems for Web Usage mining. Theproceedings of the recent WebKDD workshop [41], held inconjunction with the KDD-1999 conference, provides a sampling of some of the current research being performed in thearea of Web Usage Analysis, including Web Usage mining.This paper provides an up-to-date survey of Web Usage mining, including both academic and industrial research efforts,as well as commercial offerings. Section 2 describes the various kinds of Web data that can be useful for Web Usagemining. Section 3 discusses the challenges involved in discovering usage patterns from Web data. The three phasesare preprocessing, pattern discovery, and patterns analysis Section 4 provides a detailed taxonomy and survey of theexisting efforts in Web Usage mining, and Section 5 givesan overview of the WebSIFT system [31], as a prototypicalexample of a Web Usage mining system, finally, Section 6discusses privacy concerns and Section 7 concludes the paper.2.can be broadly divided into three classes, i.e. content mining, usage mining, and structure mining . Web Structuremining projects such as [34; 54] and Web Content miningprojects such as [47; 21] are beyond the scope of this sur-C o n t e n t : The real data in the Web pages, i.e. thedata the Web page was designed to convey to the users This usually consists of, but is not limited t6;"text andgraphics.*Can be contacted at jaideep amazon.com Supported by NSF grant NSF/EIA-9818338S t r u c t u r e : Data which describes the organization ofthe content. Intra-page structure information includesthe arrangement of various HTML or XML tags withina given page. This can be represented as a tree structure, where the (html) tag becomes the root of the tree.: Supported by NSF grant EHR-9554517SIGKDD Explorations.W E B DATAOne of the key steps in Knowledge Discovery in Databases[33] is to create a suitable target data set for the data miningtasks. In Web Mining, data can be collected at the serverside, client-side, proxy servers, or obtained from an organization's database (which contains business data or consolidated Web data). Each type of data collection differs notonly in terms of the location of the data source, but alsothe kinds of data available, the segment of population fromwhich the data was collected, and its method of implementation.There are many kinds of data that can be used in Web Mining. This paper classifies such data into the following typesJa l 2000.Volume 1, Issue 2 - page 12

T h e principal k i n d of inter-page s t r u c t u r e i n f o r m a t i o nis hyper-links c o n n e c t i n g one page to another. U s a g e : D a t a t h a t describes t h e p a t t e r n of usage ofWeb pages, such as I P addresses, page references, a n dt h e date a n d t i m e of accesses.* U s e r P r o f i l e : Data. t h a t provides d e m o g r a p h i c information a b o u t users of t h e W e b site. T h i s includesregistration d a t a a n d c u s t o m e r profile information.2.1Data SourcesT h e usage d a t a collected at t h e different sources will represent t h e navigation p a t t e r n s of different segments of t h eoverall Web traffic, r a n g i n g from single-user, single-site browsing b e h a v i o r to multi-user, multi-site access p a t t e r n s .2.1.1Server Level CollectionA W e b server log is an i m p o r t a n t source for performing W e bUsage Mining because it explicitly records t h e browsing behavior of site visitors. T h e d a t a recorded in server logs reflects t h e (possibly c o n c u r r e n t ) access of a W e b site by multiple users. These log files can b e stored in various f o r m a t ssuch as C o m m o n log or E x t e n d e d log formats. A n example of E x t e n d e d log f o r m a t is given in Figure 2 (Section 3).However, t h e site usage d a t a recorded by server logs m a ynot b e entirely reliable due to t h e presence of various levelsof caching w i t h i n t h e W e b e n v i r o n m e n t . Cached page viewsare not recorded in a server log. In addition, arty i m p o r t a n tinformation passed t h r o u g h t h e P O S T m e t h o d will n o t b eavailable in a server log. Packet sniffing technology is a nalternative m e t h o d to collecting usage d a t a t h r o u g h serverlogs. Packet sniffers m o n i t o r network traffic coming to aWeb server a n d e x t r a c t usage d a t a directly from T C P / I Ppackets. T h e W e b server can also store o t h e r kinds of usageinformation such as cookies a n d query d a t a in separate logs.Cookies are tokens g e n e r a t e d by t h e W e b server for individual client browsers in order to automatically track t h e sitevisitors. Tracking of individual users is not a n easy taskdue to t h e stateless connection model of t h e H T T P protocol. Cookies rely on implicit user cooperation a n d t h u s haveraised growing concerns regarding user privacy, which willb e discussed in Section 6. Query d a t a is also typically generated by online visitors while searching for pages relevantto t h e i r i n f o r m a t i o n needs. Besides usage data, t h e serverside also provides c o n t e n t data, s t r u c t u r e information a n dW e b page m e t a - i n f o r m a t i o n (such as t h e size of a file a n dits last modified time).T h e W e b server also relies on o t h e r utilities such as C G Iscripts to h a n d l e d a t a sent back from client browsers. W e bservers i m p l e m e n t i n g t h e C G I s t a n d a r d parse t h e U R I 1 oft h e requested file to d e t e r m i n e if it is a n application program. T h e U R I for C G I p r o g r a m s m a y contain additionalp a r a m e t e r values to b e passed to t h e C G I application. Oncet h e C G I p r o g r a m h a s completed its execution, t h e W e bserver send t h e o u t p u t of t h e C G I application back to t h ebrowser.2.1.2Client Level Collection1Uniform Resource Identifier (URI) is a more general definition t h a t includes t h e c o m m o n l y referred to Uniform Resource Locator (UI:tL).S I G K D D Explorations.Client-side d a t a collection can b e i m p l e m e n t e d by using a rem o t e agent (such as Javascripts or J a v a applets) or by modifying t h e source code of a n existing browser (such as Mosaic or Mozilla) to e n h a n c e its d a t a collection capabilities.T h e i m p l e m e n t a t i o n of client-side d a t a collection m e t h o d srequires user cooperation, either in e n a b l i n g t h e functionality of t h e Javascripts a n d J a v a applets, or to voluntarily uset h e modified browser. Client-side collection has an advantage over server-side collection because it ameliorates b o t ht h e caching a n d session identification problems. However,J a v a applets perform no b e t t e r t h a n server logs in t e r m s ofd e t e r m i n i n g t h e actual view t i m e of a page. In fact, it m a yincur some additional overhead especially w h e n t h e J a v a applet is loaded for t h e first time. Javascripts, on t h e o t h e rh a n d , c o n s u m e little i n t e r p r e t a t i o n t i m e b u t c a n n o t capt u r e all user clicks (such as reload or back b u t t o n s ) . Thesem e t h o d s will collect only single-user, single-site browsing behavior. A modified browser is m u c h more versatile a n d willallow d a t a collection a b o u t a single user over mult!ple W e bsites. T h e most difficult p a r t of using this m e t h o d is convincing t h e users to use t h e browser for t h e i r daily browsingactivities. This can be done by offering incentives to usersw h o are willing to use t h e browser, similar to t h e incentive p r o g r a m s offered by companies such as NetZero [9] a n dA l l A d v a n t a g e [2] t h a t reward users for clicking on b a n n e ra d v e r t i s e m e n t s while surfing t h e Web.2.1.3 Proxy Level CollectionA W e b proxy acts as a n i n t e r m e d i a t e level of caching between client browsers a n d W e b servers. Proxy caching (:anb e used to reduce t h e loading t i m e of a W e b page experienced by users as well as t h e network traffic load a t t h eserver a n d client sides [27]. T h e p e r f o r m a n c e of proxy cachesd e p e n d s on t h e i r ability to predict future page requests correctly. P r o x y traces m a y reveal t h e actual H T T P requestsfrom multiple clients to multiple W e b servers. This m a yserve as a d a t a source for characterizing t h e browsing behavior of a group of a n o n y m o u s users, sharing a c o m m o nproxy server.2.2Data AbstractionsT h e i n f o r m a t i o n provided by t h e d a t a sources describedabove c a n all b e used to c o n s t r u c t / i d e n t i f y several d a t a abstractions, n o t a b l y users, server sessions, episodes, clickstreams, a n d page views. In order to provide some consist e n c y in t h e way these t e r m s are defined, t h e W 3 C rebC h a r a c t e r i z a t i o n Activity ( W C A ) [14] h a s p u b l i s h e d a draftof W e b t e r m definitions relevant to analyzing W e b usage. Auser is defined as a single individual t h a t is accessing filefrom one or more W e b servers t h r o u g h a browser. Whilethis definition seems trivial, in practice it is very difficult touniquely a n d repeatedly identify users. A user m a y accesst h e W e b t h r o u g h different machines, or use more t h a n onea g e n t on a single machine. A page view consists of every filet h a t c o n t r i b u t e s to t h e display on a user's browser at onetime. Page views are usually associated with a single useraction (such as a mouse-click) a n d can consist of several filessuch as frames, graphics, a n d scripts. W h e n discussing a n danalyzing user behaviors, it is really t h e aggregate page viewt h a t is of i m p o r t a n c e . T h e user does not explicitly ask for"n" frames a n d "m" graphics to be loaded into his or herbrowser, t h e user requests a "Web page." All of t h e inform a t i o n to d e t e r m i n e which files c o n s t i t u t e a page view is J a n 2000.Volume 1, Issue 2 - page 13

accessible from the W e b server. A click-stream is a sequential series of page view requests. Again, t h e d a t a availablefrom t h e server side does not always provide enough inform a t i o n to r e c o n s t r u c t t h e full click-stream for a site. Anypage view accessed t h r o u g h a client or proxy-level cache willnot b e "visible" from t h e server side. A user session is t h eclick-stream of page views for a singe user across t h e entireWeb. Typically, only t h e p o r t i o n of each user session t h a t isaccessing a specific site can b e used for analysis, since accessi n f o r m a t i o n is not publicly available from t h e vast m a j o r i t yof W e b servers. T h e set of page-views in a user sessionfor a particular W e b site is referred t o as a server session(also c o m m o n l y referred to as a visit). A set of server sessious is t h e necessary i n p u t for a n y W e b Usage analysis ord a t a m i n i n g tool. T h e e n d of a server session is defined ast h e p o i n t w h e n t h e user's browsing session at t h a t site hasended. Again, this is a simple concept t h a t is very difficultto track reliably. Any semantically m e a n i n g f u l subset of auser or server session is referred t o as a n episode by t h e W 3 CWCA.3.WEB USAGE MININGAs shown in Figure 1, t h e r e are t h r e e m a i n t a s k s for performing W e b Usage Mining or W e b Usage Analysis. Thissection presents a n overview of t h e tasks for each step a n ddiscusses t h e challenges involved.3.1PreprocessingPreprocessing consists of c o n v e r t i n g t h e usage, content, a n ds t r u c t u r e information c o n t a i n e d in t h e various available d a t asources into t h e d a t a a b s t r a c t i o n s necessary for p a t t e r n discovery.3.1.1 UsagePreprocessingUsage preprocessing is a r g u a b l y t h e m o s t difficult task int h e W e b Usage Mining process due to t h e incompleteness oft h e available data. Unless a client side t r a c k i n g m e c h a n i s mis used, only t h e I P address, agent, a n d server side clicks t r e a m are available to identify users azld server sessions.Some of t h e typically e n c o u n t e r e d p r o b l e m s are: Single I P a d d r e s s / M u l t i p l e Server Sessions - I n t e r n e tservice providers (ISPs) typically have a pool of proxyservers t h a t users access t h e W e b t h r o u g h . A singleproxy server m a y have several users accessing a W e bsite, potentially over t h e same t i m e period. Multiple I P a d d r e s s / S i n g l e Server Session - Some ISPsor privacy tools r a n d o m l y assign each request from auser t o one of several I P addresses. In this case, asingle server session c a n have multiple I P addresses. Multiple I P a d d r e s s / S i n g l e User - A user t h a t accessest h e W e b from different m a c h i n e s will have a differentI P address from session to session. T h i s m a k e s tracking r e p e a t visits from t h e same user difficult. Multiple A g e n t / S i n g e User - Again, a user t h a t usesmore t h a n one browser, even on t h e same machine,will a p p e a r as multiple users.A s s u m i n g each user h a s now b e e n identified ( t h r o u g h cookies, logins, or I P / a g e n t / p a t h analysis), t h e click-stream foreach user m u s t b e divided into sessions. Since page requestsS I G K D D Explorations.from o t h e r servers are n o t typically available, it is difficultto know w h e n a user h a s left a W e b site. A t h i r t y m i n u t et i m e o u t is often used as t h e default m e t h o d of b r e a k i n g auser's click-stream into sessions. T h e t h i r t y m i n u t e t i m e o u tis b a s e d on t h e results of [23]. W h e n a session ID is emb e d d e d in each URI, t h e definition of a session is set by t h ec o n t e n t server.W h i l e t h e exact c o n t e n t served as a result of each user action is often available from t h e request field in t h e serverlogs, it is s o m e t i m e s necessary to have access to t h e c o n t e n tserver i n f o r m a t i o n as well. Since c o n t e n t servers can m a i n t h i n s t a t e variables for each active session, t h e i n f o r m a t i o nnecessary to d e t e r m i n e exactly w h a t c o n t e n t is served by auser request is n o t always available in t h e URI. T h e finalp r o b l e m e n c o u n t e r e d w h e n preprocessing usage d a t a is t h a tof inferring cached page references. As discussed in Section2.2, t h e only verifiable m e t h o d of tracking cached page viewsis to m o n i t o r usage from t h e client side. T h e referrer fieldfor each request can b e used to detect some of t h e instancesw h e n c a c h e d pages have b e e n viewed.Figure 2 shows a sample log t h a t illustrates several of t h ep r o b l e m s discussed above ( T h e first c o l u m n would n o t b epresent in a n a c t u a l server log, a n d is for illustrative purposes only). I P address 1 2 3 . 4 5 6 . 7 8 . 9 is responsible fort h r e e server sessions, a n d I P addresses 2 0 9 . 4 5 6 . 7 8 . 2 a n d209.45.778.3 are responsible for a f o u r t h session. Usinga c o m b i n a t i o n of referrer a n d agent information, lines 1t h r o u g h 11 c a n b e divided into t h r e e sessions of A-B-F-Q-6,L-R, a n d A-B-C-J. P a t h completion would a d d two page references t o t h e first session A-B-F-I3-F-B-G, a n d one referenceto t h e t h i r d session A-B-A-C-J. W i t h o u t using cookies, a ne m b e d d e d session ID, or a client-side d a t a collection m e t h o d ,t h e r e is n o m e t h o d for d e t e r m i n i n g t h a t lines 12 a n d 13 areactually a single server session.3.1.2Content PreprocessingC o n t e n t preprocessing consists of converting t h e t e x t , image, scripts, a n d o t h e r files such as m u l t i m e d i a into formst h a t are useful for t h e W e b Usage M i n i n g process. Often,t h i s consists of p e r f o r m i n g c o n t e n t m i n i n g such as classification or clustering. W h i l e a p p l y i n g d a t a m i n i n g to t h ec o n t e n t of W e b sites is a n interesting area of research in itsown right, in t h e c o n t e x t of W e b Usage M i n i n g t h e c o n t e n tof a site c a n b e used to filter t h e i n p u t to, or o u t p u t fromt h e p a t t e r n discovery algorithms. For example, results ofa classification a l g o r i t h m could b e used t o limit t h e discovered p a t t e r n s t o those c o n t a i n i n g page views a b o u t a c e r t a i ns u b j e c t or class of products. In a d d i t i o n to classifying orclustering page views b a s e d on topics, page views c a n a l s ob e classified according to t h e i r i n t e n d e d use [50; 30]. Pageviews c a n b e i n t e n d e d to convey i n f o r m a t i o n ( t h r o u g h t e x t ,graphics, or o t h e r m u l t i m e d i a ) , g a t h e r i n f o r m a t i o n from t h euser, allow n a v i g a t i o n ( t h r o u g h a list of h y p e r t e x t links), orsome c o m b i n a t i o n uses. T h e i n t e n d e d use of a page viewc a n also filter t h e sessions before or after p a t t e r n discovery.In order t o r u n c o n t e n t m i n i n g algorithms on page views t h e i n f o r m a t i o n m u s t first b e c o n v e r t e d into a quantifiableformat. Some version of t h e vector space m o d e l [51] is typically u s e d t o accomplish this. Text files c a n b e b r o k e n u pinto vectors of words. Keywords or t e x t descriptions c a nb e s u b s t i t u t e d for graphics or m u l t i m e d i a . T h e c o n t e n t ofstatic page views c a n b e easily preprocessed b y parsing t h eH T M L a n d r e f o r m a t t i n g t h e i n f o r m a t i o n or r u n n i n g addi-J a n 2000.Volume 1, Issue 2 - page 14

Site FilesPreprocessingvRaw LogsPreprocessedCiickstreamData"Interesting"Rules, Patterns,and StatisticsRules, Patterns,and StatisticsFigure 1: High Level Web Usage Mining ProcessIP Address UseddTimeMethodJURU Protocol Statue -0580] "GETA.h l HI-FP/1.0" 200 3290Mozla/3.04 (Win95, I)123.456.78.9[23/Apd1998:03:05:34-0500] "GETB.html I.ITFP/1.0" 200 2050 A.h 5:39,0500] 'GET Lhlrnl H'ITPI1.0" 200 4130Moziga/3.94(Win95, I)123A56.78.9[25/April998:03:06:02 -0500] "GET F.html HTTP/1.ff' 200 5896 B.hlml :58-0580] "GET A.h l HTrP/1.0' 200 3290123,456.78.9[25/Apr/1998:03:07:42-0500] "GETB.hlml HTTP/1.0" 200 2050 A.html MoziBa/3.01(X11,I, IRIX6.2, IP22)Mozilla/3.01{Xll, I, IRIX6.2, IP22)123.456.76.9[25/April998:03:07:55-0500] "GETR.html HTTPI1.0" 200 8140 3:09:50-0500] "GETC.html HI-rP/1.0" 200 1820 A.hknlMozgla/3.01(XI1.I, 0] "GETO.hlml HTIP/1.0" 200 2270F,html :45.0500] 'GET J.html HTTP/I.0" 200 9430C.html Moziga/3.01(X11,I, IRIX62, IP22)123.456.78.9[25/Apr/1998:03:12:23-0500] "GETG.html HTTP/I.0" 200 7220 B.htnd :22-0500] "GETA.html H'FrP/I.0" 200 3290209.456.78.3[225/Apr/1998:05:06:03-0500] 'GET D.h l HTTP/1.0' 200 1680 A.hb'nl Moziga/3.94(Win95,1)Mozgla/3.940Nin95, I)Figure 2: Sample Web Server LogSIGKDD Explorations.Jan 2000.Volume 1, Issue 2 - page 15

tional algorithms as desired. Dynamic page views presentmore of a challenge. Content servers that employ personalization techniques a n d / o r draw upon databases to constructthe page views may be capable of forming more page viewsthan can be practically preprocessed. A given set of serversessions may only access a fraction of the page views possiblefor a large dynamic site. Also the content may be revisedon a regular basis. T h e content of each page view to be preprocessed must be "assembled", either by an H T T P requestfrom a crawler, or a combination of template, script, anddatabase accesses. If only the portion of page views thatare accessed are preprocessed, the output of any classification or clustering algorithms may be skewed.3.1.3Structure PreprocessingThe structure of a site is created by the hypertext links between page views. The structure can be obtained and preprocessed in the same manner as the content of a site. Again,dynamic content (and therefore links) pose more problemsthan static page views. A different site structure may haveto be constructed for each server session.3.2Pattern DiscoveryPattern discovery draws upon methods and algorithms developed from several fields such as statistics, data mining,machine learning and pattern recognition. However, it isnot the intent of this paper to describe all the available algorithms and techniques derived from these fields. Interestedreaders should consult references such as [33; 24]. This section describes the kinds of mining activities t h a t have beenapplied to the Web domain. Methods developed from otherfields must take into consideration the different kinds of d a t aabstractions and prior knowledge available for Web Mining.For example, in association rule discovery, the notion of atransaction for market-basket analysis does not take intoconsideration the order in which items are selected. However, in Web Usage Mining, a server session is an orderedsequence of pages requested by a user. Furthermore, due tothe difficulty in identifying unique sessions, additional priorknowledge is required (such as imposing a default timeoutperiod, as was pointed out in the previous section).3.2.1In the context of Web Usage Mining, association rules referto sets of pages that are accessed together with a supportvalue exceeding some specified threshold. These pages maynot be directly connected to one another via hyperlinks. Forexample, association rule discovery using the Apriori algorithm [18] (or one of its variants) may reveal a correlationbetween users who visited a page containing electronic products to those who access a page about sporting equipment.Aside from being applicable for business and marketing applications, the presence or absence of such rules can helpWeb designers to restructure their Web site. The associationrules may also serve as a heuristic for prefetching documentsin order to reduce user-perceived latency when loading apage from a remote site.Statistical AnalysisStatistical techniques are the most common m e t h o d to extract knowledge about visitors to a Web site. By analyzingthe session file, one can perform different kinds of descriptive statistical analyses (frequency, mean, median, etc.) onvariables such as page views, viewing time and length of anavigational path. Many Web traffic analysis tools producea periodic report containing statistical information such asthe most frequently accessed pages, average view time of apage or average length of a p a t h through a site. This reportmay include limited low-level error analysis such as detecting unauthorized entry points or finding the most commoninvalid URI. Despite lacking in the depth of its analysis,this type of knowledge can be potentially useful for improving the system performance, enhancing the security of thesystem, facilitating the site modification task, and providingsupport for marketing decisions.3.2.2 Association RulesAssociation rule generation can be used to relate pages t h a tare most often referenced together in a single server session.S I G K D D Explorations.3.2.3ClusteringClustering is a technique to group together a set of itemshaving similar characteristics. In the Web Usage domain,there are two kinds of interesting clusters to be discovered :usage clusters and page clusters. Clustering of users tendsto establish groups of users exhibiting similar browsing patterns. Such knowledge is especially useful for inferring userdemographics in order to perform market segmentation inE-commerce applications or provide personalized Web content to the users. On the other hand, clustering of pageswill discover groups of pages having related content. Thisinformation is useful for Internet search engines and Webassistance providers. In both applications, permanent ordynamic H T M L pages can be created t h a t suggest relatedhyperlinks to the user according to the user's query or pasthistory of information needs.3.2.4ClassificationClassification is the task of mapping a data item into oneof several predefined classes [33]. In the Web domain, oneis interested in developing a profile of users belonging to aparticular class or category. This requires extraction andselection of features that best describe the properties of agiven class or category. Classification can be done by usingsupervised inductive learning algorithms such as decisiontree classifiers, naive Bayesian classifiers, k-nearest neighbor classifiers, Support Vector Machines etc. For example,classification on server logs may lead to the discovery of interesting rules such as : 30% of users who placed an onlineorder i n / P r o d u c t / M u s i c are in the 18-25 age group and liveon the West Coast.3.2.5Sequential PatternsThe technique of sequential pattern discovery a t t e m p t s tofind inter-session patterns such t h a t the presence of a set ofitems is followed by another item in a time-ordered set of sessions or episodes. By using this approach, Web marketerscan predict future visit patterns which will be helpful inplacing advertisements aimed at certain user groups. Othertypes of temporal analysis that can be performed on sequentim patterns includes trend analysis, change point detection,or similarity analysis.3.2.6 Dependency ModelingDependency modeling is another useful pattern discoverytask in Web Mining. The goal here is to develop a modelcapable of representing significant dependencies among thevarious variables in the Web domain. As an example, oneJan 2000.Volume 1, Issue 2 - page 16

may be interested to build a model representing the differentstages a visitor undergoes while shopping in an online storebased on the actions chosen (ie. from a casual visitor to a serious potential buyer). There are several probabilistic learning techniques that can be employed to model the browsingbehavior of users. Such techniques include Hidden MarkovModels and Bayesian Belief Networks. Modeling of Web usage patterns will not only provide a theoretical frameworkfor analyzing the behavior of users but is potentially usefulfor predicting future Web resource consumption. Such information may help develop strategies to increase the sales ofproducts offered by the Web site or improve the navigationalconvenience of users.3.3Pattern AnalysisPattern analysis is the last step in the overall Web Usagemining process as described in Figure 1. The motivationbehind pattern analysis is to filter out uninteresting rules orpatterns from the set found in the pattern discovery phase.The exact analysis methodology is usually governed by theapplication for which Web mining is done. The most common form of pattern analysis consists of a knowledge querymechanism such as SQL. Another m e t h o d is to load usagedata into a data cube in order to perform O L A P operations.Visualization techniques, such as graphing patterns or assigning colors to different values, can often highlight overallpatterns or trends in the data. Content and structure information can be used to filter out patterns containing pagesof a certain usage type, content type, or pages that matcha certain hyperlink structure.4.TAXONOMY AND PROJECT SURVEYSince 1996 there have been several research projects andcommercial products that have analyzed Web usage datafor a number of different purposes. This section describesthe dimensions and application areas t h a t can be used toclassify Web Usage Mining projects.4.1Taxonomy D

Web usage mining is the application of data mining tech- niques to discover usage patterns from Web data, in order to understand and better serve the needs of Web-based appli- . WEB DATA One of the key steps in Knowledge Discovery in Databases [33] is to create a suitable target data set for the data mining tasks. In Web Mining, data can be .