Sequential Pattern Mining: A Comparison Between GSP, SPADE And . - IJEDR

Transcription

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Sequential Pattern Mining: A Comparison betweenGSP, SPADE and Prefix SPAN1Manika Verma, 2Dr. Devarshi Mehta1Assistant Professor, 2Associate ProfessorDepartment of Computer Science, Kadi Sarva Vishwavidyalaya,2GLS Institute of Computer Technology, Ahmedabad, India1Abstract - This paper presents a comparison between basically three kinds of algorithm GSP (Generalized SequentialPattern), SPADE (An efficient Algorithm for mining Frequent Sequences) and Prefix Span (Prefix-projected SequentialPattern Mining). GSP is the Apriori based Horizontal formatting method, SPADE is the Apriori based vertical formattingmethod and Prefix-SPAN is Projection-based pattern growth method. This paper elaborate step wise explanation of eachalgorithm demonstrating number of iterations required in each algorithm. Later a comparison is made between Totaltime required to execute algorithm, count of frequent sequences found and Max memory (in mb) required by algorithmsGSP, SPADE and Prefix-SPAN. The above stated attributes i.e. total time; frequent sequences and Max Memory areobtained using SPMF (A sequential Pattern Mining Framework).Keywords - GSP, SPADE, Prefix-Span, Apriori-based, Projection-basedI. INTRODUCTIONA sequential pattern is a series of item-sets; item-sets in sequences are in specific order.Sequential pattern mining helps toextract the sequences which are most frequent in the sequence database, which in turn can be interpreted as domain knowledge forseveral purposes[6]. Sequential pattern mining is used in various areas for different purposes. It can be used for identifyingCustomer Shopping Sequence [8][7]. Various algorithms have been implemented for identifying frequent sequences fromsequence database. One of the approaches used to identify frequent sequences is apriori approach. “The apriori approach is basedon the apriori property, as introduced in the context of association rule mining. This property states that if a pattern a is notfrequent then any pattern bthat contains acannot be frequent. Two of the most successful algorithm that takes this approach areGSP and SPADE. The major difference between the two is that GSP uses a horizontal data format while SPADE uses vertical one.The sequence growth approach does not require candidate generation; it gradually grows the frequent sequences. Prefix spanwhich originated from FP-growth, uses this approach as follows: first it finds the frequent single items, then it generates a set ofprojected database, one database for each frequent item. Each of these databases is then mined recursively while concatenating thefrequent single item, then it generates a set of projected databases, one database for each frequent item. Each of these database isthen mined recursively while concatenating the frequent single items into a frequent sequence. These algorithms perform well indatabase consisting of short frequent sequences. However, when mining databases consisting of long frequent sequences e.g.stocks values, DNA chains, or machine monitoring Data, their overall performance worsens by an order of magnitude.” [8]Original Input Sequence DataBase[1]IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3016

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Figure 1To demonstrate step by step explanation of the algorithms GSP, SPADE and Prefixspan the above database shown in Figure 1is used as input. For all the algorithms, minimum support is 2. The Candidate sets are denoted by C. Ckspecifies candidate setshaving k items. For Eg C2 mean Candidate sets having 2 items. The Candidate sets that satisfy minimum support belongs to Lk. InLk ,Kdenotes number of items in sequence.II. ALGORITHMS OF SEQUENTIAL PATTERN MININGA. Generalized Sequential Pattern (GSP) Algorithm[2]GSP discovers sequential pattern. GSP scales linearly with the number of data sequences, and has very good scale-upproperties with respect to the average data sequence size. [2]Step by step explanation of how does GSP works:The database in figure 1 is taken in following format to demonstrate step by step explanation of GSP.Table 1.1SIDTIME(EID)ITEMS110,15,20,25 (CD)(ABC)(ABF)(ACDF) 215,20 (ABF)(E) 310 (ABF) 410,20,25 (DGH)(BF)(AGH) Table 1.1 To determine frequent items using GSP, consider minimum support as 2 in this example.IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3017

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Table 1.2ItemsNo. of occurrence of ItemsA4B4C1D2E1F4G1H1The items that satisfy minimum support are, A,B,D,F . So A,B,D,F are Frequent 1- Sequence1.2.Frequent 1- itemsets are A,B,D,FUsing Apriori property, frequent itemsets obtained are shown in table below. C is used for denoting Candidate sets (Itemsets). Ck denotes candidates having k items. For eg C1 denotes item sets (or candidate sets) having 1 item.Table 1.3 C1ItemsNo. of occurrence of ItemsA4B4C1D2E1F4G1H1The items that satisfies minimum support belongs to L1. Lk denotes candidate having k items. For eg L1 denotes items sets (orcandidate sets) having 1 item.Table 1.4 L1ItemsNo. of occurrence of ItemsA4B4D2F43.From above Table 1.4 we can see that a,b,c,d are the that satisfy minimum support.C2 (where k 2 ) can be obtained by joining L1 X L1 but the condition is k-2 items must be common, so here while joiningfor obtaining items sets for C2 (2-2 0) items must be commonTable 1.5 C2ItemsNo. of occurrence of ItemsA- A1A- B1A- D1A- F1AB3IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3018

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939ADAFB- AB- BB- DB- FBDBFD- AD- BD- FD- DDFF- AF- BF- DF- F13211114222112111Candidates that satisfy minimum support belongs to L24.Table 1.6 L2ItemsNo. of occurrence of ItemsAB3AF3B- A2BF4D- A2D- B2D- F2F- A2Now, in this step we have to achieve 3-sequence item-sets, i.e item-sets having 3 items. C3(where k 3) can be obtained byjoining L2 X L2 but the condition is k-2 items must be common, so here while joining for obtaining items sets for C3 (32 1) items must be common.Table 1.7 C3ItemsNo. of occurrence of ItemsABF3AB- A1AF- A1BF- A2D- B- A2D- F- A2D- BF2F- AF1D- FA1D- AB1F- AB0IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3019

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939B- ABB- AFCandidates having minimum support belong to L3ItemsABFBF- AD- B- AD- F- AD- BF11Table 1.8 L3No. of occurrence of Items322225.In this step we have to achieve 4-sequence item-sets, i.e item-sets having 4 items. C4(where k 4) can be obtained byjoining L3 X L3 but the condition is k-2 items must be common, so here while joining for obtaining itemsets for C4 (42 2) items must be common.Table 1.9 C4ItemsNo. of occurrence of ItemsD- BF- A2Candidates having minimum support qualify for moving into L4.Table 1.10 L4ItemsNo. of occurrence of ItemsD- BF- A2Here, we have obtained only 1, 4-sequence(Sequence of item set having 4 items) item set.B. SPADE(An efficient Algorithm for mining Frequent Sequences)[1]SPADE is the algorithm used for fast discovery of Sequential pattern . Most of the sequential pattern mining algorithms assumehorizontal database layout .Spade uses vertical database format, where we maintain a disk based id-list for each item, show infollowing figure. [1]Figure 2 ID-List for ATOMSTo determine frequent items using SPADE, consider minimum support as 2 in this example.IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3020

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Consider the database shown in figure 1 above. For generating 1-sequence (single items) the occurrence of each item is beingcounted thus those items that satisfy the minimum support (here minimum support 2) belongs to L1.1-Sequence items arespecified in above figure. Above figure specifies Id-List for the atoms.Step by step explanation of how does SPADE works:1. Items and count of their occurrence is given in table belowTable 2.1 C1ItemsNo. of occurrence of ItemsA4B4C1D2E1F4G1H1Table 2.2 L1ItemsNo. of occurrence of ItemsA4B4D2F42. Following steps shows how using temporal join we can identify frequent items.Here we wish to compute the support of sequence AB. The set is J {A,B,D,F}. We can perform temporal join one at a timeto obtain final-id list as shown in figure below. Here the relationship between AB is non-temporal one, which we call equalityjoin, since they must occur at the same time. We thus find all occurrences of A and B with the same e-id and store them in idlist shown below.[7,page 40]Table 2.3 Id-list for AB (Non-temporal join )ABSIDEID AEIDB11515120202151531010In above table, SID is same mean the sequence occurred two times but belongs to samecustomer, so in this case we ignore 1count for sequence SID 1, so total number of occurrence of AB is 3. Thus minimum support is satisfied.Table 2.4 Id-list for AD (Non-temporal join )ADSIDEID AEIDD12525For AD, the number of occurrence is 1, which does not satisfy minimum support.Table 2.5 Id-list for AF (Non-temporal join )IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3021

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939AFSIDEID AEIDF12020125252151531010For AF, the number of occurrence is 3, which satisfies minimum support. While counting number of occurrences here twosequences are for same SID-1, so one count is ignored and hence total number of occurrence is 3.Table 2.6 Id-list for BD (Non-temporal join )BDSIDEID BEIDDFor BD, the number of occurrence is 0, which does not satisfy minimum support.Table 2.7 Id-list for BF (Non-temporal join )BFSIDEID BEIDF12020215153101042020For BF, the number of occurrence is 4, which satisfies min support.Table 2.8 Id-list for DF (Non-temporal join )DFSIDEID DEIDF12525For DF, the number of occurrence is 1, which does not satisfy minimum support.Table 2.9 Id-list for A- A (Temporal join )A- ASIDEID AEIDA1152012025For A- A, the number of occurrence is 2, but both occurrence belongs to same SID so number of occurrence is actually 1, whichdoes not satisfy minimum support.Table 2.10 Id-list for A- B (Temporal join )A- BSIDEID AEIDB11520For A- B, the number of occurrence is 1, which does not satisfy minimum support.Table 2.11 Id-list for A- D (Temporal join)IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3022

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939A- DSIDEID AEIDD1152512025For A- D, the number of occurrence is 1 and not 2 because both occurrences are for same SID so we would ignore one count andoccurrence of A- D is 1 which does not satisfy minimum support.Table 2.12 Id-list for A- D (Temporal join)A- FSIDEID AEIDF115201152512025For A- F, the number of occurrence is 1 and not 3 because all the three occurrences are for same SID so we would ignore 2counts and occurrence of A- F is 1 which does not satisfy minimum support.Table 2.13 Id-list for B- A (Temporal join)B- ASIDEID BEIDA1152042025For B- A, the number of occurrence is 2, which satisfies the minimum support.SID11Table 2.14 Id-list for B- B (Temporal join)B- BEID BEIDD15152020For B- B, the number of occurrence is 2, but both occurrence belongs to same SID so number of occurrence is actually 1, whichdoes not satisfy minimum support.Table 2.15 Id-list for B- D (Temporal join )B- DSIDEID BEIDD1152512025For B- D, the number of occurrence is 1 and not 2 because both the occurrences are for same SID so we would ignore 1 countand occurrence of B- D is 1 which does not satisfy minimum support.Table 2.16 Id-list for B- F (Temporal join )B- FSIDEID BEIDF12025For B- F, the number of occurrence is 1 which does not satisfy minimum support.IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3023

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Table 2.17 Id-list for D- D (Temporal join )D- DSIDEID DEIDD11025For D- D, the number of occurrence is 1 which does not satisfy minimum support.Table 2.18 Id-list for D- A (Temporal join )D- ASIDEID DEIDA110201102541025For D- A, the number of occurrence is 3, but 2 occurrences are for same SID so number of occurrence for D- A is 2, whichsatisfies the minimum support.Table 2.19 Id-list for D- B (Temporal join)D- BSIDEID FEIDD110151102041020For D- B, the number of occurrence is 2, which satisfies the minimum support.Table 2.20 Id-list for D- F (Temporal join)D- FSIDEID DEIDF110201102541020For D- F, the number of occurrence is 3, but two occurrences are for same SID, so the number of occurrence of D- F is 2 whichsatisfy the minimum support.Table 2.21 Id-list for F- F (Temporal join)F- FSIDEID FEIDF12025For F- F, the number of occurrence is 1, which does not satisfy the minimum support.Table 2.22 Id-list for F- D (Temporal join)F- DSIDEID FEIDD12025For F- D, the number of occurrence is 1, which does not satisfy the minimum support.Table 2.23 Id-list for F- B (Temporal join)F- BSIDEID FEIDB12025IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3024

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939For F- B, the number of occurrence is 1, which does not satisfy the minimum support.Table 2.24 Id-list for F- A (Temporal join)F- ASIDEID FEIDA1202542025For F- A, the number of occurrence is 2, which satisfies the minimum support.From non-temporal and temporal joins obtained above, below is the table showing count of occurrence of sequences of items.ItemsABADAFBDBFDFA- AA- BA- DA- FB- AB- BB- DB- FD- DD- AD- BD- FF- FF- DF- BF- AL2 (Sequences that satisfy minimum support in C2)ItemsABAFBFB- AD- AD- BD- FF- AIJEDR1403022C2No. of occurrence3130411111211112221111No. Of occurrence33422222International Journal of Engineering Development and Research (www.ijedr.org)3025

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-99393.This step generates 3-sequenceTable 3.1 list for ABF (Non-Temporal join)ABFSIDEID AEID BEIDF120202021515153101010For ABF, the number of occurrence is 3, which satisfies the minimum support.Table 3.2 Id-list for AB- A (Temporal join)AB- ASIDEID AEID BEIDA115152012020252151531010For AB- A, the number of occurrence is 2, but as both sequence belongs to SID-1which does not satisfy the minimum support. Inthis case number of occurrence is 2 because we have sequence AB- A for SID 1, but for SID 2 and 3 we have AB together but ABis not followed by A, thus – (dash) is specified.Table 3.3 Id-list for AF- A (Temporal join)AF- ASIDEID AEID FEID A1202025125252151531010For AF- A, the number of occurrence is 1, which does not satisfy the minimum support.Table 3.4 Id-list for BF- A (Temporal join)BF- ASIDEID BEID FEID A120202521515310104202025For BF- A, the number of occurrence is 2, which satisfies the minimum support.Table 3.5 Id-list for D- B- A (Temporal join)D- B- ASIDEID DEID BEID A11015204102025For D- B- A, the number of occurrence is 2, which satisfies the minimum support.IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3026

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Table 3.6 Id-list for D- F- A (Temporal join)D- F- ASIDEID DEID FEID A11020254102025For D- F- A, the number of occurrence is 2, which satisfies the minimum support.Table 3.7 Id-list for D- BF (Temporal join)D- BFSIDEID DEID BEIDF110202021515310104102020For D- BF, the number of occurrence is 2, which satisfies the minimum support.Table 3.8 Id-list for F- AF (Temporal join)F- AFSIDEID FEID AEID F12020125252151531010For F- AF, the number of occurrence is 0, which does not satisfy the minimum support.Table 3.9 Id-list for D- AF (Temporal join)D- AFSIDEID DEID AEID F110202011025252151531010For D- AF, the number of occurrence is 2, but both occurrences belongs to same SID-1, so actually number of occurrence is 1,which does not satisfy the minimum support.Table 3.10 Id-list for D- AB (Temporal join)D- ABSIDEID DEID AEID B110151511020202151531010For D- AB, the number of occurrence is 2, but both occurrence belongs to same SID-1, so actually number of occurrence is 1,which does not satisfy the minimum support.Table 3.11 Id-list for F- AB (Temporal join)F- ABIJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3027

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939SIDEID FEID AEID B11515120202151531010For F- AB, the number of occurrence is 0, which does not satisfy the minimum support.Table 3.12 Id-list for B- AB (Temporal join)B- ABSIDEID BEID AEID B1151511520202151531010For B- AB, the number of occurrence is 1, which does not satisfy the minimum support.Table 3.13 Id-list for B- AF (Temporal join)B- AFSIDEID BEID AEIDF1152020115252512025252151531010For B- AF, the number of occurrence is 3, but all the three sequences belongs to same SID, so number of occurrence is 1,whichdoes not satisfy the minimum support.From temporal joins obtained above, below is the table showing count of occurrence of sequences of items.C3ItemsNo. of occurrenceABF3AB- A1AF- A1BF- A2D- B- A2D- F- A2D- BF2F- AF0D- AF1D- AB1F- AB0B- AB1B- AF1L3IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3028

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939ItemsABFBF- AD- B- AD- F- AD- BF4.No. of occurrence32222This step generates 3-sequenceTable 4.1 Id-list for ABF- A (Temporal join)ABF- ASIDEID AEIDEID FEID AB12020202521515153101010For ABF- A, the number of occurrence is 1, which does not satisfy the minimum support.Table 4.2Id-list for D- BF- A (Temporal join)D- BF- ASIDEID D EID BEID F EID A1102020252151531010410202025For D- BF- A, the number of occurrence is 2, which satisfy minimum support.C. Prefix-projected Sequential Pattern Mining (PREFIX SPAN) Algorithm[3]PrefixSpan explores prefix-projection in sequential pattern mining. PrefixSpan mines the complete set of patterns but greatlyreduces the efforts of candidate sequence generation.Prefix-projection substantially reduces the size of projected databases andleads to efficient processing. [3]From the figure 1 of database, following given table is taken.SIDTIME(EID)ITEMS110,15,20,25 (CD)(ABC)(ABF)(ACDF) 215,20 (ABF)(E) 310 (ABF) 410,20,25 (DGH)(BF)(AGH) Figure 3For below given example, consider minimum support as 2.Step by step explanation of how does SPADE works:1. Count number of occurrence for each itemaItemsABCDEFGHNo.of frequent44121411itemsThe items that satisfy minimum support are, A,B,D,F. So A,B,D,F are Frequent 1- Sequence.IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3029

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-99392.Divide Search Space A ( BC)(ABF)(ACDF)( BF)(E)( BF)( GH) B ( BC)(ABF)(ACDF)( F)(E)( F)( F)(AGH)3. Find Subsets3.1- Subsets for A A ( BC)(ABF)(ACDF)( BF)(E)( BF)( GH)3.1.1 A ABFCDEBC11111131The items that satisfy minimum support are shown in table belowBF32 D (ABC)(ABF)(ACDF) F (ACDF)(E)( GH)(BF)(AGH)(AGH)F2G1H1 AB AF ( C)(ABF)(ACDF)(ACDF)( F)EE( F)The items Frequent 2-Sequence are AB and AF, as they comes under 2 sequences and thus satisfy minimum support.Considering prefix AB for ( C)(ABF)(ACDF) for AB AB ( C)(ABF)(ACDF)( F)E( F) AB A1B1C1D1E1F2F1C1F2 ABF (ACDF)EIJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3030

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939The items Frequent 3-Sequence is ABF, as it comes under 2 sequence and thus satisfy minimum support.Considering prefix AB for sequence ( F)(ACDF) AB ( F)(ACDF)( F)E( F)A1C1D1F1E1F3 ABF ABF (ACDF)EThe items Frequent 3-Sequence is ABF, as it comes under 2 sequence and thus satisfy minimum support. Same is being detected inabove example AF ACD E F11111No items frequent.3.2 –Subsets for B B ( BC)(ABF)(ACDF)( F)(E)( F)( F)(AGH)3.2.1 B A BC D E F G B FH2111131111Items that satisfy minimum support i.e. 2 are A, FA2C1F3Considering prefix B- A for sequence ( BF)(ACDF)B- A( BF)(ACDF)( GH)B- A is frequent as it comes under two sequencesA B CDF G H F1 1111111From above table no more items are frequentIJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3031

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Considering prefix B- A for sequence ( CDF)B- A( CDF)( GH)B- A is frequent as it comes under two sequenceC DF GH11111From above table no more items are frequentBF(ACDF)EAGHBF is frequent as it comes under two sequencesA C D E F G H21 1 1 1 1 1BF- A( CDF)( GH)BF- A is frequent as it occurs in two sequences.C D F G H11111From above table no more items are frequent3.1.3 Find Subsets D (ABC)(ABF)(ACDF)( GH)(BF)(AGH)A2B2C1D1F2G1G1H1H1D- A( BC)(ABF)(ACDF)( GH)D- A is frequent as it comes under 2 sequences.A B B C C D F G H1 1 11111 11No item satisfy min supportConsidering prefix D- B for for sequence ( C)(ABF)(ACDF)D- B( C)(ABF)(ACDF)( F)(AGH)D- B is frequent as it comes under 2 sequences.IJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3032

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939A2B1C1C1D1F1F2G1H1D- B- A( BF)(ACDF)( GH)D- B - A is frequent as it comes under 2 sequences.A B C D FG H F H1 11 1 111 11No more item from above table satisfy min supportConsidering prefix D- B for for sequence ( F)(ACDF)D- B( F)(ACDF)( F)(AGH)F2A2C1D1F1G1H1D- BFACDFAGHD- BF is frequent as it comes under two sequencesA C D F G H21 1 1 1 1D- BF-A( CDF)( GH)D- BF- A is frequent as it comes under two sequencesC D F G H11111No more item from above table satisfy min supportD- B- A( BF)(ACDF)( GH)D- B- A is frequent.A C D F G H B F1 1 1 1 1111D- F(ACDF)(AGH)D- F comes under 2 sequences so it is frequentA C D F G HIJEDR1403022International Journal of Engineering Development and Research (www.ijedr.org)3033

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939211111D- F- A( CDF)( GH)D- F- A comes under 2 02sequences so it is frequentC D F G H11111No more item from above table satisfy min support3.1.4 Find Subsets F (ACDF)(E)(AGH)A2C1D1E1F1G1H1F- A( CDF)( GH)F- A is frequent as it comes under 2 sequences so it is frequentCD F G H11111No more item from above table satisfy min support.III. COMPARISON BETWEEN GSP,SPADE AND PREFIXSPANPurposePPerformanceGSPGSP algorithm is used forextracting frequentlyoccurring sequences and isproposed by Agrawal andSrikant[2]GSPisaniterativealgorithm. It scans thedatabase number of timesdepending on the length ofthelongestfrequentsequences in database. TheI/O cost is substantial(large) if database containsverylongfrequentsequences. [2]IJEDR1403022SPADESPADE algorithm is used for fast(faster than GSP) discovery ofsequential pattern and is proposed byMOHAMMED J. ZAKI[1]Spade outperforms the GSP by afactor two, and by an order ofmagnitude with precomputed supportof 2-sequences. It has excellentscaleup properties with respect to anumber of parameters such as numberof input-sequences, the number ofevents per-input sequence, the eventsize and the size of potential maximalfrequent events and sequences. [1]PrefixSpanPrefixSpan algorithm is used fordiscovery of sequential pattern andperforms better in mining largesequences.It is proposed by Jian Peiet.al[3]PrefixSpan mines the complete set ofpatterns but greatly reduces the tionsubstantially reduces the size ofprojectedDatabases and leads to tudyshowsthatPrefixSpan, in most cases, outperformsthe apriori-based algorithm GSP, FreeSpan,International Journal of Engineering Development and Research (www.ijedr.org)3034

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Format ofdatabaseApproachUses Horizontal FormatDatabase[2]Apriori Based[2]Uses Vertical Format Database [1]CandidateSequenceCandidate Sequencegeneration required[2]Candidate Sequence generationrequired.[1]Apriori Based[1]and SPADE[5]Uses Horizontal Format Database [3]Prefix-Projecte Pattern-GrowthBased[3]Candidate Sequence generation is notrequired. It uses prefix projection.[3]IV. ANAYSISUsing the SPMF (A Sequential Pattern mining framework) [9] algorithms, the above mentioned three algorithms GSP,SPADEand PrefixSpan were executed on the same dataset. The result of analysis is displayed below:The Table below shows the comparison among algorithms. Algorithms were executed on a text file named ContextSpade. Size ofthe file was 145 bytes.Attribute GSPSPADEPrefixSpans(Max PatternLength 4)Total 140 ms 78 ms 15msTimeFrequent39173917700Sequences CountMax8.1845855 8.397941586.29127502441Memory( 71289062935546940625700mb)The following chart shows comparison among attributes of GSP,SPADE and PrefixSpan100001000Total Time100FrequentSequences tional Journal of Engineering Development and Research (www.ijedr.org)3035

2014 IJEDR Volume 2, Issue 3 ISSN: 2321-9939Figure 3V. CONCLUSIONThis paper made the comparison among GSP, SPADE and PrefixSpan algorithms. This paper presented the concept andexplanation of the above mentioned algorithms. Using SPMF it is determined that PrefixSpan is better than GSP and SPADE inperformance as it takes lesser time than GSP and SPADE. SPADE takes less time than GSP but it takes quite more time thanPrefixSpan. The memory required by PrefixSpan is much less than GSP and SPADE while the memory required by GSP andSPADE is same. Various attributes like Total Time required for executing algorithm, frequent sequences found by algorithms andMaximum memory used by mmed J Zaki. (2001). SPADE: An efficient algorithm for Mining Frequent Sequences.In Journal MachineLearning, Volume 42 Issue 1-2, January-February 2001 Pages 31-60.Ramakrishna Srikant, Rakesh Agarwal (1996). Mining Sequential Patterns: Generalizations and PerformanceImprovements. In proceedings of the 5th International Conference on EDT:Advances in Database Technology Pages 3-17Jian Pei Jiawei Han Behzad Mortazavi-Asl Helen Pinto, Qiming Chen Umeshwar Dayal Mei-ChunHsu.(2001).PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth, , published in DataEngineering, 2001. Proceedings. 17th International Conference on , Page(s):215 - 224A GSP based efficient algorithm for mining frequent sequences, Minghua Zang, Ben Kao, Chi-Lap Yip, David CheungJian Pei, Behzad Mortazavi-Asl, Jianyong Wang, Helen Pinto, Qiming Chen,Umeshwar Dayal. (2004). MiningSequential Patterns by Pattern-Growth:The PrefixSpan Approach. In IEEE TRANSACTIONS ON KNOWLEDGE ANDDATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004ThabetSlimani and Amor Lazzez.(2013). SEQUENTIAL MINING:PATTERNS AND ALGORITHMS ANALYSIS.InInternational Journal of Computer and Electronics Research [Volume 2, Issue 5, October 2013]R. Srikant et.al. Data Mining:Concepts and Techniques Mining sequence patterns in transactional 22.3033-003/8timeseries.pptYaron Gonen,Nurit Gal-Oz,Ran Yahalom, and Ehud Gudes. (2010). CAMLS: A constraint based Apriori Algorithm formining Long Sequences. In proceedings of Database Systems for Advanced Applications, 15th International ConferenceSPMF: A Sequential Pattern Mining FrameWork 403022International Journal of Engineering Development and Research (www.ijedr.org)3036

algorithm demonstrating number of iterations required in each algorithm. Later a comparison is made between Total time required to execute algorithm, count of frequent sequences found and Max memory (in mb) required by algorithms GSP, SPADE and Prefix-SPAN. The above stated attributes i.e. total time; frequent sequences and Max Memory are