582364 Data Mining, 4 Cu Lecture 4: Finding Frequent Itemsets .

Transcription

582364 Data mining, 4 cuLecture 4: Finding frequentitemsets - concepts and algorithmsSpring 2010Lecturer: Juho RousuTeaching assistant: Taru ItäpeltoData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Itemset lattice Itemsets that can be constructedfrom a set of items have a partialorder with respect to the subsetoperator i.e. a set is larger than its propersubsets This induces a lattice where nodescorrespond to itemsets and arcscorrespond to the subset relation The lattice is called the itemsetlattice For d items, the size of the latticeis 2dData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Frequent itemsets on the itemset lattice The Apriori principle isillustrated on the Itemsetlattice Thesubsets of a frequentitemset are frequent They span a sublattice of theoriginal lattice (the grey area)Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Frequent itemsets on the itemset lattice Conversely Thesupersets of an infrequentitemset are infrequent Theyalso span a sublattice ofthe original lattice (the crossedout nodes) If we know that {a,b} isinfrequent, we never need tocheck any of the supersets Thisfact is used in supportbased pruningData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Compact Representation of Frequent Itemsets In practise, the number of frequent itemsets producedfrom transaction data can be very large whenthe database is dense i.e. many items per transaction onaverage when the number of transactions is high when the minimum support level is set too low We will look at methods that usethe properties of the itemset lattice and the supportfunction. to compress the collection of frequent itemsets in a moremanageable size. so that all frequent itemsets can be derived from the compressedrepresentationData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Maximal Frequent Itemsets The minimum support thresholdFrequentitemsetsinduces a partition of the itemsetlattice into frequent and infrequentitemsets (grey nodes) Frequent itemsets that cannot beextended with any item withoutmaking them infrequent are calledmaximal frequent itemsets We can derive all frequentitemsets from the set of maximalitemsets Use of the Apriori principle“backwards”Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)Infrequentitemsets

Maximal Frequent Itemsets {A,C} is not maximal as it can beFrequentitemsetsextended to frequent itemset{A,C,E} although its supersets{A,B,C}, {A,C,D} are infrequent {A,D} is maximal as all itsimmediate supersets {A,B,D},{A,C,D} and {A,D,E} areinfrequent {B,D} is not maximal as it can beextended to frequent itemsets{B,C,D} and {B,D,E}InfrequentitemsetsData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Maximal frequent itemsets The number of maximalfrequent itemsets is typicallyconsiderably smaller than theFrequentitemsetsnumber of all frequent itemsets In worst case, the number canstill be exponential in thenumber of items: e.g. consider the case whereall itemsets of size d/2 arefrequent and no itemset of sized/2 1 is frequent. Still need efficient algorithmsData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)InfrequentitemsetsBorder

Maximal frequent itemsets Exact support counts of thesubsets cannot be directlyderived from support of theFrequentitemsetsmaximal frequent itemset From Apriori principle we onlyknow that the subsets must befrequent, but not how frequent Need to do support counting forthe subsets of the maximalfrequent itemset to createassociation rulesInfrequentitemsetsData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)Border

Closed itemsets An alternative approach is to tryto retain some of the supportinformation in the compactedrepresentation A closed itemset is an itemsetwhose all immediate supersetshave different support count A closed frequent itemset is aclosed itemset that satisfies theminimum support threshold Maximal frequent itemsets areclosed by definitionData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Example: Closed frequent itemsets Assume minimum supportthreshold 40% {b} is frequent: σ({b}) 3, but notclosed: σ({b}) σ({b,c}) 3 {b,c} is frequent: σ({b,c}) 3, andclosed: σ({a,b,c}) 2,σ({b,c,d}) 1,σ({b,c,e}) 1 {b,c,d} is not frequent: σ({b,c,d}) 1, and not closed : σ({a,b,c,d}) 1Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Maximal vs Closed ItemsetsTransactionIdsNot supportedby anytransactionsData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Maximal vs Closed Frequent ItemsetsMinimum support 2Closed butnot maximalClosed andmaximal# Closed 9# Maximal 4Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Maximal vs Closed ItemsetsData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Determining the support of non-closedfrequent itemsets Consider a non-closedfrequent itemset {a,d} assumewe have not storedits support count By definition, there must beat least one immediatesuperset that has the samesupport count It must be that σ({a,d}) σ(X) for some immediatesuperset X of {a,d}Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Determining the support of non-closedfrequent itemsets From the Apriori principlewe know that no supersetcan have higher supportthan {a,d} It must be that thesupport equals thesupport of the mostfrequent supersetσ({a,d}) max(σ(abd),σ(acd),σ(ade))Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Determining the support of non-closedfrequent itemsets Algorithm sketch:1.kmax size of largest closed frequent itemset2.Fkmax closed frequent itemsets of size kmax3.for k kmax-1 downto 1 do4.Fk {f f immediate subset of f’ in Fk 1 or f is closed, f k }5.for every f in Fk do6.if f is not closed7.f.support max(f’.support f’ in Fk 1, f’ is a superset of f )8.endif9.endfor10.endforData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Characteristics of Apriori algorithm Breadth-first search algorithm: all frequent itemsets of givensize are kept in the algorithmsprocessing queueLevel 0Level 1Level 2 General-to-specific search: start with itemsets with largesupport, work towards lower-Level 3support region Generate-and-test strategy: generate candidates, test byLevel 4support countingLevel 5Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Weaknesses of Apriori Apriori is one of the first algorithms that succesfullytackled the exponential size of the frequent itemset space Nevertheless the Apriori suffers from two mainweaknesses: High I/O overhead from the generate-and-test strategy:several passes are required over the database to find thefrequent itemsets The performance can degrade significantly on densedatabases, as large portion of the itemset lattice becomesfrequentData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Alternative methods for generating frequentitemsets: Traversal of itemset lattice Apriori uses general-to-specific search: start from most highlysupported itemsets, work towards lower support region Works well if the frequent itemset border is close to the top of the latticeData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Alternative methods for generating frequentitemsets: Traversal of itemset lattice Specific-to-general search: look first for the most specific frequentitemsets, work towards higher support region Works well if the border is close to the bottom of the lattice Dense databasesData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Alternative Methods for Frequent ItemsetGeneration: Breadth-first vs Depth-first Apriori traverses the itemset lattice in breadth-first manner Alternatively, the lattice can be searched in depth-first manner:extend single itemset until it cannot be extended often used to find maximal frequent itemsets hits the border of frequent itemsets quicklyData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Alternative Methods for Frequent ItemsetGeneration: Breadth-first vs Depth-first Depth-first search allowsdifferent kind of pruning of thesearch space Example: if {b,c,d,e} is foundmaximal frequent by the searchalgorithm, the region of thelattice consisting of subsets of{b,c,d,e} does not need to betraversed known to be frequent non-maximalData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Alternative methods for generating frequentitemsets: Equivalence classes Many search algorithms can be seen to conceptuallypartition the itemset lattice into equivalence classes The itemsets in one equivalence class are processed beforemoving into the next Several ways of defining equivalence classes Levels defined by itemset size (used by Apriori) Prefix labels: two itemsets that share a prefix of length kbelong to the same class e.g. {a,c,d}, {a,c,e} if k 2 Suffix labels: two itemsets that share a suffix of length kData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Prefix and suffix trees Left: prefix tree and equivalence classes defined by for prefixes of length k 1 Right: suffix tree and equivalence classes defined by for prefixes of length k 1Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

FP-growth algorithm FP-growth avoids the repeated scans of the database of Apriori byusing a compressed representation of the transaction databaseusing a data structure called FP-tree Once an FP-tree has been constructed, it uses a recursive divideand-conquer approach to mine the frequent itemsetsData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

FP-tree FP-tree is a compressed representation of the transaction database Each transaction is mapped onto a path in the tree Each node contains an item and the support count corresponding tothe number of transactions with the prefix corresponding to the pathfrom root Nodes having the same item label are cross-linked: this helpsfinding the frequent itemsets ending with a particular itemData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

FP-tree constructionnullAfter reading TID 1:A:1B:1After reading TID 2:A:1B:1nullB:1C:1D:1Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

FP-Tree ConstructionTransactionDatabasenullB:3A:7B:5Header tableC:1D:1C:3D:1C:3D:1D:1D:1E:1E:1Pointers are used to assistfrequent itemset generationData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)E:1

FP-Tree vs. original database If the transactions share a significant number of items, FP-treecan be considerably smaller as the common subset of the itemsis likely to share paths There is a storage overhead from the links as well from thesupport counts, so in worst case may even be larger than originalData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Frequent itemset generation in FP-growth FP-growth uses a divideand-conquer approach tofind frequent itemsets It searches frequentitemsets ending with itemE first, then itemsetsending with D,C,B,A i.e.uses equivalence classesbased on length-1 suffixes Paths corresponding todifferent suffixes areextracted from the FP-treenullB:3A:7B:5C:1D:1D:1E:1E:1E:1Data mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)C:3D:1C:3D:1D:1

Frequent itemset generation in FP-growthData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Frequent itemset generation in FP-growth To find all frequent itemsets ending with given last item(e.g. E), we first need to compute the support of theitem This is given by the sum of support counts of all nodeslabeled with the item (σ(E) 3) foundby following the cross-links connecting the nodes withthe same item If last item is found frequent, FP-growth next iterativelylooks for all frequent itemsets ending with givenlength-2 suffix (DE,CE,BE, and AE), and recursively with length-3 suffix, length-4 suffixuntil no more frequent itemsets are found Conditional FP-tree is constructed for each differentsuffix to speed up the computationData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Frequent itemset generation in FP-growthData mining, Spring 2010 (Slides adapted from Tan, Steinbach Kumar)

Itemset lattice Itemsets that can be constructed from a set of items have a partial order with respect to the subset operator i.e. a set is larger than its proper subsets This induces a lattice where nodes correspond to itemsets and arcs correspond to the subset relation The lattice is called the itemset lattice