The Comparison Of Apriori Algorithm With Preprocessing And FP-Growth .

Transcription

Advances in Intelligent Systems Research, volume 172Sriwijaya International Conference on Information Technologyand Its Applications (SICONIAN 2019)The Comparison of Apriori Algorithm withPreprocessing and FP-Growth Algorithm for FindingFrequent Data Pattern in Association RuleDeo WICAKSONO1, Muhammad Ihsan JAMBAK2,*, and Danny MatthewSAPUTRA11Informatics Department, Sriwijaya University, Palembang, IndonesiaInformatics Management Department, Sriwijaya University, Palembang, Indonesia*Corresponding author: jambak@unsri.ac.id2ABSTRACTAssociation Rules is a data mining method to find the relation between items called rules. Finding rules inthe association method can be divided into two phases. The first phase is finding the frequent pattern whichsatisfies specified minimum frequent, and the second phase is finding strict rules from the frequent patternwhich satisfy the minimum support and confidence. The main problem of Association Rules is based on thealgorithm used, and this method takes a large amount of memory and time-consuming. This study aims toadd preprocessing using the aggregate function on the Apriori Algorithm and therefore improve the memoryand time consumption for finding a large number of rules.Keywords: data mining, Association Rules, FP-Growth, apriori, rulesIntroductionAssociation Rules is one of the Data Mining methodsabout mining frequent patterns from large databases tofind the relations between data patterns, which is calledfrequent itemset that going to be used to find the rule.Rules are frequent itemset, which satisfies the specifiedminimum support and minimum confidence; support isthe percentage of an itemset in the database, whileconfidence is how strong the relationship between theitem in association rules. Association Rules use analgorithm to do its processes, such as Apriori and FPGrowth Algorithm.Apriori Algorithm is one of the traditional and simplealgorithms. Apriori algorithm using a Brute-force strategyto find data patterns by scanning the database repeatedly.The advantages of the Apriori algorithm is easy toimplement and to study because the data structure isstraightforward [1]. However, the disadvantage of theApriori algorithm is the algorithm needs to scan thedatabase repeatedly to generate candidate, this processtakes a lot of time and memory, especially if the pattern ittoo many and long [2].FP-Growth Algorithm using a root-like data structure anddivide and conquer strategy to find candidate, this makesthe FP-Growth algorithm as an efficient algorithm to findrules [3]. FP-Growth Algorithm can reduce memory andtime used to find association rules because the FP-Growthalgorithm only needs to scan the database two times tofind rules candidates [4]. The disadvantage of the FPGrowth algorithm is, if the data is too long, thecomplexity of the data structure can reduce theperformance [1].However, with the disadvantages of the AprioriAlgorithm, it does not mean the FP-Growth algorithm issuperior compared to the Apriori Algorithm. AprioriAlgorithm works better with a big dataset, while the FPGrowth Algorithm works better with a small dataset [1].One of the advantages of the apriori algorithm is easy tolearn and makes the apriori algorithm easy to use anddeveloped by the researcher [5]. The Preprocess that wasgoing to be done for Apriori in this research is to make theaggregate function to group the database. Because theApriori algorithm needs to scan the database repeatedly,the preprocess aim to reduce the database every time itneeds to scan so the time and memory used can bereduced.Apriori AlgorithmThe Apriori algorithm is one of the most basic andpopular algorithms for association rules mining. Agrawaland Srikant proposed the Apriori algorithm in 1994. UntilCopyright 2020 The Authors. Published by Atlantis Press SARL.This is an open access article distributed under the CC BY-NC 4.0 license 5

Advances in Intelligent Systems Research, volume 172now, this algorithm is the most used and developed by theresearcher [5]. Apriori is designed to operate on databasescontaining transactions. In the Apriori algorithm, everytransaction is seen as itemsets, with a given threshold, thealgorithm will identify the item, which is subsets at leastby minimum threshold as new itemsets.FP-GROWTH ALGORITHMFP-Growth Algorithm is an efficient and scalable methodfor mining a complete set of frequent patterns by patternfragment growth. FP-Growth algorithm was proposed byHan in 2000, using extended prefix tree structure forstoring compressed information about frequent patternsnamed frequent-pattern tree. In his study proving that thismethod outperforms other method for frequent miningpatterns.FP-Growth algorithm is an efficient algorithm forassociation rules. In this algorithm, using an alternativeway to find frequent itemsets without candidategenerations which take a lot of memory and process time,this makes this algorithm performance is better thanApriori. As an alternative way, this algorithm uses adivide-and-conquer strategy and data structure calledfrequent-pattern tree to store frequent pattern mining.Figure 1. Flowchart of Apriori AlgorithmApriori algorithm uses a "bottom-up" approach, where theitemsets are determined one item at a time; these stepsknown as candidate generation. This algorithm uses abreadth-first search and hash tree structure to countcandidate itemsets efficiently. A group of candidates istested against the data, which will be pruned if thecandidates have an infrequent subpattern. This processrepeats until no further successful extensions are found.Apriori algorithm is considered as a brute-force methodbecause this method considers every k-itemset as thecandidate of frequent itemset [6]. Consider thecomputational needed for every candidate is O(k), as awhole algorithm, the complexity of this method is O(d*2 d1) [6].Figure 3. Flowchart of FP-Growth AlgorithmThe run-time of the FP-Growth Algorithm is dependenton the compaction factor from the dataset [6]. Ifconditional FP-Tree of the result has many branches, thenthe algorithm performance will be drop drasticallybecause the algorithm has to process a lot of conditionalFP-Tree. The complexity of FP-Growth is very dependenton the pathfinding in FP-Tree for every item in the headertable, which is dependent on how deep the tree is. Themaximum depth of a tree is restricted by d for everyconditional FP-Tree. Then the complexity of thealgorithm is O (the amount of unique item in header table* the maximum depth of the tree) O (d*d) [6].Figure 2. Example of how the Apriori Algorithm works.[9]Because of the simple process of this algorithm, theadvantage of this algorithm is more comfortable to learn,understand, and implemented; this is the reason thisalgorithm is called the most basic algorithm forassociation rules. However, this leads to its disadvantage,candidate generations, which is the primary process of thealgorithm itself, is computationally costly. As thealgorithm scan databases each time its try to generate newcandidate, this process takes a lot of memory andprocessing time.Figure 4. Example of FP-Tree [10]Compared to the Apriori Algorithm, the FP-Growthalgorithm is more complex to implement and understand.316

Advances in Intelligent Systems Research, volume 172However, on the other hand, this algorithm is moreefficient. Therefore, the FP-Growth algorithm has muchpotential to develop and optimize.MEASUREMENT METRICMeasurement metrics on association rules are todetermine if the generated rules are strict rules or uselessrules. There are three metrics used in this research:Support, Confidence, and Dependency Factor.1) Supportrepeatedly, so the time needed will increase as thedatabase get larger. The database reduction will be madeby comparing the k-itemset with the count of items in asingle ID, and if the count is lower than k, then the ID willbe ignored. There are the following steps of the proposedpreprocessing:Scan the database to determine the count ofthe item in a single transactionWhen the k-itemset candidate generated,record the kIf Item count of an ID is less than k, ignorethe transaction.Support is the percentage of a transaction that has this rule[7]. That is mean, support is a value that indicates howfrequent the itemset from the whole transaction.P ( X ) Support ( X ) Frequency( X )Total Transactions(1)Support ( X Y ) Support ( X Y )(2)2) ConfidenceConfidence is a value that determines how frequent thedata pattern appears in frequent itemsets as a rule.Confidence is that association rules are to gauge howaccurate a rule.Confidences ( X Y ) Frequency ( X ,Y )Frequency ( X )Figure 5. The proposed method of preprocessingFrom Figure 5 can be explained as follows, first scan thedatabase to create a table (i) which show how many itemtransacted in single ID, when finding k-itemset candidate,ignore ID that has lower Item count than k, for example inFigure 5 (iii) when finding 3-itemset, the ID 2,4 and 5ignored from scanning (in red background) because itsitem count lower than k, which is 3. This preprocessingapplied because k-itemset will not be found in ID, whichcontains items lower than k. With this preprocessing, thendatabase scanned will be decreasing for every k-itemset.(3)EXPERIMENTThe dataset used for an experiment on this research isusing a total of 1000 Transaction ID from CV. Sukses Inti3) Dependency FactorPrima, dated July 2016 to December 2016. The datasetwill be run with 3 Algorithm: Apriori Algorithm (A), FPDependency is a metric that modified from certaintyGrowth Algorithm (FP), and Apriori Algorithm withfactor, and the difference is the dependency factor ispreprocessing (AP). The dataset then experimented withbased on real maximum and minimum values offour different scenarios. From Figure 6 to 9 show, byconf(X Y) for given value Support (X) and Support (Y)using different scenario, each algorithm has it is[8]. Also, the dependency factor determines by how muchcharacteristic and advantages compared to anotherthe value of the lift of a rule X Y differs from the valuealgorithm:one concerning how much it could have been different. ItUsing different Minimum Frequencyranges within [-1,1]Using different Minimum SupportUsing different Minimum ConfidenceConf ( X Y ) P (Y ) Usingthe (differentDF ( X Y ) P(Y )ifConfX Y )amounts P(Y )of0theif dataset.Conf ( X Y ) P ( Y ) ¿¿On FP-Growth Algorithm can be considered as theMaxConf ( X Y P ( X ) , P ( X ) ¿P ( Y ) Msuperior algorithm for s small amount of dataset. For a(4)small dataset and rules. The memory and time usage islower than Apriori and Apriori with preprocessingbecause of the tree structure data. However, for a largeProposed Preprocessing on Aprioriamount of dataset and rules, FP-Growth algorithm timeAlgorithmand memory usage getting worse, the tree structuregetting more complex.In this research, the proposed preprocessing on theOn Apriori Algorithm can be considered as the superiorApriori Algorithm will be the aggregate function to groupalgorithm for a large amount of dataset. Because tocount how many items are transacted in a singlegenerate itemsets in Apriori Algorithm need to scantransaction. Apriori algorithm scans the databaserepeatedly makes this algorithm not right for finding a{317

Advances in Intelligent Systems Research, volume 172small number of rules. However, because of the simpledata structure, this algorithm performs better to find rulesin a large dataset.On Apriori Algorithm with preprocessing, generally needmore time and memory compared to Apriori and FPGrowth Algorithm to find a small number of rules,because the algorithm needs more time and memory to dopreprocessing compared to the optimized time andmemory for finding the rules itself. However, for findinga large number of rules, this Algorithm performs betterthan the other two, as can be seen in Figure 5, and Figure6, which visualizes the Apriori Algorithm withpreprocessing perform, became better as the thresholdlowered (lower threshold mean more rules generated).Figure 9. Fourth Scenario, Different set of Total IDSUMMARYFigure 6. First Scenario, Different set of minimumfrequencyThis paper described a preprocessing stage that can beused for the implementation of the Apriori Algorithm,which contains how to aggregate function to groupingdata to reduce the amount of time used on scanning data.As the experimental result show, each algorithm has itsadvantages for a different amount of dataset and rulesgenerated. For the FP-Growth algorithm, it is superior fora small amount of dataset and rules; for the Apriorialgorithm, its superior for a large amount of dataset, whilefor the Apriori algorithm with preprocessing superior forfinding a large number of rules.REFERENCES[1] Kavitha, M., & Selvi, M. S. T. T.,Comparative Study on Apriori Algorithmand Fp Growth Algorithm with Pros andCons, 4(4), 161–164. (2016)Figure 7. Second Scenario, Different set of minimumsupport[2] Shweta, M., & Garg, K., Mining EfficientAssociation Rules through AprioriAlgorithm Using Attributes andComparative Analysis of VariousAssociation Rule Algorithms.International Journal of AdvancedResearch in Computer Science andSoftware Engineering, 3(6), 306–312.(2013)[3] Borgelt, C., An Implementation of theFP-Growth Algorithm. , pp.759–770.(2005)Figure 8. Third Scenario, Different set of MinimumConfidence[4] Han, J., Pei, J. & Yin, Y., Miningfrequent patterns without candidategeneration. Proceedings of the 2000ACM SIGMOD international conferenceon Management of Data - SIGMOD ’00,pp.1–12. (2000)[5] Anggraeni, R. M., "PerbandinganAlgoritma Apriori dan Algoritma FP-318

Advances in Intelligent Systems Research, volume 172Growth Untuk Perekomendasi PadaTransaksi Peminjaman Buku diPerpustakaan Universitas DianNuswantoro." Teknik Informatika, 1-5.(2014)[6] Tan, Pang-Ning, Michael Steinbach, AnujKarpatne, and Vipin Kumar, Introductionto Data Mining. pp.327-414. (2018)[7] Berzal, F. et al., Measuring the accuracyand interest of association rules: A newframework. Intelligent Data Analysis, 6,pp.221–235. (2002)[8] Kryszkiewicz, M., Dependence Factor forAssociation Rules. , pp.135–145. (2015)[9] Zaiane, Osmar R.,Principles ofKnowledge Discovery in Databases.(1999)[10] Chee, C.-H., Jaafar, J., Aziz, I. A., Hasan,M. H., & Yeoh, W., Algorithms forfrequent itemset mining: a literaturereview. Artificial Intelligence Review.(2018)319

algorithm scan databases each time its try to generate new candidate, this process takes a lot of memory and processing time. FP-GROWTH ALGORITHM. FP-Growth Algorithm is an efficient and scalable method for mining a complete set of frequent patterns by pattern fragment growth. FP-Growth algorithm was proposed by