SEQUENTIAL PATTERN MINING THE

Transcription

SEQUENTIAL PATTERN MININGTHE MICRON'S AUTOMATA PROCESSORKe Wang, Elaheh Sadredini, Kevin SkadronCenter for Automata ComputingDepartment of Computer ScienceUniversity of Virginia

Outlinev Micron’s Automata Processor (AP)v SPM on the AP: Opportunities and Challengesv AP Accelerated SPM and optimizationsv Performance Evaluationv Conclusions and the Future Work

Sequential Pattern MiningSequential Pattern Mining(SPM):Ø discovers frequent sequence of transactionsØ the order of items within a transaction doesn’t matter, but theorder of transaction is importanto Customer purchase patterning analysiso Correlation analysis of storage systemo Web log analysiso Software bug trackingo Software API usage trackingTrans. Items1 {Bread, Milk}, {Coke} 2 {Bread, Milk, Diaper}{Beer, Eggs}{Diaper} 3 {Milk} {Diaper} {Beer, Coke} 4 {Bread, Milk, Diaper}{Beer, Diaper}{Beer, Coke, Eggs} 5 {Bread, Milk}{Coke}{Diaper}{Eggs} BreadEggsMilk

Seq. Pat. Mining vs. Freq. Set. Miningq SimilarityØ Can be solved by Apriori-based algorithmØ Massive pattern matching is the bottleneckq Differencesv Hierarchical patterns in SPMØ The sequence of itemsets (transactions) are consideredØ The items within an itemset could be unordered and discontinuousØ The itemsets within an a sequence could be discontinuous but the order should becheckedv Larger search spaceØ For a given number of total items in a sequential pattern - k, the totalnumber of structures is 2 (k-1)Ø Search space needs to be reduced by Apriori-based algorithm (GSP)Ø Automaton design space needs to be reduced to save reconfiguration time

AP Accelerated SPM - FlowchartData preprocessing:1) Filter out infrequent items2) Recode - 8-bit / 16-bitsymbols3) Recode transactions4) Sort items in transactions5) Connect transactions by anitemset delimiter(\x253)6) Connect sequences by asequence delimiter(\x254)Encoding:freq item# 254: 8-bit253 freq item# 64009:16-bit5ms45ms

Automata Design for SPM: Flattened

Automata Design for SPM: Multi-entry

Performance Evaluation – AP Target hardware: D480 X 32 8-bit encoding:§ Speed: 7.5ns/ item\ 16-bit encoding:§ Speed: 15ns/item Connection reconfiguration time: 5ms for an entire boardSymbol replacement time: 45ms for an entire board

Performance Evaluation - Comparisonq Compare with other implementations1.2.3.4.5.6.Java multi-threading implementation of GSP: GSP-JavaC sequential implementation of GSP: GSP-1COpenMP multi-threading implementation of GSP: GSP-6CGPU implementation of GSP: GSP-1GJava multi-threading implementation of PrefixSpanJava multi-threading implementation of SPADEq Testing platformo CPU: Intel CPU i7-5820K (6 physical cores, 3.30GHz)o Mem: 32GB, 1.333GHzo GPU: Nvidia Kepler K40, 706 MHz clock, 2888 CUDA cores, 12 GBglobal memory

Performance Evaluation - Datasetsq Six real-world datasets

Performance Evaluation – GSP implementations 1010.10.0100.0090.008 0.007 0.006 0.005Relative Minimum SupportBM20.004Computation Time (s)1001000 GSP-JAVA GSP-1C GSP-6C GSP-1G GSP-AP Computation Time (s)1000100 GSP-JAVA GSP-1C GSP-6C GSP-1G GSP-AP1010.10.0300.0250.0200.0150.010Relative Minimum SupportKosarak0.005

Performance Evaluation – GSP implementations (2) 1001000 GSP-JAVA GSP-1C GSP-6C GSP-1G GSP-AP 10Computation Time (s)Computation Time (s)100010.10.100.080.060.040.02Relative Minimum SupportBible0.00100 GSP-JAVA GSP-1C GSP-6C GSP-1G GSP-AP1010.10.300.250.200.150.10Relative Minimum SupportFIFA0.05

110100300200 GSP-6C M&C Speedup GSP-1G M&C Speedup GSP-AP M&C Speedup9000.010200801000.0090.0080.007 0.006Relative Minimum SupportBM20.005700.004 GSP-1C M&C percentage GSP-6C M&C percentage GSP-1G M&C percentage GSP-AP M&C percentage GSP-AP AP conf percentage 400 Speedup400 GSP-1C M&C percentage GSP-6C M&C percentage GSP-1G M&C precentage GSP-AP M&C precentage GSP-AP AP conf precentage600 Percentage50012000.030 GSP-6C M&C Speedup GSP-1G M&C Speedup GSP-AP M&C Speedup0.0250.020 0.0150.010Relative Minimum 00.005 Percentage600 SpeedupPerformance Evaluation – Analysis

Performance Evaluation – Analysis (2) Relative Minimum SupportBible1200800600 Speedup1000 GSP-1C M&C percentage GSP-6C M&C percentage GSP-1G M&C percentage GSP-AP M&C percentage GSP-AP AP conf percentage400 GSP-6C M&C Speedup GSP-1G M&C Speedup GSP-AP M&C Speedup20000.300.250.20 ive Minimum SupportFIFAv As sup decreasing, un-parallelized candidate generation becomes a new bottleneckv The STE symbol replacement dominates the AP processing time Percentage Percentage Speedup150 GSP-1C M&C percentage140 GSP-6C M&C percentage130600 GSP-1G M&C percentage120 GSP-AP M&C percentage110500 GSP-AP AP conf percentage100904008070 GSP-6C M&C Speedup300 GSP-1G M&C Speedup60 GSP-AP M&C Speedup5020040301002010000.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.011400700

Performance Evaluation – Effect of Sym. Rep. time BM20.500.400.350.30 1X 2X 5X 10X 1X 2X 5X 10X0.25 200180160140120100806040200 Percentage0.45Total Computation Time (s) Percentage300280 1X 1X 0.35260 2X 2X0.30240 5X 5X220 10X 0-0.0520-0.1000.010 0.009 0.008 0.007 0.006 0.005 0.004Relative Minimum Support Total Computation Time (s)0.400.0250.0200.0150.010Relative Minimum Support Kosarak0.005

Performance Evaluation – Compare with PrefixSpanand SPADE PrefixSpan SPADE GSP-6C GSP-1G GSP-AP GSP-1G-MTCG GSP-AP-MTCG10 PrefixSpan SPADE GSP-6C GSP-1G GSP-AP GSP-1G-MTCG GSP-AP-MTCG 2100Computation Time (s)Computation Time (s)3100.0100.0090.008 0.007 0.006 0.005Relative Minimum SupportBM20.00410.10.10 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02Relative Minimum SupportLeviathan

Performance Evaluation – Compare withPrefixSpan and SPADE (2) 100 PrefixSpan SPADE GSP-6C GSP-1G GSP-AP GSP-1G-MTCG GSP-AP-MTCG 101000Computation Time (s)Computation Time (s)1000 10.10.100.080.060.040.02Relative Minimum SupportBible0.00100 PrefixSpan SPADE GSP-6C GSP-1G GSP-AP GSP-1G-MTCG GSP-AP-MTCG1010.10.300.250.200.150.10Relative Minimum SupportFIFA0.0

Performance Evaluation – input size 605040SPADE runs out of memory30 Total computation time (s)70100000Rel. min sup 1.05% PrefixSpan SPADE GSP-1G GSP-AP20100010203040Input file size (MB)Kosarak506070Rel. min sup 1.5% PrefixSpan SPADE GSP-1G GSP-AP10000Rel. min sup 2.5% PrefixSpan SPADE GSP-1G GSP-AP1000 Rel. min sup 0.45% PrefixSpan SPADE GSP-1G GSP-APTotal computation time (s)80 100101010Input file size (MB)Leviathan20

Conclusions and Future Work We present a hardware-accelerated solution for sequential pattern mining (SPM),using Micron’s AP Our proposed solution adopts the algorithm framework of the GeneralizedSequential Pattern (GSP ) We derive a compact automaton design for matching and counting frequentsequences.1) flatten sequences into strings by using delimiters and place-holders2) multiple-entry NFA strategy is proposed to accommodate variable-structuredsequencesTogether, this allows a single, compact template to match any candidate sequence ofa given length, so this template can be replicated to make full use of the capacity andmassive parallelism of the AP Up to 430X, 90X, and 29X speedups are achieved by the AP-accelerated GSP onsix real-world datasets, when compared with the single-threaded CPU, multicoreCPU, and GPU GSP implementations, respectively By parallelizing candidate generation, up to 452X and 49X over PrefixSpan andSPADE More speedup on larger datasets

Backups

GSP Algorithm Srikant, Ramakrishnan, and Rakesh Agrawal. Miningsequential patterns: Generalizations and performanceimprovements. Springer Berlin Heidelberg, 1996.

GSP-JAVA GSP-1C GSP-6C GSP-1 GSP-A Bible FIFA. PerformanceEvaluation–Analysis 0.0100.0090.0080.0070.0060.00