An Integrated, Generic Approach To Pattern Mining: Data Mining Template . PDF Free Download

1y ago

22 Views

1 Downloads

1.10 MB

39 Pages

Report/dmca

Download PDF

Transcription

Data Min Knowl DiscDOI 10.1007/s10618-008-0098-xAn integrated, generic approach to pattern mining:data mining template libraryVineet Chaoji · Mohammad Al Hasan ·Saeed Salem · Mohammed J. ZakiReceived: 8 May 2007 / Accepted: 29 April 2008Springer Science Business Media LLC 2008Abstract Frequent pattern mining (FPM) is an important data mining paradigm toextract informative patterns like itemsets, sequences, trees, and graphs. However, nopractical framework for integrating the FPM tasks has been attempted. In this paper,we describe the design and implementation of the Data Mining Template Library(DMTL) for FPM. DMTL utilizes a generic data mining approach, where all aspectsof mining are controlled via a set of properties. It uses a novel pattern propertyhierarchy to define and mine different pattern types. This property hierarchy can bethought of as a systematic characterization of the pattern space, i.e., a meta-patternspecification that allows the analyst to specify new pattern types, by extending thishierarchy. Furthermore, in DMTL all aspects of mining are controlled by a set ofdifferent mining properties. For example, the kind of mining approach to use, thekind of data types and formats to mine over, the kind of back-end storage managerto use, are all specified as a list of properties. This provides tremendous flexibility tocustomize the toolkit for various applications. Flexibility of the toolkit is exemplifiedby the ease with which support for a new pattern can be added. Experiments on synthetic and public dataset are conducted to demonstrate the scalability provided by theResponsible editor: Hannu Toivonen.V. Chaoji · M. Al Hasan · S. Salem · M. J. Zaki (B)Computer Science Department, Rensselaer Polytechnic Institute, Troy, NY 12180, USAe-mail: zaki@cs.rpi.eduV. Chaojie-mail: chaojv@cs.rpi.eduM. Al Hasane-mail: alhasan@cs.rpi.eduS. Saleme-mail: salems@cs.rpi.edu123

V. Chaoji et al.persistent back-end in the library. DMTL been publicly released as open-sourcesoftware (http://dmtl.sourceforge.net/), and has been downloaded by numerousresearchers from all over the world.Keywords Frequent pattern mining · Itemset mining · Sequence mining · Treemining · Graph mining · Generic programming1 IntroductionFrequent pattern mining (FPM) is an important data mining paradigm that helps todiscover patterns that conceptually represent relations among discrete entities (oritems). Depending on the complexity of these relations, different types of patterns arise.The most common type of patterns are sets, where the relation is the co-occurrence ofitems. A well-known example of a set pattern is a market-basket, the set of items thatare bought together by a customer, say at a supermarket. Next, there are sequence patterns, where we require an ordering (temporal or positional) between items. Examplesinclude time-series data in financial markets, genome sequence data in bioinformatics,etc. Data mining researchers also work with tree and graph patterns. In tree patternsthe item relationship takes a hierarchical form, and in graph patterns the relationshipis mostly arbitrary. Mining web log data, XML, or semi-structured data are examplesof tree mining, and mining chemical compounds for drug discovery, or web communities in a web graph, are examples of graph mining. It is thus clear that differentapplications require the ability to define and mine different types of patterns; somemay even require the ability to define custom pattern types (Horváth et al. 2006).All of these scenarios require efficient and flexible FPM algorithms and supportingdata/index structures, which can be reused in a variety of domains.It is worth noting that there exist open source data mining suites such as Weka(Witten and Frank 1999), which span commonly used data mining methods like association rules, clustering, classification, and regression. For the specific case of itemsetmining, there also exist repositories of separate methods, such as Frequent Itemset Mining Implementations (FIMI) (Goethals and Zaki 2003). However, no unifiedframework for various FPM tasks currently exists.The Data Mining Template Library (DMTL) is the first such effort in realizing aunified framework for mining various pattern types. DMTL is implemented in C ,which was chosen due to its support for generic programming. These features includetemplate classes and functions, partial specialization, and so on. The most appealingpart of C template features is that the binding is resolved at compile time, hence noruntime overhead is incurred. In contrast, while an object-oriented approach providesre-usability through inheritance, sometimes it suffers from runtime inefficiency, dueto virtual table lookup for dynamic binding. In addition, most data mining librariesdo not provide support for persistence. As a result they are restricted by the size ofavailable main memory. For instance, Weka (Witten and Frank 1999) does not provideany mechanism to store intermediate results so that it can scale to much larger datasets.However, there has been some recent work to address this limitation (Zou et al. 2006).123

An integrated, generic approach to pattern mining: data mining template libraryThe major contributions of our work are as follows: DMTL adopts the generic software development approach using C templates,inspired by the state-of-the-art generic libraries such as the C Standard TemplateLibrary (STL) (Musser et al. 2001) and Boost Graph Library (Siek et al. 2002),and hence provides widespread usability without compromising on efficiency. DMTL offers algorithms for different pattern mining tasks in a unified platform.Currently the library has implementations for mining four key patterns—itemsets,sequences, trees, and graphs. DMTL offers flexible interfaces for each of the algorithms, including each of itssub-tasks so that it is very simple for end users to use it as a library component intheir software development. DMTL is extensible; new patterns can be mined with minimal effort from the enduser. Users need to define some template parameters to ensure that the libraryselects the proper mining algorithm to mine that pattern successfully. Some additional specialized code may be required for efficiency reasons. In DMTL all aspects of mining are controlled by properties. Pattern-propertiesand mining-properties are used in DMTL to specify the following aspects of themining process: (a) pattern to be mined, (b) input data source and format, (c) datastructure to be used in the mining algorithm, (d) storage management, and (e)mining algorithm/approach. Support for multiple back-ends, which enables the library to scale to much largerdatasets. The current implementation includes a file-based back-end that providestransparent persistency for mining out-of-core datasets.The DMTL is available as open-source software from the world-wide Sourceforgerepository (http://dmtl.sourceforge.net/). It has already been downloaded by numerous researchers and practitioners from all over the world.1 Below, we describe themain elements of the design and implementation of DMTL. We explicitly show viaan example, how the DMTL can be easily extended to mine custom defined patterns.For instance, we extend DMTL to mine multiset patterns. Finally, we perform extensive experimental studies to demonstrate the scalability of DMTL for mining variouspattern types, and for mining very large, out-of-core datasets.2 Related workFrequent structure mining refers to an important class of exploratory mining tasks,namely those dealing with extracting patterns in massive databases representing complex interactions between entities. It encompasses mining techniques like itemsets(Agrawal et al. 1996), sequences (Agrawal and Srikant 1995), trees (Zaki 2002), andgraphs (Inokuchi et al. 2003; Kuramochi and Karypis 2004). As one increases thecomplexity of the structures to be discovered, one extracts more informative patterns.Here we briefly review the existing methods for FPM.1 As of Nov 2007, DMTL has been downloaded by around 3,000 researchers from the Sourceforge site (itis averaging about 2,000 hits and 100 downloads a month).123

V. Chaoji et al.Itemset mining The itemset mining problem is to discover frequently co-occurring setsof items (or attributes). Since its introduction by Agrawal et al. (1993), over the pastdecade many interesting algorithms have been proposed for mining frequent itemsets(Agrawal et al. 1996; Zaki et al. 1997; Savasere et al. 1995; Brin et al. 1997; Han et al.2000a; Zaki and Gouda 2003). It continues to be an active area of research. Methodsfor mining maximal (those that have no frequent superset) and closed patterns (thosethat have no superset with the same frequency) have appeared in (Pasquier et al. 1999;Zaki and Hsiao 2002; Zaki and Hsiao 2005; Wang et al. 2003; Burdick et al. 2001a).Recent advances are described in the comparative study on Frequent Itemset MiningImplementations (FIMI) by Goethals and Zaki (2003).Sequence mining Sequence mining helps discover frequent sequential patterns acrosstime or positions in a given data set. Within data mining, the problem of mining sequential patterns was introduced by Agrawal and Srikant (1995). Many other approacheshave followed since then (Srikant and Agrawal 1996; Mannila et al. 1995; Mannilaand Toivonen 1996; Oates et al. 1997; Zaki 2001; Ayres et al. 2002; Pei et al. 2001).Methods that consider constraints like maximum/minimum gaps, sliding windows,regular expressions, and taxonomies have also been proposed (Srikant and Agrawal1996; Zaki 2000b; Garofalakis et al. 1999). Methods for mining closed sequencesappear in Wang and Han (2004), Balcazar and Casas-Garriga (2005).Tree mining Several algorithms for tree mining have been proposed recently, startingwith the earlier work in Wang and Liu (1998), Asai et al. (2002), and Zaki (2002).The new methods mine different kinds of tree patterns, such as ordered/unorderedembedded trees (Zaki 2005b; Wang et al. 2004; Termier et al. 2002; Termier et al.2004; Zaki 2005a) or induced trees (Chi et al. 2003; Shasha et al. 2004; Xiao et al.2003; Nijssen and Kok 2003; Asai et al. 2003; Chi et al. 2004a). Methods for miningclosed and maximal tree patterns appear in Chi et al. (2004b).Graph mining Given a database of graph objects, the goal of graph mining is to find allthe commonly occurring sub-graph patterns. Some of the early work in graph mininginclude Cook and Holder (1994), Yoshida and Motoda (1995), and Dehaspe et al.(1998). Many recent methods have been proposed which improve the efficiency ofmining, these include Inokuchi et al. (2000), Kramer et al. (2001), Kuramochi andKarypis (2001), Yan and Han (2002a), Huan et al. (2003a), and Nijssen and Kok(2004). Closed graph mining methods have also been proposed (Yan and Han 2003).As reviewed above, there have been many stand-alone algorithms to mine different types of patterns. On closer examination, certain common themes and commonalgorithmic paradigms permeate all of the existing methods. The goal of the DMTLis to abstract these common elements into generic primitives both in terms of the datastructures used and in terms of the algorithms. In addition, DMTL explicitly handlesthe issue of persistency, i.e., the ability to mine out-of-core datasets. An initial versionof DMTL was described in Zaki et al. (2004), however that design was not based on theproperty-based framework we present here. In a recent paper on DMTL (Hasan et al.2005), we focused on the software design; neither did we emphasize the data miningaspects, nor did we present any experiments results. This paper describes the novelproperty-based DMTL framework, it shows how DMTL can be extended to mine newpattern types, and it presents a comprehensive experimental evaluation demonstratingDMTL’s scalability to large datasets.123

An integrated, generic approach to pattern mining: data mining template libraryWe would like to point out that the work by the authors in Mannila and Toivonen(1997), is similar in concept to that of ours. In Mannila and Toivonen (1997), theauthors provide an abstract theoretical analysis for the problem of traversing the partialorder in a levelwise fashion. In the pattern space, they mainly focus on the itemsetand sequence patterns and their variations. The authors also discuss bounds on thenumber of evaluations of the “interestingness criterion” in terms of the size of theborder. Our work lends an empirical grounding for their theoretical work. Moreover,our framework supports both levelwise and depthwise traversal techniques. We alsoexplore more complex patterns such as trees and graphs.3 PreliminariesThe problem of mining frequent patterns can be stated as follows: Let N {x1 , x2 , . . . ,xnv } be a set of nv distinct nodes or vertices. A pair of nodes (xi , xj ) is called an edge.Let L {l1 , l2 , . . . , lnl }, be a set of nl distinct labels. Let Ln : N L, be a nodelabeling function that maps a node to its label Ln (xi ) li , and let Le : N N Lbe an edge labeling function, that maps an edge to its label Le (xi , xj ) lk .It is intuitive to represent a pattern P as a graph (PV , PE ), with labeled vertexset PV N and labeled edge set PE {(xi , xj ) xi , xj PV }. The number ofnodes in a pattern P is called its size. A pattern of size k is called a k-pattern, and theclass of frequent k-patterns is referred to as Fk . In some applications P is comprisedof undirected edges, i.e., the edges define a symmetric relation: (xi , xj ) (xj , xi ),while in other applications P is comprised of directed edges, i.e., the edges definean anti-symmetric relation: (xi , xj ) (xj , xi ). A path in P is a set of distinct nodes{xi0 , xi1 , . . . , xin }, such that (xij , xij 1 ) is an edge in PE for all j 0, . . . , n 1. xi0is the start node and xin is the end node. The number of edges gives the length of thepath. If xi and xj are connected by a path of exactly length n we denote it as xi n xj .Thus the edge (xi , xj ) can also be written as xi 1 xj .Given two patterns P (PV , PE ) and Q (QV , QE ), let f : PV QV be aninjective function. We say that P is a sub-pattern of Q (or Q is a super-pattern of P ),denoted P Q if and only if (iff) for all xi , xj PV :(i) Nodes labels are preserved by f , i.e., Ln (xi ) Ln (f (xi )).(ii) Edge labels are preserved by f , i.e., Le (xi , xj ) Le (f (xi ), f (xj )).(iii) (xi , xj ) PE (f (xi ), f (xj )) QE , i.e., PE QE .If P Q we also say that P is contained in Q or Q contains P . Note that with theexception of itemsets, we are generally interested only in connected sub-patterns,where we require that there exists a path between xi and xj for all xi , xj PV . Furthermore, in some data mining applications, we desire embedded sub-patterns; P iscalled an embedded sub-pattern of Q iff:(i) Nodes labels are preserved by f , i.e., Ln (xi ) Ln (f (xi )).(ii) (xi , xj ) PE f (xi ) l f (xj ) in QE , where l 1. In other words, anedges (xi , xj ) exists in P if f (xi ) is connected to f (xj ) by a path in Q. Notethat if l 1, then f (xi ) 1 f (xj ) (f (xi ), f (xj )) QE . In this case123

V. Chaoji et al.PE QE , and if we require that edge labels be preserved, then P becomes aregular sub-pattern of Q.A database D is just a collection (a multi-set) of patterns. A database pattern isalso called an object or a record. Let R {t1 , t2 , . . . , tnr }, be a set of nr distincttransaction identifiers (tid). An object has a unique identifier, given by the functionR(di ) tj , where di D and tj R. The number of objects in D is given as D .The absolute support of a pattern P in a database D is defined as the number ofobjects in D that contain P , given as π a (P , D) {P d d D} . The (relative)a,D )support of P is given as π(P , D) π (P D . A pattern is frequent if its support is morethan some user-specified minimum threshold π min . A frequent pattern is maximal ifit is not a sub-pattern of any other frequent pattern. A frequent pattern is closed if ithas no super-pattern with the same support. The frequent pattern mining problem is toenumerate all the patterns that satisfy the user-specified π min frequency requirement(and any other user-specified conditions).3.1 FPM instancesSome common types of patterns include itemsets, sequences, trees, and graphs, asshown in Fig. 1. Every pattern can be modeled as a graph; the nodes (xi ) are shownunder each circle and the node labels (Ln (xi )) are shown inside the circle, whereasedge labels have been omitted.In an itemset no two nodes have the same label. Let V {x1 , x2 , . . . , xk } be a nodeset such that Ln (xi ) Ln (xj ) for all xi , xj V , and without loss of generality, wemay assume that Ln (xi ) Ln (xi 1 ) for all 1 i k 1. There are several possiblegraph representations for itemset patterns: (i) vertex-only: An itemset pattern P is justa set of vertices, i.e., PV V and PE , this is shown in Fig. 1, (ii) linear: in anotherformulation the itemset is defined as PV V , and PE {(xi , xi 1 ) xi , xi 1 PV },SEQUENCE (A AB C)ITEMSET (ABCD)A1BC3D451AA3TREE11CB34ACB34AA2Fig. 1 FPM instances1234GRAPHACB22

An integrated, generic approach to pattern mining: data mining template library(iii) clique: A third alternative is to represent itemset P as a clique, i.e., PV V andPE {(xi , xj ) i j and xi , xj PV }. All edges in itemsets are undirected.A simple sequence pattern P is easily modeled as a set of directed edges PE {(xi , xi 1 ) (xi , xi 1 ) is ordered, and xi , xi 1 PV }. Moreover, for sequences, thenode labeling function Ln need not be injective, i.e., two nodes can have the same label.The sequential patterns defined by Agrawal and Srikant (1995) are ordered lists of itemsets. We can model a sequential pattern P as being made up of a sequence of n itemsetsP i , i 1, . . . n, using the linear representation, with directed edges from one itemsetto the next. For example, Fig. 1 shows the sequential pattern A {A, B} C. Notethat using the vertex-only formulation for the itemsets in a sequential pattern would beproblematic, since it would result in a disconnected pattern. In summary,a sequential pattern P has a vertex set made up of n disjoint subsets PV ni 1 PVi , and theedge set PE contains all the edges within P i (consecutive and undirected), and it alsocontains a directed edge for every pair of consecutive itemsets, i.e., from the last nodeof P i to the first node of P i 1 .A tree pattern P consists of the vertex set PV {r, x1 , x2 , . . .}, where r is a specialnode called root. A tree pattern must satisfy the following properties, namely i) theroot has no parent, i.e., (xi , r) PE for any xi PV , ii) the edges are directed, i.e., if(xi , xj ) PE , then (xj , xi ) PE , iii) a node has only one parent, i.e., if (xi , xj ) PE ,then (xk , xj ) PE for any xk xi , iv) the tree is connected, i.e., for all xi PV ,there exists a path from the root r to xi . Furthermore for ordered trees the order ofa nodes’ children matters. This means that there is an ordering of edges in PE , suchthat (xi , xj ) comes before (xi , xk ) in PE only if xj is before xk in the ordering ofxi ’s children. Embedded trees can be defined by following the definition of embeddedpatterns introduced earlier.Finally, by definition a general pattern can be modeled as a graph, along with anyspecial constraints that typically arise in graph mining (e.g., connected graphs, orinduced subgraphs). It is also possible to model other patterns such as directed acyclicgraphs (DAGs) or free trees (undirected acyclic graphs). DMTL currently implementsdirect support for the mining of (i) itemsets, (ii) regular or embedded simple sequences,(iii) regular/embedded, rooted trees with ordered/unordered edges, and (iv) undirectedgraphs with no self loops or multiple edges. As we shall describe below, the toolkitcan be extended to incorporate mining of other user defined patterns as well.3.2 Database formatIn a typical FPM task, the database is in a horizontal format, i.e. a set of transactions,where each transaction is an object of the pattern type being mined (Agrawal et al.1996). Recently, vertical database formats have been proposed for mining itemsets,sequences, and trees (Zaki 2000a, 2001, 2002). The vertical format is an attractivealternative since it enables fast computation of support by avoiding repeated databaseaccesses. It does so by associating an entity called Vertical attribute table (VAT) witheach pattern. For an itemset, the VAT is the list of tids in which it is contained; VATsfor sequences and trees are more complex and are described later. Currently a verticalscheme does not exist for graphs; the introduction of a new and efficient VAT scheme123

V. Chaoji et al.for graphs is one of the contributions of this paper. DMTL introduces two modes ofpersistence: (i) the collection of frequent patterns itself may be too large to fit in mainmemory, and hence persistent containers are provided to hold them, and (ii) persistentstorage and access to VATs. Both these modes of persistence are entirely transparentto the user.4 DMTL design and implementation4.1 Design motivationWhile implementing mining algorithms for different patterns, we noticed that theyexhibit considerable similarity, which suggested developing a common frameworkfor implementing them. Figure 2 outlines a generic pattern mining algorithm (pseudocode) that applies to all commonly explored patterns in the literature. The algorithmcan be broken down into key sub-tasks, which include candidate generation, isomorphism checking, and support counting. By implementing generic algorithms for thesesub-tasks, we retain the abstraction shown in this pseudo-code. The overall idea ofthe algorithm is as follows: the mining process searches incrementally in the patternspace by iteratively applying these sub-tasks in each iteration to enumerate patterns ofsize one, two, and so on. Each iteration discovers frequent patterns extended by onemore item (or edge) than in the previous step, until no further frequent patterns existin the database.The sub-tasks of a generic mining algorithm can be developed using generic methods (expressed using C function templates). For example, the candidate generationmethod generates candidate patterns of generic type T , by combining two parent patterns of type T . The algorithm strictly requires that both the input arguments, togetherwith the output argument, are of the same type T (e.g., we cannot join a set patternwith a tree pattern to produce a tree candidate pattern). The isomorphism checkingmethod takes as input a pattern P of type T and produces a boolean value to indicatewhether P is generated from a branch of the candidate generation tree where its canonical code is minimum. A minimum canonical code is a unique signature of a pattern,thus it guarantees that each candidate is enumerated only once. The support countingFig. 2 Generic frequent patternmining algorithm123

An integrated, generic approach to pattern mining: data mining template librarymethod takes as input a pattern of type T , counts its frequency in the entire database (via sub-pattern, i.e., subgraph isomorphism, checking). Typically, the supportof entire sets of candidates are determined in one database access.In all the above three generic methods, the type T models a pattern concept. It hasthe following requirements: (i) T defines an object that relates some elements. (ii) Tmust adhere to a structure that is defined by a collection of relational properties. (iii) Tdefines a comparison ( ) operator. (iv) Associated with type T there exists a patterniterator, which is used to iterate through the elements of the pattern. All commonlyknown patterns in data mining, like set, sequence, tree, or graph are refinements ofa pattern concept. The entire pattern mining process can be represented in terms ofabstract objects and operations, that can be captured easily using the C templatemechanism.Figure 3 shows the key architectural components of DMTL. The components arepartitioned into two main segments—the front end and the back end. The front endmanages the core mining process while the back end provides the necessary storage support. The dotted rectangular block in the front end corresponds to the genericpattern mining algorithm shown in Fig. 2 and the three generic subtasks (candidategeneration, isomorphism checking, and support counting) are represented by solidboxes inside it. The arrows in this figure show the data/control flow among differentcomponents. For instance, the down arrows among the tasks in the dotted rectangleindicate the order in which they were called and the loop back arrow (count supportCANDIDATEGENERATIONFRONT NDATASETBACK ENDFILEWITH BUFFERPSTLMEMORYFig. 3 High-level architecture diagram of the Data Mining Template Library. The boxes denote differentfunctional units and the arrows connecting different boxes shows the data flow123

V. Chaoji et al.to candidate generation) illustrates the recursive enumerative nature of a genericfrequent pattern mining algorithm. The count support task delegates its responsibilityto the back-end which manages the storage and can return the support of candidatepatterns by efficient back-end operations (intersecting the VAT of the parent patterns).The separation between the front-end and the back-end, as shown in this figure, isexplicitly retained in the implementation, which makes DMTL flexible, extensible,and also widely usable. Since the enumeration based algorithm can generate a largenumber of frequent patterns, efficient data structures are required such that the desiredpatterns and their corresponding VATs can be accessed efficiently from the containerwhere they are stored. These data structures are implemented in the back-end, whoseimplementation is hidden to the front-end. The VAT for a pattern is accessed fromthe back-end via the Pattern Structure module. Note that direct access to VATs is notallowed (for modularity and abstraction). Instead pattern identifiers are passed to theback-end, which in turn computes the VAT for a candidate pattern and returns thesupport value to the front-end. The candidate pattern, if found frequent, is saved inthe back-end along with its VAT. The Database Parser initiates the mining process byreading all frequent patterns of length one from the input source which could eitherbe a database, a flat-file or another process generating the data. The Database Parsergenerates two objects for each level-one pattern—pattern object and the VAT objectand stores those in the back-end storage manager.4.2 The front-end mining engineThe front-end consists of the core mining engine which implements the enumerativemining process.4.3 Data typesThe key data structure in DMTL is the pattern, along with its associated VAT. Thereader can more or less consider that the pattern is associated with the front-end andthe VAT is associated with the back-end. This is because majority of the operations onthe pattern structure are performed in the front-end, while the VAT is operated uponprimarily in the back-end.4.3.1 Expressing patterns as propertiesAny pattern is conceptualized as a group of elements that can be represented as vertices of a graph. The relationship between the elements is captured by the presenceof edges between the vertices. For instance, a set is a specialized graph which has noedges between its vertices. Whereas graphs provide an effective pattern abstraction,using a general purpose graph mining implementation to mine simpler patterns (itemsets, sequences or trees) is inefficient. The concept of pattern property provides aneffective solution to this dilemma. With this idea of expressing patterns in terms ofprimitive properties, we can ensure that a generic algorithm can pick the most appropriate sub-tasks. In Sect. 4.3.3, we explain the use of these pattern properties to ensure123

An integrated, generic approach to pattern mining: data mining template librarya generic algorithm that does not compromise efficiency via the use of appropriatepattern-specific specializations. Below we explain the different pattern properties thatwe used.The pattern properties constrain the graph to form the desired pattern. We analyzedthe pattern space and found that the following properties are sufficient to describe themost common patterns, but nevertheless, additional properties can be added to DMTLseamlessly. The properties are themselves categorized depending on the elements(nodes, edges, etc.) of a graph on which the constraints are imposed.1.Edge properties: The edge set PE is defined as PE PV PV , where PV isthe set of vertices in the pattern. Under edge relation category we consider thefollowing properties: no-edge means that elements in the patterns are not connected with any edge. directed means that elements in the patterns are connected with directed(asymmetric) edges. undirected means that elements in the patterns are connected with undirected(symmetric) edges. cyclic means that at least one vertex is reflexive on the edge relation in thetransitive closure of the pattern; otherwise, the pattern possesses the acyclicproperty.2.3.Vertex properties: Here we consider only one property, ordered, which imposesan ordering on the neighbors of a vertex, or else the pattern is said to be unordered.Ordering is usually relevant for only tree patterns.Degree-related properties: These relate to the degree constraints placed on thenodes of the pattern. indegree lte one constrains all vertices of a graph to have indegree 1. outdegree lte one constrains all vertices of a graph to have outdegree 1.4.Label properties: Here the unique label property requires the labeling functionto be

An integrated, generic approach to pattern mining: . pattern types, and for mining very large, out-of-core datasets. 2 Related work Frequent structure mining refers to an important class of exploratory mining tasks, namely those dealing with extracting patterns in massive databases representing com-