Using Text Mining And Link Analysis For Software Mining

Transcription

Using Text Mining and Link Analysis forSoftware MiningMiha Grcar1, Marko Grobelnik1, Dunja Mladenic11Jozef Stefan Institute, Dept. of Knowledge Technologies, Jamova 39,1000 Ljubljana, Slovenia{miha.grcar, marko.grobelnik, dunja.mladenic}@ijs.siAbstract. Many data mining techniques are these days in use for ontologylearning – text mining, Web mining, graph mining, link analysis, relational datamining, and so on. In the current state-of-the-art bundle there is a lack of“software mining” techniques. This term denotes the process of extractingknowledge out of source code. In this paper we approach the software miningtask with a combination of text mining and link analysis techniques. We discusshow each instance (i.e. a programming construct such as a class or a method)can be converted into a feature vector that combines the information about howthe instance is interlinked with other instances, and the information about its(textual) content. The so-obtained feature vectors serve as the basis for theconstruction of the domain ontology with OntoGen, an existing system forsemi-automatic data-driven ontology construction.Keywords: software mining, text mining, link analysis, graph and networktheory, feature vectors, ontologies, OntoGen, machine learning1 Introduction and MotivationMany data mining (i.e. knowledge discovery) techniques are these days in use forontology learning – text mining, Web mining, graph mining, network analysis, linkanalysis, relation data mining, stream mining, and so on [6]. In the current state-ofthe-art bundle mining of software code and the associated documentations is notexplicitly addressed. With the growing amounts of software, especially open-sourcesoftware libraries, we argue that mining such data is worth considering as a newmethodology. Thus we introduce the term “software mining” to refer to suchmethodology. The term denotes the process of extracting knowledge (i.e. usefulinformation) out of data sources that typically accompany an open-source softwarelibrary.The motivation for software mining comes from the fact that the discovery ofreusable software artifacts is just as important as the discovery of documents andmultimedia contents. According to the recent Semantic Web trends, contents need tobe semantically annotated with concepts from the domain ontology in order to bediscoverable by intelligent agents. Because the legacy content repositories arerelatively large, cheaper semi-automatic means for semantic annotation and domain

ontology construction are preferred to the expensive manual labor. Furthermore, whendealing with software artifacts it is possible to go beyond discovery and also supportother user tasks such as composition, orchestration, and execution. The need forontology-based systems has yield several research and development projectssupported by EU that deal with this issue. One of these projects is TAO(http://www.tao-project.eu) which stands for Transitioning Applications toOntologies. In this paper we present work in the context of software mining for thedomain ontology construction. We illustrate the proposed approach on the softwaremining case study based on GATE [3], an open-source software library for naturallanguage processing written in Java programming language.We interpret “software mining” as being a combination of methods for structuremining and for content mining. To be more specific, we approach the software miningtask with the techniques used for text mining and link analysis. The GATE case studyserves as a perfect example in this perspective. On concrete examples we discuss howeach instance (i.e. a programming construct such as a class or a method) can berepresented as a feature vector that combines the information about how the instanceis interlinked with other instances, and the information about its (textual) content. Theso-obtained feature vectors serve as the basis for the construction of the domainontology with OntoGen [4], a system for semi-automatic, data-driven ontologyconstruction, or by using traditional machine learning algorithms such as clustering,classification, regression, or active learning.2 Related WorkWhen studying the literature we did not limit ourselves to ontology learning in thecontext of software artifacts – the reason for this is in the fact that the more generaltechniques also have the potential to be adapted for software mining.Several knowledge discovery (mostly machine learning) techniques have beenemployed for ontology learning in the past. Unsupervised learning, classification,active learning, and feature space visualization form the core of OntoGen [4].OntoGen employs text mining techniques to facilitate the construction of an ontologyout of a set of textual documents. Text mining seems to be a popular approach toontology learning because there are many textual sources available (one of the largestis the Web). Furthermore, text mining techniques are shown to produce relativelygood results. In [8], the authors provide a lot of insight into the ontology learning inthe context of the Text-To-Onto ontology learning architecture. The authors employ amulti-strategy learning approach and result combination (i.e. they combine outputs ofseveral different algorithms) to produce a coherent ontology definition. In this samework a comprehensive survey of ontology learning approaches is presented.Marta Sabou’s thesis [13] provides valuable insights into ontology learning forWeb Services. It summarizes ontology learning approaches, ontology learning tools,acquisition of software semantics, and describes – in detail – their framework forlearning Web Service domain ontologies.There are basically two approaches to building tools for software componentdiscovery: the information retrieval approach and the knowledge-based approach. The

first approach is based on the natural language documentation of the softwarecomponents. With this approach no interpretation of the documentation is made – theinformation is extracted via statistical analyses of the words distribution. On the otherhand, the knowledge-based approach relies on pre-encoded, manually providedinformation (the information is provided by a domain expert). Knowledge-basedsystems can be “smarter” than IR systems but they suffer from the scalability issue(extending the repository is not “cheap”).In [9], the authors present techniques for browsing amongst functionality relatedclasses (rather than inheritance), and retrieving classes from object-oriented libraries.They chose the IR approach for which they believe is advantageous in terms of cost,scalability, and ease of posing queries. They extract information from the source code(a structured data source) and its associated documentation (an unstructured datasource). First, the source code is parsed and the relations, such as derived-from ormember-of, are extracted. They used a hierarchical clustering technique to form abrowse hierarchy that reflected the degree of similarity between classes (the similarityis drawn from the class documentation rather than from the class structure). Thesimilarity between two classes was inferred from the browse hierarchy with respect tothe distance of the two classes from their common parent and the distance of theircommon parent from the root node.In this paper we adopt some ideas from [9]. However, the purpose of ourmethodology is not to build browse hierarchies but rather to describe programmingconstructs with feature vectors that can be used for machine learning. In other words,the purpose of our methodology is to transform a source code repository into a featurespace. The exploration of this feature space enables the domain experts to build aknowledge base in a “cheaper” semi-automatic interactive fashion.3 Mining Content and Structure of Software ArtifactsIn this section we present our approach and give an illustrative example of datapreprocessing from documented source code using the GATE software library. In thecontext of the GATE case study the content is provided by the reference manual(textual descriptions of Java classes and methods), source code comments,programmer’s guide, annotator’s guide, user’s guide, forum, and so on. The structureis provided implicitly from these same data sources since a Java class or method isoften referenced from the context of another Java class or method (e.g. a Java classname is mentioned in the comment of another Java class). Additional structure can beharvested from the source code (e.g. a Java class contains a member method thatreturns an instance of another Java class), code snippets, and usage logs (e.g. one Javaclass is often instantiated immediately after another). In this paper we limit ourselvesto the source code which also represents the reference manual (the so called JavaDoc)since the reference manual is generated automatically out of the source codecomments by a documentation tool.A software-based domain ontology should provide two views on the correspondingsoftware library: the view on the data structures and the view on the functionality[13]. In GATE, these two views are represented with Java classes and their member

methods – these are evident from the GATE source code. In our examples we limitourselves to Java classes (i.e. we deal with the part of the domain ontology that coversthe data structures of the system). This means that we will use the GATE Java classesas text mining instances (and also as graph vertices when dealing with the structure).Let us first take a look at a typical GATE Java class. It contains the following bitsof information relevant for the understanding of this example (see also Fig. 1): Class comment. It should describe the purpose of the class. It is used by thedocumentation tool to generate the reference manual (i.e. JavaDoc).It is mainly a source of textual data but also provides structure – two classes areinterlinked if the name of one class is mentioned in the comment of the other class. Class name. Each class is given a name that uniquely identifies the class. Thename is usually a composed word that captures the meaning of the class.It is mainly a source of textual data but also provides structure – two classes areinterlinked if they share a common substring in their names. Field names and types. Each class contains a set of member fields. Each field hasa name (which is unique within the scope of the class) and a type. The type of afield corresponds to a Java class.Field names provide textual data. Field types mainly provide structure – twoclasses are interlinked if one class contains a field that instantiates the other class. Field and method comments. Fields and methods can also be commented. Thecomment should explain the purpose of the field or method.These comments are a source of textual data. They can also provide structure inthe same sense as class comments do. Method names and return types. Each class contains a set of member methods.Each method has a name, a set of parameters, and a return type. The return type ofa method corresponds to a Java class. Each parameter has a name and a type whichcorresponds to a Java class.Methods can be treated similarly to fields with respect to taking their names andreturn types into account. Parameter types can be taken into account similarly toreturn types but there is a semantic difference between the two pieces ofinformation. Parameter types denote classes that are “used/consumed” forprocessing while return types denote classes that are “produced” in the process. Information about inheritance and interface implementation. Each classinherits (fields and methods) from a base class. Furthermore, a class can implementone or more interfaces. An interface is merely a set of methods that need to beimplemented in the derived class.The information about inheritance and interface implementation is a source ofstructural information.3.1 Textual ContentTextual content is taken into account by assigning a textual document to each unit ofthe software code – in our illustrative example, to each GATE Java class. Suppose wefocus on a particular arbitrary class – there are several ways to form thecorresponding document.

Fig. 1. Relevant parts of a typical Java class.It is important to include only those bits of text that are not misleading for the textmining algorithms. At this point the details of these text mining algorithms are prettyirrelevant provided that we can somehow evaluate the domain ontology that we buildin the end.Another thing to consider is how to include composed names of classes, fields, andmethods into a document. We can insert each of these as: a composed word (i.e. in its original form, e.g. “XmlDocumentFormat”), separate words (i.e. by inserting spaces, e.g. “Xml Document Format”), or combination of both (e.g. “XmlDocumentFormat Xml Document Format”).The text-mining algorithms perceive two documents that have many words incommon more similar that those that only share a few or no words. Breakingcomposed names into separate words therefore results in a greater similarity betweendocuments that do not share full names but do share some parts of these names.3.2 Determining the StructureThe basic units of the software code – in our case the Java classes – that we use astext-mining instances are interlinked in many ways. In this section we discuss howthis structure which is often implicit can be determined from the source code.As already mentioned, when dealing with the structure, we represent each class(i.e. each text mining instance) by a vertex in a graph. We can create several graphs –one for each type of associations between classes. This section describes severalgraphs that can be constructed out of object-oriented source code.Comment Reference Graph. Every comment found in a class can reference anotherclass by mentioning its name (for whatever the reason may be). In Fig. 1 we can seefour such references, namely the class DocumentFormat references classesXmlDocumentFormat, RtfDocumentFormat, MpegDocumentFormat, and MimeType

(denoted with “Comment reference” in the figure). A section of the commentreference graph for the GATE case study is shown in Fig. 2. The vertices representGATE Java classes found in the “gate” subfolder of the GATE source code repository(we limited ourselves to a subfolder merely to reduce the number of vertices for thepurpose of the illustrative visualization). An arc that connects two vertices is directedfrom the source vertex towards the target vertex (these two vertices represent thesource and the target class, respectively). The weight of an arc (at least 1) denotes thenumber of times the name of the target class is mentioned in the comments of thesource class. The higher the weight, the stronger is the association between the twoclasses. In the figure, the thickness of an arc is proportional to its weight.Name Similarity Graph. A class usually represents a data structure and a set ofmethods related to it. Not every class is a data structure – it can merely be a set of(static) methods. The name of a class is usually a noun denoting either the datastructure that the class represents (e.g. Boolean, ArrayList) or a “category” of themethods contained in the class (e.g. System, Math). If the name is composed (e.g.ArrayList) it is reasonable to assume that each of the words bears a piece ofinformation about the class (e.g. an ArrayList is some kind of List with the propertiesof an Array). Therefore it is also reasonable to say that two classes that have morewords in common are more similar to each other than two classes that have fewerwords in common. According to this intuition we can construct the name similarityFig. 2. A section of the GATE comment reference graph.graph. This graph contains edges (i.e. undirected links) instead of arcs. Two verticesare linked when the two classes share at least one word. The strength of the link (i.e.the edge weight) can be computed by using the Jaccard similarity measure which isoften used to measure the similarity of two sets of items (see http://en.wikipedia.org/wiki/Jaccard index). The name similarity graph for the GATE case study ispresented in Fig. 3. The vertices represent GATE Java classes found in the “gate”subfolder of the GATE source code repository. The Jaccard similarity measure was

used to weight the edges. Edges with weights lower than 0.6 and vertices of degree 0were removed to simplify the visualization. In Fig. 3 we have removed class namesand weight values to clearly show the structure. The evident clustering of vertices isthe result of the Kamada-Kawai graph drawing algorithm [14] employed by Pajek [1]which was used to create graph drawings in this paper. The Kamada-Kawai algorithmpositions vertices that are highly interlinked closer together.Type Reference Graph. Field types and method return types are a valuable source ofstructural information. A field type or a method return type can correspond to a classin the scope of the study (i.e. a class that is also found in the source code repositoryunder consideration) – hence an arc can be drawn from the class to which the field orthe method belongs towards the class represented by the type.Inheritance and Interface Implementation Graph. Last but not least, structure canalso be determined from the information about inheritance and interfaceimplementation. This is the most obvious structural information in an object-orientedsource code and is often used to arrange classes into the browsing taxonomy. In thisgraph, an arc that connects two vertices is directed from the vertex that represents abase class (or an interface) towards the vertex that represents a class that inherits fromthe base class (or implements the interface). The weight of an arc is always 1.Persistence ig. 3. The GATE name similarity graph. The most common substrings in names are shown forthe most evident clusters.4 Transforming Content and Structure into Feature VectorsMany data-mining algorithms work with feature vectors. This is true also for thealgorithms employed by OntoGen and for the traditional machine learning algorithmssuch as clustering or classification. Therefore we need to convert the content (i.e.

documents assigned to text-mining instances) and the structure (i.e. several graphs ofinterlinked vertices) into feature vectors. Potentially we also want to include otherexplicit features (e.g. in- and out-degree of a vertex).4.1 Converting Content into Feature VectorsTo convert textual documents into feature vectors we resort to a well-known textmining approach. We first apply stemming1 to all the words in the documentcollection (i.e. we normalize words by stripping them of their suffixes, e.g. stripping strip, suffixes suffix). We then search for n-grams, i.e. sequences of consecutivewords of length n that occur in the document collection more than a certain amount oftimes [11]. Discovered n-grams are perceived just as all the other (single) words.After that, we convert documents into their bag-of-words representations. To weightwords (and n-grams), we use the TF-IDF weighting scheme ([6], Section 1.3.2).4.2 Converting Structure into Feature VectorsLet us repeat that the structure is represented in the form of several graphs in whichvertices correspond to text-mining instances. If we consider a particular graph, thetask is to describe each vertex in the graph with a feature vector.For this purpose we adopt the technique presented in [10]. First, we convert arcs(i.e. directed links) into edges (i.e. undirected links)2. The edges adopt weights fromthe corresponding arcs. If two vertices are directly connected with more than one arc,the resulting edge weight is computed by summing, maximizing, minimizing, oraveraging the arc weights (we propose summing the weights as the default option).Then we represent a graph on N vertices as a N N sparse matrix. The matrix isconstructed so that

as text mining instances (and also as graph vertices when dealing with the structure). Let us first take a look at