A Semantic Similarity Approach To Electronic Document Modeling And .

Transcription

A Semantic Similarity Approach to Electronic Document Modeling andIntegrationWilliam W. Song*, David Cheung, and CJ TanE-Business Technology Institute, The University of Hong KongHong Kong, PRCEmail: { wsong, dcheung, ctan) @eti.hku.hkAbstractmatch method implies that the Web documents mustcontain a list of presumed keywords. Because ofsubjectivity of assigning keywords and multivocality ofkeywords selected, search results by keywords are notsatisfactory, for example, low in search precision anddifficult in comprehension of the search results [7].The introduction of labels (or ratings) [ 151 to describea Web document was considered to be a better way toidentify Web documents. The labels assigned to Webdocuments were pre-defined, to some extent, with a fixedmeaning. More important is that these labels were relatedto each other according to a semantic model. Forexample, the RSACi rating standard for recreationmaterials suggested by Recreation Software AdvisoryCouncil [ 171 provides a tree-like semantic structure forlabels and ratings. In this semantic model, the Webdocuments labeled by a same label indicate that theybelong to the same class. In other words, the documentsin the same class have the same label. Motivation ofdeploying labeling techniques is to group Web documentsin classes. Significant relationships between the classesare also defined according to the semantic data modeladopted.Consequently, by using the Web browsers havingcapability of rating the Web documents, the Web userscan rapidly access to or ignore the Web documents in thecertain class identified by the label [ 6 ] .Furthermore, thelabeling system (i.e. the semantic model) may provideclassifications for labels as well. We should note that suchclassifications or labeling methods can be defined by theWeb resource providers or some authorities (e.g. a ratingbureau) or both.However, at least two problems occur in the labelingmethods. First, a labeling method is not really a welldefined and theoretically sound semantic model. It is onlyan empirical model. Second, many Web-authoring toolsdo not provide the document authors any support indocument labeling.The World Wide Web is an enormous collection ofinformation resources serving for various purposes.However the diversity of the Web information as well asthe relared formats makes it vety difficult for users toefficiently search and obtain the information they require,The reason for the difficulty is because most of theinformation uploaded to the Web is unstructured or lessstructured. Many metadata models are proposed toresponse to this problem. These metadata models attemptto provide a certain kind of general description for theWeb infimnation to improve its structuredness. Althoughthese documents consist in a largest portion of the Webinformarion or Web resources, few metadata models aredealing with the ill-structured Web documents throughanalyzing their semantic relations with each other. In thispaper we consider this large portion of the Webinformation, called electronic documents. We propose ametadara model, called EDM (Electronic DocumentMetadata Model). Using the metadata model we canextract semantic characteristics from electronicdocuments and then use the characteristics to form asemantic electronic document model. This model,inverseiy, provides a basis for analysis of semanticsimilarity between electronic documents and for theelectronic document integration. The document modelingand integration will support further manipulations on theelectronic documents, such as exchange, search, andevolution.1Introduction1.1 ProblemsIt. is known to all that the largest part of the Webresources consists of electronic documents. Electronicdocuments are also the major targets for search. However,it is an extremely difficult to effectively find one singleelectronic document that a user asks for out from anexplosively large number of the Web documents. Thisdifficulty arises because there is no well-defined structureto represent these documents. Although a Web documentis designed and uploaded to the Web to convey certaininformation to the Web readers, the Web document per sedoes provide little information effectively for purpose ofsearch. In other words, a Web document cannot bedirectly used to uniquely identify the Web document to besearched for.Most of the search engines available use keywords(textual strings) matching mechanism to search the Webdocuments and other resources as well. This keyword1.2 Conceptual modelingGrouping electronic documents according to some prerequisite criteria is an important issue, which has recentlyreceived increasing attention from both the researchersand users. Its aim is to collect electronic documentsrelated to a certain topic for a group of people havingsimilar interest. These people sharing the same interest incertain subject form a newsgroup [I]. To find out theresemblance among these electronic documents on thetopic and compute their similarities is a crucial step ingrouping the electronic documents. The similarity* Correspondent author.0-7695-0577-5/00 10.00 0 2000 IEEE116

computation is based on the characteristics or attributes we assume that they can be captured from the informationor data hidden in the electronic documents.Through observing keywords and labels (considered tobe metadata) from the electronic documents, we found itquite important to suggest a description framework,which can capture as many as possible characteristics ormetadata about electronic documents. In other words, weneed a conceptual model for description of metadata, ormetadata model as quite often called in publications, tosupport modeling and managing the electronicdocuments. Such a metadata model would, can defineattributes for the electronic documents and relationshipsbetween the electronic documents, and hence assist theelectronic document similarity computation [9]. Thetechniques of using the characteristics of conceptualobjects for computing object similarities and thenintegrating them to form a new conceptual object havebeen well developed in the conceptual database designarea [IO]. Therefore, in this paper, we will use thesesimilarity techniques for electronic document clustering.We have noticed that there were many efforts put inimproving various search mechanisms and techniques inorder to improve search quality. Such search quality, suchas precision and comprehension [7], is used to evaluatesearch engines about their search results [3, 81. However,little work has been done aiming at analyzing electronicdocuments, extracting their characteristics (metadataformula), and hence defining a conceptual metadatamodel for describing and modeling the electronicdocuments. A conceptual metadata model is fundamentaland essential for making full use of the electronicresources for the following reasons:1. A conceptual metadata model captures the basicfeatures of electronic documents than just superficialobservations.2. Conceptual modeling will organize these attributes ondocuments and relationships between document for theelectronic document integration.3. Modeling process will help us to have an abstractiveview of the electronic document structure and a detaileddescription of the document characteristics.4. More importantly, more than a decade's research anddevelopment on conceptual modeling has formed a soundbasis for metadata research and application.proposed metadata models share a common set offeatures. Their aims include describing the structure ofWeb sites, distributing annotation and authoring, andexchanging formats of information. For example, RDFcontains a set of directed labeled graphs consisting of aset of nodes, labeled arcs, and attributed values,corresponding respectively, for example, Web resources,the relationships among the resources, and the attributesto describe the resources. RDF can be viewed as a verygeneral data model for description of electronicdocuments.However, we maintain that a conceptual metadatamodel should first of all take into consideration electronicdocument structures. From the document structures ametadata model can be defined relatively general in orderboth to effectively represent the common features orattributes of electronic documents and to easily apply thedata model for different purposes, such as documentclustering computation.In [13], a set of document structures is defined,including sequence structure, grid structure, treestructure, and Web structure. This description of the Webdocument structures is mainly based on how thedocuments are related to each other and follows thecriteria of predictability, information richness, andmodifiability. The first term, predictability, indicates thatit is easy to find related resources and users would not belost in a chain of search processes. Information richnessrequires that a Web resource be linked to many otherresources to gain more information on a subject. The termmodifiability means that changes on a Web documentwould not cause substantial loss of related information,i.e. links to other Web documents.Electronic document resemblance computation is togroup together the documents of a common interest intoone class having one or several common characteristics.For example, news and articles about Intranet will be puttogether in a special interest group, e.g. IntranetSig. Someapproaches to similarity computation [3, 81 have beensuggested and basic process can be described as follows.1 ) A profile of record for the user's interests is collectedand organized in certain forms. 2) Pick up one profile asan original and compare it with other profiles, and weighthe similarity distance between the picked profile with theother profiles. The shorter the similarity distance, themore similar are the original and the profile from theothers. 3) Given a fixed distance value, all the profileshaving the similarity distances to the original less than thevalue are considered to have the same profile items. 4)This set of profiles will be used as identifying standardfor electronic documents and recommended to the users.Searching Web information can be seen as to find a setof Web information items with one or several commonfeatures by giving one or several searching attributes.There exist a number of commonly used search enginesfor the Web readers to find information they need.However, arguments take place from time to time on thefeatures of search engines, such as imprecision andincomprehension, as well as the search strategiesdeployed by the search engines. There are also quite anumber of search methods or tools sprung out, declaredto be an improvement or enhancement of those1.3 Related workMany efforts have been put in various aspects of theelectronic information applications, such as electronicdocument modeling, document similarity computation,electronic information search quality study, etc. Theelectronic document modeling aims to extractcharacteristics, or attributes from existing electronicdocuments and therefore to form a metadata model.Conversely, the metadata model can then support toformally define inter-document relationships and therelationships within the electronic documents.A number of metadata models have been proposed andsome of them have been recommendations of W3C.Among others are XML Schema (extensible markuplanguage) [ 181, RDF (Resource Description framework)[16J, and MCF, (meta-content framework [14]. These117

commorily used search engines. The improvements canbe simply summarized as: 1) adding more capability ofsearching than merely textual; 2) taking into account ofhyperlinks which are viewed as just texts by ordinarysearch mechanisms; 3) considering information clustering(grouping Web documents) according to some predefinedprofiles [2,4, 121.model for description of electronic documents. Multiplelayer (tree-like) structure is often used in documentmanagement. So we also consider this kind of structure aspart of the metadata model. Finally, we propose ametadata model, EDM, which is to support electronicdocument grouping, classification, filtering, andintelligent searching.1.4 Paper Structure2.1 Meta-information on electronic documentsAs we discussed previously, conceptual modeling is akey to the analysis and formation of the structure of theelectronic documents and their relationships. Theconceptual model for electronic documents can also beused for the management of electronic documents, suchas searc:hing and analysis. In the next section, we willpropose and discuss an electronic document metadatamodel, called EDM, and its constructs, where twosupportimg conceptual models, Basic Metadata Model(BDM) and Path Tree Structure (PTS) are described. Insection 3, based on the metadata model, WDM, weanalyze various relations between electronic documentsand define a set of semantic relatedness relations and a setof semantic similarity relations. In section 4, we suggest aprocess of electronic document clustering and discuss themajor steps of the process. Finally, in section 5, weconclude the paper and suggest our future work.Look at an electronic document. It usually contains asequence of textual paragraphs separated by a number oftextual headings. Within the paragraphs spotted are someunderlined words, called hyperlinks. The hyperlinksconnect the words to other electronic documents, whichare supposed to provide further detailed informationabout the words. In addition, the window containing theelectronic documents may be split into 2 or 3 frames toshow certain kind of content cohesion.In the HEAD part, if the Web document was written inHTML, we will see some “meta” information items andother information items, such as “title” of the page“keywords” used for searching the page, etc. Some Webdocuments, like htto://www.w3.ordPICS/, contain PICSlabels in their head part. PICS (platform for Internetcontent selection) is a metadata model for rating Webdocuments in order to filter the Web documents. Whetheror not we can use PICS labels for Web informationfiltering is depended on if the Web browsers suouort tointeruret and execute a ore-defined label taxonomystructure. From these descriptions we can see that anelectronic document, together with other information, notvisible to the end users, provides meta-information foridentification of the electronic document in one way oranother. In the following we summarize our observationsand put forward a four-layer structure as metainformation for electronic documents.We maintain that there are four types of information(metadata), which can be used to describe an electronicdocument. These descriptive information collections formfour layers for describing electronic documents orresources along different dimensions, from differentviews, and for different purposes. The first layer is theinformation describing the object content, called ContentInformation. For example, consider the Web page ofCNN. The fist, major goal of the Web page is to convey avariety of news. The descriptive information serving thisgoal in general includes headlines, subjects, introductoryparagraphs, together with images, movies, and so on soforth. Within an article, there are subtitles, section titles,keywords, review comments, etc. By browsing thedescriptive information, the Web readers can quicklyfigure out what the content is all about.The second layer contains the data items used todescribe the relevant to an electronic document, such asauthor, creator, creation date, etc. We may say that mostof the attributes defined in the metadata model DublinCore [9] to some extent belong to this layer. We call themManagerial Information. One of the important attributeswithin the managerial information is version. Becauseversioning is now an important measure of theevolutionary process of an object, this attribute is adynamic factor about the document. The managerial2Conceptual Metadata ModelingPreviously we have briefly discussed the necessity ofintroduction of a conceptual metadata model for theelectronic document clustering or integration. Due to thediversity of electronic document descriptions, searching,exchanging, and management of electronic documents aredifficult. The diversity also causes the difficulty in directuse of existing conceptual modeling methods for purposeof electronic document modeling. It is indispensable tobuild up a conceptual metadata model, which is able totake in1.o account various characteristics of the electronicdocuments. So we propose a conceptual metadata modelfor the description of electronic and Web documents.The metadata model, called EDM (electronicdocument metadata model), is a conceptual model. It isintended to formalize various characteristics fromelectronic documents and various relations (attributes)between electronic documents. The metadata data modelis supposed to serve for a few purposes, includingto build up better Web documentation languages forWI-b document authoring,to use the attributes for classification of electronicdocument schemas, generated from the Web datamode, andto define similarity relations between electronicdocuments for clustering Web information.In the following, we first discuss what information orknowledge we observe on the electronic documents. Wesuggest a layer structure for description of the electronicdocuments. Then in section 2.2, we propose a basicmetadata model, on which the rest of electronic documentmetadata can be built, and a path tree structure for URLs.In section 2.3 we discuss EDM, the conceptual metadata118

information is essential when we want to know some“publication” information about the objects of interest.Most of these items in the metadata information provideclassification and categorization information formanagement of objects.Referencing Information, the third layer, comes fromthe “hyperlinks” appearing in an electronic document. Weextend “links” to a more general concept to represent“reference links” to any Web information, documents,and resources. So the environment information can bealso called reference information, which means that theremay be other objects or resources associated to thefocused object and used for detailed descriptions of theobject. This information also contains for example astructure of an electronic document. For example, apaper’s structure is presented with a set of links to itsvarious chapters, which appear in other Web resources.The referencing structure for an electronic document canbe hierarchical, where an upper object has several links toits children object, and neighboring, like in an electronicmap where a number of links from a spot to its fourdirection neighbors.primitive constructs for electronic documents. Theprimitive constructs of this model are object, relationship,and attribute. Now we try to define the basic metadatamodel, denoted to be BDM.Def 1 A basic metadata model is a triple, denoted asBDM O, R, A , where 0 is a set of object types, R is aset of relationship types which relate one document toanother, and A is a set of attribute types which describe adocument.Def2 An instance of basic metadata model BDM iscalled a BDM schema, denoted as BDMS sn, 0, r, a ,where sn is the unique name of the schema, o is a set ofobjects, r is a set of relationships, and a is a set ofattributes.The aim to define BDM schemas is to support us toanalyze instance documents.Def 3 A BDM object is a triple, denoted as BDMO on, or, 00,where on is the name, or is the set ofrelationships from and to the object 0, and oa is a set ofattributes describing the object 0.Def 4 A BDM relationship is a triple, denoted to beBDM R rn, rol, ro2 , where rn is the name, and roland ro2 are BDM objects respectively.Def 5 A BDM attribute is a triple, denoted as BDMA an, ao, av , where an is the name, ao is the BDMobject that A is to describe, and av is a set of values.2.2.2Natural Tree Structure ModelIninformation analysis andrepresentation,information distribution structure is very important.Usually, people tend to organize documents according tocertain criteria, for example, addressing the same subjector belonging to the same type. In organizing electronicpages, the ”identifier” for a page is usually its URL(universal referencing location), giving a path (usuallyglobally unique) to the page. Along the path, there maybe more documents, each having its own URL but havingthe same domain name for example. In this sense, weconsider a tree-like structure associated to such pathbased URLs, more general, URI (universal referenceidentity). When we search for an electronic page, we mayactually receive the other pages on the path to the pagewe require. More importantly these pages may support uswith a better understanding of the meaning of therequired page. In other words, the electronic documentson the same path or referencing link to the requireddocument tend to have stronger semantic relations andtend to be clustered in one class.For example, this paper, “A Semantic SimilarityApproach to electronic Document Modeling andIntegration”, is found at the web site of http://www.eti.hku. hk/pu bdmeta-data/electronicdocuments/WebDocModeLhtml. We may reasonably assume that the otherdocuments found with this directory of uments/ may dealwith the similar problems related to metadata, electronicdocuments, etc. Here we deduce a group of documentshaving a closely related meaning or subject by thekeywords or subjects from one of the documents.Now we try to define this natural tree structure usingBDM. It is obvious that the natural tree structure is a submodel of BDM, because in the natural tree structure,although the objects can be any Web resources, theFig. I Related information about an electronic documentThe final layer is the Carrier Information, which ingeneral provides the physical attributes about anelectronic document. The information includes forexample fonts, color, size, and so on. These informationpieces become important when some semantic meaningsare assigned to them. For example, “bold face” of a textmay mean that the text is emphasized. In addition, inEmail systems, people may hope to control the size ofemails. Templates’ information (layout) of electronicdocuments is also included in the carrier information,because this can help to manage different formats fordocuments.2.2 Basic Metadata ModeUNatural Tree Model2.2.1 Basic Metadata ModelThrough analysis of the electronic documents and theexisting metadata modeling methods, there are someelements or element types in electronic documents. First,any document can be seen to be a resource. Seconddocuments are linked together by some relationships, forexample, the buttonin Powerpoint documents.Third, each document can be described a set ofdescriptive data, usually called metadata. Therefore, wecan define a basic metadata model, which contains119

first level or the second level is depended on the users’judgements on how important the role the attributes playin the object semantic identification. Therefore suchdivision is of high subjectivity. This division is on thepurpose of electronic document clustering andintegration. It is true in the reality that somecharacteristics are more important than the other metadatain identifying objects.2.3.2 Metadata Model: EDMAs we discussed previously, an electronic documentcontains many descriptive items, called metadata. Somemetadata are more important and useful in identifying anelectronic document than the other metadata. Thereforethe former metadata are the level-one attributes and thelatter the level-two attributes. In the following, we giveformal definitions to EDM, the electronic documentmetadata model. First of all, we can say roughly thatEDM is a sub-model of BDM, because the set of objectsin EDM (only electronic documents) is a subset of the setof objects in BDM (any objects).Def 8 (WDM) The metadata model WDM, denoted tobe WDM DO, DR, DA , is a sub-model of BDM,where DO is a subset of 0, DR a subset of R, and DA ofA.relationships between the objects follow the groupingsemantics by the paths or the fragments of the paths to theWeb objects.Def 6 A path tree structure, denoted to be PTS O, R’,A , is a sub-model of BDM, where 0 and A are definedas the same as in BDM while R’ is a subset of R. There isa particular object in PTS called root object. There is a setof objec1.s in PTS called leaf objects.Def 7 A branch relationship type R’ in PTS is definedas R’ m, rol, ro2 , where rn is the name cut from thepath name, and rol and ro2 are respective a parent objectand a child object.2.3 WLIM: A Conceptual DataElectronic DocumentsModel for2.3.1Metadata elementsIn the section 2.1, we described the four layers ofmetadata information for electronic documents, i.e.,content information, managerial information, referencinginformation, and carrier information. These four layers ofmetadata information consist in fundamental componentsin a conceptual metadata model, which is considered to bean extension to the basic metadata model, BDM. The fourlayers of metadata information describing the electronicdocuments can be either of relationship types betweenelectronic documents or of attribute types describingelectronic documents.Now let us have a detail look at what are includedwithin the four layers of metadata information.Metadata on content (Content Information) includetopic, title, subject, abstract, keyword, heading, subheading, content-type], etc. These data are considered tobe attribute types.Metadata on managerial elements (ManagerialInformation) include author, date, creator, version,edition, publisher (if any), number of pages, etc. Themetadata are considered to be attribute types.Metatlata on referencing information include variouslinks, connections, and other relationships. So thesemetadata will be mainly part of relationships in BDM.Metadata on carriers (Carrier Information) includedocument types, e.g., Word, fonts, faces, media, etc.These are also considered to be attribute types.In the: reality, we divide the metadata information intotwo leve:ls. The level-one metadata information about anobject provides direct evidences of telling what the objectis, while the level-two metadata only provides theinformation of telling what the object could be. Forexample, about the metadata on content, we may say thattitles, subjects or keywords can be of level-one attributes,whereas content-type, sub-headings are of level-twoattributes because their contribution to the semanticidentity of the Web object2 is not as significant as themetadata of level-one attributes.Similarly, the two-level division is also valid to themetadata on managerial information and the metadata oncarriers. In the reality, whether a metadata belongs to theFig. 2 Meta-model of WDM, PTS, and BDM.In the context of the Web application, DO is a set ofelectronic documents, DR a set of relationships linkingone electronic document to another, and DA a set ofattributes describing electronic documents.Def 9 A WDM document is a triple, denoted as WDMDO dn, dr, d o , where dn is the document identifier, dris a set of references from and to the document, and da isa set of attributes describing the document.Def 10 A WDM reference is a triple, denoted to beWDM R rn, r d l , rd2 , where rn is the name, and rdland rd2 are BDM documents respectively.Def 11 A WDM attribute is a triple, denoted as WDMA an, ad, av , where an is the name, ad is the WDMdocument that A describes, and av is a set of valuesassociated to the document ad through A.These definitions can be graphically illustrated by themeta-model (self-explanatory) as in the figure Fig.2.I Content-type indicates that metadata tell what kind of narrative typesis related to the document, such as report, memo, minutes, etc.When no obvious misunderstanding arises we interchangeably use theterms elec ronicdocument, electronic object, or electronic resource.120

2.3.3 Example of WDMIn this section, we consider an example to illustrate theconcepts proposed in the previous section. The example isa fragment of a Web document disDlavine news. Here wetry to extract the metadata out from !he document fordescribing the document.In our opinions, main requirements that should be metby a conceptual metadata model as EDM include I ) easyto understand and to use, 2) capable to represent othermetadata models, and 3) well-defined for the electronicdocument analysis and classification. One of the reasonsto urouose these reauirements is that the metadata modelshould be general and expressive enough to translatevarious object types and relationships existing in theelectronic documents and hence easier to represent themfor purposes of analysis and classification [ 1 I].13Semantic Relations and SimilaritiesThe EDM, we described in last section, provides animportant modeling basis for clustering electronicdocuments. In this section, we will discuss variousrelatedness and similarity relations between electronicdocuments. As we know, electronic documents can berelated to each other in a variety of ways. Possibly twodocuments have exactly the same content, or theirheadings indicate that one document is a follower ofanother (like section 1 and section 2), or their links reveala referencing relation (such as a detailed explanation of aphrase). These situations are considered to be amotivation for electronic document clustering orintegration3.Our consideration on the electronic documentrelatedness as well as similarity relations are based on theassumption that two electronic documents are morestrongly related (probably more similar) to each other iftheir components have more in common. In other words,two electronic documents are similar if their attributesand relationships are respectively similar. In order tocluster electronic documents together based on theirrelatedness relations we also assume that these schemascan be included in the same cluster if some attributes ofWDM schemas are partially the same.M i l e s I h n l s

electronic document integration. 3. Modeling process will help us to have an abstractive view of the electronic document structure and a detailed description of the document characteristics. 4. More importantly, more than a decade's research and development on conceptual modeling has formed a sound basis for metadata research and application.