The MATE Markup Framework

Transcription

The MATE Markup FrameworkLaila DYBKJÆR and Niels Ole BERNSENNatural Interactive Systems Laboratory, University of Southern DenmarkScience Park 10, 5230 Odense M, Denmarklaila@nis.sdu.dk, nob@nis.sdu.dkAbstractSince early 1998, the European Tele maticsproject MATE has worked towardsfacilitating re-use of annotated spokenlanguage data, addressing theoretical issuesand implementing practical solutions whichcould serve as standards in the field. Theresulting MATE Workbench for corpusannotation is now available as licensed opensource software.This paper describes the MATE markupframework which bridges between thetheoretical and the practical activities ofMATE and is proposed as a standard for thedefinition and representation of markup forspoken dialogue corpora. We also presentearly experience from use of the framework.1. ferate in the market, commercial andresearch applications constantly increasing invariety and sophistication. These developmentsgenerate a growing need for tools and standardswhich can help improve the quality andefficiency of product development andevaluation. In the case of spoken languagedialogue systems (SLDSs), for instance, theneed is obvious for standards and standard-basedtools for spoken dialogue corpus annotation andautomatic information extraction. Informationextraction from annotated corpora is used inSLDSs engineering for many different purposes.For several years, annotated speech corpora havebeen used to train and test speech recognisers.More recently, corpus-based approaches arebeing applied regularly to other levels ofprocessing, such as syntax and dialogue. Forinstance, annotated corpora can be used toconstruct lexicons and grammars or train agrammar to acquire preferences for frequentlyused rules. Similarly, programs for dialogue actrecognition and prediction tend to be based onannotated corpus data. Evaluation of usersystem interaction and dialogue success is alsobased on annotated corpus data. As SLDSs andother language products become moresophisticated, the demand will grow for corporawith multilevel and cross-level annotations, i.e.annotations which capture information in theraw data at several different conceptual levels ormark up phenomena which refer to more thanone level. These developments will inevitablyincrease the demand for standard tools insupport of the annotation process.The production (recording, transcription,annotation, evaluation) of corpus data for spokenlanguage applications continues to be timeconsuming and costly. So is the construction oftools which facilitate annotation and informationextraction. It is therefore desirable that alreadyavailable annotated corpora and tools be usedwhenever possible. Re-use of annotated data andtools, however, confronts systems developerswith numerous problems which basically derivefrom the lack of common standards. So far,language engineering projects usually haveeither developed the needed resources fromscratch using homegrown formalisms and tools,or painstakingly adapted resources fromprevious projects to novel purposes.In recent years, several projects have addressedannotation formats and tools in support ofannotation and information extraction (for ). Some projects have addressed theissue of markup standardisation from differentperspectives. Examples are the Text EncodingInitiative (TEI) S)(http://www.cs.vassar.edu/CES/),andtheEuropean Advisory Group for LanguageEngineering Standards (EAGLES) (http://www.-

ilc.pi.cnr.it/EAGLES96/home.html).Whilstthese initiatives have made good progress onwritten language and current coding practice,none of them have focused on the creation ofstandards and tools for cross-level spokenlanguage corpus annotation. It is only recentlythat there has been a major effort in this domain.The project Multi-level Annotation ToolsEngineering (MATE) (http://mate.nis.sdu.dk)was launched in March 1998 in response to theneed for standards and tools in support ofcreating, annotating, evaluating and exploitingspoken language resources. The central idea ofMATE has been to work on both annotationtheory and practice in order to connect the twothrough a flexible framework which can ensure acommon and user-friendly approach acrossannotation levels. On the tools side, this meansthat users are able to use level-independent toolsand an interface representation which isindependent of the internal coding filerepresentation.This paper presents the MATE markupframework and its use in the MATE Workbench.In the following, Section 2 briefly reviews theMATE approach to annotation and toolsstandardisation. Section 3 presents the MATEmarkup framework. Section 4 concludes thepaper by reporting on early experiences with thepractical use of the markup framework anddiscussing future work.2The MATE ApproachThis section first briefly describes the creation ofthe MATE markup framework and a set ofexample best practice coding schemes inaccordance with the markup framework. Then itdescribes how a toolbox (the MATEWorkbench) has been implemented to supportthe markup framework by enabling annotationon the basis of any coding scheme expressedaccording to the framework.2.1TheoryThe theoretical objectives of MATE were tospecify a standard markup framework and toidentify or, when necessary, develop a series ofbest practice coding schemes for implementationin the MATE Workbench. To these ends, webegan by collecting information on a largenumber of existing annotation schemes for thelevels addressed in the project, i.e. prosody,(morpho-)syntax, co-reference, dialogue acts,communication problems, and cross-level issues.Cross-level issues are issues which relate tomore than one annotation level. Thus, forinstance, prosody may provide clues for avariety of phenomena in semantics anddiscourse. The resulting report (Klein et al.,1998) describes more than 60 coding schemes,giving details per scheme on its coding book, thenumber of annotators who have worked with it,the number of annotated dialogues/segments/utterances, evaluation results, the underlyingtask, a list of annotated phenomena, and themarkup language used. Annotation examples areprovided as well.We found that the amount of pre-existing workvaries enormously from level to level. Therewas, moreover, considerable variation in thequality of the descriptions of the individualcoding schemes we analysed. Some did notinclude a coding book, others did not provideappropriate examples, some had never beenproperly evaluated, etc. The differences indescription made it extremely difficult tocompare coding schemes even for the sameannotation level, and constituted a ratherconfused and incomplete basis for the creationof standard re-usable tools within, as well asacross, levels.The collected information formed the startingpoint for the development of the MATE markupframework which is a proposal for a standard forthe definition and representation of markup forspoken dialogue corpora (Dybkjær et al., 1998).Analysis of the collected information on existingcoding schemes as regards the informationwhich came with the schemes as well as theinformation which was found missing, providedinput to our proposal for a minimal set ofinformation items which should be provided fora coding scheme to make it generallycomprehensible and re-usable by others. Forinstance, a prescriptive coding procedure wasincluded among the information items in theMATE markup framework despite the fact thatmost existing coding schemes did not come withthis information. This list of information itemswhich we call a coding module, is the coreconcept of the MATE markup framework andextends and formalises the concept of a codingscheme. The ten entries which constitute acoding module are shown in Figure 4. Roughly

speaking, a coding module includes or describeseverything that is needed in order to perform acertain kind of markup of spoken languagecorpora. A coding module prescribes whatconstitutes a coding, including the representationof markup and the relations to other codings.Thus, the MATE coding module is a proposalfor a standardised description of codingschemes.The above-mentioned five annotation levelsand the issues to do with cross-level annotationwere selected for consideration in MATEbecause they pose very different markupproblems. If a common framework can beestablished and shown to work for those levelsand across them, it would seem likely that theframework will work for other levels as well.For each annotation level, one or more existingcoding schemes were selected to form the basisof the best practice coding modulesimplemented in the MATE Workbench (Mengelet al., 2000). Common to the selected codingschemes is that these are among the most widelyused coding schemes for their respective levelsin current practice, each having been used byseveral annotators and for the annotation ofmany dialogues. Since all MATE best practicecoding schemes are expressed in terms of codingmodules, they should contain sufficientinformation for use by other annotators. Theiruniform description in terms of coding modulesmakes it easy for the annotator to work onmultiple coding schemes and/or levels, and tocompare schemes since these all contain thesame categories of information. The use ofcoding modules also facilitates use of the sameset of software tools and enables the sameinterface look-and-feel independently of level.2.2ToolingThe engineering objective of MATE has been tospecify and implement a generic annotation toolin support of the markup framework and theselected best practice coding schemes. Severalexisting annotation tools were reviewed early onto gather input for MATE workbenchspecification (Isard et al., 1998). Building onthis specification, the MATE markup frameworkand the selected coding schemes, a java-basedworkbench has been implemented (Isard et al.,2000) which includes the following majorfunctionalities:The MATE best practice coding modules areincluded as working examples of the state of theart. Users can add new coding modules via theeasy-to-use interface of the MATE codingmodule editor.An audio tool enables listening to speech filesand having sound files displayed as a waveform.For each coding module, a default stylesheetdefines how output to the user is presentedvisually. Phenomena of interest in the corpusmay be shown in, e.g., a certain colour or inboldface. Users can modify style sheets anddefine new ones.The workbench enables information extractionof any kind from annotated corpora. Queryresults are shown as sets of references to thequeried corpora. Extraction of statisticalinformation from corpora, such as the number ofmarked-upnouns,isalsosupported.Computation of important reliability measures,such as kappa values, is enabled.Import of files from XLabels and BAS Partiturto XML format is supported in order todemonstrate the usefulness of importing widelyused annotation formats for further work in theWorkbench. Similarly, a converter fromTranscriber format (http://www.etca.fr/CTA/gip/Projets/Transcriber/) to MATE formatenables transcriptions made using Transcriber tobe annotated using the MATE Workbench.Other converters can easily be added. Export tofile formats other than XML can be achieved byusing style sheets. For example, informationextracted by the query tool may be exported toHTML to serve as input to a browser.On-line help is available at any time.The first release of the MATE Workbenchappeared in November 1999 and was madeavailable to the 80 members of the MATEAdvisory Panel from across the world. Sincethen, several improved versions have appearedand in May 2000 access to executable versionsof the Workbench was made public. The MATEWorkbench is now publicly available both in anexecutable version and as open source softwareat http://mate.nis.sdu.dk. The Workbench is stillbeing improved by the MATE consortium, sonew versions will continue to appear. Adiscussion forum has recently been set up at theMATE web site where colleagues are invited toask questions and provide information from theirexperience with the Workbench, including the

new tools they have added to the MATEWorkbench to enhance its functionality.We have no exact figures on how many usersare now using the workbench but we know thatthe MATE workbench is already being used byand is being considered for use in severalEuropean and national research projects.3The MATE Markup FrameworkThe MATE markup framework is a conceptualmodel which basically prescribes (i) how filesare structured, for instance to enable multi-levelannotation, (ii) how tag sets are represented interms of elements and attributes, and (iii) how toprovide essential informationsemantics, coding purpose etc.3.1onmarkup,Files, elements and attributesWhen a coding module has been applied to acorpus, the result is a coding file. The coding filehas a header which documents the codingcontext, such as who annotated the file, when,and the experience of the annotator, and a bodywhich lists the coded elements. Figure 1 showsan example of how annotated communicationproblems are displayed to the user in the MATEWorkbench. Figure 2 shows an excerpt of theinternal representation of the file.Figure 1. A screen shot from the MATE Workbench showing a dialogue annotated withcommunication problems (top left-hand panel). Guidelines for cooperative dialogue behaviour areshown in the top right-hand panel. Communication problems are categorised as types of violations ofthe cooperativity guidelines. Violation types are shown in the bottom right-hand panel. Notes may beadded as part of the annotation. Notes are shown in the bottom left-hand panel.Figure 2. Excerpt of the internal XMLrepresentation of the annotated dialogue shownin Figure 1. The tags will be explained inSection 3.1.1 below.As shown in Figure 2, the annotated filerepresentation is simply a list of references tothe transcription file. The underlying filestructure idea is depicted in Figure 3 whichshows how coding files (bottom layer) refer to atranscription file and possibly to other codingfiles, cf. entry 5 in the coding module in Figure4. A transcription (which is also regarded as acoding file) refers to a resource file listing theraw data resources behind the corpus, such assound files and log files. The resource fileincludes a description of the corpus: purpose ofthedialogues,dialogueparticipants,experimenters, recording conditions, etc. Abasic, sequential timeline representation of the

spoken language data is defined. The timelinemay be expressed as real time, e.g. inRaw dataSound filesLog filesResource fileTranscriptions(coding files)Coding filesmilliseconds, or as numbers indicating, e.g., thesequence of utterance starts.Other data, e.g.video or picturesText filesList of blemsPhonetictranscriptionDialogue actsProsodyFigure 3. The raw corpus data are listed in the resource file to which transcriptions refer. Coding filesat levels other than transcription refer to a transcription and only indirectly to the raw data. Codingfiles may refer to each other.Given a coding purpose, such as to identify allcommunication problems in a particular corpus,and a coding module, the actual coding consistsin using syntactic markup to encode the relevantphenomena found in the data. A coding isdefined in terms of a tag set. The tag set isconceptually specified by, and presented to, theuser in terms of elements and attributes, cf. entry6 in the coding module in Figure 4. Importantly,workbench users can use this markup directlywithout having to know about complex formalstandards, such as SGML, XML or TEI.3.1.1 ElementsThe basic markup primitive is the element (aterm inherited from TEI and SGML) whichrepresents a phenomenon such as a particularphoneme, word, utterance, dialogue act, orcommunication problem. Elements haveattributes and relations to each other both withinthe current coding module and across codingmodules. Considering a coding module M, themarkup specification language is described as: E1 .En : The non-empty list of tag elements. For each element Ei, the following propertiesmay be defined:1. Ni : The name of Ei .Example: u 2. Ei may contain a list of elements Ej fromM.Example: u may contain t : u t Example /t /u 3. Ei has mi attributes Aij , where j 1 . mi .Example: u has attributes who and id,among others.4. Ei may refer to elements in coding moduleMj , implying that M references Mj .Example: a dialogue act coding may referto phonetic or syntactic cues.A concrete example is the coding module forcommunication problems which, i.a., has theelement comprob , cf. the XML representationin Figure 2. comprob has, i.a., the attributes iduref and vtype . uref is a reference to an utterancein the transcription coding. vtype is a referenceto a type of violation of a guideline in theviolation type coding. Due to the inflexibility ofXML, this logical structure has to be representedslightly differently internally in the workbench.Thus, the uref corresponds to the first href in

Figure 2 while vtype is wrapped up in a newelement and corresponds to the second href.3.1.2 AttributesAttributes are assigned values during coding.For each attribute Aij the type of its values mustbe defined. There are standard attributes, userdefined attributes, and hidden attributes, asfollows.Standard attributes are attributes prescribed byMATE. id [mandatory]: ID. The element id iscomposed of the element name and a machinegenerated number.Example: id n 123Time start and end are optional. Elements musthave time information, possibly indirectly byreferencing other elements (in the same codingmodule or in other modules) which have timeinformation. TimeStart [optional]: TIME. Start of event. TimeEnd [optional]: TIME. End of event.User-defined attributes are used to parametriseand extend the semantics of the elements theybelong to. For instance, who is an attribute ofelement u designating by whom the utteranceis spoken. There will be many user-definedattributes (and elements), cf., e.g., the uref andvtype mentioned above.Hidden attributes are attributes which the userwill neither define nor see but which are used forinternal representation purposes. An example isthe following of coding elements which mayrefer to utterances in a transcription but whichdepend on the technical programming choice ofthe underlying, non-user related representation:ModuleRefs CDATA ‘href:transcription#u’See Figure 2 for a concrete example from theMATE Workbench.3.1.3 Attribute standard typesThe MATE markup framework proposes a set ofpredefined attribute value types (attributes aretyped) which are supported by the workbench.By convention, types are written in capitals. Theincluded standard types are: TIME: in milliseconds, as a sequence ofnumbers, or as named points on the timeline.Values are numbers or identifiers, and thedeclaration of the timeline states how tointerpret them.Example: time 123200 dur 1280 (these arederived values, with time TimeStart , and dur TimeEnd – TimeStart ). HREF[MODULE,ELEMENTLIST]:HereMODULE is the name of another codingmodule, and ELEMENTLIST is a list of namesof elements from MODULE. When applied asconcrete attribute values, two parameters mustbe specified:- The name of the referenced coding file whichis an application of the declared MODULEcoding module.- The id of the element occurrence that isreferred to.The values of this attribute are of the ion, u) allows an attribute used as,e.g., OccursIn ”base#u 123” , where base is acoding file using the transcription module andu 123 is the value of the id attribute of a uelement in that file.Example:Forthedeclarationwho:participant]an actualoccurrence may look like who ”#participant2”HREF[transcription,where the omitted coding file name byconvention generically means the currentcoding file.The concept of hyper-references together withparameters referencing coding modules (seepoint 5 in Figure 4) is what enables codingmodules to handle cross-level markup. ENUM: A finite closed set of values.Values are of the form: “(“ Identifier ( “ ”Identifier )* “)”Example: time(year month day hour) allowssuchastime day.attributesThe user may be allowed to extend the set, butnever to change or delete values from the set. TEXT: Any text not containing ‘“’ (which isused to delimit the attribute value).Example: The declaration desc TEXT allowsuses such as: event desc ”Door is slammed” . ID: Automatically generated id for the element.Only available in the automatically addedattribute id.

3.2Coding modulesIn order for a coding module and the dialoguesannotated using it to be usable andunderstandable by people other than its creator,some key information must be provided. TheMATE coding module which is the central partof the markup framework, serves to capture thisinformation. A coding module consists of the tenitems shown in Figure 4.1. Name of the module.2. Coding purpose of the module.3. Coding level.4. The type of data source scoped by themodule.5. References to other modules, if any. Fortranscriptions, the reference is to a resource.6. A declaration of the markup elements andtheir attributes. An element is a feature, or typeof phenomenon, in the corpus for which a tag isbeing defined.7. A supplementary informal description ofthe elements and their attributes, including:a. Purpose of the element, its attributes,and their values.b. Informal semantics describing how tointerpret the element and attribute values.c. Example of each element and attribute.8. An example of the use of the elements andtheir attributes.9. A coding procedure.10. Creation notes.Figure 4. Main items of the MATE codingmodule.Some coding module items have a formal role,i.e. they can be interpreted and used by theMATE workbench. Thus, items (1) and (5)specify the coding module as a namedparametrised module or class which builds oncertain other predefined modules (no cyclesallowed). Item (6) specifies the markup to beused in the coding. All elements, attributenames, and ids have a name space restricted totheir module and its coding files, but arepublicly referrable by prefixing the name of thecoding module or coding file in which theyoccur. Other items provide directives andexplanations to users. Thus, (2), (3) and (4)elaborate on the module itself, (7) and (8)elaborate on the markup, and (9) recommendscoding procedure and quality measures. (10)provides information about the creation of thecoding module, such as by whom and when.In the following, we show an abbreviatedversion of a coding module for communicationproblems to illustrate the 10 coding moduleentries.Name: Communication problems.Coding purpose: Records the different ways inwhich generic and specific guidelines areviolated in a given corpus. A communicationproblems coding file refers to a problem typecoding file as well as to a transcription.Coding level: Communication problems.Data sources: Spoken human-machine dialoguecorpora.Module references: Module Basic orthographic transcription; Module Violation types.Markup declaration:ELEMENT comprobATTRIBUTESvtype: REFERENCE(Violation types, vtype)wref: REFERENCE(Basic orthographictranscription, (w,w) )uref: REFERENCE(Basic orthographictranscription, u )caused by: REFERENCE(this, comprob)temp: TEXTELEMENT noteATTRIBUTESwref: REFERENCE(Basic orthographictranscription, (w,w) )uref: REFERENCE(Basic orthographictranscription, u sproducedbyinadequate system utterance design we use theelement comprob. It refers to some kind ofviolation of one of the guidelines listed in Figure1, top right-hand panel. The comp rob elementmay be used to mark up any part of the dialoguewhich caused, or might cause, a communicationproblem. Thus, comprob may be used to annotateone or more words, an entire utterance, or evenseveral utterances in which an actual or potentialcommunication problem was detected. Thecomprob element has five attributes in addition tothe automatically added id.

The attribute vtype is mandatory. vtype is areference to a description of a guidelineviolation in a file which contains the differentkinds of violations of the individual guidelines.Either wref or uref must be indicated. Boththese attributes refer to an orthographictranscription. wref delimits the word(s) whichcaused or might cause a communicationproblem, and uref refers to one or more entireutterances which caused or might cause aproblem.We stop the illustration here due to spacelimitations. The full description is available in(Mengel et al. 2000).Example:In the following snippet of a transcription fromthe Sundial corpus: u id "S1:7-1-sun" who "S" flight informationbritish airways good day can I help you /u communication problems are marked up asfollows: comprob id "3" vtype "Sundial problems#SG4-1"uref "Sundial#S1:7-1-sun"/ We do not exemplify note here and do notshow the violation type coding file due to spacelimitations. However, note that once a codingmodule is specified in the MATE workbench,the user does not have to bother about themarkup shown in the example above. The userjust selects the utterance to mark up and thenclicks on the violation type palette, or, in case itis a new type, clicks on the violatedcooperativity guideline which means that a newviolation type is added and text can be entered todescribe it, cf. Figure 1.Coding procedure: We recommend to use thesame coding procedure for markup ofcommunication problems as for violation typessince the two actions are tightly connected. As aminimum, the following procedure should befollowed:1. Encode by coders 1 and 2.2. Check and merge codings (performed bycoders 1 and 2 until consensus).Creation notes:Authors: Hans Dybkjær and Laila Dybkjær.Version: 1 (25 November 1998), 2 (19 June1999).Comments: For guidance on how to identifycommunication problems and for a collection ofexamples the reader is invited to look at(Dybkjær 1999).Literature: (Bernsen et al. 1998).The MATE Workbench allows its users tospecify a coding module via a coding moduleeditor. A screen shot of the coding module editoris shown in Figure 5.Figure 5. The MATE coding module editor.

4Early Experience and Future WorkThe MATE markup framework has been wellreceived for its transparency and flexibility bythe colleagues on the MATE Advisory Panel.The framework has been used to ensure acommon description of coding modules at theMATE coding levels and has turned out to workwell for all these levels. We therefore concludethat the framework is likely to work for otherannotation levels not addressed by MATE. Theuse of a common representation and a commoninformation structure in all coding modules atthe same level as well as across levels facilitateswithin-level comparison, creation and use ofnew coding modules, and working at multiplelevels.On the tools side, the markup framework hasnot been fully exploited as intended, i.e. as anintermediate layer between the user interfaceand the internal representation. This means thatthe user interface for adding new codingmodules, in particular for the declaration ofmarkup, and for defining new visualisations isstill sub-optimal from a usability point of view.The coding module editor which is used foradding new coding modules, represents a majorstep forward compared to requiring users towrite DTDs. The coding module editorautomatically generates a DTD from thespecified markup declaration. However, theXML format used for the underlying filerepresentation has not been hidden completelyfrom the editor’s interface. Peculiarities and lackof flexibility in XML have been allowed toinfluence the way in which users must specifyelements and attributes, making the process lesslogical and flexible than it could have been. It ishigh on our wish list to repair this shortcoming.As regards coding visualisation, XSLT-likestyle sheets are used to define how codings aredisplayed to the user. Writing style sheets,however, is cumbersome and definitely notsomething users should be asked to do to definehow codings based on a new coding moduleshould be displayed. We either need a style sheeteditor comparable to the coding module editor asregards ease of use, or, alternatively, acompletely new interface concept should beimplemented to replace the style sheets andenable users to easily define new visualisations.It is high on our wish-list to better exploit themarkup framework in the Workbenchimplementation in order to achieve a better userinterface.Other frameworks have been proposed but toour knowledge the MATE markup framework isstill the more comprehensive framework around.An example is the annotation frameworkrecently proposed by Bird and Liberman (1999)which is based on annotation graphs. These arenow being used in the ATLAS project (Bird etal., 2000) and in the Transcriber tool (Geoffroiset al., 2000). The annotation graphs serve as anintermediate representation layer in agreementwith the argument above for having anintermediate layer of representation between theuser interface and the internal representation.Whilst Bird and Liberman do not considercoding modules or discuss the interface from ausability point of view, they present tion and time line reference. The twoframeworks may, indeed, turn out tocomplement each other nicely.AcknowledgementsWe gratefully acknowledge the support for theMATE project provided by the ng Programme. We would also like tothank all MATE partners. Without the veryconsiderable joint efforts of the projectconsortium it would not have been possible tobuild the MATE Workbench.ReferencesNote on MATE deliverables: like the MATEWorkbench, these are all ob

The MATE best practice coding modules are included as working examples of the state of the art. Users can add new coding modules via the easy-to-use interface of the MATE coding module editor. An audio tool enables listening to speech files and having sound files displayed as a waveform.