Longman Dictionary Of Contemporary English To Longman .

Transcription

TOWARDS A DICTIONARY S U P P O R T E N V I R O N M E N TFOR R E A L T I M E PARSINGHiyan Alshawi, Bran Boguraev, Ted BriscoeComputer Laboratory, Cambridge UniversityCorn Exchange StreetCambridge CB2 3QG, U.K.ABSTRACTbatch processing (eg. Amsler, 1981; Walker &Amsler, 1983), or on developing dictionary servers foroffice automation systems (Kay, 1984b). Few parsingsystems have substantial lexicons and even thosewhich employ very comprehensive grammars (eg.Robinson, 1982; Bobrow, 1978) consult relativelysmall lexicons, typically generated by hand. Twoexceptions to this generalisation are the LinguisticString Project (Sager, 1981) and the Epistle Project(Heidorn et al., 1982); the former employs adictionary of less than 10,000 words, most of whichare specialist medical terms, the latter has well over100,000 entries, gathered from machine-readablesources, however, their grammar formalism and thelimited grammatical information supplied by thedictionary makethisachievement,thoughimpressive, theoretically less interesting.In this article we describe research on thedevelopment of large dictionaries for naturallanguage processing. We detail the development of adictionarysupportenvironmentlinkingarestructrured version of the Longman Dictionary ofContemporary English to natural languageprocessing systems. We describe the process ofrestructuring the information in the dictionary andour use of the Longman grammar code system toconstruct dictionary entries for the PATR-II parsingsystem and our use of the Longman word definitionsfor automated word sense classification.INTRODUCTIONRecent developments in linguistics, andespecially on grammatical theory - for example,Generalised Phrase Structure Grammar' (GPSG)(Gazdar et al., In Press), Lexical FunctionalGrammar (LFG) (Kaplan & Bresnan, 1982) - and onnatural language parsing frameworks - for example,Functional Unification Grammar (FUG) (Kay,1984a), PATR-II (Shieber, 1984) - make it feasible toconsider the implementation of efficient systems forthe syntactic analysis of substantial fragments ofnaturallanguage. These developments alsodemonstrate that if natural language processingsystems are to be able to handle the grammatical andlogical idiosyncracies of individual lexical itemselegantly and efficiently, then the lexicon must be acentral component of the parsing system. Real-timeparsing imposes stringent requirements on adictionary support environment; at the very least itmust allow frequent and rapid access to theinformation in the dictionary via the dictionary headwords.We chose to employ the Longman Dictionaryof Contemporary English (Procter 1978, henceforthLDOCE) as the machine-readable source for ourdictionary environment because this dictionary hasseveral properties which make it uniquelyappropriate for use as the core knowledge base of anatural language processing system. Most isations of the 60,000 entries, the largeamount of information concerning phrasal verbs,noun compounds and idioms, the individual subject,collocational and semantic codes for the entries andthe consistent use of a controlled 'core' vocabulary indefining the words throughout the dictionary.(Michiels (1982) gives further description anddiscussion of LDOCE from the perspective of naturallanguage processing.)The problem of utilising LDOCE in naturallanguage processing falls into two areas. Firstly, wemust provide a dictionary environment which linksthe dictionary to our existing natural languageprocessing systems in the appropriate fashion andsecondly, we must restructure the information in thedictionary in such a way that these systems are ableto utilise it effectively. These two tasks form thesubject matter of the next two sections.The idea of using the machine-readablesource of a published dictionary has occurred to awide range of researchers - for spelling correction,lexical analysis, thesaurus construction, machinetranslation, to name but a few applications - very fewhowever have used such a dictionary to support anatural language parsing system. Most of the workon automated dictionaries has concentrated onextracting lexical or other information in, essentially,171

THE ACCESS ENVIRONMENTWe have implemented an efficient dictionaryaccess system which services requests for sexpression entries made by client Cambridge Lispprograms. The lispified file was sorted and convertedinto a random access file together with indexinginformation from which the disc addresses ofdictionary entries for words and compounds can berecovered. Standard database indexing techniqueswere used for this purpose. The current access systemis implemented in the programming language C. Itruns under U N I X and makes use of the random fileaccess and inter-process communication facilitiesprovided by this operating system. ( U N I X is a TradeM a r k of Bell Laboratories.) To the Lisp programmer,the creation of a dictionary process and subsequentrequests for information from the dictionary appearsimply as Lisp function calls.To link the machine-readable version ofLDOCE to existing natural language processingsystems we need to provide fast access from Lisp todata held in secondary storage. Furthermore, thecomplexity of the data structures stored on discshould not be constrained in any way by the methodof access, because we have little idea what form therestructured dictionary may eventually take.Our first task in providing an environmentwas therefore the creation ofa 'lispifed' version o f t h emachine-readable LDOCE file. A batch programwritten in a general editing facility was used toconvert the entrire LDOCE typesetting tape into asequence of Lisp s-expressions without any loss ofgenerality or information. Figure 1 illustrates part ofan entry as it appears in the published dictionary, onthe typesetting tape and after lispification.W e have provided for access to the dictionaryvia head words and the first words of compounds andphrasal verbs, either through the spelling orpronunciation fields. R a n d o m selection of dictionaryentries is also provided to allow the testing ofsoftware on an unbiased sample. This access issufficient to supportourcurrentparsingrequirements but could be supplemented with theaddition of further indexing files if required.Eventually access to dictionary entries will need to beconsiderably more intelligent and flexible than asimple left-to-fight sequential pass through thelexical items to be parsed, if our processing systemsare to m a k e full use of the information concerningcompounds and idioms stored in L D O C E .ul[Tl;X9]tocauseto sten with RIVETsI:. vet228289801 RO154300 rlvet28289902 02 28290005 v 28290107 0100 TI;X9 NAZV H XS28290208 to cause to fasten with28290318 1 R0154300 ! rivet)(2 2 ! ! )(5v! )(7 100 ! T1 !; X9 ! NAZV ! . H---XS)(8 to cause to fasten with*CA RIVET *CB *46 s *44 *8A :.RESTRUCTURING THE DICTIONARYThe lispified LDOCE file retains the broadstructure of the typesetting tape and divides eachentry into a n u m b e r of f e l d shead word,pronunciation, grammar codes, definitions, examplesand so forth. However, each of these fields requiresfurther decoding and restructuring to provide clientprograms with easy access to the information theyrequire (Calzolari (1984) discusses this need). For thispurpose the formatting codes on the typesetting tapeare crucial since they provide clues to the correctstructure of this information. For example, wordsenses are largely defined in terms of the 2000 wordcore vocabulary, however, in some cases other words(themselves defined elsewhere in terms of thisvocabulary) are used. These words always appear insmall capitals and can therefore be recognisedbecause they will be preceded by a font change controlcharacter. In Figure 1 above the definition o f " r i v e t "includes the noun definition of"RIVETI", a s signalledby the font change and the numerical superscriptwhich indicates that it is the noun entry homograph;additional notation exists for word senses withinhomograhps. On the typesetting tape, font control))Figure IThis still leaves the problem of access, fromLisp, to the dictionary entry s-expressions held onsecondary storage. A d hoc solutions, such assequential scanning of files on disc or extractingsubsets of such files which will fit in main m e m o r yare not adequate as an efficient interface to a parser.(Exactly the same problem would occur if our naturallanguage systems were implemented in Prolog, sincethe Prolog 'database facility',refers to the knowledgebase that Prolog maintains in main memory.) Inprinciple, given that the dictionary is now in a Lispreadable format, a powerful virtual m e m o r y systemmight be able to manage access to the internal Lispstructures resulting from reading the entiredictionary; we have, however, adopted an alternativesolution as outlined below.172

characters are indicated within curly brackets byhexadecimal numbers. In addition, there is a furthercomplication because this sense is used in the pluraland the plural morpheme must be removed before"RIVET" can be associated with a dictionary entry.However, the restructuring program can achieve thisbecause such morphology is always italicised,so theprogram knows that in the context of non-corevocabulary items the italic font control charactersignals the occurrence of a morphological variant of aL D O C E head entry.((pair)(1 P0008800 pair)(2 1 )(3 peER)(7 200 C9 !, esp ! "46 of CD-- . . . . J - - - Y )(8 "45 a *44 2 things that are alike or of the samekind !, and are usu ! used together : *46 a pair ofshoes tJ a beautiful pair of legs *44 "63 compare*CA COUPLE "CB *8B *45 b *44 2 playing cards of thesame value but of different *CA SUIT *CB *46 s *8A*44 (3) : *46 a pair of kings)(7 300 GC --- --S-U---Y)(8 *45 a "44 2 people closely connected : *46 a pairof dancers *45 b *CA COUPLE *CB "88 *44 (2)(esp t. in the phr !. *45 the happy pair *44) "45 c*46 sl "44 2 people closely connected who causeannoyance or displeasure : *46 You !'re a fine paircoming as late as this !!)A suite of programs to unscramble andrestructure all the fields in LDOCE entries has beenwritten which is capab e of decoding all the fieldsexcept those providing cross-reference and usageinformation for complete homographs. Figure 2illustrates a simple lexical entry before and after theapplication of these programs.)(Word-sense (Number 2)((Sub-definition(Item a) (Label NIL)(Definition 2 things that are alike or of the samekind !, and are usually used together)((Example NIL (a pair of shoes))(Example NIL (a beautiful pair of legs)))(Cross-referencecompare-with(Ldoce-entry (Lexical COUPLE)(Morphology NIL )(Homograph-number 2)(Word-sense-number NIL)))(Sub-definition(item b) (Label NIL)(Definition 2 playing cards of the same valuebut of different(Ldoce-entry (SUIT)(Morphology s)(Homograph-number 1)(Word-sense-number 3))((Example NIL (a pair of kings))))))(Word-sense (Number 3)((Sub-definition(Item a) (Label NIL)(Definition 2 people closely connected)((Example NIL (a pair of dancers))))(Sub-definition(Item b) (Label NIL)(Definition(Ldoce-entry (Lexical COUPLE )(Morphology NIL)(Homograph-number 2)(Word-sense-number 2))(Gloss: especiat y in the phrase the happy pair )))(Sub-definition(Item c) (Label slang)(Definition 2 people closely connected whocause annoyance or displeasure)((Example NIL(You!' re a fine pair coming as/ate as this!))))))The development of the restructuringprograms is a non-trivial task because theorganisation of information on the typesetting tapepresupposes its'visual presentation, and the ability ofhuman users to apply common sense, utilise basicmorphological knowledge, ignore minor notationalinconsistencies, and so forth. To provide a test-bed forthese programs we have implemented an interactivedictionary browser capable of displaying therestructured information in a variety of ways andrepresenting it in perspicuous and expanded form.To illustrate the problems involved in therestructuring process we will discuss therestructuring of the grammar codes in some detail,however, the reader should bear in mind that thisrepresents only one comparatively constrained fieldof an LDOCE entry and therefore, a small proportionof the overall restructuring task. Figure 3 (Illustratesthe grammar code field for the third word sense of theverb "believe" as it appears in the publisheddictionary, on the typesetting tape and afterrestructuring.Multiple grammar codes are elided andabbreviated in the dictionary to save space andrestructuring must reconstruct the full set of codes.This can be done with knowledge of the syntax of thegrammar code system and the significance ofpunctuation and font changes. For example, semicolons indicate concatenated codes and commasindicate concatenated, elided codes. However,discovering the syntax of the system is dimcult sinceno explicit description is available from Longman andthe code is geared more towards visual presentationthan formal precision; for example, words whichqualify codes, such as "to be" in Figure 3, appear initalics and therefore, will be preceded by the fontcontrol character "45'. But sometimes the thin spaceFigure 2173

believer3complements. Clearly, the restructuring p r o g r a m scannot correct this last type of error, however, wehave developed a system which is sufficiently robustto handle the other problems described above. R a t h e rt h a n apply these programs to the dictionary andcreate a new restructured file, they are applied on ademand basis, as required by the dictionary browseror the other client programs described in the nextsection; this allows us to continue to refine therestructuring programs incrementally as furtherproblems emerge.[TSa,b,V3;X (to be) 1, (to be) 7](7 300 ! T5a ! , b ! ; V3 l; X (*46 to be "44)i !, (*46 to be *44) 7 ! . . . . . . . . )word sense3head: X7xhead: X l xhead: V3head:TSahead:TSbFigure 3USINGcontrol character " 6 4 ' also appears; the insertion ofthis code is based solely on visual criteria, r a t h e rt h a n the informational structure of the dictionary.Similarly, choice of font can be varied for reasons ofappearance and occasionally information normallyassociated with one field of an entry is shifted intoanother to create a more compact or elegant printedentry. In addition to the 'noise' generated by the factt h a t we are working with a typesetting tape geared tovisual presentation, r a t h e r t h a n a database, there areerrors in the use of the g r a m m a r code system; forexample, Figure 4 illustrates the code for the firstsense of the noun "promise".IprOmisenlDICTIONARYOnce the information ia L D O C E has beenrestructured into a format suitable for accessing byclient programs, it stillremains to be shown that thisinformation is of use to our natural languageprocessing systems. In this section, we describe theuse that we have m a d e of the g r a m m a r codes andword sense definitions.GrammarcodesThe g r a m m a r code system used in LDOCE isbased quite closely on the descriptive g r a m m a t i c a lframework of Quirk et al. (1972). The codes aredoubly articulated; capital letters represent theg r a m m a t i c a l relations which hold between a verb n frames which a verb can appear in.(The small letters which appear with some codesrepresent a variety of less important information, forexample, whether a sentential complement will takean obligatory or optional complementiser.) Most ofthe subcategorisation frames are specified bysyntactic category, but some are very ill-specified;forinstance, 9 is defined as "needs a descriptive word orphrase". In practice anything functioning as anadverbial will satisfy this code, w h e n attached to averb. The criteria for assignment of capital letters toverbs is not m a d e explicit, but is influenced by thesyntactic and semantic relations which hold betweenthe verb and its arguments; for example, 15, L5 andT5 can all be assigned to verbs which take a N Psubject and a sentential complement, but 15 will onlybe assigned if there is a fairly close semantic linkbetween the two arguments and T5 will be used inpreference to I5 if the verb is felt to be semanticallytwo place r a t h e r t h a n one place, such as "know"versus "appear". On the other hand, both "believe"and "promise" are assigned V3 which m e a n s theytake a NP object and infinitival complement, yetthere is a similar semantic distinction to be madebetween the two verbs; so the criteria for theassignment of the V code seem to be syntactic.[C (of},C3,5; under UIFigure 4The occurrence of the full code "C3" betweenc o m m a s is incorrect because c o m m a s are clearlyintended to delimit sequences of elided codes. Thistype of error arises because grammatical codes areconstructed by hand and no automatic checkingprocedure is attempted (see Michiels, 1982). Finally,there are errors or omissions in the use of the codes;for example, Figure 5 illustratesthe g r a m m a r codesfor the listedsenses of the verb "upset".upset:for cat vword sense 1word sense 2word sense 3word sense 4THEhead T1head Ihead T1head T1Figure 5These codes correspond to the simpletransitive and intransitive uses of "upset"; no codesare given for the uses of "upset" with sentential174

It is not possible to automatically constructPATR-II dictionary entries for verbs just by m a p p i n gone full g r a m m a r code from the restructured LDOCEentry into a set of templates. However, it turns outt h a t if we compare the full set of g r a m m a r codesassociated with a particular sense of a verb, followinga suggestion of Michiels (1982), then we can constructthe correct set of templates. T h a t is, we can extract allthe information t h a t PATR-II requires concerningthe subcategorisation and semantic type of verbs. Forexample, as we saw above, "believe" under one senseis assigned the codes T5 and V3; the presence of theT5 code tells us t h a t "believe" is a 'raising-to-object'verb and logically two-place under the V3interpretation. On the other hand, "persuade" is onlyassigned the V3 code, so we can conclude t h a t it isthree-place with object control of the infinitive. Bysystematically exploiting the collocation of differentcodes in the same field, it is possible to distinguishthe raising, equi and control properties of verbs. Ineffect, we are utilising w h a t was seen as thetransformational consequences of the semantic typeof the verb within classical generative g r a m m a r .The parsing systems we are interested in allemploy g r a m m a r s which carefully distinguishsyntactic and semantic information of this kind,therefore, if the information provided by theLongman g r a m m a r code system is to be of use weneed to be able to separate out this information andmap it into the representation scheme used for lexicalentries used by one of these parsing systems. Todemonstrate that this is possible we haveimplemented a system which constructs dictionaryentries for the PATR-II system (Shieber, 1984 andreferences therein). PATR-II was chosen because thesystem has been reimplemented in Cambridge andwas therefore, available; however, the task would benearly identical if we were constructing entries for asystem based on GPSG, F U G or LFG.The PATR-H parsing system operates byunifying directed graphs (DGs); the completed parsefor a sentence will be the result of successivelyunifying the DGs associated with the words andconstituents of the sentence according to the rules ofthe g r a m m a r . The DG for a lexical item is constructedfrom its lexical entry which will consist of a set oftemplates for each syntactically distinct variant.Templatesarethemselvesabbreviations forunifications which define the DG. For example, thebasic entry and associated DG for the verb "storm"are illustrated in Figure 6.w o r d marry:w o r d sense h e a d trans sense-no V Takes NP Dyadicw o r d sense h e a d trans sense-no V TakeslntransNP M o n a d i cw o r d sense head trans sense-no V TakesNP Dyadicw o r d sense h e a d trans sense-no V TakesNPPP Triadicw o r d storm:w o r d sense h e a d trans sense-no 1V Takes NP Dyadicw o r d persuade:w o r d sensew o r d d a g storm:[cat: vhead: [aux: falsetrans: [pred: stormsense-no: Ia r g l : D G 1 5 []arg2: D G 1 6 []]]syncat: [first : [cat: NPhead: [trans: D G 1 5 ] ]rest: [first: [cat: NPhead: [trans: D G 1 6 ] ]rest: [first: lambda]]]]w o r d sensew o r d sensew o r d sense1123 h e a d t r a n s sense-no IV Takes NP Dyadic h e a d trans sense-no IV TakesNPSbar Triadic h e a d trans sense-no 2V TakesNP Dyadic h e a d trans sense-no 2V TakesNPInf ObjectControl TriadicFigure 7The modified version of PATR-II t h a t wehave implemented contains a small dictionary andconstructs entries automatically from restructuredLDOCE entries for most verbs that it encounters. Aswell as carrying over the g r a m m a r codes, PATR-IIhas been modified to represent the word sensenumbers which particular g r a m m a r codes areassociated with. Thus, the analysis of a sentence bythe PATR-II system now represents its syntactic andlogical structure and the particular senses of thewords (as defined in LDOCE) which are relevant inthe g r a m m a t i c a l context. Figure 7 illustrates theFigure 6The template Dyadic defines the way inwhich the syntactic a r g u m e n t s to the verb contributeto the logical structure of the sentence; thus, theinformation that "storm" is transitive and t h a t it islogically a two-place predicate is kept distinct.Consequently, the system can represent the fact t h a tsome verbs which take two syntactic a r g u m e n t s arenevertheless logically one-place predicates.175

dictionary entries for " m a r r y " andconstructed by the system from LDOCE.There are various possibilities for the form ofthe output resulting from processing a definition. Thecurrent experimental system produces output t h a t isconvenient for incorporating new word senses into aknowledge base organized around classificationhierarchies, as discussed shortly. However, thesystem allows the form of output structures to bespecified in a flexible way. Alternative possibleoutput representations would be m e a n i n g postulatesand definitions based on semantic primitives."persuade"In Figure 8 we show one of the two analysesproduced by PATR-II for a sentence containing thesetwo verbs. The other analysis is syntactically andparse: uther might persuade gwen to marry cornwallanalysis 1 :As mentioned above, the i m p l e m e n t e dexperimental system is intended to enable theclassification (see e.g. Schmolze, 1983) of new wordsenses with respect to a hierarchically organizedknowledge base, for example the one described inAlshawi (1983). The proposal being made here is t h a tthe analysis of dictionary definitions can provideenough information to link a new word sense todomain knowledge already encoded in the knowledgebase of a limited domain n a t u r a l l a n g u a g eapplication such as a database query system. Given ahand-coded hierarchical organization of the r e l e v a n t(central) senses of the definition vocabulary togetherwith a classification of the relationships betweenthese senses and domain specific concepts, theLDOCE definition of a new word sense often containsenough information to enable the inclusion of theword sense in this classification, and hence allow thenew word to be handled correctly when performingthe application task.[cat: SENTENCEhead: [form: finiteagr: [per: p3 hum: sg]aux: truetrans: [pred: possiblesense-no: 1argl: [pred: persuadesense-no: 2argl : [ref: uther sense-no: 1]arg2: [ref: gwen sense-no: 1]arg3: [pred: marrysense-no: 2arg1: [ref: gwensense-no 1]arg2: [ref: cornwallsense-no: 1]]]]]]Figure 8logically identical but incorporates sense two of"marry". Thus, the system knows that furthersemantic analysis need only consider sense two of"persuade" and sense one and two of "marry"; thisrules out one further sense of each, as defined inLDOCE.The information necessary for this process ispresent, in the case of nouns, as restrictions on theclasses which subsume the new type of object, itsproperties, and predications often expressed byrelative clauses. There are also a n u m b e r of morespecific predications (such as "purpose" in theexample given below) t h a t are very common indictionary definitions, and have immediate utility forthe classification of the relationships between wordsenses. Similarly, the information relevant to theclassification of verb and adjective senses present insense definitions includes the classes of predicatest h a t subsume the new predicate corresponding to theword sense, restrictions on the a r g u m e n t s of thispredicate, and words indicating opposites as isfrequently the case with adjective definitions.Word sense definitionsThe automatic analysis of the definitiontexts of LDOCE entries is aimed at m a k i n g thesemantic information on word senses encoded inthese definitions available to n a t u r a l languageprocessing systems. LDOCE is particularly suitableto such an endeavour because of the 2000 wordrestricted definition vocabulary, and in fact only'central' senses of the words in this restrictedvocabulary occur in definition texts. It is thuspossible to process the LDOCE definition of a wordsense in order to produce some representation of thesense definition in terms of senses of words in therestricted vocabulary. This representation could thenbe combined, for the benefit of the client l a n g u a g eprocessing system, with the other semanticinformation encoded for word senses in LDOCE; inparticular the 'box codes' t h a t give simple selectionalrestrictions and the 'subject codes' t h a t classify sensesaccording to subject area usage. (These are not in thepublished version of the dictionary, but are availableon the t a p e . )Figure 9 below shows the output produced bythe implemented definition analyser for lispifiedLDOCE definitions of one of the noun senses and oneof the verb senses of the word "launch". It should beemphasized t h a t the output produced is not regardedas a formal language, but r a t h e r as an intermediatedata structure containing information relevant to theclassification process.176

considerable amount of information in LDOCE whichwe have not attempted to exploit as yet; for example,the box codes, which contain selection restrictions forverbs or the subject codes, which classify word sensesaccording to the Merriam-Webster codes for subjectmatter (see Walker & Amsler (1983) for a suggesteduse for these). The large amount of semi-formalisedinformation concerning the interpretation of nouncompounds and idioms also represents a rich andpotentially very useful source of information fornatural language processing systems. In particular,we intend to investigate the automatic generation ofphrasal analysis rules from the information onidiomatic word usage.(launch)(a large usu. motor-driven boat used for carrying peopleon rivers, lakes, harbours, etc .)((CLASS BOAT) )(OBJECTPEOPLE))))(to send (a modern weapon or instrument) into the sky orspace by means of scientific explosive apparatus)((CLASSSEND)(OBJECT((CLASS RN)))) (ADVERBIAL((CASE INTO) (FILLER(CLASSSKY)))))In the longer term, it is clear that no existingpublished dictionary can meet all the requirements ofa natural language processing system and asubstantial component of the research reported abovehas been devoted to restructuring LDOCE to make itmore suitable for automatic analysis. This suggeststhat the automatic construction of dictionaries frompublished sources intended for other purposes willhave a limited life unless lexicography is heavilyinfluenced by the requirements of automated naturallanguage analysis. In the longer term, therefore, theautomatic construction of dictionaries for naturallanguage processing systems may need to be based ontechniques for the automatic analysis of large corpora(eg. Leech et al., 1983). However, in the short term,the approach outlined in this paper will allow us toproduce a sophisticated and useful dictionary rapidly.Figure 9The analysis process is intended to extractthe most important information from definitionswithout necessarily having to produce a completeanalysis of the whole of a particular definition textsince attempting to produce complete analyses wouldbe difficult for many LDOCE definition texts. In factthe current definition analyser applies successivelymore specific phrasal analysis patterns; moredetailed analyses being possible when relativelyspecific phrasal patterns are applied successfully to adefinition. A description of the details of this analysismechanism is beyond the scope of the present paper.Currently, around fifty phrasal patterns are usedaltogether for noun, verb, and adjective definitions. Amajor difficulty encountered so far in this work stemsfrom the liberal use in LDOCE definitions ofderivational morphology and phrasal verbs whichgreatly expands the effective definition vocabulary.A CK N O WL E D G E ME N T SWe would like to thank the Longman Group Limitedfor kindly allowing us access to the LDOCEtypesetting tape for research purposes. We also thankKaren Sparck Jones and John Tait for theircomments on the first draft, which substantiallyimproved this paper. We are very grateful to theSERC for funding this research.CONCLUSIONThe research reported in this paperdemonstrates that it is both possible and useful torestructure the information contained in LDOCE foruse in natural language processing systems. Mostapplications for natural language processing systemswill require vocabularies substantially larger thanthose typically developed for theoretical ordemonstration purposes and it

Contemporary English to natural language processing systems. We describe the process of restructuring the information in the dictionary and our use of the Longman grammar code system to construct dictionary entries for the PATR-II parsing system and our use of the Longman word