Introduction To IE With GATE - Uni-due.de

Transcription

What is IE?GATEANNIEAnnotation and EvaluationIntroduction to IE with GATEbased on Material from Hamish Cunningham, Kalina Bontcheva(University of Sheffield)Melikka Khosh Niat8. Dezember 2010

What is IE?GATE1What is IE?2GATE3ANNIE4Annotation and EvaluationANNIEAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationInformation Extraction (IE)IE is a process which takes unseen texts as input and producesfixed-format, unambiguous data as output. This data may be useddirectly for display to users, or may be stored in a database orspreadsheet for later analysis, or may be used for indexing purposesin Information Retrieval (IR) applications such as Internet searchengines.[Cowie and Lehnert 96, Appelt 99]

What is IE?GATEIE is not IRIR pulls documentsfrom large textcollections (usuallythe Web) in responseto specific keywordsor queries. Youanalyse thedocuments.IE pulls facts andstructuredinformation from thecontent of large textcollections. Youanalyse the facts.ANNIEAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationIE is an enabling technology for many other applicationsText miningSemantic annotationQuestion answeringOpinion minigand so on .

What is IE?GATEANNIEAnnotation and EvaluationTypical subtasks of IENamed Entity recognition (NE)Finds and classifies names, places, etc.Coreference resolution (CO)Identifies identity relations between entities in texts.Template Element construction (TE)Adds descriptive information to NE results (using CO).Template Relation construction (TR)Finds relations between TE entities.Scenario Template production (ST)Fits TE and TR results into specified event scenarios.

What is IE?GATEANNIEAnnotation and EvaluationExample of IEThe shiny red rocket was fired on Tuesday. It is thebrainchild of Dr. Big Head. Dr. Head is a staff scientist at WeBuild Rockets Inc.NE: entities (rocket,Tuesday, Dr. Head, We Build RocketsInc)TE: shiny red rocket Head’s brainchildTR: Dr. Head We Build Rockets IncSR: rocket launching event with various entities involved inCO: it rocket

What is IE?GATEANNIETwo kinds of approachesKnowledge Engineeringrule baseddeveloped by experienced language engineersmake use of human intuitionrequire only small amount of training datadevelopment can be very time consumingsome changes may be hard to accommodateAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationTwo kinds of approachesLearning Systemsuse statistics or other machine learningdevelopers do not need LE expertiserequire large amounts of annotated training datasome changes may require re-annotation of the entire trainingcorpus

What is IE?GATEANNIEAnnotation and EvaluationThe cornerstone of IE:Named Entity RecognitionIdentification of proper names in texts, and their classificationinto a set of predefined categories of interestPersonsOrganisations (companies, government organisations,committees, etc)Locations (cities, countries, rivers, etc)Date and time expressions

What is IE?GATEANNIEAnnotation and EvaluationWhy is Named Entity Recognition important?NE provides a foundation from which to build more complexIE systemsRelations between NEs can provide tracking, ontologicalinformation and scenario buildingTracking (co-reference): Dr. Head, Joe Head, Joe, he

What is IE?GATEANNIEAnnotation and EvaluationTypical NE pipelinePre-processing: tokenisation, sentence splitting,morphological analysis, POS taggingEntity finding: gazetteer lookup, NE grammarsCoreference: alias finding, orthographic coreference etc.Export to database, XML, ontology

What is IE?GATEANNIEExample of IEJoe Lives in Cologne. He works there for IBM.NE RecognitionCoreferenceRelationsAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationGeneral Architecture for Text EngineeringIs a framework for language processing (http://gate.ac.uk)Open Source (LGPL licence)A framework for programmers, GATE is an object-orientedclass library that implements the architectureA development environment: For language engineers,computational linguists et al, a graphical developmentenvironmentGATE includes support for reading in various formats andconverting to the internal annotation representation: HTML,XML, PDF, SGML, RTF, email, plain textOver ten years old with 1000s of users at 100s of sitesCurrent version 6

What is IE?GATEANNIEAnnotation and EvaluationGATE includesComponents for language processing e.g. parsers, machinelearning tools, stemmers, IR tools, IE components for variouslanguages.Tools for visualising and manipulating text, annotations,ontologies, parse trees, etc.Various information extraction toolsEvaluation and benchmarking tools

What is IE?GATEANNIEAnnotation and EvaluationGATE ComponentsGDM: the GATE Document ManagerGGI: the GATE Graphical InterfaceCREOLE: a Collection of REusable Objects for LanguageEngineering: a set of LE components integrated with thesystem

What is IE?GATEANNIEAnnotation and EvaluationGATE Components are one of three typs:Language Resources (LRs): lexicons, corpora, ontologiesProcessing Resources (PRs): represent entities that areprimarily algorithmicVisual Resources (VRs): represent visualisation and editingcomponents that participate in GUIs

What is IE?GATEANNIEGATE APIsEverything is a replaceable beanAll communication via fixed APIsAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationA Nearly-New Information Extraction SystemANNIE is a ready-made collections of algorithms thatperforms IE on unstructured text.The ANNIE application contains a set of core PRs:1Tokeniser2Sentence Splitter3POS tagger4Gazetteers5Named entity tagger (JAPE transducer)6Orthomatcher (orthographic coreference)

What is IE?GATEANNIE componentsANNIEAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationANNIE componentsEach PR in the ANNIE pipeline creates some new annotations ormodifies exisitig ones:Document Reset removes annotationsTokeniser Token annotationsSentence Splitter Sentence, Split annotationsPOS tagger adds category features to Token annotationsGazetteers Lookup annotationsNamed entity tagger (JAPE transducer) Date, Person,Location, Organisation annotationsOrthomatcher (orthographic coreference) adds matchfeatures to NE annotations

What is IE?GATEANNIEAnnotation and EvaluationTokeniserSplits the text into very simple tokens such as numbers,punctuation and words of different typesTokeniser Rules:1left hand side (LHS) regular expression which has to bematched on the input and is separated from the RHS by Operators: (OR)* (0 or more occurrences)? (0 or 1 occurrences) (1 or more occurrences)2right hand side (RHS) describes the annotations to beadded to the AnnotationSet, uses ; as a separator:

What is IE?GATEANNIEAnnotation and EvaluationRHS{LHS} {Annotation type};{attribute1} {value1};.;{attribute n} {value n}Example‘UPPERCASE LETTER’ ‘LOWERCASE LETTER’* Token;orth upperInitial;kind word;

What is IE?GATEANNIEAnnotation and EvaluationGazetteersGazetteers are plain text files containing lists of namesThe lists are compiled into Finite State MachinesEach gazetteer has an index file listing all the lists, plusfeatures of each listLists can be modified either internally using the GazetteerEditor, or externally in your favourite editor

What is IE?GATEANNIE GazetteerANNIEAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationEditing gazetteer listsThe ANNIE gazetteer has about 60,000 entries arranged in 80listsEach list reflects a certain categoryList entries might be entities or parts of entities, or they maycontain contextual information

What is IE?GATEANNIEAnnotation and EvaluationSentence SplitterFinds sentences based on TokensCreates Sentence annotations and Split annotations on thesentence delimitersUses a gazetteer of abbrivations etc. and a set of JAPEgrammers which find sentence delimiters and then annotatesentences and splits

What is IE?GATEANNIEAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationPOS taggerANNIE POS tagger is a modified version of the Brill taggerUses a default lexicon and ruleset, trained on the Wall StreetJournal corpusDefault ruleset and lexicon can be modified manuallyRequires Tokeniser and Sentence Splitter to be run fisrt

What is IE?GATEANNIEAnnotation and EvaluationNE transducersGazetteers can be used to find terms that suggest entitiesEntities can often be ambiguous “May Smith” vs “May 2010” vs “May I help you?” “General Motors” vs “General Smith”Handcrafted grammers are used to define patterns over thelookup and other annotationsThese patterns can help disambiguate, and they can combinedifferent annotations: Date day number mounthEach NE transducer consists of one or more grammars writtenin the JAPE language

What is IE?GATEANNIEAnnotation and EvaluationJava Annotation Pattern EngineJolly And Pleasant Experience :-)Specially developed pattern-matching language for GATEEach JAPE rule consists of: LHS which consists patterns to match RHS which details the annotations to be created

What is IE?GATEANNIEAnnotation and EvaluationJAPE exampleMatch all university names in Germany, e.g. “University ofBonn”The gazetteers might contain the word “Bonn” in the list ofcitiesThe rule looks for specific words such as “University of”followed by the name of a cityThis wouldn’t be enough to mach all university names, butit’s a start :-)

What is IE?GATEANNIEAnnotation and EvaluationRule nameRule: University1LHS({Token.string "University"}{Token.string "of"}{Lookup.minorType city}):orgName-- RHS:orgName.Organisation {kind "university", rule "University1"}

What is IE?GATEANNIEAnnotation and EvaluationMatching a text stringEverything to be matched must be specified in terms ofannotationsEach annotations is enclosed in a curly braceTo match a string of text, use the “Token” annotation andthe “string” feature: { Token.string "by"}You can combine sequences of annotations is apattern

What is IE?GATEANNIEAnnotation and EvaluationLabels on the LHSFor every combination of patterns that you want to create anannotation for, you need a lableThe pattern combination is enclosed in round brackets,followed by a colon and the lable

What is IE?GATEANNIEAnnotation and EvaluationOne or more cities or countries in any order and combination( {Lookup.minorType city} {Lookup.minorType country}) is not the same asOne city OR one or more countries( {Lookup.minorType city} {Lookup.minorType country}) )

What is IE?GATEANNIEAnnotation and EvaluationCoreferenceDifferent expressions may refer to the same entityOrthographic coreference matches proper names and theirvariants in a document Marry Smith and Mrs. Smith International Business Machines Ltd. will match IBMClassification of unknown entities very useful for surnameswhich match a full name, or abbrivations Smith h unknown iwill match Sir John Smith h person i

What is IE?GATEANNIEAnnotation and EvaluationA Walk-Through ExampleA 3-stage procedureRecognise the phrase “800,000 US dollars” as an entity oftype “Number”, with the feature “money”Give an example of a grammar rule for moneyStep 1: TokenisationStep 2: List LookupStep 3: Grammar Rules

What is IE?GATEGrammar rule for moneyMacro: MILLION BILLION({Token.string "m"} {Token.string "million"} {Token.string "b"} {Token.string "billion"})Macro: AMOUNT NUMBER({Token.kind number}(({Token.string ","} {Token.string "."}){Token.kind number})*(({SpaceToken.kind space})?(MILLION BILLION)?))Rule: Money1// e.g. 30 pounds((AMOUNT NUMBER)(SpaceToken.kind space)?({Lookup.majorType currency unit})):money -- :money.Number {kind "money", rule "Money1"}ANNIEAnnotation and Evaluation

What is IE?GATEANNIETokenisationToken, string ‘800’, kind number, length 3Token, string ‘,’, kind punctuation, length 1Token, string ‘000’, kind number, length 3SpaceToken, string ‘ ’, kind space, length 1Token, string ‘US’, kind word, length 2, orth allCapsSpaceToken, string ‘ ’, kind space, length 1Token, string ‘dollars’, kind word, length 7, orth lowercaseList LookupLookup, minorType post amount, majorType currency unitGrammar RulesNumber, kind money, rule Money1Annotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationSystem development cycleCollect corpus of textsDefine what is to be extractedManually annotate gold standardCreate systemEvaluate performance against gold standardReturn to step 3, until desired performance is reached

What is IE?GATEANNIEAnnotation and EvaluationBefore you start annotating.You need to think about annotation guidelinesYou need to consider what you want to annotate and then todefine it appropriatelyWith multiple annotators it’s essential to have clear set ofguidelines for them to followConsistency of annotation is really important for a properevaluation

What is IE?GATEANNIEAnnotation and EvaluationAnnotation guidelinesPeople need clear definition of what to annotate in thedocuments, with examplesTypically written as a guidelines documentPiloted first with few annotators, improved, then “real”annotation starts, when all annotators are trainedAnnotation tools require the definition of a formal DTD (e.g.XML schema)

What is IE?GATEANNIEAnnotation in GATE GUIAdding annotation setsAdding annotationsResizing them (changing boundaries)DeletingChanging highlighting colourSetting features and their valuesAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationPerformance EvaluationEvaluation metric: mathematically defines how to measurethe system’s performance against human-annotated goldstandardScoring program: implements the metric and providesperformance measures For each document and over the entire corpus For each type of annotation

What is IE?GATEANNIETerminology ComparisonGold Standard IEGold Standard IRCorrectTrue PositiveMissingFalse NegativeSpuriousFalse PositivePartially CorrectTrue NegativeAnnotation and Evaluation

What is IE?GATEANNIEAnnotation and EvaluationTerminology ComparisonCorrect: things annotated correctly annotating “Norbert Fuhr” as a PersonMissing: things not annotated that should have been not annotating “Duisburg” as a LocationSpurious: things annotated wrongly annotating “Norbert Fuhr” as a LocationPartially Correct: the annotation type is correct, but thespan is wrong annotating just “Fuhr” as a Person is too short orannotating “luckily Norbert Fuhr” as a Person is too long

What is IE?GATEANNIEAnnotation and EvaluationPrecision and RecallHow many of the entities your application found were correct?Correct PrecisionCorrect SpuriousHow many of the entities that exist did your applicationfound?Correct RecallCorrect Missing

What is IE?GATEANNIEAnnotation and EvaluationF-MeasurePrecision and Recall tend to trade off against one another: specifying rules precisely to improve precision, may cause alower recall very general rules, may deliver good recall, but low precisionThis make it difficult to compare applications, or to checkwhether a change has improved or worsened the results overallF-measure combines precision and recall into one measure

GATE includes support for reading in various formats and converting to the internal annotation representation: HTML, XML, PDF, SGML, RTF, email, plain text Over ten years o