Discourse Analysis - IIT Bombay

Transcription

Discourse AnalysisA statistical approach tocoreference resolution of noun phrasesAbhijit Mishra (114056002)Samir Janardan Sohoni (114086001)

What is 'Discourse'? ”A mode of organizing knowledge, ideas orexperience that is rooted in language and itsconcrete contexts” - Meriam Webster Dictionary”A continuous stretch of (especially spoken)language larger than a sentence, oftenconstituting a coherent unit such as a sermon,argument, joke or narrative” - Crystal (1992:25)

What is 'Discourse'? ”A mode of organizing knowledge, ideas orexperience that is rooted in language and itsconcrete contexts” - Meriam Webster Dictionary”A continuous stretch of (especially spoken)language larger than a sentence, oftenconstituting a coherent unit such as a sermon,argument, joke or narrative” - Crystal (1992:25)

What happens in DiscourseAnalysis? Analyze the formal and contextual links within adiscourseFormal links are built into the languagerenderingContextual links rely upon world knowledge

Formal Links in language Substitution – use of one, do, so Ellipsis – omission of words or clauses Prosperity is a great teacher; adversity ? a greater ?.Conjunction – addition, temporals andcausals Tom produced a nice painting. I told you so long ago.The travellers had lunch then they rested because they weretired.References – pronouns and articles drawmeaning from other words/contexts Neighbours bought a new car, it is nice. It is raining/It is day/It is night.

Motivation Ability to link coreferring NPs within and acrosssentences is importantCoreference resolution is important because MT will need it IE, QA and Summarization systems need it In a natural environment, students learning newlanguage need to understand the phenomenon

Problem statement A coreference is not always limited to a pronounlike they, it etc.It can be a chain of non-pronominals Mahatma Gandhi insisted on non-violent means for freedom.He is a key figure in Indian history. Gandhi is also known as'father of the nation'.Coreferenced chain (Mahatma Gandhi)-(He)-(Gandhi)Can we identify coreferenced chains of nounpharases?

Why Statistical Approach Rules-based approaches takes time, moneyand trained personnel to make and test therules.Corefernce resolution is a semantic level taskwhich requires a lot of time and effort.Statistical methods may not be highly accuratebut save a lot of time and money.Availability of monolingual corpus motivates usto try out quick statistical systems.

Glossary Markables – NPs, nested NPs, pronouns etc.that are identities of reference ((Bill Gates)B, (the chairman)C of (Microsoft Corp)D)AMUC – Message Understanding Conference Initiative to US Gov and depts like DARPA Standardize data to be used by participants

Methodology1)Training data is standardized corpus having chains of coref-annotatedmarkables2)From an annotated chain each bigram pair as antecedent,anaphore isobtained.3)Basing on the features possessed by such pairs, a decision tree is learnt.4)For testing, chains of markables are created from test data.5)Markers are presented to the classifer and coreference chains areextracted.

Processing PipelineHMMFree textHMMMorphologicalprocessingPOS taggerNoun phraseidentificationTokenization& ed nounphrase extractionNamedEntityRecognitionHMMClassifier

Features Properties of a discourse which help to decide whether twomarkable corefer or not Should be domain independent. Should not be too difficult to compute.For a marker pair i,j we consider 12 different kinds of features.Consider this example:Separately, Clinton transition officials said that Frank Newman, 50, Vice chairmanand chief finantial officer of BankAmerica Corp., is expected to be nominatedas assistant Treasury secretary for domestic finance.Markeri ”Frank Newman” and Markerj ”Vice chairman”

Features Distance Feature : fdist Possible Values: Num : 0,1,2,3 Captures distance between i and j If i and j are in same sentence, f(i,j) 0. If they are one sentence apart f(i,j) 1 andso on.E.g : fdist(Frank Newman,Vice chairman) 0I-Pronoun and J-Pronoun : fi pron , fj pron Possible values true,false If i is a pronoun then fi pron(i,j) ”true” Similarly if j is a pronoun then fj pron(i,j) ”true” Pronouns include reflexive herself,himself , personal pronouns She,her,you and possessive pronouns her, his .E.g: fj pron(Frank Newman,Vice chairman) false

Features (contd.) Definite and Demonstrative NP : fdef , fdem If ”j” is a definite NP (e.g ”the car”) or demonstrative NP (e.g ”that boy”) thenreturn true.E.g: fdef NP(Frank Newman,Vice chairman) falseNumber and Gender: fnum and fgender If i, j agree in number then fnum ( i , j ) true If i, j agree in gender then fgender ( i , j ) trueE.g : fnum (Frank Newman,Vice chairman) true fgender can take three values true, false, unknown Designators and pronouns such as ”Mr”, ”Mrs”, ”she”, ”he” are used to determinethe gender.

Features (contd.) Both-Proper-Noun : If both i and j are proper nouns return true. Alias Feature: If i is an alias of j return true. Appositive Feature: If ”j” is an apposition to ”i” return true.E.g : fappositive (Frank Newman,Vice chairman) true Semantic Class Agreement feature : fsemclass Possible values are true, false, unknown The marker head words are assigned with one the following classes. person , organization, location , time , object Semantic class labeling is done by finding out the class lable closest to the firstsense of the head word in a marker.E.g: fsemclass (Frank Newman,Vice chairman) truesince both i and j corespond to persons

Training DataNP1One chainNP2NP3NP4Not in anyChainNP5Another chain(Eastern Air)1 proposes (date)2 for (talks)3 on (pay-cut plan)4. ((EasternAirlines)5 executives)6 notified ((union)7 leaders)8 that (the carrier)9wishes to discuss (selective (wage)10 reductions)11 on (Feb. 3)12.((Union)13 representatives)14 who could be reached said (they)15hadn't decided whether (they)16 would respond. By proposing (ameeting date)17 (Eastern)18 moved (one step)19 closer towardsreopening (current high-cost contracts agreements)20 with ((its)21unions)22)23.((union)7 (unions)13) and ((union)13 (its unions)22)(the carrier)9 (union)13) and ((wage)10(union)13)

Training C5 decision tree algorithm is used to learn a decision tree fromthe training data.It's an updated version of ID3 algorithm in which the feature tobe selected is the one which provides maximum informationgainGain(S, A) Entropy(S) - (( A / S ) * Entropy(A))Entropy( X ) - Sumi ( Pr ( xi ) * log ( Pr ( xi )) C5 has a better pruning mechanism and also handles training datawith missing attribute values.

Testing Algorithm: ( Document D, Decision Tree T) : ListM get markers from document (D)for ( j 2; j M ; j ):for ( i 1; i j ; i ):F get feature vector ( i, j )/********Get the class from Decision Tree*******/corfer get corefer ( F , T)if (corefer ):j.antecedent ifor ( j M ; j 1; j-- ):chain back track ( j )List.add ( chain )return List

Testing Example (Ms. Washington)73's candidacy is being championed by(several lawmakers)74 including ((her)76 boss)75, (chairman JohnDingell)77 (D., (Mich.)78) of (the House Energy and CommerceCommittee)79. (She)80 currently is (a counsel)81 to (thecommittee)82. (Ms. Washington)83 and (Mr. Dingell)84 have beenconsidered (allies)85 of (the (securities)87 exchanges)86, while(banks)88 and ((futures)90 exchanges)89 have often fought withthem.

Testing (contd.)Coreferenced chain is (Ms. Washington)73-(her)76-(She)80Courtesy: Soon, Ng, Lim (2001)

EvaluationCourtesy: Soon, Ng, Lim (2001)

Evaluation (contd.)Courtesy: Soon, Ng, Lim (2001)

Precision Errors (false ve)NoType of errorExamplePrenominal modifierstring matchDavid Bronczek,(vice) president of .was namedsenior (vice) president.1642.12Strings match butNP refer to differentthings.the House Energy and Commerce.(thecommittee). committee.Senate FinanceCommittee.(the committee).1128.93Errors in NPidentification”.May and June.” should actually be (May) and(June).but it is treated as nouns in apposition410.54Errors in AppositionDetermination.Metaphor, a company, that IBM bought in 1991,also named (Chris Grejtak)1, (43 year old)2,currently a SVP, president & CEO.513.25Error in AliasDeterminationMs. Washington, a long time (House)1 stafferand an expert in securities laws,.Ms.Washington's candidacy is championed by.John Dingell (D., Mich.) of (the House Energyand Commerce Committee)2.25.31Freque %ncy

Recall Errors (false -ve)NoType of errorExampleFreque %ncy1Inadequacy ofcurrent surfacefeaturesMr. X, (general manager)1 of ., was named(president)2 of .3863.32Errors in NPidentificationThe NP identification module completely missescandidate phrases in coreference chain711.73Errors in semanticclass determinationError made in applying semantic class ”Date”.”.Losses for (fiscal 2nd period)1 faltering saleswill result in (second-quarter)2 loss.711.74Errors in POSassignmentNouns are not being marked correctly.Mr. X, who is ED in (Canada), succeeds Mr. Y asVP and GM (there)58.35Error in appositionDeterminationNot able to detect coreference due to apposition(Bill Gates)1 ,(the chairman of Microsoft)223.36Errors due totokenization.debt-to-equity ration changed to (1-to-2)1 from(15-to-1)211.7

Conclusion and furtherImprovements Works on a small annotated corpus Domain and language independent Resolves noun phrase coreferences in general and not limited topronominal coreference resolution.We can consider verb suffixes to determine gender in morphologicallyrich languages. Similarly, other language specific properties can betaken into consideration.This is a sequence labelling problem. We can apply techniques likeHMM and CRF instead of scalar classifiers like decision trees.

References Soon, Wee Meng, Hwee Tou Ng, and Daniel Chung Yong Lim.2001. A machine learning approach to coreference resolution ofnoun phrases. Computational Linguistics, 27(4):521–544.Crystal, David. 1992. An encyclopedic dictionary of languageand languages. Cambridge, MA: Blackwell.Kamil Wiśniewski, 2006, iscourse.htmQuinlan, John Ross. 1993. C4.5: Programs for MachineLearning. Morgan Kaufmann, San Francisco, CA.

Thank you

Features Properties of a discourse which help to decide whether two markable corefer or not Should be domain independent. Should not be too difficult to compute. For a marker_pair i,j we consider 12 different kinds of features. Consider this example: Separately, Clinton transition officials said that Frank Newman, 50, Vice chairman