Hva Er Maskinoversettelse? - UiO

Transcription

INF5820Natural Language Processing - NLPH2009Jan Tore Lønningjtl@ifi.uio.no

Probabilistic parsingLexicalization, corpora, evaluationINF5830Lecture 15Nov 11, 2009

Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation

So far PCFGDisappointing results(Grand-)mother annotationLexicalized-parsing (Collins)

Probabilistic CFG Each local subtree in a treebank yields aCFG-ruleThe frequency of the subtree in thetreebank determines the probability of theruleStandard parser (CKY) adopted withprobabilitySomewhat disappointing results: .75 on local trees

Example PCFG

Explanations Non-uniform probabilities Lexical dependencies, E.g. PP-attachment: knuste koppen med vilje spiste brødskiva med syltetøy spiste smørbrødet med kniv og gaffel

Alt 1 (Grand-)parent annotation Should get the subject – object distinction

Alt. 2: Lexicalization

History-based model

Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation

Klein and Manning (2003) Penn treebank, WJS (as usual)Input trees: annotated (mother, etc.), ortransformedLocal trees as CFG-rules withfrequency probabilitiesParsing: PCKY

Markovization aka Collins’ historiesas trees Unary rule ( generate the head)Binary rules ( generate the daughters)

CFG-rules VP: [VBZ] VP: [VBZ] NP VP: [VBZ] PP VP VBZ VP: [VBZ] NP VP: [VBZ] NP PP VP: [VBZ] PP Context-free rules VP: [VBZ] PP etc, intermediatecategory cf. dotted itemCount frequencis/probabilities

Dimensions Vertical, ancestor v 1, only motherv 2, grandmother,etc.Horizontal h , do not split RHSh 1, h 2, VP: [VBZ] PP VP: [VBZ] NP PP VP: [VBZ] PP PP VP: [VBZ] NP PP PPetc.

Results Unsmoothed means backoff for low frequencies

Refinements

Parent-annotate tags

More Split-IN: 6 different tags for IN-tagSplit-Aux: separate tags for be and haveSeparate tag for % (like )Strategy: OK to go down to word-level on functionwordsBut not in generalSplit-VP: finite-infinite verb Mark VP-nodes with head-tag

Results State-of-the-art 2003

Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation

Corpus variation Everybody uses Penn treebankWSJ-partHow representative are the results for Other text-genres?Other annotation schemas?

Gildea, 2001Collins-parser:

Is it the lexicon? H – headC – a sister to headH hw – head word of H and motherH ht – this words tagC hw – head word of subtree dominated by CC ht – tag of this wordEtc.This relationship between Chw and Hhw is The bilexical dependenciesExpensive (many of them)

Simplification Original smoothingSkip the most specific part

Results

Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation

Parser evaluation No corpus: Linguistic constructions covered, Unannotated corpus: cf. logon\lingo\erg\data\csli.itemsCoverage on corpusAverage parse base (ambiguity)Entropy/perplexity (language model)Annotated corpus:

Parser evaluation Annotated corpus: POS-tagger accuracyBest-first/Ranked consistencyTree similarityGEIC: Originally only PS-bracketing Unfair: some mistakes are punished too hardSome mistakes not punished at all with flat treesDependency structure

Purpose?1.2.Regression testing of one particulargrammar, vs.Comapring different (approaches to)grammars: Beware of the purpose of the grammar

Briscoe, Carroll, Sanfilippo dependent(introducer, head, depen.) mod(type, h, d)E.g.: mod( ,flag,red), mod(with,walk,John) cmodxmodncmod (may or may not be specified)arg mod(type, head, depen., initial)E.g.: arg mod(by, kill, Brutus, subj)arg(head, depen.) subj(h,d, initial)obj(h,d) dobjobj2Iobjclausal(h,d) more

The competition Similar to dependency structures, but Paul intends to leave IBMPaul argument to both intends and leaveNot single-headNot projective (”moved constituents”)Right now: What is the best evaluation scheme?Several candidates

INF5820 Natural Language Processing - NLP H2009 Jan Tore Lønning jtl@ifi.uio.no