Transcription
INF5820Natural Language Processing - NLPH2009Jan Tore Lønningjtl@ifi.uio.no
Probabilistic parsingLexicalization, corpora, evaluationINF5830Lecture 15Nov 11, 2009
Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation
So far PCFGDisappointing results(Grand-)mother annotationLexicalized-parsing (Collins)
Probabilistic CFG Each local subtree in a treebank yields aCFG-ruleThe frequency of the subtree in thetreebank determines the probability of theruleStandard parser (CKY) adopted withprobabilitySomewhat disappointing results: .75 on local trees
Example PCFG
Explanations Non-uniform probabilities Lexical dependencies, E.g. PP-attachment: knuste koppen med vilje spiste brødskiva med syltetøy spiste smørbrødet med kniv og gaffel
Alt 1 (Grand-)parent annotation Should get the subject – object distinction
Alt. 2: Lexicalization
History-based model
Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation
Klein and Manning (2003) Penn treebank, WJS (as usual)Input trees: annotated (mother, etc.), ortransformedLocal trees as CFG-rules withfrequency probabilitiesParsing: PCKY
Markovization aka Collins’ historiesas trees Unary rule ( generate the head)Binary rules ( generate the daughters)
CFG-rules VP: [VBZ] VP: [VBZ] NP VP: [VBZ] PP VP VBZ VP: [VBZ] NP VP: [VBZ] NP PP VP: [VBZ] PP Context-free rules VP: [VBZ] PP etc, intermediatecategory cf. dotted itemCount frequencis/probabilities
Dimensions Vertical, ancestor v 1, only motherv 2, grandmother,etc.Horizontal h , do not split RHSh 1, h 2, VP: [VBZ] PP VP: [VBZ] NP PP VP: [VBZ] PP PP VP: [VBZ] NP PP PPetc.
Results Unsmoothed means backoff for low frequencies
Refinements
Parent-annotate tags
More Split-IN: 6 different tags for IN-tagSplit-Aux: separate tags for be and haveSeparate tag for % (like )Strategy: OK to go down to word-level on functionwordsBut not in generalSplit-VP: finite-infinite verb Mark VP-nodes with head-tag
Results State-of-the-art 2003
Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation
Corpus variation Everybody uses Penn treebankWSJ-partHow representative are the results for Other text-genres?Other annotation schemas?
Gildea, 2001Collins-parser:
Is it the lexicon? H – headC – a sister to headH hw – head word of H and motherH ht – this words tagC hw – head word of subtree dominated by CC ht – tag of this wordEtc.This relationship between Chw and Hhw is The bilexical dependenciesExpensive (many of them)
Simplification Original smoothingSkip the most specific part
Results
Today The story so farUnlexicalized probabilistic parsingCorpus variationParser evaluation
Parser evaluation No corpus: Linguistic constructions covered, Unannotated corpus: cf. logon\lingo\erg\data\csli.itemsCoverage on corpusAverage parse base (ambiguity)Entropy/perplexity (language model)Annotated corpus:
Parser evaluation Annotated corpus: POS-tagger accuracyBest-first/Ranked consistencyTree similarityGEIC: Originally only PS-bracketing Unfair: some mistakes are punished too hardSome mistakes not punished at all with flat treesDependency structure
Purpose?1.2.Regression testing of one particulargrammar, vs.Comapring different (approaches to)grammars: Beware of the purpose of the grammar
Briscoe, Carroll, Sanfilippo dependent(introducer, head, depen.) mod(type, h, d)E.g.: mod( ,flag,red), mod(with,walk,John) cmodxmodncmod (may or may not be specified)arg mod(type, head, depen., initial)E.g.: arg mod(by, kill, Brutus, subj)arg(head, depen.) subj(h,d, initial)obj(h,d) dobjobj2Iobjclausal(h,d) more
The competition Similar to dependency structures, but Paul intends to leave IBMPaul argument to both intends and leaveNot single-headNot projective (”moved constituents”)Right now: What is the best evaluation scheme?Several candidates
INF5820 Natural Language Processing - NLP H2009 Jan Tore Lønning jtl@ifi.uio.no