Lecture 21: Unlabeled Data For NLP

Transcription

Lecture 21:Unlabeled data for NLPIntro to NLP, CS585, Fall 2014http://people.cs.umass.edu/ brenocon/inlp2014/Brendan O’Connor1Wednesday, November 19, 14

Project schedulingLabeling2Wednesday, November 19, 14

What to do when we only have a little bit oflabeled data? (Like in the final project!) Get more labelsDifferent forms of supervision Tag dictionaries: type-level supervisionMore sophisticated features Semi-supervised learningActive learning:intelligently choose which unlabeled data to annotateExploit unlabeled data3Wednesday, November 19, 14

Unlabeled data Labeled data: human element is costly PTB or ImageNet: the largest labeled datasets andvery successful -- but very expensive! PTB 1M tokensImageNet 1M imagesSmall efforts and new problems: typically thousandsof tokensBut we have huge quantities of unlabeled, rawtext. Can we use them somehow?4Wednesday, November 19, 14

45k tokens(our NER dataset)Wednesday, November 19, 14

45k tokens(our NER dataset)Wednesday, November 19, 141M tokens(WSJ PTB)

1B tokens(Gigaword: decades of news articles)45k tokens(our NER dataset)1M tokens(WSJ PTB)[246 more rows.]Wednesday, November 19, 14Twitter, web:trillions of tokens .

Semi-supervised learning Formally: given (1) small labeled dataset of (x,y) pairs,(2) large unlabeled dataset of (x, ) pairs,. learn a better f(x)- y function than from justlabeled data alone.Two major approaches 1. Learn an unsupervised model on the x’s. Use itsclusters/vectors as features for labeled training.2. Learn a single model on both labeled andunlabeled data together6Wednesday, November 19, 14

Unsupervised NLP Can we learn lexical or grammatical structuresfrom unlabeled text? Maybe lexical/structural information is a latentvariable . like alignments in IBM Model 1(Different use: exploratory data analysis) You shall know a word by the company it keeps(Firth, J. R. 1957:11) Intuition for lexical semantics: the distributionalhypothesis.Very useful technique: learn word clusters (or otherword representations) on unlabeled data, then use asfeatures in a supervised system.7Wednesday, November 19, 14

Distributional example:What types of words can go into these tional semantics isbased on the idea that:Words with similarcontext statistics havesimilar meaning.hesheMaryJohnithimherAssemble sets of wordswith similar contextfrequencies.lolhahaMany ways to capturethis. including HMMs.8Wednesday, November 19, 14

Brown HMM word clustering HMM for the unlabeled dataset With a one-class-per-word restriction! (Remember: real-world POS data kinda has this property)Thus each HMM class is described by a hard clustering ofwords (a set of words)Heuristically search for word clusters that maximizelikelihoodNotation:c is a clustering of wordtypes. c(w) is w’s cluster ID.Y c arg maxpMLE (c(wi ) c(wi 1 )) pMLE (wi c(wi ))c2Ci9Wednesday, November 19, 14

Hierarchical clustering One form of Brown clustering is alsohierarchical, through agglomerative clustering:iteratively merge clusters, and track the mergehistory Initialize: Greedily assign words to K clustersIterate: Merge the two clusters that causes theleast-worst hit to likelihood(There are many other approaches to this type of HMM;see kcls.html)10Wednesday, November 19, 14

Brown nt10011November October10011101walkrun sprint Words merged according to contextualsimilarity Clusters are equivalent to bit-string prefixes Prefix length determines the granularity ofthe clustering[Slide credit:Terry Koo]Wednesday, November 19, 14

Brown ent110011November October10011101walkrun sprint Words merged according to contextualsimilarity Clusters are equivalent to bit-string prefixes Prefix length determines the granularity ofthe clustering[Slide credit:Terry Koo]Wednesday, November 19, 14

Brown esident110011November October1001111101walkrun sprint Words merged according to contextualsimilarity Clusters are equivalent to bit-string prefixes Prefix length determines the granularity ofthe clustering[Slide credit:Terry Koo]Wednesday, November 19, 14

steringweetscorporaneHier. clusters as POS features 1000 leaves, cluster prefixes as features for Twitter POSUsing the Liang 2005 version of Brown ster, NamesternsHighest Weighted ClustersCluster prefix TagTypes Most common word in each cluster with prefixAccuracy8160 lol lmao haha yes yea oh omg aww ah btw wow thanks11101010*!11000*L1110101100*E2798 x 3 :d :p :) :o :/111110*A6510 young sexy hot slow dark low interesting easy important1101*D01*Vsorry congrats welcome yay ha hey goodnight hi dearplease huh wtf exactly idk bless whatever well ok428 i'm im you're we're he's there's its it'ssafe perfect special different random short quick bad crazyserious stupid weird lucky sad378 the da my your ur our their hisTaggeToken29267 do did kno know care mean hurts hurt say realize believeworry understand forget agree remember love miss hatethink thought knew hope wish guess bet have11101*O899 you yall u it mine everything nothing something anyone100110*&103 or n & andsomeone everyone nobodyhttp://www.ark.cs.cmu.edu/TweetNLP/cluster viewer.htmlWednesday, November 19, 14Accuracy(incl. sysTagger, tokenizer, and data all released at: Impr New

Other examples Dependency parsingNER 2004(Miller et al. 2004) Milleretal,NAACL(Koo et al. 2008)fect of Training Corpus Size10095Cluster-based85.3 ( 3.3)87.5 ( 2.5)89.7 ( 1.8)91.4 ( 1.7)92.2 ( 1.1)93.2 ( 1.1)93.3 ( 0.9) The POS tagger uses the same trainingcorpus as the parser9085Discriminative Cluste rs80F-MeasureTraining 600091.13200092.13983292.4HM M7570656055501000100001000001000000Training SizeFigure 2: Impact of Word ClusteringThird, weanalysis:consider the impact of active learning. FigureThis is a learning curve3 shows (a) discriminative tagger performance withouperformance as a functiontrainingsetsamesizetagger using activecluster offeatures,(b) theWednesday, November 19, 14learning,(c) the discriminative tagger with cluste15features, and (d) the discriminative tagger with cluste

Brown clusters as features Have been seen useful for POSNERDependency parsing(others?)More generally: use automatically learnedword representations. Next week: vector-valued reprs.I think word reprs are the most established use ofunlabeled data for NLP systemsSee also: sday, November 19, 14

Semi-supervised learning Formally: given (1) small labeled dataset of (x,y) pairs,(2) large unlabeled dataset of (x, ) pairs,. learn a better f(x)- y function than from justlabeled data alone.Two major approaches 1. Learn an unsupervised model on the x’s. Use itsclusters/vectors as features for labeled training. 2. Learn a single model on both labeledand unlabeled data together17Wednesday, November 19, 14

EM for semi-sup learning we have (1) small labeled dataset of (x,y) pairs,(2) large unlabeled dataset of (x, ) pairs, Init: train model on labeled dataE-step: soft predictions on unlabeledM-step: maximize labeled loglik, PLUS weighted loglikaccording to our new soft predictions. So the entireunlabeled dataset is part of the training setTreat missing labels as latent variables. Learn with EM!Issues: Have to re-weight the M-step (what if unlabeled data is 1 million times bigger?)Can go off the rails18Wednesday, November 19, 14

Self-training Same setup, but only add in a small number ofhighly-confident examples Label all unlabeled x’s. Choose the top-10 mostconfident (and/or higher than 99% confidence.).Add those 10 to the labeled datasetRe-train and iterate E.g. best parsers use self-trainingMany examples of this being useful -- may haveto limit the number of iterations and/or playwith thresholds19Wednesday, November 19, 14

Active learning You want to label more data. Use your current classifier to helpchoose the most useful examples to annotate. Uncertainty sampling: Choose the example where the model ismost uncertain. (If binary: closest to 50% predicted prob. If2.3. MEASURES OF UNCERTAINTY 13multiclass: highest entropy)333222111000-1-1-1-2-2-2-3-4-202(a) a 2D toy data set 4-3-4-202(b) random sampling4-3-4-2024(c) uncertainty samplingFigure 2.2: Uncertaintywith a toyset. (a)swear400 instances,sampled fromMy take:samplingsome peoplein dataindustryby AL, evenlybut I haven’tseentwo classGaussians. Instancesrepresentedas pointsin a 2D inputspace. gains(b) A logisticmodel trainedmanyareresearchpapersshowingdramaticfrom regressionit. Not surewith 30 labeledwhyinstancesrandomly drawnthe problemdomain. The line represents the decisionthe difference.Seefromreviewby http://burrsettles.com/boundary of the classifier. (c) A logistic regression modeltrained with 30 actively queried instances using20uncertainty sampling.Wednesday, November 19, 14

Active learning You want to label more data. Use your current classifier to helpchoose the most useful examples to annotate. Uncertainty sampling: Choose the example where the model ismost uncertain. (If binary: closest to 50% predicted prob. If2.3.2.3. MEASURESMEASURES OFOF UNCERTAINTYUNCERTAINTY 1313multiclass: highest -2-2-2-3-3-4-4-2-20022(a)(a) aa 2D2D toytoy datadata setset 44-3-3-4-4-2-20022(b)(b) randomrandom samplingsampling44-3-3-4-4-2-2002244(c)(c) uncertaintyuncertainty samplingsamplingFigureFigure 2.2:2.2: UncertaintyUncertaintysamplingwithwith aa toytoydata set.set. (a)(a)swear400400 instances,instances,evenlysampledsampled fromfrom twotwo classclassMy take:samplingsome peoplein dataindustryby AL, evenlybutI sarerepresentedrepresentedasas pointspointsinin aa 2D2D inputinputspace.space.gains(b)(b) AA logisticlogisticregressionmodelmodel from regressionit.Not surewithwith 3030 y drawndrawnfromthethe problemproblemdomain.domain. TheThe lineline representsrepresents thethe decisiondecisionthe difference.Seefromreviewby http://burrsettles.com/boundaryboundary ofof thethe classifier.classifier. (c)(c) AA logisticlogistic regressionregression modelmodeltrainedtrained withwith 3030 activelyactively queriedqueried instancesinstances usingusing20uncertaintyuncertainty sampling.sampling.Wednesday, November 19, 14

Active learning You want to label more data. Use your current classifier to helpchoose the most useful examples to annotate. Uncertainty sampling: Choose the example where the model ismost uncertain. (If binary: closest to 50% predicted prob. If2.3.2.3. MEASURESMEASURESOFOFUNCERTAINTYUNCERTAINTY 1313multiclass: highest -2-2-2-3-3-4-4-2-20022(a)(a) aa 2D2D toytoy datadata setset 44-3-3-4-4-2-20022(b)(b) randomrandom samplingsampling44-3-3-4-4-2-2002244(c)(c) uncertaintyuncertainty samplingsamplingFigureFigure 2.2:2.2: UncertaintyUncertaintysamplingwithwith aa toytoydata set.set. (a)(a)swear400400 instances,instances,evenlysampledsampled fromfrom twotwo classclassMy take:samplingsome peoplein dataindustryby AL, evenlybutI ersshowingdramaticgainsfrom regressionit.Not surewithwith 3030 y drawndrawnfromthethe problemproblemdomain.domain. TheThe lineline representsrepresents thethe decisiondecisionthe difference.Seefromreviewby http://burrsettles.com/boundaryboundary ofof thethe classifier.classifier.(c)(c) AA logisticlogistic regressionregression modelmodeltrainedtrained withwith 3030 activelyactively queriedqueried instancesinstances usingusing20uncertaintyuncertainty sampling.sampling.Wednesday, November 19, 14

Brown HMM word clustering HMM for the unlabeled dataset With a one-class-per-word restriction! (Remember: real-world POS data kinda has this property) Thus each HMM class is described by a hard clustering of words (a set of words) Heuristically search for word clusters that maximize likelihood 9 Notatio