Chapter 5 - Categorizing And Tagging Words

Transcription

Chapter 5 - Categorizing and Tagging WordsJianzhang ZhangAlibaba Business SchoolHangzhou Normal UniversityMay 25, 2022

Table of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 20222 / 37

1. Using a TaggerTable of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 20223 / 37

1. Using a TaggerUsing a TaggerPart-of-speech tagging (POS-tagging): the process of classifyingwords into their parts of speech and labeling them accordingly.Parts of speech are also known as word classes or lexical categories.POS tagging is the second step in the typical NLP pipeline, followingtokenization.A POS-tagger processes a sequence of words, and attaches a part ofspeech tag to each word.import nltkfrom nltk import word tokenize# ['And', 'now', 'for', 'something', 'completely', 'different']text word tokenize("And now for something completelydifferent"), # [('And', 'CC'), ('now', 'RB'), . , ('different', 'JJ')]print(nltk.pos tag(text))zjzhang (HZNU)Text MiningMay 25, 20224 / 37

1. Using a Taggerand is CC, a coordinating conjunction;now and completely are RB, or adverbs;for is IN, a preposition;something is NN, a noun;and different is JJ, an adjective.import jiebaimport jieba.posseg as pseg# jieba的词性标签含义见:, �分for word, pos in pseg.cut('坚持把解决好“三农”, 问题作为全党工作重中之重 (word, pos)zjzhang (HZNU)Text MiningMay 25, 20225 / 37

1. Using a TaggerLet’s look at another example, this time including some homonyms(同形异义,如 refuse, permit).# 他们拒绝让我们获得垃圾许可证text word tokenize("They refuse to permit us to obtain the, refuse permit")# [('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit',, 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'),, ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')]nltk.pos tag(text)refUSE is a verb meaning ”deny,” while REFuse is a noun ��在后). For this reason, text-to-speechsystems usually perform POS-tagging.zjzhang (HZNU)Text MiningMay 25, 20226 / 37

2. Tagged CorporaTable of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 20227 / 37

2. Tagged CorporaRepresenting Tagged TokensBy convention in NLTK, a tagged token is represented using a tupleconsisting of the token and the tag.tagged token nltk.tag.str2tuple('fly/NN')# ('fly', 'NN')print(tagged token)sent '''The/AT grand/JJ jury/NN commented/VBD on/IN a/AT number/NN of/INother/AP topics/NNS ,/, AMONG/IN them/PPO the/AT Atlanta/NP, and/CCFulton/NP-tl County/NN-tl purchasing/VBG departments/NNS ./.'''[nltk.tag.str2tuple(t) for t in sent.split()]zjzhang (HZNU)Text MiningMay 25, 20228 / 37

2. Tagged CorporaReading Tagged CorporaSeveral of the corpora included with NLTK have been tagged for theirpart-of-speech.# [('The', 'AT'), ('Fulton', 'NP-TL'), .]nltk.corpus.brown.tagged words()# [('The', 'DET'), ('Fulton', 'NOUN'), .]nltk.corpus.brown.tagged words(tagset 'universal')nltk.corpus.nps chat.tagged words()nltk.corpus.conll2000.tagged words()nltk.corpus.treebank.tagged words()zjzhang (HZNU)Text MiningMay 25, 20229 / 37

2. Tagged CorporaA Universal Part-of-Speech TagsetTagged corpora use many different conventions for tagging words. Tohelp us get started, we will be looking at a simplified tagset as below:Figure 1: Universal Part-of-Speech Tagsetzjzhang (HZNU)Text MiningMay 25, 202210 / 37

2. Tagged CorporaLet’s see which of these tags are the most common in the newscategory of the Brown corpus:from nltk.corpus import brownbrown news tagged brown.tagged words(categories 'news',, tagset 'universal')tag fd nltk.FreqDist(tag for (word, tag) in brown news tagged)tag fd.most common()tag fd.plot(cumulative True) # 80%的词使用前五个词性标签标注zjzhang (HZNU)Text MiningMay 25, 202211 / 37

2. Tagged CorporaNounsNouns generally refer to people, places, things, or concepts.The simplified noun tags are N for common nouns (普通名词) likebook, and NP for proper nouns (专有名词) like Scotland.Let’s inspect some tagged text to see what parts of speech occurbefore a noun, with the most frequent ones first.word tag pairs list(nltk.bigrams(brown news tagged))noun preceders [a[1] for (a, b) in word tag pairs if b[1] , 'NOUN']fdist nltk.FreqDist(noun preceders)fdist.most common()zjzhang (HZNU)Text MiningMay 25, 202212 / 37

2. Tagged CorporaVerbsVerbs are words that describe events and actions.What are the most common verbs in news text? Let’s sort all theverbs by frequency:# [('Pierre', 'NOUN'), ('Vinken', 'NOUN'), (',', '.'), .]wsj nltk.corpus.treebank.tagged words(tagset 'universal')word tag fd nltk.FreqDist(wsj)[wt[0] for (wt, ) in word tag fd.most common() if wt[1] , 'VERB']# find words which can be both VBD and VBNcfd1 nltk.ConditionalFreqDist(wsj)[w for w in cfd1.conditions() if 'VBD' in cfd1[w] and 'VBN' in, cfd1[w]]zjzhang (HZNU)Text MiningMay 25, 202213 / 37

2. Tagged CorporaAdjectives and AdverbsAdjectives describe nouns, and can be used as modifiers (e.g.large in the large pizza), or in predicates (e.g. the pizza is large).Adverbs modify verbs to specify the time, manner, place ordirection of the event described by the verb (e.g. quickly in the stocks fellquickly). Adverbs may also modify adjectives (e.g. really in Mary’steacher was really nice).English has several categories of closed class words in addition toprepositions, such as articles (also often called determiners 限定词)(e.g., the, a), modals (e.g., should, may), and personal pronouns (e.g.,she, they).zjzhang (HZNU)Text MiningMay 25, 202214 / 37

2. Tagged CorporaExploring Tagged CorporaLet’s look at some larger context, and find words involving particularsequences of tags and words.# pattern is Verb to Verb def process(sentence):for (w1,t1), (w2,t2), (w3,t3) in nltk.trigrams(sentence):if (t1.startswith('V') and t2 'TO' and, t3.startswith('V')):print(w1, w2, w3)from nltk.corpus import brownfor tagged sent in brown.tagged sents():# seek to set# like to seeprocess(tagged sent)zjzhang (HZNU)Text MiningMay 25, 202215 / 37

3. Mapping Words to Properties Using Python DictionariesTable of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 202216 / 37

3. Mapping Words to Properties Using Python DictionariesMost often, we are mapping from a ”word” to some structured object.For example, a document index maps from a word (which we canrepresent as a string), to a list of pages (represented as a list of integers).Dictionaries in PythonDefining DictionariesDefault DictionariesIncrementally Updating a DictionaryComplex Keys and ValuesInverting a DictionaryFor the above topics, please refer to the content of Chapter 4 in myprogramming basics course.zjzhang (HZNU)Text MiningMay 25, 202217 / 37

4. Automatic TaggingTable of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 202218 / 37

4. Automatic TaggingThe Default TaggerWe can create a tagger that tags everything as the same POS tag(e.g. NN). It establishes an important baseline for tagger performance.tags [tag for (word, tag) in, brown.tagged words(categories 'news')]# 将default tagger设定为NNnltk.FreqDist(tags).max() # NNdefault tagger nltk.DefaultTagger('NN')# 0.13default tagger.evaluate(brown tagged sents)zjzhang (HZNU)Text MiningMay 25, 202219 / 37

4. Automatic TaggingThe Regular Expression TaggerThe regular expression tagger assigns tags to tokens on the basis ofmatching patterns.patterns [(r'.*ing ', 'VBG'),(r'.*ed ', 'VBD'),(r'.*es ', 'VBZ'),(r'.*ould ', 'MD'),(r'.*\'s ', 'NN '),(r'.*s ', 'NNS'),(r' -?[0-9] (\.[0-9] )? ', 'CD'),(r'.*', 'NN')]########gerundssimple past3rd singular presentmodalspossessive nounsplural nounscardinal numbersnouns (default)regexp tagger nltk.RegexpTagger(patterns)# 0.20, regexp tagger.evaluate(brown tagged sents)zjzhang (HZNU)Text MiningMay 25, 202220 / 37

4. Automatic TaggingThe Lookup TaggerLet’s find the hundred most frequent words and store their most likelytag.fd nltk.FreqDist(brown.words(categories 'news'))cfd nltk.ConditionalFreqDist(brown.tagged words(categories 'ne ⌋, ws'))# 最高频的100个词most freq words fd.most common(100)# �likely tags dict((word, cfd[word].max()) for (word, ) in, most freq words)baseline tagger nltk.UnigramTagger(model likely tags)# 到了45%,, 比正则表达式分词器提高了25%baseline tagger.evaluate(brown tagged sents) # 0.45zjzhang (HZNU)Text MiningMay 25, 202221 / 37

4. Automatic Taggingsent brown.sents(categories 'news')[3]# [(' ', ' '), ('Only', None), ('a', 'AT'), ., ]print(baseline tagger.tag(sent))Many words have been assigned a tag of None, because they were notamong the 100 most frequent words.In these cases we would like to assign the default tag of NN.In other words, we want to use the lookup table first, and if it isunable to assign a tag, then use the default tagger, a process known asbackoff (退避).zjzhang (HZNU)Text MiningMay 25, 202222 / 37

4. Automatic Tagging# 评估lookup tagger default tagger的准确率def performance(cfd, wordlist):lt dict((word, cfd[word].max()) for word in wordlist)baseline tagger nltk.UnigramTagger(model lt,, backoff nltk.DefaultTagger('NN'))return baseline tagger.evaluate(brown.tagged sents(categories ⌋'news')), baseline tagger nltk.UnigramTagger(model likely tags,backoff n ⌋, ltk.DefaultTagger('NN'))# ,比lookuptagger提高了13%。, performance(cfd, words by freq[:100]) # 0.58zjzhang (HZNU)Text MiningMay 25, 202223 / 37

5. N-Gram TaggingTable of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 202224 / 37

5. N-Gram TaggingUnigram TaggingA unigram Tagger assign the tag that is most likely for that particulartoken. It behaves just like a lookup tagger.We train a UnigramTagger by specifying tagged sentence data as aparameter when we initialize the tagger.The training process involves inspecting the tag of each word andstoring the most likely tag for any word in a dictionary.brown tagged sents brown.tagged sents(categories 'news')unigram tagger nltk.UnigramTagger(brown tagged sents)zjzhang (HZNU)Text MiningMay 25, 202225 / 37

5. N-Gram TaggingSeparating the Training and Testing DataWith the data division, we will have a better picture of the usefulnessof this tagger, i.e. its performance on previously unseen text.# 9:1size int(len(brown tagged sents) * 0.9)train sents brown tagged sents[:size]test sents brown tagged sents[size:]unigram tagger nltk.UnigramTagger(train sents)# 0.81unigram tagger.evaluate(test sents)zjzhang (HZNU)Text MiningMay 25, 202226 / 37

5. N-Gram TaggingGeneral N-Gram TaggingAn n-gram tagger is a generalization of a unigram tagger whosecontext is the current word together with the part-of-speech tags of then-1 preceding tokens, as shown in below:An n-gram tagger picks the tag that is most likely in the givencontext.bigram tagger nltk.BigramTagger(train sents)# �在训练过程中见过该句子print(bigram tagger.tag(brown sents[2007]))zjzhang (HZNU)Text MiningMay 25, 202227 / 37

5. N-Gram Taggingunseen sent brown sents[4203]print(bigram tagger.tag(unseen sent))# [('The', 'AT'), ('population', 'NN'), ('of', 'IN'), ('the',, 'AT'), ('Congo', 'NP'), ('is', 'BEZ'), ('13.5', None),, ('million', None), (',', None), ('divided', None), ('into',, None), ('at', None), ('least', None), ('seven', None),, ('major', None), (' ', None), ('culture', None),, ('clusters', None), ("''", None), ('and', None),, ('innumerable', None), ('tribes', None), ('speaking', None),, ('400', None), ('separate', None), ('dialects', None), ('.',, None)]bigram tagger.evaluate(test sents) # ��只能标记为 None,None �训练数据中没见过 None 标签。随着 n ��据稀疏问题。zjzhang (HZNU)Text MiningMay 25, 202228 / 37

5. N-Gram TaggingCombining TaggersWe could combine the results of a bigram tagger, a unigram tagger,and a default tagger, as follows:1Try tagging the token with the bigram tagger;2If the bigram tagger is unable to find a tag for the token, try theunigram tagger;3If the unigram tagger is also unable to find a tag, use a default tagger.t0 nltk.DefaultTagger('NN')t1 nltk.UnigramTagger(train sents, backoff t0)t2 nltk.BigramTagger(train sents, backoff t1)# 比仅使用unigram t sents) # 0.84zjzhang (HZNU)Text MiningMay 25, 202229 / 37

5. N-Gram TaggingTagging Unknown WordsA useful method to tag unknown words (out-of-vocabulary items,OOV) based on context is to limit the vocabulary of a tagger to the mostfrequent n words, and to replace every other word with a special wordUNK .During training, a unigram tagger will probably learn that UNK isusually a noun.However, the n-gram taggers will detect contexts in which it hassome other tag.For example, if the preceding word is to (tagged TO), then UNK willprobably be tagged as a verb.Quiz: please modify the above combining tagger based on the aboveidea, and evaluate its performance.zjzhang (HZNU)Text MiningMay 25, 202230 / 37

5. N-Gram TaggingStoring Taggers# dump to filefrom pickle import dumpoutput open('./t2.pkl', 'wb')dump(t2, output, -1)output.close()# load model from filefrom pickle import loadinput open('t2.pkl', 'rb')tagger load(input)input.close()text """The board's action shows what free enterprise is up, against in our complex maze of regulatory laws ."""tokens text.split()print(tagger.tag(tokens))zjzhang (HZNU)Text MiningMay 25, 202231 / 37

5. N-Gram TaggingPerformance LimitationsA way to investigate the performance of a tagger is to study itsmistakes (错误分析). Some tags may be harder than others to assign, andit might be possible to treat them specially by pre- or post-processingthe data. ��标注出来)Another convenient way to look at tagging errors is the confusionmatrix. It charts expected tags (the gold standard) against actual tagsgenerated by a tagger:Figure 2: Confusion Matrix Examplezjzhang (HZNU)Text MiningMay 25, 202232 / 37

6. Transformation-Based TaggingTable of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 202233 / 37

6. Transformation-Based TaggingThe shortcomings of N-gram tagger are mainly: too large model sizeand sparse context.Brill tagging is a kind of transformation-based learning: guess the tagof each word, then go back and fix the mistakes.The rules used in Brill tagger are learned from annotated trainingdata. The rules are linguistically interpretable.We will examine the operation of two rules: (a) Replace NN with VBwhen the previous word is TO; (b) Replace TO with IN when the next tagis NNS.Figure 3: Steps in Brill Taggingzjzhang (HZNU)Text MiningMay 25, 202234 / 37

7. How to Determine the Category of a WordTable of Contents1Using a Tagger2Tagged Corpora3Mapping Words to Properties Using Python Dictionaries4Automatic Tagging5N-Gram Tagging6Transformation-Based Tagging7How to Determine the Category of a Wordzjzhang (HZNU)Text MiningMay 25, 202235 / 37

7. How to Determine the Category of a WordHow do we decide what category a word belongs to in the first ��别呢?)In general, linguists use morphological, syntactic, and semanticclues to determine the category of a word.形容词通常出现在名词之前或 be,very 之后;Dutch word verjaardag means the same as the English word birthday- Noun;Nouns are called an open class while prepositions are regarded as aclosed class;zjzhang (HZNU)Text MiningMay 25, 202236 / 37

THE ENDzjzhang (HZNU)Text MiningMay 25, 202237 / 37

3. Mapping Words to Properties Using Python Dictionaries Most often, we are mapping from a "word" to some structured object. For example, a document index maps from a word (which we can represent as a string), to a list of pages (represented as a list of integers). Dictionaries in Python Defining Dictionaries Default Dictionaries