Transcription
ification
Isthisspam?
WhowrotewhichFederalistpapers? 1787- SConstitution:Jay,Madison,Hamilton. Authorshipof12ofthelettersindispute 1963:solvedbyMosteller rHamilton
Maleorfemaleauthor?1. By1925present- ekongdeltawasthecolonyofCochin- wastheprotectorateofAnnam 2. hergreatestshameintooneofhergreatestassets xt,volume23,number3,pp.321–346
Positiveornegativemoviereview? unbelievablydisappointing greatplottwists thisisthegreatestscrewballcomedyeverfilmed s.5
Whatisthesubjectofthisarticle?MeSH SubjectCategoryHierarchyMEDLINE Article?6 Antogonists logyEpidemiology
TextClassification ionLanguageIdentificationSentimentanalysis
TextClassification:definition Input: adocumentd afixedsetofclassesC {c1,c2, ,cJ} Output:apredictedclassc C
ClassificationMethods:Hand- ‐codedrules Rulesbasedoncombinationsofwordsorotherfeatures spam:black- ‐list- ‐addressOR(“dollars”AND“have beenselected”) Accuracycanbehigh Ifrulescarefullyrefinedbyexpert Butbuildingandmaintainingtheserulesisexpensive
ClassificationMethods:SupervisedMachineLearning Input: adocumentd afixedsetofclassesC {c1,c2, ,cJ} Atrainingsetofm hand- ‐labeleddocuments(d1,c1),.,(dm,cm) Output: alearnedclassifierγ:d à c10
ClassificationMethods:SupervisedMachineLearning Anykindofclassifier Naïve BayesLogisticregressionSupport- ‐vectormachinesk- ‐NearestNeighbors
ification
TextClassificationandNaïveBayesNaïve Bayes(I)
NaïveBayesIntuition esrule Reliesonverysimplerepresentationofdocument Bagofwords
TheBagofWordsRepresentationI lovethismovie!It'ssweet,I loveI tiricalhumor.Thebut dialogueis greatanddialogueis greatthethedialogueis alIt managesto bewhimsicalIt anticwhilelaughingand romanticwhile laughingattheconventionsat the conventions of ofthetheat enre.I wouldrecommendit tojustaboutfairy recommendtalegenre. itI towouldjustaboutanyone.I'veseenit severalrecommendit nyone.I'veseenit alwaysseveraltimes,andI'mI'malwayshappytoseeit againwheneverto andseeitI'magainwheneverI Itimes,alwayshappyhavea friendwhohasn'thaveafriendwhohasn'tto see itseenagainwhenever Iityet!it yet!who hasn'thaveseena friendseen it yet!15fairy always love it it itfairyfairyalways lovelovetotoalwaystoitwhimsical it itit Iit it whimsicalwhimsicaland seen areare I enturesweetofsatiricalsweet of moviewhoitsatiricalwhoitsweetmovieI oseveralyetitbutromanticseveral humorthethe againwouldseenthetoscenesscenesI Iit the managestothethe manageswouldseenthethefuntimesI theIscenesandfun toand timesmanagesI andandaboutthe thconventionswithhaveitit it6 6II I5 5thethe the 4 4to to to 3 3and and 3 3andseen2seenseen2yetyet yet 1 1wouldwould1 1would 1whimsicalwhimsical 1timeswhimsicaltimes1 1sweettimes1 1sweetsatiricalsweetsatirical1 1adventuresatiricaladventure1 1genregenre1 1adventurefairy1fairy1genre 1humorhumor fairy 1havehave1 1greathumorgreat1 1 have great
1recommendhappy11.) c
TextClassificationandNaïveBayesNaïve Bayes(I)
ve BayesClassifier
Bayes’RuleAppliedtoDocumentsandClasses Foradocumentd andaclasscP(d c)P(c)P(c d) P(d)
Naïve BayesClassifier(I)cMAP argmax P(c d)c CMAP is “maximum aposteriori” mostlikely classP(d c)P(c) argmaxc CP(d) argmax P(d c)P(c)c CBayes RuleDropping thedenominator
Naïve BayesClassifier(II)cMAP argmax P(d c)P(c)c C argmax P(x1, x2 , , xn c)P(c)c CDocument drepresented asfeaturesx1.xn
Naïve BayesClassifier(IV)cMAP argmax P(x1, x2 , , xn c)P(c)c CO( X n C mberoftrainingexampleswasavailable.How often does thisclass occur?We can just count therelative frequencies ina corpus
MultinomialNaïve BayesIndependenceAssumptionsP(x1, x2 , , xn c) BagofWordsassumption:Assumepositiondoesn’tmatter iesP(xi cj)areindependentgiventheclassc.P(x1, , xn c) P(x1 c) P(x2 c) P(x3 c) . P(xn c)
MultinomialNaïve BayesClassifiercMAP argmax P(x1, x2 , , xn c)P(c)c CcNB argmax P(c j ) P(x c)c Cx X
sificationpositions allwordpositionsintestdocumentcNB argmax P(c j )c j C i positionsP(xi c j )
ve BayesClassifier
TextClassificationandNaïveBayesNaïve Bayes:Learning
Sec.13.3LearningtheMultinomialNaïve BayesModel Firstattempt:maximumlikelihoodestimates simplyusethefrequenciesinthedataP̂(c j ) P̂(wi c j ) doccount(C c j )N doccount(wi , c j ) count(w, c j )w V
ParameterestimationP̂(wi c j ) count(wi , c j ) count(w, c j )fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccjw V Createmega- ‐documentfortopicj byconcatenatingalldocsinthistopic Usefrequencyofw inmega- ‐document
Sec.13.3ProblemwithMaximumLikelihood astic andclassifiedinthetopicpositive (thumbs- ‐up)?count("fantastic", positive)P̂("fantastic" positive) 0 count(w, positive)w V heotherevidence!cMAP argmax c P̂(c) P̂(xi c)i
Laplace(add- ‐1)smoothingforNaïve Bayescount(wi , c) 1P̂(wi c) (count(w, c)) 1)w Vcount(wi , c) 1 #&%% count(w, c)(( V w V'
MultinomialNaïveBayes:Learning Fromtrainingcorpus,extractVocabulary CalculateP(cj) terms CalculateP(wk cj) terms Foreachcj inC dodocsj alldocswithclass cj docs j P(c j ) total # documents Textj singledoccontainingalldocsj For eachwordwk inVocabularynk #ofoccurrencesofwk inTextjnk αP(wk c j ) n α Vocabulary
TextClassificationandNaïveBayesNaïve Bayes:Learning
TextClassificationandNaïveBayesNaïve Bayes:RelationshiptoLanguageModeling
GenerativeModelforMultinomialNaïve Bayesc ChinaX1 Shanghai35X2 andX3 ShenzhenX4 issueX5 bonds
Naïve BayesandLanguageModeling Naïve bayes classifierscanuseanysortoffeature URL,emailaddress,dictionaries,networkfeatures Butif,asinthepreviousslides Weuseonly wordfeatures weuseall ofthewordsinthetext(notasubset) Then36 Naïve bayes hasanimportantsimilaritytolanguagemodeling.
Sec.13.2.1Eachclass aunigramlanguagemodel Assigningeachword:P(word c) Assigningeachsentence:P(s c) Π P(word isfunfilm0.10.1.050.01 0.1P(s pos) 0.0000005
Sec.13.2.1Naïve BayesasaLanguageModel 10.0010.010.010.050.0050.10.1P(s pos) P(s neg)
TextClassificationandNaïveBayesNaïve Bayes:RelationshiptoLanguageModeling
ayes:AWorkedExample
NcP̂(c) Ncount(w, c) 1P̂(w c) count(c) V TrainingTestDoc12345WordsChinese kyoJapanChineseChineseChineseChineseTokyo JapanPriors:P(c) 34 1P(j) 441ConditionalProbabilities:P(Chinese c) (5 1)/(8 6) 6/14 3/7P(Tokyo c) (0 1)/(8 6) 1/14P(Japan c) (0 1)/(8 6) 1/14P(Chinese j) (1 1)/(3 6) 2/9P(Tokyo j) (1 1)/(3 6) 2/9P(Japan j) (1 1)/(3 6) 2/9Classcccj?Choosingaclass:P(c d5) 3/4*(3/7)3 *1/14*1/14 0.0003P(j d5) 1/4*(2/9)3 *2/9*2/9 0.0001
Naïve BayesinSpamFiltering SpamAssassin Features: ousNon- he.org/tests 3 3 x.html
Summary:NaiveBayesisNotSoNaive VeryFast,lowstoragerequirements eachotherwithoutaffectingresults ecisionTreessufferfromfragmentation insuchcases– especiallyiflittledata fierforproblem Agooddependablebaselinefortextclassification Butwewillseeotherclassifiersthatgivebetteraccuracy
ayes:AWorkedExample
TextClassificationandNaïve BayesPrecision,Recall,andtheFmeasure
The2- ‐by- notcorrectfptn
Precisionandrecall ecttpfnnotcorrectfptn
Acombinedmeasure:F re(weightedharmonicmean):12( β 1) PRF 211βP Rα (1 α )PR Theharmonicmeanisaveryconservativeaverage;seeIIR §8.3 PeopleusuallyusebalancedF1measure i.e.,withβ 1(thatis,α ½):F 2PR/(P R)
TextClassificationandNaïve BayesPrecision,Recall,andtheFmeasure
:Evaluation
MoreThanTwoClasses:Setsofbinaryclassifiers Dealingwithany- ‐oformultivalue classification Adocumentcanbelongto0,1,or 1classes. Foreachclassc C Buildaclassifierγc todistinguishc fromallotherclassesc’ C Giventestdocd, Evaluateitformembershipineachclassusingeachγc d belongstoany classforwhich γc returnstrue51Sec.14.5
MoreThanTwoClasses:SetsofbinaryclassifiersSec.14.5 One- ‐oformultinomialclassification neclass Foreachclassc C Buildaclassifierγc todistinguishc fromallotherclassesc’ C Giventestdocd, Evaluateitformembershipineachclassusingeachγc d belongstotheone classwithmaximumscore52
Evaluation:ClassicReuters- ‐21578DataSetSec. 15.2.4 knens) 9603training,3299testarticles(ModApte/Lewissplit) 118categories Anarticlecanbeinmorethanonecategory Learn118binarycategorydistinctions ses Onlyabout10outof118categoriesarelargeCommon categories(#train,#test)53 Earn (2877, 1087)Acquisitions (1650, 179)Money- fx (538, 179)Grain (433, 149)Crude (389, 189) Trade (369,119)Interest (347, 131)Ship (197, 89)Wheat (212, 71)Corn (182, 56)
ReutersTextCategorizationdataset(Reuters- ‐21578)documentSec. 15.2.4 REUTERS TOPICS "YES" LEWISSPLIT "TRAIN" CGISPLIT "TRAINING-SET" OLDID "12981"NEWID "798" DATE 2-MAR-1987 16:51:43.42 /DATE TOPICS D livestock /D D hog /D /TOPICS TITLE AMERICAN PORK CONGRESS KICKS OFF TOMORROW /TITLE DATELINE CHICAGO, March 2 - /DATELINE BODY The American Pork Congress kicks off tomorrow,March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positionson a number of issues, according to the National Pork Producers Council, NPPC.Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the futuredirection of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether toendorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry,the NPPC added. Reuter54 /BODY /TEXT /REUTERS
Confusionmatrixc Foreachpairofclasses c1,c2 howmanydocumentsfromc1wereincorrectlyassignedtoc2? ocsintestset55TrueUKAssigned AssignedAssignedinterest e0003437Trueinterest- ‐1213265Truetrade00214510
Sec. ocsinclassi classifiedcorrectly:cii cijjPrecision:Fractionofdocsassignedclassi thatareactuallyaboutclassi:cii c jij cii56Accuracy:(1- ‐ errorrate)Fractionofdocsclassifiedcorrectly:i cijji
Sec. 15.2.4Micro- ‐ vs.Macro- ‐Averaging formancemeasuresintoonequantity? average. tecontingencytable,evaluate.57
Sec.15.2.4Micro- ‐ vs.Macro- o201860 Macroaveraged precision:(0.5 0.9)/2 0.7 Microaveraged precision:100/120 .83 Microaveraged scoreisdominatedbyscoreoncommonclasses58
DevelopmentTestSetsandCross- ‐validationTrainingsetDevelopment Test Set Metric:P/R/F1orAccuracy Unseentestset avoidoverfitting (‘tuningtothetestset’) moreconservativeestimateofperformance Cross- ‐validationovermultiplesplits Handlesamplingerrorsfromdifferentdatasets Poolresultsovereachsplit Computepooleddev setperformanceTestSetTrainingSet Dev TestTrainingSetDev TestDev TestTrainingSetTestSet
:Evaluation
:PracticalIssues
Sec. 15.3.1TheRealWorld Gee,I’mbuildingatextclassifierforreal,now! WhatshouldIdo?62
)andnot(wholeorbread)thenCategorizeasgrain Needcarefulcrafting Humantuningondevelopmentdata Time- ‐consuming:2daysperclass63Sec. 15.3.1
Sec. 15.3.1Verylittledata? UseNaïve Bayes NaïveBayesisa“high- ‐bias”algorithm(NgandJordan2002NIPS) Getmorelabeleddata Findcleverwaystogethumanstolabeldataforyou Trysemi- ‐supervisedtrainingmethods: Bootstrapping,EMoverunlabeleddocuments, 64
Sec. 15.3.1Areasonableamountofdata? Perfectforallthecleverclassifiers SVM RegularizedLogisticRegression Youcanevenuseuser- ‐interpretabledecisiontrees Usersliketohack Managementlikesquickfixes65
Sec. 15.3.1Ahugeamountofdata? Canachievehighaccuracy! Atacost: SVMs(traintime)orkNN (testtime)canbetooslow Regularizedlogisticregressioncanbesomewhatbetter SoNaïveBayescancomebackintoitsownagain!66
Sec. 15.3.1Accuracyasafunctionofdatasize Withenoughdata Classifiermaynotmatter67BrillandBanko onspellingcorrection
Real- ‐worldsystemsgenerallycombine: Automaticclassification Manualreviewofuncertain/difficult/"new”cases68
UnderflowPrevention:logspace Multiplyinglotsofprobabilitiescanresultinfloating- ‐pointunderflow. Sincelog(xy) log(x) log(y) probabilities. Classwithhighestun- e. cNB argmax log P(c j ) c j Ci positions Modelisnowjustmaxofsumofweightslog P(xi c j )
Sec. 15.3.2Howtotweakperformance Domain- erformance Sometimesneedtocollapseterms: Partnumbers,chemicalformulas, Butstemminggenerallydoesn’thelp Upweighting:Countingawordasifitoccurredtwice:70 titlewords(Cohen&Singer1996) firstsentenceofeachparagraph(Murata,1999) Insentencesthatcontaintitlewords(Ko etal, 2002)
:PracticalIssues
Positive%or%negative%movie%review? unbelievably disappointing Full of zany characte