Text%Classification% And%Naïve%Bayes

Transcription

ification

Isthisspam?

WhowrotewhichFederalistpapers? 1787- SConstitution:Jay,Madison,Hamilton. Authorshipof12ofthelettersindispute 1963:solvedbyMosteller rHamilton

Maleorfemaleauthor?1. By1925present- ekongdeltawasthecolonyofCochin- wastheprotectorateofAnnam 2. hergreatestshameintooneofhergreatestassets xt,volume23,number3,pp.321–346

Positiveornegativemoviereview? unbelievablydisappointing greatplottwists thisisthegreatestscrewballcomedyeverfilmed s.5

Whatisthesubjectofthisarticle?MeSH SubjectCategoryHierarchyMEDLINE Article?6 Antogonists logyEpidemiology

TextClassification ionLanguageIdentificationSentimentanalysis

TextClassification:definition Input: adocumentd afixedsetofclassesC {c1,c2, ,cJ} Output:apredictedclassc C

ClassificationMethods:Hand- ‐codedrules Rulesbasedoncombinationsofwordsorotherfeatures spam:black- ‐list- ‐addressOR(“dollars”AND“have beenselected”) Accuracycanbehigh Ifrulescarefullyrefinedbyexpert Butbuildingandmaintainingtheserulesisexpensive

ClassificationMethods:SupervisedMachineLearning Input: adocumentd afixedsetofclassesC {c1,c2, ,cJ} Atrainingsetofm hand- ‐labeleddocuments(d1,c1),.,(dm,cm) Output: alearnedclassifierγ:d à c10

ClassificationMethods:SupervisedMachineLearning Anykindofclassifier Naïve BayesLogisticregressionSupport- ‐vectormachinesk- ‐NearestNeighbors

ification

TextClassificationandNaïveBayesNaïve Bayes(I)

NaïveBayesIntuition esrule Reliesonverysimplerepresentationofdocument Bagofwords

TheBagofWordsRepresentationI lovethismovie!It'ssweet,I loveI tiricalhumor.Thebut dialogueis greatanddialogueis greatthethedialogueis alIt managesto bewhimsicalIt anticwhilelaughingand romanticwhile laughingattheconventionsat the conventions of ofthetheat enre.I wouldrecommendit tojustaboutfairy recommendtalegenre. itI towouldjustaboutanyone.I'veseenit severalrecommendit nyone.I'veseenit alwaysseveraltimes,andI'mI'malwayshappytoseeit againwheneverto andseeitI'magainwheneverI Itimes,alwayshappyhavea friendwhohasn'thaveafriendwhohasn'tto see itseenagainwhenever Iityet!it yet!who hasn'thaveseena friendseen it yet!15fairy always love it it itfairyfairyalways lovelovetotoalwaystoitwhimsical it itit Iit it whimsicalwhimsicaland seen areare I enturesweetofsatiricalsweet of moviewhoitsatiricalwhoitsweetmovieI oseveralyetitbutromanticseveral humorthethe againwouldseenthetoscenesscenesI Iit the managestothethe manageswouldseenthethefuntimesI theIscenesandfun toand timesmanagesI andandaboutthe thconventionswithhaveitit it6 6II I5 5thethe the 4 4to to to 3 3and and 3 3andseen2seenseen2yetyet yet 1 1wouldwould1 1would 1whimsicalwhimsical 1timeswhimsicaltimes1 1sweettimes1 1sweetsatiricalsweetsatirical1 1adventuresatiricaladventure1 1genregenre1 1adventurefairy1fairy1genre 1humorhumor fairy 1havehave1 1greathumorgreat1 1 have great

1recommendhappy11.) c

TextClassificationandNaïveBayesNaïve Bayes(I)

ve BayesClassifier

Bayes’RuleAppliedtoDocumentsandClasses Foradocumentd andaclasscP(d c)P(c)P(c d) P(d)

Naïve BayesClassifier(I)cMAP argmax P(c d)c CMAP is “maximum aposteriori” mostlikely classP(d c)P(c) argmaxc CP(d) argmax P(d c)P(c)c CBayes RuleDropping thedenominator

Naïve BayesClassifier(II)cMAP argmax P(d c)P(c)c C argmax P(x1, x2 , , xn c)P(c)c CDocument drepresented asfeaturesx1.xn

Naïve BayesClassifier(IV)cMAP argmax P(x1, x2 , , xn c)P(c)c CO( X n C mberoftrainingexampleswasavailable.How often does thisclass occur?We can just count therelative frequencies ina corpus

MultinomialNaïve BayesIndependenceAssumptionsP(x1, x2 , , xn c) BagofWordsassumption:Assumepositiondoesn’tmatter iesP(xi cj)areindependentgiventheclassc.P(x1, , xn c) P(x1 c) P(x2 c) P(x3 c) . P(xn c)

MultinomialNaïve BayesClassifiercMAP argmax P(x1, x2 , , xn c)P(c)c CcNB argmax P(c j ) P(x c)c Cx X

sificationpositions allwordpositionsintestdocumentcNB argmax P(c j )c j C i positionsP(xi c j )

ve BayesClassifier

TextClassificationandNaïveBayesNaïve Bayes:Learning

Sec.13.3LearningtheMultinomialNaïve BayesModel Firstattempt:maximumlikelihoodestimates simplyusethefrequenciesinthedataP̂(c j ) P̂(wi c j ) doccount(C c j )N doccount(wi , c j ) count(w, c j )w V

ParameterestimationP̂(wi c j ) count(wi , c j ) count(w, c j )fractionoftimeswordwi appearsamongallwordsindocumentsoftopiccjw V Createmega- ‐documentfortopicj byconcatenatingalldocsinthistopic Usefrequencyofw inmega- ‐document

Sec.13.3ProblemwithMaximumLikelihood astic andclassifiedinthetopicpositive (thumbs- ‐up)?count("fantastic", positive)P̂("fantastic" positive) 0 count(w, positive)w V heotherevidence!cMAP argmax c P̂(c) P̂(xi c)i

Laplace(add- ‐1)smoothingforNaïve Bayescount(wi , c) 1P̂(wi c) (count(w, c)) 1)w Vcount(wi , c) 1 #&%% count(w, c)(( V w V'

MultinomialNaïveBayes:Learning Fromtrainingcorpus,extractVocabulary CalculateP(cj) terms CalculateP(wk cj) terms Foreachcj inC dodocsj alldocswithclass cj docs j P(c j ) total # documents Textj singledoccontainingalldocsj For eachwordwk inVocabularynk #ofoccurrencesofwk inTextjnk αP(wk c j ) n α Vocabulary

TextClassificationandNaïveBayesNaïve Bayes:Learning

TextClassificationandNaïveBayesNaïve Bayes:RelationshiptoLanguageModeling

GenerativeModelforMultinomialNaïve Bayesc ChinaX1 Shanghai35X2 andX3 ShenzhenX4 issueX5 bonds

Naïve BayesandLanguageModeling Naïve bayes classifierscanuseanysortoffeature URL,emailaddress,dictionaries,networkfeatures Butif,asinthepreviousslides Weuseonly wordfeatures weuseall ofthewordsinthetext(notasubset) Then36 Naïve bayes hasanimportantsimilaritytolanguagemodeling.

Sec.13.2.1Eachclass aunigramlanguagemodel Assigningeachword:P(word c) Assigningeachsentence:P(s c) Π P(word isfunfilm0.10.1.050.01 0.1P(s pos) 0.0000005

Sec.13.2.1Naïve BayesasaLanguageModel 10.0010.010.010.050.0050.10.1P(s pos) P(s neg)

TextClassificationandNaïveBayesNaïve Bayes:RelationshiptoLanguageModeling

ayes:AWorkedExample

NcP̂(c) Ncount(w, c) 1P̂(w c) count(c) V TrainingTestDoc12345WordsChinese kyoJapanChineseChineseChineseChineseTokyo JapanPriors:P(c) 34 1P(j) 441ConditionalProbabilities:P(Chinese c) (5 1)/(8 6) 6/14 3/7P(Tokyo c) (0 1)/(8 6) 1/14P(Japan c) (0 1)/(8 6) 1/14P(Chinese j) (1 1)/(3 6) 2/9P(Tokyo j) (1 1)/(3 6) 2/9P(Japan j) (1 1)/(3 6) 2/9Classcccj?Choosingaclass:P(c d5) 3/4*(3/7)3 *1/14*1/14 0.0003P(j d5) 1/4*(2/9)3 *2/9*2/9 0.0001

Naïve BayesinSpamFiltering SpamAssassin Features: ousNon- he.org/tests 3 3 x.html

Summary:NaiveBayesisNotSoNaive VeryFast,lowstoragerequirements eachotherwithoutaffectingresults ecisionTreessufferfromfragmentation insuchcases– especiallyiflittledata fierforproblem Agooddependablebaselinefortextclassification Butwewillseeotherclassifiersthatgivebetteraccuracy

ayes:AWorkedExample

TextClassificationandNaïve BayesPrecision,Recall,andtheFmeasure

The2- ‐by- notcorrectfptn

Precisionandrecall ecttpfnnotcorrectfptn

Acombinedmeasure:F re(weightedharmonicmean):12( β 1) PRF 211βP Rα (1 α )PR Theharmonicmeanisaveryconservativeaverage;seeIIR §8.3 PeopleusuallyusebalancedF1measure i.e.,withβ 1(thatis,α ½):F 2PR/(P R)

TextClassificationandNaïve BayesPrecision,Recall,andtheFmeasure

:Evaluation

MoreThanTwoClasses:Setsofbinaryclassifiers Dealingwithany- ‐oformultivalue classification Adocumentcanbelongto0,1,or 1classes. Foreachclassc C Buildaclassifierγc todistinguishc fromallotherclassesc’ C Giventestdocd, Evaluateitformembershipineachclassusingeachγc d belongstoany classforwhich γc returnstrue51Sec.14.5

MoreThanTwoClasses:SetsofbinaryclassifiersSec.14.5 One- ‐oformultinomialclassification neclass Foreachclassc C Buildaclassifierγc todistinguishc fromallotherclassesc’ C Giventestdocd, Evaluateitformembershipineachclassusingeachγc d belongstotheone classwithmaximumscore52

Evaluation:ClassicReuters- ‐21578DataSetSec. 15.2.4 knens) 9603training,3299testarticles(ModApte/Lewissplit) 118categories Anarticlecanbeinmorethanonecategory Learn118binarycategorydistinctions ses Onlyabout10outof118categoriesarelargeCommon categories(#train,#test)53 Earn  (2877,  1087)Acquisitions  (1650,  179)Money- fx (538,  179)Grain  (433,  149)Crude  (389,  189) Trade  (369,119)Interest  (347,  131)Ship  (197,  89)Wheat  (212,  71)Corn  (182,  56)

ReutersTextCategorizationdataset(Reuters- ‐21578)documentSec. 15.2.4 REUTERS TOPICS "YES" LEWISSPLIT "TRAIN" CGISPLIT "TRAINING-SET" OLDID "12981"NEWID "798" DATE 2-MAR-1987 16:51:43.42 /DATE TOPICS D livestock /D D hog /D /TOPICS TITLE AMERICAN PORK CONGRESS KICKS OFF TOMORROW /TITLE DATELINE CHICAGO, March 2 - /DATELINE BODY The American Pork Congress kicks off tomorrow,March 3, in Indianapolis with 160 of the nations pork producers from 44 member states determining industry positionson a number of issues, according to the National Pork Producers Council, NPPC.Delegates to the three day Congress will be considering 26 resolutions concerning various issues, including the futuredirection of farm policy and the tax law as it applies to the agriculture sector. The delegates will also debate whether toendorse concepts of a national PRV (pseudorabies virus) control and eradication program, the NPPC said.A large trade show, in conjunction with the congress, will feature the latest in technology in all areas of the industry,the NPPC added. Reuter54 /BODY /TEXT /REUTERS

Confusionmatrixc Foreachpairofclasses c1,c2 howmanydocumentsfromc1wereincorrectlyassignedtoc2? ocsintestset55TrueUKAssigned AssignedAssignedinterest e0003437Trueinterest- ‐1213265Truetrade00214510

Sec. ocsinclassi classifiedcorrectly:cii cijjPrecision:Fractionofdocsassignedclassi thatareactuallyaboutclassi:cii c jij cii56Accuracy:(1- ‐ errorrate)Fractionofdocsclassifiedcorrectly:i cijji

Sec. 15.2.4Micro- ‐ vs.Macro- ‐Averaging formancemeasuresintoonequantity? average. tecontingencytable,evaluate.57

Sec.15.2.4Micro- ‐ vs.Macro- o201860 Macroaveraged precision:(0.5 0.9)/2 0.7 Microaveraged precision:100/120 .83 Microaveraged scoreisdominatedbyscoreoncommonclasses58

DevelopmentTestSetsandCross- ‐validationTrainingsetDevelopment Test Set Metric:P/R/F1orAccuracy Unseentestset avoidoverfitting (‘tuningtothetestset’) moreconservativeestimateofperformance Cross- ‐validationovermultiplesplits Handlesamplingerrorsfromdifferentdatasets Poolresultsovereachsplit Computepooleddev setperformanceTestSetTrainingSet Dev TestTrainingSetDev TestDev TestTrainingSetTestSet

:Evaluation

:PracticalIssues

Sec. 15.3.1TheRealWorld Gee,I’mbuildingatextclassifierforreal,now! WhatshouldIdo?62

)andnot(wholeorbread)thenCategorizeasgrain Needcarefulcrafting Humantuningondevelopmentdata Time- ‐consuming:2daysperclass63Sec. 15.3.1

Sec. 15.3.1Verylittledata? UseNaïve Bayes NaïveBayesisa“high- ‐bias”algorithm(NgandJordan2002NIPS) Getmorelabeleddata Findcleverwaystogethumanstolabeldataforyou Trysemi- ‐supervisedtrainingmethods: Bootstrapping,EMoverunlabeleddocuments, 64

Sec. 15.3.1Areasonableamountofdata? Perfectforallthecleverclassifiers SVM RegularizedLogisticRegression Youcanevenuseuser- ‐interpretabledecisiontrees Usersliketohack Managementlikesquickfixes65

Sec. 15.3.1Ahugeamountofdata? Canachievehighaccuracy! Atacost: SVMs(traintime)orkNN (testtime)canbetooslow Regularizedlogisticregressioncanbesomewhatbetter SoNaïveBayescancomebackintoitsownagain!66

Sec. 15.3.1Accuracyasafunctionofdatasize Withenoughdata Classifiermaynotmatter67BrillandBanko onspellingcorrection

Real- ‐worldsystemsgenerallycombine: Automaticclassification Manualreviewofuncertain/difficult/"new”cases68

UnderflowPrevention:logspace Multiplyinglotsofprobabilitiescanresultinfloating- ‐pointunderflow. Sincelog(xy) log(x) log(y) probabilities. Classwithhighestun- e. cNB argmax log P(c j ) c j Ci positions Modelisnowjustmaxofsumofweightslog P(xi c j )

Sec. 15.3.2Howtotweakperformance Domain- erformance Sometimesneedtocollapseterms: Partnumbers,chemicalformulas, Butstemminggenerallydoesn’thelp Upweighting:Countingawordasifitoccurredtwice:70 titlewords(Cohen&Singer1996) firstsentenceofeachparagraph(Murata,1999) Insentencesthatcontaintitlewords(Ko etal, 2002)

:PracticalIssues

Positive%or%negative%movie%review? unbelievably disappointing Full of zany characte