Stacking Classifiers For Anti-spam Filtering Of E-mail

Transcription

In Proceedings of the 6th Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), L.Lee and D. Harman (Eds.), pp. 44–50, Carnegie Mellon University, Pittsburgh, PA, USA, 2001.Stacking classifiers for anti-spam filtering of e-mailGeorgios Sakkis , Ion Androutsopoulos , Georgios Paliouras , Vangelis Karkaletsis ,Constantine D. Spyropoulos , and Panagiotis Stamatopoulos Department of InformaticsUniversity of AthensTYPA Buildings, PanepistimiopolisGR-157 71 Athens, Greecee-mail: {stud0926,T.Stamatopoulos}@di.uoa.grAbstractWe evaluate empirically a scheme forcombining classifiers, known as stackedgeneralization, in the context of anti-spamfiltering, a novel cost-sensitive application oftext categorization. Unsolicited commercial email, or “spam”, floods mailboxes, causingfrustration, wasting bandwidth, and exposingminors to unsuitable content. Using a publiccorpus, we show that stacking can improve theefficiency of automatically induced anti-spamfilters, and that such filters can be used in reallife applications.IntroductionThis paper presents an empirical evaluation ofstacked generalization, a scheme for combiningautomatically induced classifiers, in the contextof anti-spam filtering, a novel cost-sensitiveapplication of text categorization.The increasing popularity and low cost of email have intrigued direct marketers to flood themailboxes of thousands of users with unsolicitedmessages, advertising anything, from vacationsto get-rich schemes. These messages, known asspam or more formally Unsolicited CommercialE-mail, are extremely annoying, as they cluttermailboxes, prolong dial-up connections, andoften expose minors to unsuitable content(Cranor & Lamacchia, 1998). Software and Knowledge EngineeringLaboratoryInstitute of Informatics and TelecommunicationsNational Centre for Scientific Research“Demokritos”GR-153 10 Ag. Paraskevi, Athens, Greecee-mail: {ionandr, paliourg, vangelis,costass}@iit.demokritos.grLegal and simplistic technical countermeasures, like blacklists and keyword-basedfilters, have had a very limited effect so far.1 Thesuccess of machine learning techniques in textcategorization (Sebastiani, 2001) has recentlyled to alternative, learning-based approaches(Sahami, et al. 1998; Pantel & Lin, 1998;Drucker, et al. 1999). A classifier capable ofdistinguishing between spam and non-spam,hereafter legitimate, messages is induced from amanually categorized learning collection ofmessages, and is then used to identify incomingspam e-mail. Initial results have been promising,and experiments are becoming more systematic,by exploiting recently introduced benchmarkcorpora, and cost-sensitive evaluation measures(Gomez Hidalgo, et al. 2000; Androutsopoulos,et al. 2000a, b, c).Stacked generalization (Wolpert, 1992), orstacking, is an approach for constructingclassifier ensembles. A classifier ensemble, orcommittee, is a set of classifiers whoseindividual decisions are combined in some wayto classify new instances (Dietterich, 1997).Stacking combines multiple classifiers to inducea higher-level classifier with improvedperformance. The latter can be thought of as thepresident of a committee with the ground-levelclassifiers as members. Each unseen incomingmessage is first given to the members; thepresident then decides on the category of the1 Consult www.cauce.org, spam.abuse.net, andwww.junkemail.org.

message by considering the opinions of themembers and the message itself. Ground-levelclassifiers often make different classificationerrors. Hence, a president that has successfullylearned when to trust each of the members canimprove overall performance.We have experimented with two groundlevel classifiers for which results on a publicbenchmark corpus are available: a Naïve Bayesclassifier (Androutsopoulos, et al. 2000a, c) anda memory-based classifier (Androutsopoulos, etal. 2000b; Sakkis, et al. 2001). Using a third,memory-based classifier as president, weinvestigated two versions of stacking and twodifferent cost-sensitive scenarios. Overall, ourresults indicate that stacking improves theperformance of the ground-level classifiers, andthat the performance of the resulting anti-spamfilter is acceptable for real-life applications.Section 1 below presents the benchmarkcorpus and the preprocessing of the messages;section 2 introduces cost-sensitive evaluationmeasures; section 3 provides details on thestacking approaches that were explored; section4 discusses the learning algorithms that wereemployed and the motivation for selecting them;section 5 presents our experimental resultsfollowed by conclusions.1BenchmarkcorpuspreprocessingandText categorization has benefited from publicbenchmark corpora. Producing such corpora foranti-spam filtering is not straightforward, sinceuser mailboxes cannot be made public withoutconsidering privacy issues. A useful publicapproximation of a user’s mailbox, however, canbe constructed by mixing spam messages withmessages extracted from spam-free publicarchives of mailing lists. The corpus that weused, Ling-Spam, follows this approach(Androutsopoulos, et al. 2000a, b; Sakkis, et al.2001). It is a mixture of spam messages andmessages sent via the Linguist, a moderated listabout the science and profession of linguistics.The corpus consists of 2412 Linguist messagesand 481 spam messages.Spam messages constitute 16.6% of LingSpam, close to the rates reported by Cranor andLaMacchia (1998), and Sahami et al. (1998).Although the Linguist messages are more topicspecific than most users’ e-mail, they are lessstandardized than one might expect. Forexample, they contain job postings, softwareavailability announcements and even flame-likeresponses. Moreover, recent experiments with anencoded user mailbox and a Naïve Bayes (NB)classifier (Androutsopoulos, et al. 2000c)yielded results similar to those obtained withLing-Spam (Androutsopoulos, et al. 2000a).Therefore, experimentation with Ling-Spam canprovide useful indicative results, at least in apreliminary stage. Furthermore, experimentswith Ling-Spam can be seen as studies of antispam filtering of open unmoderated lists.Each message of Ling-Spam was converted into a vector x x1 , x 2 , x3 , h , x n , wherex1 , , xn are the values of attributesX 1 , h , X n . Each attribute shows if a particularword (e.g. “adult”) occurs in the message. Allattributes are binary: X i 1 if the word ispresent; otherwise X i 0 . To avoid treatingforms of the same word as different attributes, alemmatizer was applied, converting each wordto its base form.To reduce the dimensionality, attributeselection was performed. First, words occurringin less than 4 messages were discarded. Then,the Information Gain (IG) of each candidateattribute X was computed:IG ( X , C ) P ( x, c ) P( x, c) log P( x) P(c)x {0 ,1}, c { spam , legit }The attributes with the m highest IG-scores wereselected, with m corresponding to the bestconfigurations of the ground classifiers that havebeen reported for Ling-Spam (Androutsopoulos,et al. 2000a; Sakkis, et al. 2001); see Section 4.2Evaluation measuresBlocking a legitimate message is generally moresevere an error than accepting a spam message.Let L S and S L denote the two errortypes, respectively, and let us assume thatL S is λ times as costly as S L .Previous research has considered three costscenarios, where λ 1, 9, or 999

(Androutsopoulos, et al. 2000a, b, c; Sakkis, etal. 2001). In the scenario where λ 999,blocked messages are deleted immediately.L S is taken to be 999 times as costly asS L , since most users would consider losinga legitimate message unacceptable. In thescenario where λ 9, blocked messages arereturned to their senders with a request to resendthem to an unfiltered address. In this case,L S is penalized more than S L , toaccount for the fact that recovering from ablocked legitimate message is more costly(counting the sender’s extra work) thanrecovering from a spam message that passed thefilter (deleting it manually). In the third scenario,where λ 1, blocked messages are simplyflagged as possibly spam. Hence, L S is nomore costly than S L . Previous experimentsindicate that the Naïve Bayes ground-classifieris unstable when λ 999 (Androutsopoulos, etal. 2000a). Hence, we have considered only thecases where λ 1 or 9. Let WL (x ) and WS (x ) be the confidence ofa classifier (member or president) that message x is legitimate and spam, respectively. The classifier classifies x as spam iff: WS ( x ) λWL ( x ) If WL (x ) and WS (x ) are accurate estimates of P (legit x ) and P ( spam x ) , respectively, thecriterion above achieves optimal results (Duda& Hart, 1973).To measure the performance of a filter,weightedaccuracy(WAcc)anditscomplementary weighted error rate (WErr 1 –WAcc) are used (Androutsopoulos, et al. 2000a,b, c; Sakkis, et al. 2001):WAcc λ N L L N S Sλ NL NSwhere N Y Z is the number of messages incategory Y that the filter classified as Z ,N L N L L N L S ,N S N S S N S L .That is, when a legitimate message is blocked,this counts as λ errors; and when it passes thefilter, this counts as λ successes.We consider the case where no filter ispresent as our baseline: legitimate messages arenever blocked, and spam messages always pass.The weighted accuracy of the baseline is:WAcc b λ NLλ NL NSThe total cost ratio (TCR) compares theperformance of a filter to the baseline:TCR WErr bNS WErr λ N L S N S LGreater TCR values indicate better performance.For TCR 1, not using the filter is better.Our evaluation measures also include spamrecall (SR) and spam precision (SP):N S SN S S N S LN S SSP N S S N L SSR SR measures the percentage of spam messagesthat the filter blocks (intuitively, itseffectiveness), while SP measures how manyblocked messages are indeed spam (its safety).Despite their intuitiveness, comparing differentfilter configurations using SR and SP is difficult:each configuration yields a pair of SR and SPresults; and without a single combining measure,like TCR, that incorporates the notion of cost, itis difficult to decide which pair is better.In all the experiments, stratified 10-foldcross-validation was used. That is, Ling-Spamwas partitioned into 10 equally populated parts,maintaining the original spam-legitimate ratio.Each experiment was repeated 10 times, eachtime reserving a different part S j (j 1, , 10)for testing, and using the remaining 9 parts asthe training set L j .3StackingIn the first version of stacking that we explored(Wolpert, 1992), which we call cross-validationstacking, the training set of the president wasprepared using a second-level 3-fold crossvalidation. Each training set L j was furtherpartitioned into three equally populated parts,and the training set of the president wasprepared in three steps. At each step, a differentpart LSi (i 1, 2, 3) of L j was reserved, and

the members were trained on the union LLi of the other two parts. Each x x1 , , xmLSiwasenhancedwiththeofmembers’ confidence W ( x ) and WS2 ( x ) that x is spam,1S1991). For the latter, we used TiMBL, animplementation of the k-Nearest Neighboralgorithm (Daelemans, et al. 2000). With NB, the degree of confidence WS ( x ) that x is spam is: WSNB ( x ) P( spam x ) LSi ' with vectors x ' x1 , , xm ,WS1 ( x ),WS2 ( x ) . At the end ofyielding an enhancedthe 3-fold cross-validation, the president wastrained on L j ' LS1 ' LS 2 ' LS 3 ' . It was thentested on S j , after retraining the members onthe entire L j and enhancing the vectors of S jwith the predictions of the members.The second stacking version that weexplored, dubbed holdout stacking, is similar toKohavi’s (1995) holdout accuracy estimation. Itdiffers from the first version, in two ways: themembers are not retrained on the entire L j ; andeach partitioning of L j into LLi and LSi leadsto a different president, trained on LS i ' , whichis then tested on the enhanced S j . Hence, thereare 3 10 presidents in a 10-fold experiment,while in the first version there are only 10. Ineach case, WAcc is averaged over the presidents,and TCR is reported as WErrb over the averageWErr.Holdout stacking is likely to be less effectivethan cross-validation stacking, since itsclassifiers are trained on smaller sets.Nonetheless, it requires fewer computations,because the members are not retrained.Furthermore, during classification the presidentconsults the same members that were used toprepare its training set. In contrast, in crossvalidation stacking the president is tested usingmembers that have received more training thanthose that prepared its training set. Hence, themodel that the president has acquired, whichshows when to trust each member, may notapply to the members that the president consultswhen classifying incoming messages.4Inducers employedAs already mentioned, we used a Naïve Bayes(NB) and a memory-based learner as membersof the committee (Mitchell 1997; Aha, et al.m P( spam) P( xi spam)i 1 k { spam , legit }mP(k ) P( xi k )i 1NB assumes that X 1 , , X m are conditionallyindependent given the category (Duda & Hart,1973).With k-NN, a distance-weighted method isused, with a voting function analogous to theinverted cube of distance (Dudani 1976). The k nearest neighbors xi of x are considered: k WSk NN ( x ) δ ( spam, C ( x ))ii 1 d ( x , xi ) 3 1 d ( x , xi ) 3ki 1 where C ( xi ) is the category of neighbor d ( xi , x j ) is the distance between xi and, xi , xj ,and δ (c1 , c2 ) 1 , if c1 c2 , and 0 otherwise.This formula weighs the contribution of eachneighbor by its distance from the message to beclassified, and the result is scaled to [0,1]. Thedistance is computed by an attribute-weightedfunction (Wettschereck, et al. 1995), employingInformation Gain (IG):n d ( xi , x j ) IGt δ (xri , xrj ) ,t 1 where xi x1i ,l, xmi , x j x1j ,l, xmj , andIGt is the IG score of X t (Section 1).In Tables 1 and 2, we reproduce the bestperforming configurations of the two learners onLing-Spam (Androutsopoulos, et al. 2000b;Sakkis, et al. 2001). These configurations wereused as members of the committee.The same memory-based learner was used asthe president. However, we experimented withseveralconfigurations,varyingtheneighborhood size (k) from 1 to 10, and

providing the president with the m best wordattributes, as in Section 1, with m ranging from50 to 700 by 50. The same attribute- anddistance-weighting schemes were used for thepresident, as with the ground-level memorybased 13.82Table 1: Best configurations of NB per usagescenario and the corresponding performance.Λkm1982600700SRSP88.6% 97.4%81.9% 98.8%ΤCR7.183.64Table 2: Best configurations of k-NN per usagescenario and the corresponding performance.λtrueclassonly oneboth failfailsLegitimate 0.66% 0.08%1Spam12.27% 8.52%All2.59% 1.49%Legitimate 0.33% 0.08%9Spam19.12% 10.19%All3.46% 1.76%Table 3: Analysis of the common errors of thebest configurations of NB and k-NN perscenario (λ) and message class.Our motivation for combining NB with k-NNemerged from preliminary results indicating thatthe two ground-level learners make ratheruncorrelated errors. Table 3 shows the averagepercentages of messages where only one, or bothground-level classifiers fail, per cost scenario (λ)and message category. The figures are for theconfigurations of Tables 1 and 2. It can be seenthat the common errors are always fewer thanthe cases where both classifiers fail. Hence,there is much space for improved accuracy, if apresident can learn to select the correct member.5Experimental resultsTables 4 and 5 summarize the performance ofthe best configurations of the president in ourexperiments, for each cost scenario. Comparingthe TCR scores in these tables with thecorresponding scores of Tables 1 and 2 showsthat stacking improves the performance of theoverall filter. From the two stacking versions,cross-validation stacking is slightly better thanholdout stacking. It should also be noted thatstacking was beneficial for most of theconfigurations of the president that we tested,i.e. most sub-optimal presidents outperformedthe best configurations of the members. This isencouraging, since the optimum configuration isoften hard to determine a priori, and may varyfrom one user to the other.λk1953mSRSP100 91.7% 96.5%200 84.2% 98.9%ΤCR8.443.98Table 4: Best configurations of holdoutstacking per usage scenario and thecorresponding performance.λk1973mSRSP300 89.6% 98.7%100 84.8% 98.8%ΤCR8.604.08Table 5: Best configurations of cross-validationstacking per usage scenario and thecorresponding performance.There was one interesting exception in thepositive impact of stacking. The 1-NN and 2-NN(k 1, 2) presidents were substantially worsethan the other k-NN presidents, often performingworse than the ground-level classifiers. Wewitnessed this behavior in both cost scenarios,and with most values of m (number ofattributes). In a “postmortem” analysis, weascertained that most messages misclassified by1-NN and 2-NN, but not the other presidents, arelegitimate, with their nearest neighbor beingspam. Therefore, the additional errors of 1-NNand 2-NN, compared to the other presidents, areof the L S type. Interestingly, in most of

those cases, both members of the committeeclassify the instance correctly, as legitimate.This is an indication, that for small values of theparameter k the additional two features, i.e., the members’ confidence WS1 ( x ) and WS2 ( x ) , donot enhance but distort the representation ofinstances. As a result, the close neighborhood ofthe unclassified instance is not a legitimate, but aspam e-mail. This behavior of the memorybased classifier is also noted in (Sakkis, et al.2001). The suggested solution there was to use alarger value for k, combined with a strongdistance weighting function, such as the onepresented in section 4.ConclusionIn this paper we adopted a stackedgeneralization approach to anti-spam ion that we examined combined amemory-based and a Naïve Bayes classifier in atwo-member committee, in which rs that we chose as members of thecommittee have been evaluated individually onthe same data as in our evaluation, i.e. the LingSpam corpus. The results of these earlier studieswere used as a basis for comparing theperformance of our method.Our experiments, using two differentapproaches to stacking and two differentmisclassification cost scenarios, show thatstacking consistently improves the performanceof anti-spam filtering. This is explained by thefact that the two members of the committeedisagree more often than agreeing in theirmisclassification errors. Thus, the president isable to improve the overall performance of thefilter, by choosing the right member’s decisionwhen they disagree.The results presented here motivate furtherwork in the same direction. In particular, we areinterested in combining more classifiers, such asdecision trees (Quinlan, 1993) and supportvector machines (Drucker, et al. 1999), withinthe stacking framework. A larger variety ofclassifiers is expected to lead the president tomore informed decisions, resulting in furtherimprovement of the filter’s performance.Furthermore, we would like to evaluate otherclassifiers in the role of the president. Finally, itwould be interesting to compare theperformance of the stacked generalizationapproach to other multi-classifier methods, suchas boosting (Schapire & Singer, 2000).ReferencesAha, W. D., Kibler D., and Albert, M.K., (1991)Instance-Based Learning Algorithms. “MachineLearning”, Vol. 6, pp. 37–66.Androutsopoulos, I., Koutsias, J., Chandrinos, K.V.,Paliouras, G., and Spyropoulos, C.D. (2000a) “Anevaluation of naïve Bayesian anti-spam filtering”.In Proceedings of the Workshop on MachineLearning in the New Information Age, 11thEuropean Conference on Machine Learning(ECML 2000), Barcelona, Spain, pp. 9–17.Androutsopoulos, I., Paliouras, G., Karkaletsis, V.,Sakkis, G., Spyropoulos, C.D., and Stamatopoulos,P. (2000b). “Learning to filter spam e-mail: acomparison of a naïve Bayesian and a memorybased approach”. In Proceedings of the Workshopon Machine Learning and Textual InformationAccess, PKDD 2000, Lyon, France, pp. 1– 3.Androutsopoulos, I, Koutsias, J, Chandrinos, K.V.,and Spyropoulos, C.D. (2000c) “An experimentalcomparison of naïve Bayesian and keyword-basedanti-spam filtering with encrypted personal e-mailmessages”. In Proceedings of SIGIR 2000, Athens,Greece, pp. 160–167.Cranor, L.F., and LaMacchia, B.A. (1998). “Spam!”,Communications of ACM, 41(8):74–83.Daelemans, W., Zavrel, J., van der Sloot, K., and vanden Bosch, A. (2000) TiMBL: Tilburg MemoryBased Learner, version 3.0, Reference Guide. ILK,Computational Linguistics, Tilburg University.http:/ilk.kub.nl/ ilk/papers.Dietterich, G. T. (1997). “Machine LearningResearch: Four Current Directions”. AI Magazine18(4):97-136.Drucker, H. D. ,Wu, D., and Vapnik V. on”. IEEE Transactions On NeuralNetworks, 10(5).Duda, R.O, and Hart, P.E. (1973). “Bayes decisiontheory”. Chapter 2 in Pattern Classification andScene Analysis, pp. 10–43, John Wiley.Dudani, A. S. (1976). “The distance-weighted knearest neighbor rule”. IEEE Transactions onSystems, Man and Cybernetics, 6(4):325–327.Gómez Hidalgo, J.M., Maña Lσpéz, M., and PuertasSanz, E. (2000). “Combining text and heuristics for

cost-sensitive spam filtering”. In Proceedings ofthe 4th Computational Natural Language LearningWorkshop, CoNLL-2000, Lisbon, Portugal, pp. 99–102.Kohavi, R. (1995). “A study of cross-validation andbootstrap for accuracy estimation and modelselection”. In Proceedings of the 12th InternationalJoint Conference on Artificial Intelligence (IJCAI1995), Morgan Kaufmann, pp. 1137–1143.Mitchell, T.M. (1997). Machine Learning. McGrawHill.Pantel, P., and Lin, D. (1998). “SpamCop: a spamclassification and organization program”. InLearning for Text Categorization – Papers fromthe AAAI Workshop, pp. 95–98, MadisonWisconsin. AAAI Technical Report WS-98-05.Quinlan, J.R. (1993). C4.5: Programs for MachineLearning, Morgan Kaufmann, San Mateo,California.Sahami, M., Dumais, S., Heckerman D., and Horvitz,E. (1998). “A Bayesian approach to filtering junke-mail”. In Learning for Text Categorization –Papers from the AAAI Workshop, pp. 55–62,Madison Wisconsin. AAAI Technical Report WS98-05.Sakkis, G., Androutsopoulos, I., Paliouras, G.,Karkaletsis, V., Spyropoulos, C.D., andStamatopoulos, P. (2001) “A memory-basedapproach to anti-spam filtering”. NCSR“Demokritos” Technical Report, Athens, Greece.Schapire, R.E., and Singer, Y. (2000). “BoosTexter: aboosting-based system for text categorization”.Machine Learning, 39(2/3):135–168.Sebastiani, F. (2001). Machine Learning inAutomated Text Categorization. Revised version ofTechnical Report IEI-B4-31-1999, Istituto le delle Ricerche, Pisa, Italy.Wettschereck, D., Aha, W. D., and Mohri, T. (1995).A Review and Comparative Evaluation of FeatureWeighting Methods for Lazy Learning Algorithms.Technical Report AIC-95-012, Naval ResearchLaboratory, Navy Center for Applied Research inArtificial Intelligence, Washington, D.C.Wolpert, D. (1992). “Stacked Generalization”.Neural Networks, 5(2):241–260.

Ling-Spam (Androutsopoulos, et al. 2000a). Therefore, experimentation with Ling-Spam can provide useful indicative results, at least in a preliminary stage. Furthermore, experiments with Ling-Spam can be seen as studies of anti-spam filtering of open unmoderated lists. Each message of Ling-Spam was converted into a vector x x1,x2 ,x3,h,xn , where