Hands-On Pattern Recognition - Mtome

Transcription

Hands-On Pattern RecognitionChallenges in Machine Learning, Volume 1

Hands-On Pattern RecognitionChallenges in Machine Learning, Volume 1Isabelle Guyon, Gavin Cawley,Gideon Dror, and Amir Saffari, editorsNicola Talbot, production editorMicrotome PublishingBrookline, Massachusettswww.mtome.com

Hands-On Pattern RecognitionChallenges in Machine Learning, Volume 1Isabelle Guyon, Gavin Cawley,Gideon Dror, and Amir Saffari, editorsNicola Talbot, production editorc 2011 Microtome Publishing, Brookline, Massachusetts, USA.Collection copyright Copyright of individual articles remains with their respective authors.ISBN-13: 978-0-9719777-1-6

Series ForewordDuring recent years, a team of motivated researchers led by Isabelle Guyon has done an admirable job conceiving and organizing performance evaluations in machine learning and inference, in the form of competitions or challenges. This book opens the series Challengesin Machine Learning. It contains papers by the top ranking challenge participants, providinginstructive analyses of the results. It also includes tutorials and theoretical papers on topicsaddressed by the challenges.Designing good challenges is far from trivial. The team benefitted from Isabelle’s experience as a member of technical staff at Bell Laboratories, where she was part of a group thatheld world records in pattern recognition tasks, while at the same time employing theoreticiansproving theorems about statistical learning. This group, which I fondly remember from thetime I spent there as a student, always put great emphasis on benchmarking, but at the sametime it avoided the trap of searching only the vicinity of local optima by tweaking existingmethods — quite the contrary; in the 1990s, the group came up with Support Vector Machines,a development to which Isabelle made significant contributions.While these methods are now part of our standard toolkit, Isabelle has moved on to designbenchmarks for tasks that are harder to evaluate. This is not only a great service to the community, but it will also enable scientific progress on problems that are arguably more difficultthan classical pattern recognition. In particular, the benchmarks include the fascinating problem of causal inference. Finding causal directions from observations is not only a profoundissue for the philosophy of science, but it can also develop into an important area for practicalinference applications. According to Hans Reichenbach, all statistical associations arise fromcausal mechanisms. However, machine learning has so far focused on the statistical ‘surface’of things. Penetrating this surface would help us detect regularities that are more robust to issues that make our life difficult today, including nonstationarity and covariate shifts. It mayalso move us closer to the long term goal of building intelligent systems that learn about thestructure of the world in an autonomous way.Bernhard SchölkopfMax Planck Institute, Tübingen, Germanyi

ForewordMachine learning is about building machines that learn. Building machines is engineering. Theidea is to create an artefact. The hope is that these artefacts are useful (typically to others). Themachines have to solve some end-user problem. The present book grapples with a number ofkey issues central to this task — how to represent the data, how to select suitable models, andhow to evaluate performance.Engineers design many types of machine — flying machines, communication machines etc.The question of how to evaluate performance arises in many areas. The question of representingdata also arises (although it means something different). If one compares the state-of-the-art inperformance measurement and prediction in mature engineering disciplines, Machine Learninglooks primitive in comparison. Communications engineers [4] can design systems and predicttheir performance in messy real world situations very well. The same is true in aeronautics [2].Why is Machine Learning lagging behind? And what can be done about it?One thing that more mature engineering disciplines seem to have in common is a wide variety of “ways of knowing” or acceptable research practices. It is well accepted in aeronauticalengineering that it is useful to have design rules of thumb [2]. In fact many scholars have argued that the traditional view of engineering as applied science is back-the-front [5]. Thereare a range of different categorisations of “useful knowledge” different to the tired “pure versus applied” (see for example Mokyr’s [3] distinction between propositional and prescriptiveknowledge). How do philosophical reflections on the nature of engineering knowledge affectthe development of machine learning, and how is it relevant to the present book? Simply, different ways of knowing require different means of inquiry. The benchmark competitions summarised in this book are a different way of knowing. They are analogous to the principled (butnot scientifically derived) empirical studies in many branches of engineering that complementmore well-honed scientific knowledge.But this is a starting point — a beginning rather than an end. There are in fact many profound scientific questions to be answered regarding performance evaluation. For example, asatisfactory theory of cross-validation still eludes the community. And the plethora of differentperformance measures need to brought into better order. Self-bounding learning algorithms [6](that not only estimate an object of interest but also estimate how well it is estimated) deservefurther study. There are many more questions to be answered.Much machine learning research is driven by the interests of the researcher. It is oftentechnique-oriented rather than problem driven. End users often neither understand nor careabout the distinctions between variants of different learning algorithms. A problem-orientedperspective is rare (an exception is [1]). However, end-users do care about performance, how torepresent their data and how to choose models. These topics are the focus of this book, whichmarks a great first step in building a richer understanding of the engineering of machines thatlearn.Robert C. WilliamsonCanberra, Australia.iii

[1] Vic Barnett, Comparative Statistical Inference, (3rd Edition) John Wiley and Sons, Chichester 1999.[2] Walter G. Vincenti, What Engineers Know and How They Know It: Analytical Studiesfrom Aeronautical History, The Johns Hopkins University Press, Baltimore 1990.[3] Joel Mokyr, The Gifts of Athena: Historical Origins of the Knowledge Economy, Princeton University Press, Princeton, 2002.[4] John G. Proakis and Masoud Salehi, Communication Systems Engineering, Pearson Education, 2003.[5] Marc J. de Vries, “The Nature of Technological Knowledge: Extending Empirically Informed Studies into What Engineers Know”, Techné, 6:3, 1–21, 2003.[6] Yoav Freund, “Self bounding learning algorithms”, COLT ’98: Proceedings of the eleventhannual conference on Computational learning theory, ACM Press, 247–258, 1998.

PrefaceRecently organized competitions have been instrumental in pushing the state-of-the-art inmachine learning, establishing benchmarks to fairly evaluate methods, and identifyingtechniques, which really work.This book harvests three years of effort of hundreds of researchers who have participated tothree competitions we organized around five datasets from various application domains. Threeaspects were explored: Data representation. Model selection. Performance prediction.With the proper data representation, learning becomes almost trivial. For the defenders of fullyautomated data processing, the search for better data representations is just part of learning.At the other end of the spectrum, domain specialists engineer data representations, which aretailored to particular applications. The results of the “Agnostic Learning vs. Prior Knowledge”challenge are discussed in the book, including longer versions of the best papers from the IJCNN2007 workshop on “Data Representation Discovery” where the best competitors presented theirresults.Given a family of models with adjustable parameters, Machine Learning provides us withmeans of “learning from examples” and obtaining a good predictive model. The problem becomes more arduous when the family of models possesses so-called hyper-parameters or whenit consists of heterogenous entities (e.g. linear models, neural networks, classification and regression trees, kernel methods, etc.) Both practical and theoretical considerations may yield tosplit the problem into multiple levels of inference. Typically, at the lower level, the parameters of individual models are optimized and at the second level the best model is selected, e.g.via cross-validation. This problem is often referred to as model selection. The results of the“Model Selection Game” are included in this book as well as the best papers of the NIPS 2006“Multi-level Inference” workshop.In most real world situations, it is not sufficient to provide a good predictor, it is important toassess accurately how well this predictor will perform on new unseen data. Before deploying amodel in the field, one must know whether it will meet the specifications or whether one shouldinvest more time and resources to collect additional data and/or develop more sophisticatedmodels. The performance prediction challenge asked participants to provide prediction resultson new unseen test data AND to predict how good these predictions were going to be on a test setfor which they did not know the labels ahead of time. Therefore, participants had to design botha good predictive model and a good performance estimator. The results of the “PerformancePrediction Challenge” and the best papers of the “WCCI 2006 workshop of model selection”will be included in the book.A selection of the special topic of JMLR on model selection, including longer contributionsof the best challenge participants, are also reprinted in the book.Isabelle Guyon, Gavin Cawley, Gideon Dror, Amir Saffari, Editors. January 2011.v

Table of ContentsSeries ForewordForewordPrefacePart IiiiivIntroductionChallenges in Data Representation, Model Selection, and Performance PredictionI. Guyon, A. Saffari, G. Dror & G. CawleyModel Selection: Beyond the Bayesian/Frequentist DivideI. Guyon, A. Saffari, G. Dror & G. Cawley; JMLR 11(Jan):61–87, 2010.1323On Over-fitting in Model Selection and Subsequent Selection Bias in PerformanceEvaluationG.C. Cawley & N.L.C. Talbot; JMLR 11(Jul):2079–2107, 2010.49Part II77Data representationHybrid Learning Using Mixture Models and Artificial Neural NetworksM. Saeed81Data Grid Models for Preparation and Modeling in Supervised LearningM. Boullé99Virtual High-Throughput Screening with Two-Dimensional KernelsC.-A. Azencott & P. Baldi131Part III Robust Parameter Estimation147Unified Framework for SVM Model SelectionM.M. Adankon & M. Cheriet151Liknon feature selection: Behind the scenesE. Pranckeviciene & R. Somorjai169vii

Model Selection in Kernel Based Regression using the Influence FunctionM. Debruyne, M. Hubert & J.A.K. Suykens; JMLR 9(Oct):2377–2400, 2008.195Part IV Ensemble Methods219An Improved Random Forests Approach with Application to the Performance PredictionChallenge Datasets223C. DahindenFeature Selection with Ensembles, Artificial Variables, and Redundancy EliminationE. Tuv, A. Borisov, G. Runger & K. Torkkola; JMLR 10(Jul):1341–1366, 2009.231Classification with Random Sets, Boosting and Distance-based ClusteringV. Nikulin257Part V285Multi-level InferencePreventing Over-Fitting during Model Selection via Bayesian Regularisation of theHyper-ParametersG.C. Cawley & N.L.C. Talbot; JMLR 8(Apr):841–861, 2007.289Particle Swarm Model SelectionH.J. Escalante, M. Montes & L.E. Sucar; JMLR 10(Feb):405–440, 2009.309Bilevel Cross-validation-based Model SelectionG. Kunapuli, J.-S. Pang & K.P. Bennett345Appendix A371Dataset DescriptionDatasets for the Agnostic Learning vs. Prior Knowledge CompetitionI. Guyon373Appendix B397Fact SheetsB1 Performance Prediction Challenge399B1.1LogitBoost with trees399B1.2Weighted LS-SVM Leave-One-Out Cross-Validation Repeated Hold-Out400B1.3Bayesian Neural Networks for the Performance Prediction Challenge402B1.4Random Forests405B1.5Kernel Classifier406

TABLE OF CONTENTSixB1.6Random Linear Matching Pursuit408B1.7Regularized and Averaged Selective Naïve Bayes Classifier409B1.8Artificial Contrasts with Ensembles and Regularized Least Squares Classifiers411B1.9SVM-LOO412B1.10 Model Selection in an Ensemble Framework413B1.11 Advanced Analytical Methods, INTEL414B1.12 Learning with Mean-Variance Filtering, SVM and Gradient-based Optimization415B1.13 Large margin linear classifiers with bias adjustment for skewed two-class distributions.416B1.14 A Study of Supervised Learning with Multivariate Analysis on Unbalanced Datasets 417B1.15 Cross-indexing419B2 AL vs PK Challenge421B2.1LogitBoost with trees421B2.2Feature selection with redundancy elimination gradient boosted trees.422B2.3Cross-indexing424B2.4Classification with Random Sets, Boosting and Distance-based Clustering426B2.5PSMS for Neural Networks429B2.6Hybrid approach for learning432B2.7Linear Programming SVM (Liknon)433B2.8Agnostic Learning with Ensembles of Classifiers436B2.9Modified multi-class SVM formulation; Efficient LOO computation436B2.10 Report on Preliminary Experiments with Data Grid Models in the Agnostic Learning vs. Prior Knowledge Challenge438B2.11 Dimensionality Reduction Techniques440B2.12 DoubleBoost441B2.13 Boosting with SVM442B2.14 High-Throughput Screening with Two-Dimensional Kernels443

Appendix CCLOP: The challenge learning object packageQuick Start Guide for CLOPA.R.S.A. Alamdari & I. Guyon447449

Part IIntroduction

Chapter 1Challenges in Data Representation, Model Selection, andPerformance PredictionIsabelle GuyonISABELLE @ CLOPINET. COMClopiNet, Berkeley, CA 94708, USAAmir SaffariAMIR @ YMER . ORGGideon DrorGIDEON @ MTA . AC . ILGraz University of Technology, AustriaAcademic College of Tel-Aviv-Yaffo, IsraelGavin CawleyGCC @ CMP. UEA . AC . UKUniversity of East Anglia, UKAbstractWe organized a series of challenge for the conferences IJCNN/WCCI 2006, NIPS 2006 andIJCNN 2007 to explore various aspects of machine learning, ranging from the choice of datarepresentation to the selection of the best model. The class of problems addressed are classification problems encountered in pattern recognition (classification of images, speech recognition),medical diagnosis, marketing (customer categorization), text categorization (filtering of spam).All three challenges used the same five datasets, formatted in two data representations: raw dataand preprocessed data in a feature-based representation. Post-challenge submissions can stillbe made at: http://www.agnostic.inf.ethz.ch/. Several chapters in this volumeare contributed by top ranking challenge participants who describe their methods in details.Keywords: Supervised learning; Classification; Competition; Performance prediction; Modelselection; Agnostic Learning; Prior Knowledge; Domain Knowledge; Boosting; Ensemblemethods; Kernel methods; Support Vector Machines; SVM; LSSVM; Data Grid models.1.1. IntroductionChallenges have proved to be a great stimulus for research in machine learning, pattern recognition, and robotics. Robotics contests seem to be particularly popular, with hundreds of eventsevery year, the most visible ones probably being the DARPA Grand Challenges of autonomousground vehicle navigation and RoboCup, featuring several challenges for robots including playing soccer or rescuing people. In data mining and machine learning, several conferenceshave regularly organized challenges over the past 10 years, including the well established TextRecognition Conference (e.g., TREC) and the Knowledge Discovery in Databases cup (KDDcup). More specialized pattern recognition and bioinformatics conference have also held theirown contests, e.g. CASP for protein structure prediction, DREAM for reverse engineering biological networks, ICDAR for document analysis, and the PASCAL Visual Object Challenge(VOC) for object recognition. The European network of excellence PASCAL2 has activelysponsored a number of challenges around hot themes in machine learning, which have punctuated workshop at NIPS and other conferences. These contests are oriented towards scientificresearch and the main reward for the winners is to disseminate the product of their research andobtain recognition. In that respect, they play a different role than challenges like the Netflix I. Guyon, A. Saffari, G. Dror & G. Cawley.

G UYON S AFFARI D ROR C AWLEYprize, which offer large monetary rewards for solving a task of value to the Industry (moviereferral in than particular case), but are narrower scope. Attracting hundreds of participants andthe attention of a broad audience of specialists as well as sometimes the general public, theseevents have been important in several respects: (1) pushing the state-of-the art, (2) identifying techniques which really work, (3) attracting new researchers, (4) raising the standards ofresearch, (5) giving the opportunity to non-established researchers to make themselves rapidlyknown.In 2003, we organized a challenge on the theme of feature selection (Guyon et al., 2005)whose results were discussed at NIPS 2003. A book was published collecting tutorial papers and the best papers from the challenge participants (Guyon et al., 2006a). We havecontinued organizing challenges regularly every year, exploring various aspects of machinelearning: model selection, causal discovery, and active learning (see http://clopinet.com/challenges). The present paper summarizes the results of three challenges: TheIJCNN/WCCI 2006 “performance prediction challenge” (Guyon et al., 2006b), the NIPS 2006“model selection game”, and the IJCNN 2007 “agnostic learning vs. prior knowledge challenge” (Guyon et al., 2007, 2008).1.2. Motivations for this series of challengesPredictive modeling for classification and regression is a central problem addressed in statistics, data mining and machine learning. Examples include pattern recognition (handwritingrecognition, speech recognition, object classification, text classification), medical diagnosis andprognosis, spam filtering, etc. In such problems, the goal is to predict an outcome (a category ora continuous variable), given patterns represented as vectors of features, graphs, texts, etc. Thestandard approach to tackle such problems is to construct a predictive model from examples ofpairs pattern/outcome. After training, the model should be capable of making predictions ofoutcome give new examples, not used for training (generalization).Machine learning researchers have been devoting much effort in the past few decades toinventing and improving learning algorithms. In proportion,less effort has been dedicated toproblems of data representation, model selection, and performance prediction. This paperthat summarizes the results of challenges we organized around these topics. The questionsaddressed are the following: Data representation: The difficulty of learning from examples can be alleviated witha proper data representation. However, finding such representations rely on expert domain knowledge. Conversely, adding more training data may yield better performancewithout requiring such knowledge. How should human resources be rather exploited:in collecting more data or in incorporating domain knowledge in the design of the datarepresentation? Model selection: There are many learning machine architectures and algorithms to choosefrom. Given a certain amount of available training data, what strategy should be adoptedto deliver the best predictive model, including choosing the model family and the modelarchitecture, and tuning all hyper-parameters and parameters? Performance prediction: It is one thing to deliver the best model, but it is a differentthing to know how well it will perform on new, previously unseen, data. The former problem is that of model selection. The latter is that of performance prediction. A good estimator of performance prediction is obviously a good model selection criterion. However,there may exist simpler model selection criteria allowing only to rank models according4

1. C HALLENGES AND DATASETSto predictive power without predicting their performance. The problem of performanceprediction is to make best possible use of the training data to both train the model andpredict its prediction accuracy on future test data.1.3. DatasetsIn all challenges we used the same five datasets, however, the data were formatted differentlyand scrambled to prevent the participants to use results of previous challenges as a head start andgive an even chance to new competitors. The tasks are five two-class classification problemsspanning a variety of domains (marketing, handwriting recognition (HWR), drug discovery,text classification, and ecology) and a variety of difficulties, with sufficiently many examplesto obtain statistically significant results. The input variables are continuous or binary, sparseor dense. Some raw data representations are not feature based. In some problems, the classproportions are very imbalanced. A detailed report on the data preparation is available (Guyon,2005). The main data characteristics are summarized in Table 1.1. Non-feature based representations are supplied for HIVA (molecular structure) and NOVA (emails) and were used in the“data representation” competition.Table 1.1: Datasets of the three ngHWRDrug discoveryText classif.EcologyNumber of examples(train/valid/test)4147 / 415 / 414713153 / 315 / 315323845 / 384 / 384491754 / 175 / 1753713086 / 1309 / 130857Percentpos. class28.449.23.528.56.2Number of featuresRaw dataPreproc.1448784970Molecules 1617Text169691082161.4. Design of the challenges1.4.1. General evaluation procedureThe design of the challenges was informed by experience gained from another challenge weorganized previously on feature selection (Guyon et al., 2005). In particular, we used a systemof on-line submission, which provided the competitors with immediate feed-back on a smallsubset of the data called the validation set. The organizers provided initial submissions tobootstrap the challenge. A toolkit including some of the methods performing best in previouschallenges was also provided (the so-called Challenge Learning Object Package CLOP (Saffariand Guyon, 2006), see Appendix). At the end of a development period, the validation set labelswere revealed. The final ranking was performed on a large separate test set. The test set labelswill remain hidden to permit meaningful comparison with post-challenge submissions.Performance was measured in balanced error rate (BER), which is the average of the errorrate on the positive class and the error rate on the negative class. As is known, for i.i.d. errorscorresponding to Bernouilli trials with a probability of error p, the standard deviation of theerror rate E computed on a test set of size m is p(1 p)/m. This result can be adaptedto the balanced error rate. Let us call m the number of examples of the positive class, m the number of examples of the negative class, p the probability of error on examples of thepositive class, p the probability of error on examples of the negative class, and E and E the5

G UYON S AFFARI D ROR C AWLEYcorresponding empirical estimates. Both processes generating errors on the positive or negativeclass are Bernouilli processes. By definition, the balanced error rate is BER (1/2)(E E ),and its variance is var(BER) (1/4)(var(E ) var(E )). The standard deviation of the BERusing m and m examples is therefore 1 p (1 p ) p (1 p )σ .(1.1)2m m For sufficiently large test sets, we may substitute p by E and p by E to compute σ .To rank the participants we adopted the following scheme: The entries were first ranked foreach individual dataset. Then a global ranking was obtained based on the average rank overall five datasets. Different ranking scores incorporating the BER were used in the differentchallenges.1.4.2. Performance prediction challengeWe ran first the competition on performance prediction, which was easiest technically to organize. In that competition, in addition to providing predicted labels on test data, the participantshad to also provide an estimate of their performance on the test set. For the performanceprediction challenge, the ranking score balanced the classification accuracy and performanceprediction accuracy. Denoting as BER the balanced error rate actually computed from predictions made on test examples, and BERguess the challenger’s own performance prediction, wedefined our ranking score as:S BER δBER (1 e δBER /σ ) ,(1.2)where δBER BERguess BER measures in absolute value the difference between the computed BER and the predicted BER. The multiplicative factor (1 e δBER /σ ) accounts for ouruncertainly of the exact BER, since we can only estimate it on a finite test set of size m. If theBER error bar σ is small compared to the error of the challenger δBER , then this factor is justone. The ranking score becomes simply BER δBER . But if σ is large relative to the error madeby the challenger, we have S BER.The challenge started September 30th , 2005 and ended March 1st , 2006 (duration: 21weeks). Therefore, the competitors had several months to build classifiers with provided (labeled) training data. We estimated that 145 entrants participated. We received 4228 “development entries” (entries not counting towards the final ranking). A total of 28 participants competed for the final ranking by providing valid challenge entries (results on training, validation, and test sets for all five tasks). We received 117 submissions qualifying forthe final ranking (a maximum of 5 entries per participant was allowed). The participationdoubled in number of participants and entry volume compared to the feature selection challenge. The results of the challenge were discussed at the IJCNN/WCCI 2006 conference. Thewebsite of the performance prediction challenge including details on the results is availableat: http://www.modelselect.inf.ethz.ch/. The submissions of the website areclosed because the same datasets were used in the follow-up ALvsPK challenge, whose websiteremains open.1.4.3. ALvsPK challenge on data representationThe agnostic learning vs. prior knowledge challenge (AlvsPK) had two parallel tracks: ALand PK. For the “agnostic learning” (AL) track we supplied data preprocessed to provide a6

1. C HALLENGES AND DATASETSsimple feature-based representation, suitable for use with any off-the-shelf machine learningor data mining package. The pre-processing used was identical to that used in the previouschallenge on performance prediction, but with a new split of the data. The participants had noknowledge of the identity of the features in the agnostic track. The raw data representationswere supplied in the “prior knowledge” (PK) track. They were not necessarily in the form ofdata tables. For instance, in the drug discovery problem the raw data consists of a representationof the three dimensional structure of the drug molecules; in the text processing problem, the rawdata are messages posted to USENET newsgroups. The participants had full knowledge of themeaning of the representation of the data in the PK track. Therefore, PK competitors had theopportunity to use domain knowledge to build better predictors and beat last year’s AL results ormake new “agnostic” entries. Note that the training/test splits used are the same in both tracks,but the example ordering is different in each data subset to hinder matching patterns in the tworepresentations and/or submitting results with the representation prescribed for the other track.The Balanced Error Rate (BER) was used for scoring the participants and otherwise themodalities of evaluation were similar as those of the previous challenge on performance prediction. The challenge started on October 1st , 2006 and ended on August 1st , 2007 (duration:10 months). Two milestone rankings of the participants were made using the test set, withoutrevealing either the test labels or the test performance: on December 1st , for the “model selection game”, and on March 1st , to allow us to publish intermediate results (Guyon et al., 2007).To be eligible for the final ranking, submissions had to include results on all the tasks of thechallenge in either track, on the test data. However, recognizing that domain knowledge is taskspecific, prizes were given for each task individually in the “prior knowledge” track. For eachgroup, only the last five entries in either track counted towards the final ranking. The results ofthe ALvsPK challenge were discussed at the IJCNN 2007 conference. Details can be found onthe website or the challenge http://www.agnostic.inf.ethz.ch/.1.4.4. Model selection gameIn previous challenges, we noted that different teams using similar classification methods (evensometimes the same software package) obtained very different results. We conjectured that thisvariance may be due to differences in model selection strategies. To stimulate research in modelselection and verify our conjecture, we organized a model selection game, within the ALvsPKchallenge. Using the data of the AL track, the competitors were asked to return results withthe constraint of using only models from the toolkit that the organizers provided: the ChallengeLearning Object Package CLOP (Saffari and Guyon, 2006), see Appendix). The package wasavailable for downloading from the web site of the challenge, and the latest version is availablefrom http://clopinet.com/CLOP, see Appendix for a brief description. That toolkitincludes classical methods and some of the algorithms that worked best in the previously organized chal

than classical pattern recognition. In particular, the benchmarks include the fascinating prob-lem of causal inference. Finding causal directions from observations is not only a profound issue for