Statistical Learning Machines: Notes From Theory And Practice

Transcription

Statistical Learning Machines:Notes from Theory and PracticeJames D. Malley, Ph.D.Division of Computational Biosciences, MSCL, CITandJens R. Wendland, MDGenetic Basis of Mood and Anxiety Disorders, NIMH1

The ultimate user-friendly learning machine.As the machine studies the data, it teaches usand we are the learners.2

OutlineBackgroundMachine Mythology: Learning and Unlearning MachinesStrategies for Learning: Open Books and Black BoxesResearch QuestionsRecent results on very large SNP data sets3

Background ResourcesA Probabilistic Theory ofPattern RecognitionLuc Devroye, László Györfi, Gábor Lugosi(Springer, 1996) DGLStatistical Learning forBiomedical DataJD Malley, KG Malley, S Pajevic(Cambridge University Press, 2010)4

The Bayes machineThe minimum Bayes loss rule is defined using theconditional probability function η(x)η(x) Pr[Y 1 X x]And the Bayes machine (rule) g* isg*(x) 1 if η(x) ½ 0 if η(x) ½5

The Bayes rule is bestNo other rule (machine) can in principle have a lowerBayes error loss than g*(x)But we never know the true probability function η(x)So we can’t define g*(x)And it is very hard to estimate η(x) not knowing any detailsabout the distribution of the dataHowever, it’s not necessary to accurately estimate η(x)to get good binary decision rules6

Background DetailsGiven training data Dn with sample size nconsisting of n training vectors(x1, y1), (x2, y2), . . . , (xn, yn)For X xi a d-dimensional vector of features (attributes)Y yi a binary outcome Y 0, 1Want to predict new Y given new XWith machine (rule) g(X): g(X) 0, 1generated using training data Dn : gn(X) g(X; Dn)Evaluate machine using Bayes error (loss) LnLn(g) Pr [ gn(X) Y, given training data Dn ]7

Unlearning MachinesMyth #1 There exists a super machine that willclassify all data really wellFACT: Given machine A there always exists a universally goodmachine B and a distribution D such that:1. For the data D the true Bayes error probability is exactly zero2. Machine A has error probability Machine B error probability(and for all sample sizes from D)8

Some Cautions1. Given a machine A there is not necessarily amachine B that is better for every data set.2. Given a machine A and a machine B there is notnecessarily a data set such that B is better than Aon that data.3. Both of these related—but not equivalent—assertions are open research questions.4. Need basic probability and advanced combinatorialmethods to make progress. See DGL, Chapter 1converges in probability Bayes consistentconverges a.e. strongly Bayes consistent9

Unlearning MachinesMyth #2 A machine must be complex and reallyquite clever in order to be very goodFACT: There are many simple, practical machines thatare very good.A key example: The 1-nearest neighbor machine (1-NN)requires no training, no tuning, no hard-won optimization andyet for data with true Bayes error L the Bayes loss Lnn isL Lnn 2LSo if the true Bayes error is small then without any furtherwork, the Bayes loss for 1-NN is also small10

More details about Myth #2Note that the apparent (observed) estimated error on thetraining data Dn for the 1-NN machine is always exactly zero.It is not magical or mysterious for a machine to have thisproperty. The logitboost machine can have this property.So, it does not necessarily mean we are overfitting the dataand can’t make good predictions on new data.Hence how we estimate error on the training data Dn isimportant.11

Unlearning MachinesMyth #3 (version 1)A good machine needs very little dataFACT: For any good machine there is a data set Dn suchthat it is far above its large sample Bayes errorGiven any small constant c there is a data set such that anysample of size n implies that the Bayes error is in theinterval½ Ln cThat is, the machine gets stuck arbitrarily close tocoin-tossing. See DGL, Chapter 612

Unlearning MachinesMyth #3 (version 2)A very good machine needs very little dataFACT: For a given good machine there exists data such thatthe estimated Bayes error converges arbitrarily slowlyto its large sample lower limit.1. This is true even for universally strongly Bayes consistentmachines.2. The Bayes error can be held above any decreasingsequence, for every n.13

Unlearning MachinesMyth #4 A weak machine must be abandonedFACT: Many weak machines (all with high error) can becombined to generate a provably verygood machine (low error).1. Basic idea goes back to Condorcet, 1785 (!)1. Weak, provably inconsistent machines may be very strongwhen decisions are pooled (Gérard Biau et al., 2008).1. These ensemble, or committee methods include boosting1. Committee decisions are part of the Random Foreststrategy1. The method of Mojirsheibani (1999) goes further. . .14

A superior committee method forthese uncertain timesMojirsheibani (1999, 2002) showed that any collectionof machines could be pooled to get:1. A group machine that is as least as good as thestrongest machine in the set2. A group machine that is Bayes optimal if anymachine in the group is Bayes optimal3. And we don’t need to know which machine iswhich in (1) or (2).4. Method is large sample optimal—see all thecautions above about small or fixed sample sizes5. Method is a kind of single decision tree15

Unlearning MachinesMyth #5 Finding optimal parameter estimates isthe same as finding good machinesFACT: For the binary classification problem the real issue isgetting good {0, 1} predictions and low Bayes error.1. This is not the same as finding, say, the minimum squarederror for any parameters in the machine code.2. DGL has key examples showing that MSE can beminimized and yet the resulting machine has largeerror probability, far from minimum.16

Unlearning MachinesMyth #6 A good binary machine works becauseit is a good estimator of the group probabilityFACT: Simple examples show otherwise.1. So, a good probability machine is certainly a goodbinary decision machine, but1. An excellent binary rule can be not very good as aprobability estimator.1. Basic idea: a good machine only has to get aprobability estimate that is on the same side of thedecision boundary as the Bayes probability η(x).It can be very different from the Bayes probability.17

Unlearning MachinesMore on Myth #6 . . .4. Good machines like logitboost and Random Forest do notinstinctively estimate the group probability.5. Random Forest seems to be better than logitboost aboutestimating the group probability, but clearly isn’t that good.6. It might be possible to re-engineer logitboost (LB) orRandom Forest (RF) but this is not certain.7. And both LB and RF can be seen as forms of neural nets, sosome second hidden layer—transforming output from theinitial hidden layer—might work.18

Unlearning MachinesMyth #7 A good machine requires careful tuningto work wellFACT:Good machines need only some tuning, not much.Most of the tuning rules just track the sample size.Neural nets: the number of nodes, k, for the first hidden layerand with the sigmoid threshold, should grow but not too fast:k nNearest neighbors: number of neighbors, k, should havek/n going to zero, as n goes to infinitySingle decision trees: sample size, kn, of the smallest cell(terminal node) should havekn / log n going to infinity, as n goes to infinity19

Unlearning MachinesMore on Myth #7 . . .1. Sharper results and improved tuning requires moretechnique, especiallyVapnik-Chervonenkis (VC) dimension2. VC upper and lower bounds provide optimal probabilitystatements about Bayes error and are known exactly formany machines.3 The VC dimension is a measure of the flexibility of amachine, and needs to be high but not too high.Otherwise the machine will overfit: do well on sample,but not do well on test data.20

Still more on Myth #7. . .We ignore the VC dimension at our own peril.It is not a theoretical swamp to be avoided by statisticians.It is as important for practical reasons as is the bootstrap.21

Unlearning MachinesMyth #8 A machine must see all the data and act asa global device in order to be goodFACT: A Bayes consistent machine must be local,and need be only weakly global.1. This is recent work by Zakai and Ritov, 20082. This seems obvious for a nearest machine or decisiontree, but it also holds for Support Vector Machines(known to be consistent) and boosting (also consistent)3. The technical definition of local must be made precise,but it basically means that the machine doesn’t need tosee data far from a test point.22

Unlearning MachinesMyth # 8 There must exist some unique small setof most predictive featuresFACT: Numerous simple examples show the nonuniqueness ofimportant features.1. Good models need not be unique. Or, it might take infiniteamounts of data to detect any difference between them.2. Biological processes are typically not unique or singular3. Relentless use of univariate methods over large feature setsis provably mistaken; See DGL for examples.23

Unlearning MachinesMyth # 9 Competing sets of important featuresshould be ranked and combinedFACT: Distinct lists of important features cannot always becombined and maintain logical self-consistency1. Known to Condorcet, 1785 (again!)2. Relates to the voting paradox of Arrow, 19513. Carefully studied by Saari and Haunsperger, 1991, 1992,20034. It might be possible to use a kind of probabilistic rankinginstead; this is a research question24

Summarizing what we have unlearned1. Benchmarking over a set of machines on several data sets—to find a grand super machine for all data—is provably mistaken2. It is informative to look at a set of machines on a singledata set to see how they behave on that data3. Choosing a winner is often just unnecessary, since acommittee will usually do better even with very weakmachines:Random Forest and Random Jungleare excellent committee machines25

Summarizing what we have unlearned4.A good binary rule is not the same as a good groupprobability estimator—nor does it have to be5.Machines need some tuning but not much if there isany signal in the data6.Predictive feature sets can be identified but notuniquely so, and usually not with univariate screening26

Our Academy of Higher Learning1. The Rosetta Stone PrincipleIf Nature has anything to tell us She will say thesame thing in at least three different languages2. All information is local and collective,only rarely global or singular3. Validating a machine should occupy 85% of yourbrain power; actually running or tuning the machineshould be routine.4. Nature may be mysterious or random but She is notmalicious.5. Committee methods give us hope in a dark universe27

Practical strategies for learning1. Use a small set of machines to detect signal in the data2. Pre-balance the data using over (under) sampling3. Extensively validate the results (out of bag, permutation)4. Have the machines declare some small set of features asmost predictive5. Send the small feature set to a familiar method, such aslogistic regression: move from black box to open book6. Check for error drift from the large machine to smallmachine: use Agresti-Tango methods for paired outcomes7. Freely acknowledge that many different machines can doequally well28

Research Questions1. Find solutions for merging multiple lists of features2. Develop methods for active network and clique detectionamong entangled, weakly predictive features3. Find good probability estimating machines;re-engineer Random Forest? logitboost? nearest neighbor?4. Find a good risk attribution machine:how do changes in the feature outputs change probability?5. Find good machine that handles missing data withoutimputation; See Mojirsheibani and Montazeri, 20076. Research is moving forward, but easy to get overwhelmed:29

30

Random Forests and JunglesThese are extremely efficient ensemble learning machinesusing many single decision treesat each node samples randomly from feature listusing bootstrap sampling and permutation testingResults from theory and practice underwrite their performanceas nonlinear, fully nonparametric schemes:they are Bayes consistentMore recently shown to be sparse engines of discovery:can find small set of active features in extremely large lists,so thousands of noisy features do not confuse the code(G. Biau et al., 2008, 2010)Now running fully parallel on the biowulf cluster31

JensJimDede

Random Jungle and ThreeBipolar Disorder GWAS data sets GWOur trusted guide in this jungle:Jens R. Wendland (NIMH)

The newest learning machineUser-friendly? Who is the teacher? Who the student?

Initial goalsUse an ultra speedy, robust, Bayes consistent learningmachine to make predictions of Bipolar Disorder on eachof several case/control GWAS data setsmachine of choice is Random Jungle running underopenMPI on the biowulf clusterFind tiny set of most predictive features (SNPs) frominitial huge list in each GWAS

Next GoalsCompare lists of top features across the available GWAScompare error rates using the three top lists;find most recurrent features across the data sets;merge lists?Uncover subsets, strata in the data sets usingRandom Jungle and MDS proximity plots

Quality control, balance cases/controls, LD-pruningGAINGermany (Bonn)WTCCC2,179 cases, 1,434 controls2.12m SNPs645 cases, 1,310 controls2.14m SNPs1,856 cases, 2,945 controls2.09m SNPsFounders onlyObserved & imputed, autosomal markers only(McMahon et al., Nat Genet, 2010)1,434 cases, 1,434 controls645 cases, 645 controls1,856 cases, 1,856 controlsBalance # of cases & controls in ea. dataset (drop randomly)2,077,279 SNPs2,090,183 SNPs2,036,046 SNPsHWE exact test P .0001 in controls onlyPer-SNP missingness .02 in cases and controls separatelyMAF 0.051,944,441 SNPs1,944,441 SNPs1,944,441 SNPs1st pass-QC’ed SNP overlap across all three datasets1,889,476 SNPsSuccessfully updated 1,909,452 SNPs to hg18 & removed19,976 SNPs within 3,000 CNV regions known to bepolymorphic in 180 CEU individuals (DGV) – QC2245,592 SNPsLD pruning in largest sample only, using pairwise r2 0.51,856 cases, 1,856 controls245,592 SNPsExtract set of LD pruned SNPs from all three datasetsRecode with additive coding (0, 1, 2) with same ref. alleleExclude sex1,434 cases, 1,434 controls245,592 SNPs645 cases, 645 controls245,592 SNPs

GAIN1,434 cases, 1,434 controls245,592 SNPsGerman645 cases, 645 controls245,592 SNPsWTCCC1,856 cases, 1,856 controls245,592 SNPsRandom Jungle detailsOpenMPI RJ run on biowulf80 coresmtry ½ of SNPsntree 5,000terminal node size 40Iterative run with backwardelimination, discardingbottom third of SNPs ineach iteration.

Error AnalysisBootstrap analysis and Monte Carlo samplingfor three bipolar disorder datasets usingRandom Jungle.For each of the three collections, we generatednew datasets using the set of markers resulting inthe lowest error rate from our initial RandomJungle analysis with backward elimination (N 840 SNPs for GAIN, N 1261 SNPs for Germanand WTCCC). We then re-ran Random Jungle1,000 times to estimate the eror rate samplingdistribution (black bars) and an additional 1,000times with permuted affection status to estimatethe reference distribution (grey bars). The graphsshow approximately normally distributed errorrates and good separation.

No overlap of best-predicting markers across all three data sets50% error rate cross-predicting with best set of SNPs from one dataset to other two datasets.

MDS plots of RJ-calculatedsubject-subject proximitiesusing SNP sets with lowesterror rates

Marginal effects – comparing chi-squared values forGAIN dataset with top RJ-importance scoresPearson correlation betweenmarginal outcomes:RJ importance vs. CHISQt 11.2094, df 838,p-value 2.2e-1695% interval [0.3008, 0.4185]estimated correlation 0.3611

Next ChallengeFind predictive and recurrent markers and risk factors,common to all three data setswhere risk factors can be SNPs, genomic loci,genes, pathways, other clinical dataOur central dogma is that such overlap does existand that it can explain at least some of thephenotypic variability across data sets.but this may be simply mistaken:data sets need not be equivalent—probably aren’t

AcknowledgementsDaniel Schwarz (University of Lübeck, Germany)Deanna Greenstein (NIMH, NIH)Gérard Biau (Université Pierre et Marie Curie, Paris)Luc Devroye (McGill University, Montreal)Steven Stanhope (University of Delaware)Joan Bailey-Wilson (NHGRI, NIH)Majid Mojirsheibani (Carleton University, Ottawa)Andreas Ziegler (University of Lübeck, Germany)Inke König (University of Lübeck, Germany)Lawrence Brody (NHGRI, NIH)Abhijit Dasgupta (NIAMS, NIH)and with special thanks toCR Rao (Penn State University; University of Hyderabad, India)44

Statistical Learning Machines: Notes from Theory and Practice James D. Malley, Ph.D. Division of Computational Biosciences, MSCL, CIT and Jens R. Wendlan