Statistical Modelling

Transcription

Statistical ModellingAnthony Davison and Jon Forsterc 2010http://stat.epfl.ch, http://www.s3ri.soton.ac.uk1. Model Selection2Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Basic IdeasWhy model?. . . . . . . . . . .Criteria for model selection .Motivation . . . . . . . . . . . .Setting . . . . . . . . . . . . . .Logistic regression . . . . . . .Nodal involvement. . . . . . .Log likelihood . . . . . . . . . .Wrong model . . . . . . . . . .Out-of-sample prediction . .Information criteria . . . . . .Nodal involvement. . . . . . .Theoretical aspects . . . . . .Properties of AIC, NIC, BIC.4. 5. 6. 710111215161819212223Linear ModelVariable selection . . . . . . . . .Stepwise methods . . . . . . . . .Nuclear power station data . . .Stepwise Methods: CommentsPrediction error. . . . . . . . . . .Example . . . . . . . . . . . . . . .Cross-validation . . . . . . . . . .Other criteria . . . . . . . . . . . .Experiment . . . . . . . . . . . . .24252627293032333536Sparse Variable SelectionMotivation . . . . . . . . . .Desiderata . . . . . . . . . .Example: Lasso. . . . . . .Soft thresholding. . . . . .Example . . . . . . . . . . .404142434446.1

Penalties . . . . . . . . . .Threshold functions . . .Properties of penalties .Oracle . . . . . . . . . . . .47484950Bayesian InferenceThomas Bayes (1702–1761)Bayesian inference . . . . . . .Encompassing model . . . . .Inference . . . . . . . . . . . . .Lindley’s paradox . . . . . . .Model averaging . . . . . . . .Cement data . . . . . . . . . .DIC. . . . . . . . . . . . . . . . .MDL . . . . . . . . . . . . . . . .51525355565758596364Bayesian Variable SelectionVariable selection . . . . . .Example: NMR data . . . .Wavelets . . . . . . . . . . . .Posterior . . . . . . . . . . . .Shrinkage. . . . . . . . . . . .Empirical Bayes . . . . . . .Example: NMR data . . . .Comments . . . . . . . . . . .656667686970717274.2. Beyond the Generalised Linear Model75Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76Generalised Linear Models77GLM recap. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78GLM failure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79OverdispersionExample 1 . . . .Quasi-likelihoodReasons . . . . . .Direct models . .Random effects .DependenceExample 1 revisited . . .Reasons . . . . . . . . . . .Random effects . . . . . .Marginal models . . . . .Clustered data . . . . . .Example 2: Rat growth.Random Effects and Mixed ModelsLinear mixed models . . . . . . . . . .Discussion . . . . . . . . . . . . . . . . .LMM fitting . . . . . . . . . . . . . . . .REML . . . . . . . . . . . . . . . . . . . .2808185878991939495969799100104105108110111

Estimating random effectsBayesian LMMs . . . . . . .Example 2 revisited . . . . .GLMMs . . . . . . . . . . . . .GLMM fitting . . . . . . . . .Bayesian GLMMS . . . . . .Example 1 revisited . . . . .112113115119122126127Conditional independence and graphical representationsConditional independence . . . . . . . . . . . . . . . . . . . . . . .Graphs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .DAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .Undirected graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . .1291301331341403. Missing Data and Latent Variables145Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146Missing DataExamples . . . . . . . .Introduction . . . . . .Issues . . . . . . . . . .Models . . . . . . . . .Ignorability . . . . . . .Inference . . . . . . . .Nonignorable models.147148150151152153154157Latent Variables161Basic idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162Galaxy data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163Other latent variable models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166EM AlgorithmEM algorithm . . . . . . . .Toy example . . . . . . . . .Example: Mixture modelExample: Galaxy data . .Exponential family. . . . .Comments . . . . . . . . . .3.167168171172173175176

1. Model Selectionslide 2Overview1. Basic ideas2. Linear model3. Sparse variable selection4. Bayesian inference5. Bayesian variable selectionAPTS: Statistical ModellingApril 2010 – slide 3Basic Ideasslide 4Why model?George E. P. Box (1919–):All models are wrong, but some models are useful. Some reasons we construct models:–to simplify reality (efficient representation);–to gain understanding;–to compare scientific, economic, . . . theories;–to predict future events/data;–to control a process. We (statisticians!) rarely believe in our models, but regard them as temporary constructs subjectto improvement. Often we have several and must decide which is preferable, if any.APTS: Statistical ModellingApril 2010 – slide 54

Criteria for model selection Substantive knowledge, from prior studies, theoretical arguments, dimensional or other generalconsiderations (often qualitative) Sensitivity to failure of assumptions (prefer models that are robustly valid) Quality of fit—residuals, graphical assessment (informal), or goodness-of-fit tests (formal) Prior knowledge in Bayesian sense (quantitative) Generalisability of conclusions and/or predictions: same/similar models give good fit for manydifferent datasets . . . but often we have just one dataset . . .APTS: Statistical ModellingApril 2010 – slide 6MotivationEven after applying these criteria (but also before!) we may compare many models: linear regression with p covariates, there are 2p possible combinations of covariates (each in/out),before allowing for transformations, etc.— if p 20 then we have a problem; choice of bandwidth h 0 in smoothing problems the number of different clusterings of n individuals is a Bell number (starting from n 1): 1, 2,5, 15, 52, 203, 877, 4140, 21147, 115975, . . . we may want to assess which among 5 105 SNPs on the genome may influence reaction to anew drug; .For reasons of economy we seek ‘simple’ models.APTS: Statistical ModellingApril 2010 – slide 7Albert Einstein (1879–1955)‘Everything should be made as simple as possible, but no simpler.’APTS: Statistical ModellingApril 2010 – slide 85

William of Occam (?1288–?1348)Occam’s razor: Entia non sunt multiplicanda sine necessitate: entities should not be multipliedbeyond necessity.APTS: Statistical ModellingApril 2010 – slide 9Setting To focus and simplify discussion we will consider parametric models, but the ideas generalise tosemi-parametric and non-parametric settings We shall take generalised linear models (GLMs) as example of moderately complex parametricmodels:–Normal linear model has three key aspects: structure for covariates: linear predictor η xT β; response distribution: y N (µ, σ 2 ); and relation η µ between µ E(y) and η.–GLM extends last two to y has densityf (y; θ, φ) exp yθ b(θ) c(y; φ) ,φwhere θ depends on η; dispersion parameter φ is often known; and η g(µ), where g is monotone link function.APTS: Statistical ModellingApril 2010 – slide 106

Logistic regression Commonest choice of link function for binary reponses:Pr(Y 1) π exp(xT β),1 exp(xT β)Pr(Y 0) 1,1 exp(xT β)giving linear model for log odds of ‘success’, πPr(Y 1) log xT β.logPr(Y 0)1 π Log likelihood for β based on independent responses y1 , . . . , yn with covariate vectors x1 , . . . , xn isℓ(β) nXj 1Ty j xj β nXj 1 log 1 exp(xTj β)nob , where βb is model fit MLE and β̃ is Good fit gives small deviance D 2 ℓ(β̃) ℓ(β)unrestricted MLE.APTS: Statistical ModellingApril 2010 – slide 11Nodal involvement dataTable 1: Data on nodal involvement: 53 patients with prostate cancer have nodal involvement (r),with five binary covariates age .111.001.101.011.001.101111100000000100011110APTS: Statistical ModellingApril 2010 – slide 127

Nodal involvement deviancesDeviances D for 32 logistic regression models for nodal involvement data. denotes a term includedin the model.agestgrxrac 27.9126.7225.25age st gr ac xr APTS: Statistical 3.3819.2221.2718.2218.07April 2010 – slide 13152025Deviance30354045Nodal involvement12345Number of parameters6 Adding terms––always increases the log likelihood ℓb and so reduces D,increases the number of parameters,so taking the model with highest ℓb (lowest D) would give the full model We need to trade off quality of fit (measured by D) and model complexity (number of parameters)APTS: Statistical ModellingApril 2010 – slide 148

Log likelihood Given (unknown) true model g(y), and candidate model f (y; θ), Jensen’s inequality impliesthatZZlog g(y)g(y) dy log f (y; θ)g(y) dy,(1)with equality if and only if f (y; θ) g(y). If θg is the value of θ that maximizes the expected log likelihood on the right of (1), then it isnatural to choose the candidate model that maximiseswhich should be an estimate ofthis estimate is biased upwards.Rb n 1ℓ(θ)nXj 1blog f (y; θ),b ℓ(θg ), by definition of θ,blog f (y; θ)g(y) dy. However as ℓ(θ) We need to correct for the bias, but in order to do so, need to understand the properties oflikelihood estimators when the assumed model f is not the true model g.APTS: Statistical ModellingApril 2010 – slide 15Wrong modeliidiidSuppose the true model is g, that is, Y1 , . . . , Yn g, but we assume that Y1 , . . . , Yn f (y; θ). Theb andlog likelihood ℓ(θ) will be maximised at θ,Z 1 b a.s.bℓ(θ) n ℓ(θ) log f (y; θg )g(y) dy, n ,where θg minimizes the Kullback–Leibler discrepancy Zg(y)KL(fθ , g) logg(y) dy.f (y; θ)θg gives the density f (y; θg ) closest to g in this sense, and θb is determined by the finite-sample versionof KL(fθ , g)/ θ, i.e.nXb log f (yj ; θ)0 n 1. θj 1APTS: Statistical ModellingApril 2010 – slide 169

Wrong model IIiidTheorem 1 Suppose the true model is g, that is, Y1 , . . . , Yn g, but we assume thatiidY1 , . . . , Yn f (y; θ). Then under mild regularity conditions the maximum likelihood estimator θbsatisfies ·(2)θb Np θg , I(θg ) 1 K(θg )I(θg ) 1 ,where fθg is the density minimising the Kullback–Leibler discrepancy between fθ and g, I is the Fisherinformation for f , and K is the variance of the score statistic. The likelihood ratio statisticpoX·bλr Vr ,W (θg ) 2 ℓ(θ) ℓ(θg ) niidr 1where V1 , . . . , Vp χ21 , and the λr are eigenvalues of K(θg )1/2 Ig (θg ) 1 K(θg )1/2 . ThusE{W (θg )} tr{I(θg ) 1 K(θg )}.Under the correct model, θg is the ‘true’ value of θ, K(θ) I(θ), λ1 · · · λp 1, and we recoverthe usual results.APTS: Statistical ModellingApril 2010 – slide 17Out-of-sample predictionb to choose the best candidate model: We need to fix two problems with using ℓ(θ)b ℓ(θg ) because θb is based on Y1 , . . . , Yn ;– upward bias, as ℓ(θ)–no penalisation if the dimension of θ increases.iid If we had another independent sample Y1 , . . . , Yn g and computed b n 1ℓ (θ)nXj 1blog f (Yj ; θ),then both problems disappear, suggesting that we choose the candidate model that maximisesh noi bEg E ℓ (θ),gwhere the inner expectation is over the distribution of the Yj , and the outer expectation is overbthe distribution of θ.APTS: Statistical ModellingApril 2010 – slide 1810

Information criteria Previous results on wrong model giveh noi Z1 b. Eg Eg ℓ (θ) log f (y; θg )g(y) dy tr{Ig (θg ) 1 K(θg )},2nwhere the second term is a penalty that depends on the model dimension. We want to estimate this based on Y1 , . . . , Yn only, and geto Zn1.bEg ℓ(θ) log f (y; θg )g(y) dy tr{Ig (θg ) 1 K(θg )},2n To remove the bias, we aim to maximiseb ℓ(θ)whereb K1btr(Jb 1 K),nnXb log f (yj ; θ)b log f (yj ; θ)j 1 θ θ Tthe latter is just the observed information matrix.,Jb nXb 2 log f (yj ; θ)j 1APTS: Statistical Modelling θ θ T;April 2010 – slide 19Information criteria Let p dim(θ) be the number of parameters for a model, and ℓb the corresponding maximised loglikelihood. For historical reasons we choose models that minimise similar criteriab (AIC—Akaike Information Criterion)– 2(p ℓ)––––b (NIC—Network Information Criterion)b ℓ}2{tr(Jb 1 K)b (BIC—Bayes Information Criterion)2( 1 p log n ℓ)2AICc , AICu , DIC, EIC, FIC, GIC, SIC, TIC, . . .Mallows Cp RSS/s2 2p n commonly used in regression problems, where RSS isresidual sum of squares for candidate model, and s2 is an estimate of the error variance σ 2 .APTS: Statistical ModellingApril 2010 – slide 2011

Nodal involvement data504540BIC353025253035AIC404550AIC and BIC for 25 models for binary logistic regression model fitted to the nodal involvement data.Both criteria pick out the same model, with the three covariates st, xr, and ac, which has devianceD 19.64. Note the sharper increase of BIC after the minimum.12345Number of parameters61APTS: Statistical Modelling2345Number of parameters6April 2010 – slide 21Theoretical aspects We may suppose that the true underlying model is of infinite dimension, and that by choosingamong our candidate models we hope to get as close as possible to this ideal model, using thedata available. If so, we need some measure of distance between a candidate and the true model, and we aim tominimise this distance. A model selection procedure that selects the candidate closest to the truth for large n is calledasymptotically efficient. An alternative is to suppose that the true model is among the candidate models. If so, then a model selection procedure that selects the true model with probability tending to oneas n is called consistent.APTS: Statistical ModellingApril 2010 – slide 2212

Properties of AIC, NIC, BICb where the penalty c(n, p) We seek to find the correct model by minimising IC c(n, p) 2ℓ,depends on sample size n and model dimension p Crucial aspect is behaviour of differences of IC. We obtain IC for the true model, and IC for a model with one more parameter. ThennoPr(IC IC) Pr c(n, p 1) 2ℓb c(n, p) 2ℓbnob c(n, p 1) c(n, p) . Pr 2(ℓb ℓ)and in large samplesfor AIC, c(n, p 1) c(n, p)for NIC, c(n, p 1) c(n, p) 2·2 for BIC, c(n, p 1) c(n, p) log nb · χ2 , so as n , In a regular case 2(ℓb ℓ)1(0.16,Pr(IC IC) 0,AIC, NIC,BIC.Thus AIC and NIC have non-zero probability of over-fitting, even in very large samples, but BICdoes not.APTS: Statistical ModellingApril 2010 – slide 23Linear Modelslide 24Variable selection Consider normal linear model†Yn 1 Xn pβp 1 εn 1 ,ε Nn (0, σ 2 In ),where design matrix X † has full rank p n and columns xr , for r X {1, . . . , p}. Subsets Sof X correspond to subsets of columns. Terminology–––the true model corresponds to subset T {r : βr 6 0}, and T q p;a correct model contains T but has other columns also, corresponding subset S satisfiesT S X and T 6 S;a wrong model has subset S lacking some xr for which βr 6 0, and so T 6 S. Aim to identify T . If we choose a wrong model, have bias; if we choose a correct model, increase variance—seek tobalance these.APTS: Statistical ModellingApril 2010 – slide 2513

Stepwise methods Forward selection: starting from model with constant only,1.add each remaining term separately to the current model;2.if none of these terms is significant, stop; otherwise3.update the current model to include the most significant new term; go to 1 Backward elimination: starting from model with all terms,1.if all terms are significant, stop; otherwise2.update current model by dropping the term with the smallest F statistic; go to 1 Stepwise: starting from an arbitary model,1.consider 3 options—add a term, delete a term, swap a term in the model for one not in themodel;2.if model unchanged, stop; otherwise go to 1APTS: Statistical ModellingApril 2010 – slide 26Nuclear power station data nuclearcost1 460.052 452.993 443.224 652.325 642.236 345.397 272.378 317.219 457.1210 690.19.32 150595571cap pr ne ct bw cum.n pt687 0 1 0 014 01065 0 0 1 01 01065 1 0 1 01 01065 0 1 1 012 01065 1 1 1 012 0514 0 1 1 03 0822 0 0 0 05 0457 0 0 0 01 0822 1 0 0 05 0792 0 1 1 12 07 80886100111APTS: Statistical Modelling1April 2010 – slide 2714

Nuclear power station N)PTs (df)Full modelEst (SE) 14.24 (4.229)0.209 (0.065)0.092 (0.244)0.290 (0.273)0.694 (0.136) 0.092 (0.077)0.258 (0.077)0.120 (0.066)0.033 (0.101) 0.080 (0.046) 0.224 (0.123)0.164 (21)t 3.373.210.381.055.10 1.203.351.820.33 1.74 1.83BackwardEst (SE)t 13.26 (3.140) 4.220.212 (0.043)4.910.723 (0.119)6.090.249 (0.074)0.140 (0.060)3.362.32 0.088 (0.042) 2.11 0.226 (0.114) 1.990.159 (25)ForwardEst (SE)t 7.627 (2.875) 2.660.136 (0.040)3.380.671 (0.141)4.75 0.490 (0.103) 4.770.195 (28)Backward selection chooses a model with seven covariates also chosen by minimising AIC.APTS: Statistical ModellingApril 2010 – slide 28Stepwise Methods: Comments Systematic search minimising AIC or similar over all possible models is preferable—not alwaysfeasible. Stepwise methods can fit models to purely random data—main problem is no objective function. Sometimes used by replacing F significance points by (arbitrary!) numbers, e.g. F 4 Can be improved by comparing AIC for different models at each step—uses AIC as objectivefunction, but no systematic search.APTS: Statistical ModellingApril 2010 – slide 2915

Prediction error To identify T , we fit candidate modelY Xβ ε,where columns of X are a subset S of those of X † . Fitted value isX βb X{(X T X) 1 X T Y } HY H(µ ε) Hµ Hε,where H X(X T X) 1 X T is the hat matrix and Hµ µ if the model is correct. Following reasoning for AIC, suppose we also have independent dataset Y from the true model,so Y µ ε Apart from constants, previous measure of prediction error isnob T (Y X β)b , (X) n 1 E E (Y X β)with expectations over both Y and Y .APTS: Statistical ModellingApril 2010 – slide 30Prediction error II Can show that 1 T2 n µ (I H)µ (1 p/n)σ , wrong model, (X) (1 q/n)σ 2 ,true model, 2(1 p/n)σ ,correct model;(6)recall that q p. Bias: n 1 µT (I H)µ 0 unless model is correct, and is reduced by including useful terms Variance: (1 p/n)σ 2 increased by including useless terms Ideal would be to choose covariates X to minimise (X): impossible—depends on unknownsµ, σ. Must estimate (X)APTS: Statistical ModellingApril 2010 – slide 3116

024 6810Example510Number of parameters15 (X) as a function of the number of included variables p for data with n 20, q 6, σ 2 1. Theminimum is at p q 6: there is a sharp decrease in bias as useful covariates are added; there is a slow increase with variance as the number of variables p increases.APTS: Statistical ModellingApril 2010 – slide 32Cross-validation If n is large, can split data into two parts (X ′ , y ′ ) and (X , y ), say, and use one part to estimatemodel, and the other to compute prediction error; then choose the model that minimises′b n ′ 1′′ b T′′ b (y X β ) (y X β ) n′ 1nX(yj′ x′j βb )2 .j 1 Usually dataset is too small for this; use leave-one-out cross-validation sum of squaresb CV CV n nX(yj xTj βb j )2 ,j 1where βb j is estimate computed without (xj , yj ). Seems to require n fits of model, but in factCV nb2X(yj xTj β)j 1(1 hjj )2,where h11 , . . . , hnn are diagonal elements of H, and so can be obtained from one fit.APTS: Statistical ModellingApril 2010 – slide 3317

Cross-validation II Simpler (more stable?) version uses generalised cross-validation sum of squaresGCV nXj 1b2(yj xTj β).{1 tr(H)/n}2 Can show thatE(GCV) µT (I H)µ/(1 p/n)2 nσ 2 /(1 p/n) n (X)(7)so try and minimise GCV or CV. Many variants of cross-validation exist. Typically find that model chosen based on CV issomewhat unstable, and that GCV or k-fold cross-validation works better. Standard strategy is tosplit data into 10 roughly equal parts, predict for each part based on the other nine-tenths of thedata, and find model that minimises this estimate of prediction error.APTS: Statistical ModellingApril 2010 – slide 34Other selection criteria Corrected version of AIC for models with normal responses:AICc n log σb2 n1 p/n,1 (p 2)/nwhere σb2 RSS/n. Related (unbiased) AICu replaces σb2 by S 2 RSS/(n p). Mallows suggestedCp SSp 2p n,s2where SSp is RSS for fitted model and s2 estimates σ 2 . Comments:–AIC tends to choose models that are too complicated; AICc cures this somewhat–BIC chooses true model with probability 1 as n , if the true model is fitted.APTS: Statistical ModellingApril 2010 – slide 3518

Simulation experimentNumber of times models were selected using various model selection criteria in 50 repetitions usingsimulated normal data for each of 20 design matrices. The true model has p mber23131 50472 37352 329398 5654628of covariates4569163839783 1099791 14255416APTS: Statistical ModellingApril 2010 – slide 36Simulation experimentTwenty replicate traces of AIC, BIC, and AICc , for data simulated with n 20, p 1, . . . , 16, andq 6.15AICC1015510Number of covariates1505BIC105005AIC101520n 2020n 2020n 20510Number of covariatesAPTS: Statistical Modelling15510Number of covariates15April 2010 – slide 3719

Simulation experimentTwenty replicate traces of AIC, BIC, and AICc , for data simulated with n 40, p 1, . . . , 16, andq 6.15AICC1015510Number of covariates1505BIC105005AIC101520n 4020n 4020n 40510Number of covariates15510Number of covariatesAPTS: Statistical Modelling15April 2010 – slide 38Simulation experimentTwenty replicate traces of AIC, BIC, and AICc , for data simulated with n 80, p 1, . . . , 16, andq 6.15AICC1015510Number of covariates1505BIC105005AIC101520n 8020n 8020n 80510Number of covariates15510Number of covariates15As n increases, note how AIC and AICc still allow some over-fitting, but BIC does not, and AICc approaches AIC.APTS: Statistical ModellingApril 2010 – slide 3920

Sparse Variable Selectionslide 40Motivation ‘Traditional’ analysis methods presuppose that p n, so the number of observations exceeds thenumber of covariates: tall thin design matrices Many modern datasets have design matrices that are short and fat: p n, so the number ofcovariates (far) exceeds the number of observations—e.g., survival data (n a few hundred) withgenetic information on individuals (p many thousands) Need approaches to deal with this Only possibility is to drop most of the covariates from the analysis, so the model has many feweractive covariates–usually impracticable in fitting to have p n–anyway impossible to interpret when p too large Seek sparse solutions, in which coefficients of most covariates are set to zero, and only covariateswith large coefficients are retained. One way to do this is by thresholding: kill small coefficients,and keep the rest.APTS: Statistical ModellingApril 2010 – slide 41DesiderataWould like variable selection procedures that satisfy: sparsity—small estimates are reduced to zero by a threshold procedure; and near unbiasedness—the estimators almost provide the true parameters, when these are large andn ; continuity—the estimator is continuous in the data, to avoid instability in prediction.None of the previous approaches is sparse, and stepwise selection (for example) is known to be highlyunstable. To overcome this, we consider a regularised (or penalised) log likelihood of the form12nXj 1Tℓj (xj β; yj ) npXr 1pλ ( βr ),where pλ ( β ) is a penalty discussed below.APTS: Statistical ModellingApril 2010 – slide 4221

Example: Lasso The lasso (least absolute selection and shrinkage operator) chooses β to minimise(y Xβ)T (y Xβ) such thatpXr 1 βr λ,for some λ 0; call resulting estimator β̃λ . λ 0 implies β̃λ 0, and λ implies β̃λ βb (X T X) 1 X T y. Simple case: orthogonal design matrix X T X Ip , gives(0, βbr γ,r 1, . . . , p.β̃λ,r sign(βbr )( βbr γ), otherwise,(8) Call this soft thresholding. Computed using least angle regression algorithm (Efron et al., 2004, Annals of Statistics).APTS: Statistical ModellingApril 2010 – slide 43Soft thresholding3g(beta)21001g(beta)234gamma 0.5, betahat 0.44gamma 0.5, betahat 0.9 1.0 0.50.0beta0.51.0 1.00.0beta0.51.02g’(beta)0 2 4 4 2g’(beta)024gamma 0.5, betahat 0.44gamma 0.5, betahat 0.9 0.5 1.0 0.50.0beta0.51.0 1.0APTS: Statistical Modelling 0.50.0beta0.51.0April 2010 – slide 4422

Graphical explanationIn each case aim to minimise the quadratic function subject to remaining inside the shaded region.21y0 3 2 1 3 2 1y0123Lasso3Ridge 3 2 10x123 3 2 10x12APTS: Statistical Modelling3April 2010 – slide 45Lasso: Nuclear power dataLeft: traces of coefficient estimates βbλ as constraint λ is relaxed, showing points at which thedifferent covariates enter the model. Right: behaviour o

APTS: Statistical Modelling April 2010 – slide 17 Out-of-sample prediction We need to fix two problems with using ℓ(θb) to choose the best candidate model: –