Some Advanced Model Selection Topics For Nonparametric/Semiparametric .

Transcription

Some Advanced Model Selection Topics forNonparametric/Semiparametric Models withHigh-Dimensional DataZaili FangDissertation submitted to the faculty of theVirginia Polytechnic Institute and State Universityin partial fulfillment of the requirements for the degree ofDoctor of PhilosophyinStatisticsInyoung Kim, Committee ChairEric P. SmithGeorge R. TerrellPang DuScotland C. LemanNovember 3, 2012Blacksburg, VirginiaKEYWORDS: Additive Model; Cluster Algorithm; Gaussian Random Process;Global-Local Shrinkage; Graphical Model; Ising Model; Kernel Machine; KM Model;LASSO; Long Tail Prior; Mixture Normals; Model Selection; Multivariate SmoothingFunction; Nonnegative Garrote; Nonparametric Model; Pathway Analysis;Semiparametric Model; Sparsistency; Smoothing Splines; Variable Selection.Copyright 2012, Zaili Fang

Some Advanced Model Selection Topics forNonparametric/Semiparametric Models with High-Dimensional DataZaili Fang(ABSTRACT)Model and variable selection have attracted considerable attention in areas of applicationwhere datasets usually contain thousands of variables. Variable selection is a critical stepto reduce the dimension of high dimensional data by eliminating irrelevant variables. Thegeneral objective of variable selection is not only to obtain a set of cost-effective predictorsselected but also to improve prediction and prediction variance. We have made severalcontributions to this issue through a range of advanced topics: providing a graphical viewof Bayesian Variable Selection (BVS), recovering sparsity in multivariate nonparametricmodels and proposing a testing procedure for evaluating nonlinear interaction effect in asemiparametric model.To address the first topic, we propose a new Bayesian variable selection approach viathe graphical model and the Ising model, which we refer to the “Bayesian Ising GraphicalModel” (BIGM). There are several advantages of our BIGM: it is easy to (1) employ thesingle-site updating and cluster updating algorithm, both of which are suitable for problems with small sample sizes and a larger number of variables, (2) extend this approachto nonparametric regression models, and (3) incorporate graphical prior information.In the second topic, we propose a Nonnegative Garrote on a Kernel machine (NGK) torecover sparsity of input variables in smoothing functions. We model the smoothing function by a least squares kernel machine and construct a nonnegative garrote on the kernelmodel as the function of the similarity matrix. An efficient coordinate descent/backfittingalgorithm is developed.The third topic involves a specific genetic pathway dataset in which the pathways interact with the environmental variables. We propose a semiparametric method to modelthe pathway-environment interaction. We then employ a restricted likelihood ratio testand a score test to evaluate the main pathway effect and the pathway-environment interaction.

To Father and Motheriii

AcknowledgmentsI would like to express my sincere gratitude and special thanks to my supervisor, Dr.Inyoung Kim, who has provided her guidance, support, and encouragement throughoutmy graduate studies.Also thanks to my committee members, Dr. Eric P. Smith, Dr. George Terrell, Dr. PangDu and Dr. Scotland Leman, who offered value advice and support.I have also had the privilege to talk and discuss with a number of talented individualin Department of Statistics at Virginia Tech. Thank you all the professors for your inspiring courses and guidance. Thank you all my friends for your generous collaboration anddiscussion on statistical issues that widened my vision. I also would like to thank all staffof Department of Statistics for the selfless support to the students.Lastly, I would like to thank my beloved parents, for their great and never-endingsupport companying with me all the way. None of this would be possible without thelove and support of my family.iv

Contents1Outline of this Dissertation12Bayesian Ising Graphical Model for Variable Selection42.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .42.2Model Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .82.2.1Bayesian Variable Selection with Normal Mixture Priors . . . . . . .82.2.2Bayesian Ising Graphical Model . . . . . . . . . . . . . . . . . . . . . 122.32.42.5Algorithm for Updating γ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.1Single-site Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.3.2Cluster Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.4.1Incorporating Graph Prior Information . . . . . . . . . . . . . . . . . 212.4.2Extension to Nonparametric Regression Models:Bayesian Sparse Additive Model (BSAM) . . . . . . . . . . . . . . . . 23Understanding the Mechanism of Bayesian Variable Selection . . . . . . . . 252.5.1General Profile of the Marginal Selection Probability . . . . . . . . . 262.5.2Dynamic Properties of the Odds with Different Priors . . . . . . . . . 322.5.3Expressions for πjb . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372.6Connection of BIGM to Simulated Tempering and Generalization by LévyProcess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382.7Proofs of the Lemmas and Theorems . . . . . . . . . . . . . . . . . . . . . . . 432.7.1Proof of Theorem 2.3.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 43v

2.82.92.7.2Proof of Theorem 2.5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 462.7.3Proof of Theorem 2.5.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 482.7.4Proof of Theorem 2.5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 492.7.5The Calculation of πjb . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512.8.1Case One: Comparison of Three Priors . . . . . . . . . . . . . . . . . 512.8.2Case Two: Three Regions of Global Shrinkage Parameter b . . . . . . 532.8.3Case Three: Comparison of Cluster and Single-site Algorithm . . . . 562.8.4Case Four: Bayesian Sparse Additive Model . . . . . . . . . . . . . . 602.8.5Case Five: Linear Chain Prior . . . . . . . . . . . . . . . . . . . . . . . 65Real Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662.9.1Ozone Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 662.9.2Gene Selection in Pathway Data . . . . . . . . . . . . . . . . . . . . . 682.10 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 753Sparsity Recovery From Multivariate Nonparametric Models773.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 773.2Flexible Multivariate Nonparametric Model . . . . . . . . . . . . . . . . . . . 823.33.43.2.1Multivariate Nonparametric Model Using Kernel Machine . . . . . . 823.2.2Nonnegative Garrotte on Kernel (NGK) . . . . . . . . . . . . . . . . . 833.2.3Connection with Linear Nonnegative Garotte Estimator . . . . . . . 853.2.4Connection with the Kernel Machine Learning . . . . . . . . . . . . . 863.2.5Some Notation and Regularity Conditions . . . . . . . . . . . . . . . 89An Efficient Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 923.3.1Backfitting Algorithm to Update ξj ’s . . . . . . . . . . . . . . . . . . . 923.3.2Model Selection Criterion . . . . . . . . . . . . . . . . . . . . . . . . . 96Some Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . 973.4.1Necessary and Sufficient Conditions for the Consistency of the NGKEstimator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98vi

3.4.23.53.63.73.84Recovery of Sparsity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101Proofs of the Lemmas and Theorems . . . . . . . . . . . . . . . . . . . . . . . 1043.5.1Proof of Lemma 3.2.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1043.5.2Proof of Theorem 3.4.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1053.5.3Proof of Lemma 3.4.2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1063.5.4Proof of Theorem 3.4.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 107Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1133.6.1Comparison with Linear LASSO . . . . . . . . . . . . . . . . . . . . . 1133.6.2Simulation Example 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1173.6.3Simulation Example 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 1193.6.4Simulation Example 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 121Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273.7.1Key Selection in Cryptography Data . . . . . . . . . . . . . . . . . . . 1273.7.2Gene Selection in a Pathway Data . . . . . . . . . . . . . . . . . . . . 132Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136Semiparametric Mixed Model for Evaluating Pathway Environment Interaction1394.1Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1394.2Construction of Semiparametric Linear Mixed Effects Models . . . . . . . . 1444.34.44.54.2.1Model Description and the Kernel of the Interaction Function Space 1444.2.2Linear Mixed Model Representation . . . . . . . . . . . . . . . . . . . 1484.2.3Estimate Pathway and Interaction Effects . . . . . . . . . . . . . . . . 151REML Estimation of the Variance Components . . . . . . . . . . . . . . . . . 1524.3.1REML Approach for Estimating Variance Components . . . . . . . . 1524.3.2Profile REML Approach for Estimating Variance Components . . . . 153Test for Pathway Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1554.4.1Test for Two Zero Variance Components . . . . . . . . . . . . . . . . . 1554.4.2Test for the P-E Interaction Effect . . . . . . . . . . . . . . . . . . . . . 159Simulation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163vii

4.5.1Parameters Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 1634.5.2Test Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1694.6Application to Type II Diabetes Data . . . . . . . . . . . . . . . . . . . . . . . 1724.7Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179Bibliography182A Lancaster and Šalkauskas Basis for Natural Cubic Spline194B The Representation of the Natural Cubic Spline199viii

List of Figures2.1Diagram of variable selection as a random graph model with selected nodes (filled circles), excluded nodes (circles), edges of positive interaction(black lines), and edges of negative interaction (red lines). Independentvariable selection: no interactions among nodes (a). General variable selection: a complete graph (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2Diagram of the cluster algorithm. Forming the cluster (a-c). Flipping clustered nodes (d-e). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.3Selection probability curves against κj (a). The curves of marginal selectionprobability against global shrinkage parameter b (b). Marginal selectionprobabilities with baseline subtracted (c). All plots are under orthogonaldesigns. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4Marginal prior density functions of βj and density functions of κj for different b. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332.5The derivative of log odds respect to aj against given different b (a). Notethe curves of the Cauchy and horseshoe priors are overlapped. The derivative of log odds with respect to b given different aj (b). . . . . . . . . . . . . . 372.6The profile curves of the selection probability of the simulation model (2.47)with different priors for large signal setting (a-c), and small signal setting(d-f). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522.7The profile curves of the selection probability of Model I A and B (a-b).Selection probability at two b values for Model I A (c-d). . . . . . . . . . . . 562.8The profile curves of the selection probability of Model II A and B (a-b).Selection probability at two b values for Model II A (c-d). . . . . . . . . . . . 572.9The sum of absolute ACF against variable number p for cluster algorithmand single-site algorithm at different b values. . . . . . . . . . . . . . . . . . . 58ix

2.10 The profile curves of the selection probability of the simulation model (2.48)with p 80, n 100 for independent setting (t 0) (a), and correlated setting (t 1) (b). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 612.11 True function fj ’s (blue dashed lines) and estimated function fˆj ’s (blue solid lines) with 95% credible interval (red dashed lines) for the 4 true nodes(a-d) and a noise node (e) of a run of the simulation model (2.48) with independent setting t 0 and p 80. The marginal selection probability atb 26 (f). Note we reordered the first 4 true nodes number to (2, 20, 50, 70)for a better view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 622.12 True function fj ’s (blue dashed lines) and estimated function fˆj ’s (blue solid lines) with 95% credible interval (red dashed lines) for the 4 true nodes(a-d) and a noise node (e) of a run of simulation model (2.48) with independent setting t 1 and p 80. The marginal selection probability at b 26(f). Note we reordered the first 4 true nodes number to (2, 20, 50, 70) for abetter view. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 632.13 The graph of a linear chain prior with 20 nodes with 1 through 10 nodes“in” (a). The profile curves of the selection probability of case four modelcalculated by the cluster algorithm with noninformative prior (b), and withthe linear chain prior for γ (c). . . . . . . . . . . . . . . . . . . . . . . . . . . . 652.14 Estimated function fˆj (blue solid lines) with 95% credible interval (reddashed lines) for the 8 predictors of ozone data labeled by the marginalselection probability P p(γj 1 y) at b 1.6. . . . . . . . . . . . . . . . . 672.15 Profile curves of the selection probability of genetic pathway data withnoninformative prior for γ (a), and with informative prior as (2.50) (b). . . . 702.16 Summary of the results for the genetic pathway data. Top left: genetic network structure of the data. Top right: the frequency matrix of two nodesaligned in the cluster over total iterations. Bottom left: the frequency matrix of two nodes anti-aligned in the cluster over total iterations. Bottomright: Selection probability with cluster algorithm at b 8.5 with informative prior (2.50). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743.1Incoherence condition values vs. λ with λ0 fixed at 0.0026 (a), and vs. λ0with λ fixed at 1.516 (b). Both use initial α̃ 1 (ξ̃) with ξ̃ (1, 1, 1)T . . . . 1153.2Solution paths of βi ’s calculated by linear LASSO (a), and ξi ’s calculatedby NGK with a linear kernel (b). The solutions paths of NGK are achievedwith initial α̃ and λ0 0.0026 estimated by REML. . . . . . . . . . . . . . . . 116x

3.3Selected example of NGK solution path for example 1 using the Gaussiankernel, (a) and (b), and the linear kernel, (c) and (d). Left side: ξj ’s vs. L1norm of ξj ’s, Right side: ξj ’s and BIC vs. log λ. . . . . . . . . . . . . . . . . . 1233.4Selected example of NGK solution path for example 2 using the Gaussiankernel, (a) and (b), and the linear kernel, (c) and (d). Left side: ξj ’s vs. L1norm of ξj ’s, Right side: ξj ’s and BIC vs. log λ. . . . . . . . . . . . . . . . . . 1243.5Selected example of NGK solution path for example 3 using the Gaussiankernel, (a) and (b), and the linear kernel, (c) and (d). Left side: ξj ’s vs. L1norm of ξj ’s, Right side: ξj ’s and BIC vs. log λ. . . . . . . . . . . . . . . . . . 1253.6Selection probability of each predictor in example 3 for 400 runs. . . . . . . . 1263.7(a) Diagram of side-channel attack; (b) Data structure. . . . . . . . . . . . . . 1283.8NGK solution path for SCA data using linear kernel. (a), ξj ’s vs. L1 normof ξj ’s; (b), ξj ’s and BIC vs. log λ. . . . . . . . . . . . . . . . . . . . . . . . . . 1293.9Selection probability of each key guess of SCA data using m-out-of-n resampling procedure, m 2048, n 5120 and total 1200 runs. . . . . . . . . . 1303.10 NGK solution path for diabetes data pathway 133 using Gaussian kernel.(a), ξj ’s vs. L1 norm of ξj ’s; (b), ξj ’s and BIC vs. log λ. . . . . . . . . . . . . . . 1333.11 NGK solution path for diabetes data pathway 4 using Gaussian kernel. (a),ξj ’s vs. L1 norm of ξj ’s; (b), ξj ’s and BIC vs. log λ. . . . . . . . . . . . . . . . . 1343.12 NGK solution path for diabetes data pathway 140 using Gaussian kernel.(a), ξj ’s vs. L1 norm of ξj ’s; (b), ξj ’s and BIC vs. log λ. . . . . . . . . . . . . . . 1353.13 Selection probability of each gene using residual permutation method forpathway 133 (a), and pathway 140 and 4 (b), total 3000 runs for each pathway.1364.1Diagram of the parameter space of RLRT for testing two zero variance components (a), and testing the P-E interaction effect (b). . . . . . . . . . . . . . . 1574.2Selected example of fitting results of setting 1. Because of the high dimensionality, rz , rxz and f are plotted vs. the observation index only. . . . . . . . 1664.3The estimated variance components of σ̂ 2 , τ̂x , τ̂z , τ̂xz for 251 pathways ordered by p-values of testing the overall pathway effect. The dash linesseparate the significant and insignificant pathways at 5% level. . . . . . . . 176xi

4.4The p-values of testing overall pathway effect (RLRT D) and P-E interaction effect (RLRT d) for 251 pathways. The vertical dash line divides thesignificant and insignificant pathways of overall pathway effect test, andthe horizontal dash line indicates 5% significant level. Some p-values ofRLRT d are missing because the information matrix is not positive definite. 178xii

List of Tables2.1Summary of the Cauchy, Laplace and horseshoe priors for the marginalprior of βj ’s, corresponding to priors for p(τj ) and the density functions ofthe shrinkage parameter κj . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112.2Simulation results of the sparse additive model (2.48) for 500 runs. . . . . . 612.3Parameter estimation of ozone data under Bayesian sparse additive modelat b 1.6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 693.1Selection frequency of each predictor in example 1 for 200 runs. . . . . . . . 1173.2Simulation results of example 1 for 200 runs. . . . . . . . . . . . . . . . . . . 1193.3Selection frequency of each predictor in example 2 for 200 runs. . . . . . . . 1203.4Simulation results of example 2 for 200 runs. . . . . . . . . . . . . . . . . . . 1213.5Simulation results of example 3 for 400 runs. . . . . . . . . . . . . . . . . . . 1224.1Assessments of estimating fx , fz and fxz simulated by (4.34) using REMLand p-REML procedures with ρ estimated from initial value 2 or fixed at 2.Total runs number 200 for each scenario, and the average values are reported.1634.2Type I error and power of RLRT of overall pathway effect with ρ fixed atdifferent values and estimated. Simulated samples size n 100, and bothused and true gene number equal to p 30. . . . . . . . . . . . . . . . . . . . 1654.3Type I error and power of RLRT of overall pathway effect with fitted genesnumber p equal or larger than true one p 30. Simulated samples sizen 60 and n 35. The parameter ρ is fixed at 2. . . . . . . . . . . . . . . . . 1674.4Type I error and power of RLRT and score test of P-E interaction with ρfixed at different values. Fitted and used gene numbers are equal to p 5,and n 100. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168xiii

4.5Estimated parameters of top 20 pathways obtained from p-REML and rankedby p-values of testing RLRT D. The numbers in the round brackets are thestandard errors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1714.6P-values of different tests for top 20 pathway significant in the overall pathway effect. Columns 2 and 3 are labels indicating appearance in the top 50list of other methods or not. Missing values in column 6 is because theinformation matrix is not positive definite. . . . . . . . . . . . . . . . . . . . 173xiv

Chapter 1Outline of this DissertationModel selection is a very general statistical term to describe methods for selecting a statistical model from a set of candidates, which typically involves a set of regressor variablesor perhaps model terms (transforms of the natural variables). One very popular model selection problem is variable and feature selection for selecting significant predictorsfrom tens, hundreds or thousands of candidates, which has become the focus of muchresearch in recently years. The most popular variable selection technique based on linear regression is Least Absolute Shrinkage and Selection Operator (LASSO) (Efron et al.,2004; Tibshirani, 1996). However, there are several issues associated with variable selection where LASSO does not work well, such as the function components selection ofan additive model, or the input variable selection of a Gaussian process or neural network. Usually we apply nonparametric regression methods to model those nonlinearand nonadditive functions. To address variable selection in the nonparametric regressionmodel, smoothing spline and kernel machine techniques have been employed in manyways. To develop a variable selection procedure that is practical for these situations isstill an interesting and challenging topic. We explore some advanced variable selection1

Chapter 1. Outline of this Dissertation2problems from both frequentist and Bayesian point of view. In the frequentist statisticsside, employing penalized norm of the nonlinear function space is the standard approachto reduce the dimension, while in the Bayesian statistics side, finding an efficient updating sampler is the typical research goal for the stochastic searching of the binary randomvariables corresponding to the inclusion or exclusion of the predictors. Nevertheless, themodel selection task can also involve the hypothesis testing of the main effects and interaction effects in a mixed model (Goeman et al., 2004; Zhang et al., 1998; Zhang andLin, 2003), which becomes complicated when an semi/nonparametric mixed models areconsidered to establish the relationship between the response and the predictors.Hence, this dissertation focuses on these three major problems in model selection. Theoutline of the dissertation is sketched as follows: In Chapter 2, we discuss a new Bayesian variable selection (BVS) approach via thegraphical model and the Ising model, which we refer to the “Bayesian Ising Graphical Model” (BIGM). In this chapter, we are mainly interested in the following issues:– Model Description: we connect the regular linear regression model with thegraphical model for purpose of variable selection.– Methodology: we generalize the cluster algorithm existing in the Ising modelwith fixed interaction to a cluster algorithm applicable to a complete binaryrandom graphical model with random interactions.– Extension: we provide an extension of BIGM to Bayesian additive models forfunction component selection, and we improve the performance of BIGM byincorporating the network information of the graphical model’s nodes.– Understanding BVS: we study the selection probability profile curves and evaluate the performance of different priors for the model coefficients with dif-

Chapter 1. Outline of this Dissertation3ferent characteristics in heavy mass around zero to long-tailed prior. We alsodiscuss the connection between shrinkage parameters and tempering parameters.– Simulation studies and applications. In Chapter 3, we introduce a flexible variable selection approach for recovering thesparsity in the nonadditive or additive multivariate nonparametric models. In thischapter, the following major issues are discussed:– Model Description: we extend the nonnegative garrote method to a nonlinearnonparametric model.– Methodology: we propose a coordinate descent or backfitting algorithm tosolve the problem.– Theoretical Properties: we provide theoretical results to show the sparsistency(sparsity consistency) of our method.– Simulation studies and applications. In Chapter 4, we discuss a semiparametric mixed model for evaluating pathwayenvironment interaction. In this chapter, we will focus on the following issues:– Model Description: we establish the semiparametric model to describe the environmental variable and the pathway covariates and their interaction.– Estimation: we estimate the parameters with two methods, Restricted Maximum Likelihood (REML) and profile REML.– Hypothesis Testing: a profile Restricted Likelihood Ratio Test (RLRT) and aREML score test are discussed.– Simulation studies and applications.

Chapter 2Bayesian Ising Graphical Model forVariable Selection2.1IntroductionLet’s start from the standard multiple linear regression model [y β, φ] N (Xβ, φ 1 I),where y is an n 1 vector of the response variables, X (x1 , ., xp ) is an n p matrixof predictors, β (β1 , ., βp )T is a p 1 model coefficient vector of the full model withβj corresponding to the jth predictor, j 1, ., p, and φ is the precision parameter. Theinclusion or exclusion of the jth predictor in the model is represented by a binary indicator random variable γj , where γj (0, 1). We denote the inclusion of predictor xj withγj 1, and otherwise we exclude it from the model. In recent years, incorporating prior network information of predictors into BVS models has received substantial attention(Li and Zhang, 2010; Monni and Li, 2010; Stingo et al., 2011; Tai et al., 2010). In all thesepapers, the network information of the predictors are introduced through an informativeprior for γj ’s, which is a binary random graph. However, none of these papers discuss4

Chapter 2. Bayesian Ising Graphical Model for Variable Selection5treating variable selection as a graphical model with a noninformative prior for γj ’s. Abinary random graphical model for the random vector γ (γ1 , ., γp )T is representedby an undirected graph G (V, E), where V represents the set of p vertices or nodescorresponding to p predictors and E is a set of edges connecting neighboring nodes. Inthis dissertation, we base our approach on a reparameterized BVS model known as theKM model (Kuo and Mallick, 1998). We develop the new BVS approach via the graphicalmodel and the Ising model, which is referred to ”Bayesian Ising Graphical Model” (BIGM) for variable selection. We demonstrate that with the noninformative prior for γ, thelinear regression model is essentially a complete graphical model. A nice review aboutIsing model can be found in (Iba, 2001; Newman and Barkema, 1999).Our contributions to this topic are in several aspects: (1) we connect BVS on linear regression model with the Ising model with random interaction terms and propose the BIGM, (2) we develop an efficient cluster algorithm, (3) we extend the BIGM to the Bayesiansparse additive model (BSAM) for nonparametric function components selection, (4) westudy the selection probability profiles under different shrinkage and (5) connect the BIGM with tempering algorithm. In the following we explain in detail why they are important contributions by itemizing each contribution separately. First, by revealing that the binary Markov chain random process for γ on a graphcan be modeled by the Ising model conditional on β and φ, we propose the BIGM. In a BIGM, the interactions between nodes are random and long-range (eachnode is the neighbour of any other nodes). To have flexible interactions betweennodes, we adopt the “shrink globally act locally” strategy (Polson and Scott, 2011,2012), which assigned scale normal mixture priors for the βj ’s (Barndorff-Nielsenet al., 1982; West, 1987). To our best knowledge, our work is the first one to directlyconnect BVS with the binary graphical model via the Ising model.

Chapter 2. Bayesian Ising Graphical Model for Variable Selection6 Second, we develop a generalized cluster algorithm in which the cluster is formedwith the random interactions among nodes. Possible approaches to explore the configuration space of γ in an Ising model are the cluster algorithm and a family ofexchange Monte Carlo, parallel tempering and simulated tempering algorithm (Iba,2001). However, the current cluster algorithms such as the Swendsen-Wang algorithm (Swendsen and Wang, 1987) and Wolff algorithm (Wolff, 1989) are constructedbased on the graph prior for γ and only consider fixed interactions. Therefore, bothare not applicable to the more general random complete graphical model. Furthermore, in our BIGM, it is straightforward to combine the graphical prior informationof γ. Third, we extend our BIGM to the BSAM. There are only few papers discussingBVS under nonparametric regression (Reich et al., 2009; Scheipl, 2011; Smith andKohn, 1996). Based on the KM model, our BIGM is easily extended to BSAM. Weemploy the Lancaster and Šalkauskas (LS) spline basis (Chib and Greenberg, 2010;Lancaster and Šalkauskas, 1986) to express the nonparametric function components.To our best knowledge, our method is the first to connect the graphical model withthe nonparametric regressors such that we can simultaneously select an appropriatesubset of the function components and estimate the flexible functi

of Bayesian Variable Selection (BVS), recovering sparsity in multivariate nonparametric models and proposing a testing procedure for evaluating nonlinear interaction effect in a semiparametric model. To address the first topic, we propose a new Bayesian variable selection approach via