Notes For Machine Learning: A Probabilistic Perspective .

Transcription

Notes for Machine Learning: A Probabilistic Perspective byKevin P. MurphyKevin O’ConnorChapter 13: Sparse Linear ModelsGoal: select sets of relevant variables simultaneously.13.2: Bayesian Variable SelectionSuppose we have latent variables, γj 1(feature j relevant). Then compute posterior of γ (γ1 , ., γD ), exp { f (γ)}p(γ D) Pwhere f (γ) log p(D γ) log p(γ)γ 0 exp { f (γ)}Two ways of estimating γ:1. MAP: γ̂ argmax p(γ D) argmin f (γ). However this is often not representative of fullposterior mass.2. Median model: γ̂ {j : p(γj 1 D) 0.5}.13.2.1: Spike and Slab ModelLet π0 be the probability that a feature is relevant. Then define the hyper-prior on γ,p(γ) DYkγk0Ber(γj π0 ) π0(1 π0 )D kγk0j 1Define the prior on wj δ0 (wj )γj 02p(wj σ , γj ) 22N (wj 0, σ σw ) γj 1Then assuming Gaussianity, we can write the likelihood in terms of only features with γj 1(denoted by the subscript, γ).ZZ2p(D γ) N (y Xγ wγ , σ 2 IN )N (wγ 0γ , σ 2 σwIγ )p(σ 2 )dwγ dσ 2If the marginal likelihood is intractable, can approximate with BIC.log p(D γ) log p(y X, ŵγ , σ 2 ) kγk0log N2kγk0log N λkγk0 const2where ŵγ is ML or MAP estimate based on Xγ and λ log((1 π0 )/π0 ) (controls the sparsity).log p(γ D) log p(y X, ŵγ , σ 2 ) 1

13.2.2: Bernoulli-Gaussian ModelAlso called binary mask model. X 22yi xi , w, γ, σ Nγj wj xij , σjγj Ber(π0 )2wj N (0, σw)Difference between this and spike and slab:1. Don’t integrate out irrelevant coefficients.2. γj y wj vs. γj wj y.13.3: l1 RegularizationUsing discrete priors like γj {0, 1} can cause computation problems with finding the posteriormode. So we replace such priors with continuous ones that encourage sparsity. For example,Laplace priorp(w λ) DYexp { λ wj }j 1PNLL(w) NLL(w) λkwk1LASSO solution for regression vs. Ridge solution. λOLSOLSLASSO ) ŵk sign(ŵkŵk2 ŵkRIDGE ŵkOLS1 λ13.3.4: Regularization PathAs we increase λ, ŵ(λ) gets sparser. Can visualize this by plotting ŵj (λ) vs λ, called regularization path. Note that active set of non-zero coefficients only changes at some finite numberof critical points. Gives rise to the LARS algorithm which is roughly as efficient as OLS.13.3.5: Model SelectionWe call a method which can recover the true model when N , model selection consistent. We usually choose λ for the LASSO via cross validation to maximize predictive optimality. But this does not result in model selection consistency. Can make l1 regularizationmore robust to small perturbations of the data by using Bayesian l1 regularization or LASSOon bootstrapped data.13.4: l1 Regularization: Algorithms13.4.1: Coordinate DescentIdea: optimize coordinates one-by-one.wj argminz f (w zej ) f (w)Can either cycle through each coordinate or select each at random.2

13.5: l1 Regularization: Extensions13.5.1: Group LASSOIdea: Encourage sparsity of groups of coefficients.J(w) NLL(w) GXλg kwg k2wherekwg k2 g 1sXwj2j gNote that this differs from Ridge regression as we use kwg k2 rather than kwg k22 .13.5.2: Fused LASSOIdea: want neighboring coefficients to be close to one another.J(w, λ1 , λ2 ) NX(yi wi )2 λ1i 1NX wi λ2i 1N 1X wi 1 wi i 1Can also generalize this beyond simple chain neighbor structures.13.5.3: Elastic NetIdea: Combine Ridge and LASSO.J(w, λ1 , λ2 ) ky Xwk2 λ2 kwk22 λ1 kwk1Chapter 14: Kernels14.1: IntroductionFor some complex objects, not clear how to model them without assuming some generativemodel on their features. But if we have a notion of similarity between objects, we can usekernel methods.14.2: Kernel FunctionsKernel is some function κ : X X R. Typically symmetric and non-negative.14.2.1: RBF KernelsA radial basis function (RBF) is one that only depends on kx x0 k. Gives rise to the RBFkernel, kx x0 k20κ(x, x ) exp 2σ 2Squared-exponential or Gaussian kernel, 10 T 100κ(x, x ) exp (x x ) Σ (x x )214.2.2: Kernels for Comparing DocumentsCosine similarity,κ(xi , xi0 ) xTi xi0kxi k2 kxi0 k2Could also transform with TF-IDF then get cosine similarity.3

14.2.3: Mercer (Positive Definite) KernelsGram matrix, K, has Kij κ(xi , xj ). If K positive definite for any choice of inputs, calleda Mercer kernel. Mercer kernels are nice because we can use Mercer’s theorem which usesK U T ΛU (Λ1/2 U )T (U Λ1/2 ) to write κ(x, x0 ) φ(x)T φ(x0 ). Thus there is some function φsuch that κ is an inner-product in the embedded space.14.2.6: String KernelsXκ(x, x0 ) ws φs (x)φs (x0 )s A where φs (x) is number of times s occurs as substring of x, ws 0, and A is set of all stringsfrom alphabet A.14.3: Using Kernels inside GLMs14.3.1: Kernel MachinesA kernel machine is a GLM where input vector has the formφ(x) (κ(x, µ1 ), . . . , κ(x, µK ))for some K centroids µ1 , . . . , µK .14.4: Kernel TrickKernel trick: Instead of defining a feature vector in terms of kernels as above, just work withoriginal inputs and replace inner-products with kernels.14.4.1: Kernelized Nearest-NeighborsObserve thatkxi xj k22 hxi , xi i hxj , xj i 2hxi , xj iSo we can replace all of these inner products with kernels.14.4.4: Kernel PCAFind principal component projections in lower dimensional non-linear embedding.14.5: SVMsConsider l2 -regularized empirical risk function,J(w, λ) NXL(yi , ŷi ) λkwk2i 1Idea: promote sparsity via L rather than the penalty term.4

14.5.1: SVMs for Regression(Vapnik) Epsilon insensitive loss function 0 y ŷ L (y, ŷ) y ŷ otherwiseSo it doesn’t penalize you if you are within of the target. Can formulate constrained optimization problem with this loss asJ CNXi 11(ξi ξi ) kwk22where ξi , ξi are slack variables such thatyi 6 f (xi ) ξi yi f (xi ) ξi and ŷ f (xi ) wT xi w0 , constrained to ξi 0, ξi 0. Gives standard quadratic programwith 2N D 1 variables, yielding solution,Xŵ αi xiαi 0iwith sparse α. Then can kernelize asXŷ(x) ŵ0 αi κ(xi , x)i14.5.2: SVMs for ClassificationHinge loss: let y {0, 1} then defineLhinge (y, f (x)) (1 yf (x)) Then write as constrained optimization problem,Nminw,w0 ,ξX1kwk2 Cξi s.t. ξi 0, yi (xTi w w0 ) 1 ξi , i 1, ., N2i 1Again we get a solution of the formXŵ αi xiiand thus after kernelizing, NXŷ(x) sign ŵ0 αi κ(xi , x)i 114.5.2.2: The Large Margin PrincipleCan be shown that SVM objective is maximizing distance of decision boundary from supportvectors.14.7: Kernels for Building Generative ModelsSmoothing kernel, κ : X R satisfiesZZZκ(x)dx 1,xκ(x)dx 0,x2 κ(x)dx 05

14.7.2: Kernel Density Estimator (KDE)Also known as Parzen window estimator. Use kernel with bandwidth h to estimate densityasN1 Xκh (x xi )p̂(x) Ni 114.7.4: Kernel RegressionGoal: compute f (x) E[y x]. Can use KDE to approximate joint density,p(x, y) N1 Xκh (x xi )κh (y yi )Ni 1ThenPNf (x) Pi 1Nκh (x xi )yii 1 κh (x xi )Also called Nadarya-Watson model.6

Notes for Machine Learning: A Probabilistic Perspective by Kevin P. Murphy Kevin O’Connor Chapter 13: Sparse Linear Models Goal: select sets of relevant variables simultaneously. 13.2: Bayesian Variable Selection Suppose we have latent variables, j 1 (feature j relevant). Then compute posterior of (1;:::; D), p(jD) expf f()g P 0 expf f )g where f() logp(Dj) logp() Two ways of .