Machine Learning - Google Research

Transcription

Machine LearningA Probabilistic PerspectiveKevin P. MurphyThe MIT PressCambridge, MassachusettsLondon, England

2012 Massachusetts Institute of TechnologyAll rights reserved. No part of this book may be reproduced in any form by any electronic or mechanicalmeans (including photocopying, recording, or information storage and retrieval) without permission inwriting from the publisher.For information about special quantity discounts, please email special sales@mitpress.mit.eduThis book was set in the LATEX programming language by the author. Printed and bound in the UnitedStates of America.Library of Congress Cataloging-in-Publication InformationMurphy, Kevin P.Machine learning : a probabilistic perspective / Kevin P. Murphy.p. cm. — (Adaptive computation and machine learning series)Includes bibliographical references and index.ISBN 978-0-262-01802-9 (hardcover : alk. paper)1. Machine learning. 2. Probabilities. I. Title.Q325.5.M87 2012006.3’1—dc23201200455810987654321

This book is dedicated to Alessandro, Michael and Stefano,and to the memory of Gerard Joseph Murphy.

ContentsPreface1xxviiIntroduction11.1Machine learning: what and why?11.1.1Types of machine learning21.2Supervised supervised learning91.3.1Discovering clusters101.3.2Discovering latent factors111.3.3Discovering graph structure131.3.4Matrix completion141.4Some basic concepts in machine learning161.4.1Parametric vs non-parametric models161.4.2A simple non-parametric classifier: K-nearest neighbors1.4.3The curse of dimensionality181.4.4Parametric models for classification and regression191.4.5Linear regression191.4.6Logistic regression211.4.7Overfitting221.4.8Model selection221.4.9No free lunch theorem242 Probability272.1Introduction272.2A brief review of probability theory282.2.1Discrete random variables282.2.2Fundamental rules282.2.3Bayes rule292.2.4Independence and conditional independence2.2.5Continuous random variables323016

.7Mean and variance33Some common discrete distributions342.3.1The binomial and Bernoulli distributions342.3.2The multinomial and multinoulli distributions352.3.3The Poisson distribution372.3.4The empirical distribution37Some common continuous distributions382.4.1Gaussian (normal) distribution382.4.2Degenerate pdf392.4.3The Student t distribution392.4.4The Laplace distribution412.4.5The gamma distribution412.4.6The beta distribution422.4.7Pareto distribution43Joint probability distributions442.5.1Covariance and correlation442.5.2The multivariate Gaussian462.5.3Multivariate Student t distribution462.5.4Dirichlet distribution47Transformations of random variables492.6.1Linear transformations492.6.2General transformations502.6.3Central limit theorem51Monte Carlo approximation522.7.1Example: change of variables, the MC way532.7.2Example: estimating π by Monte Carlo integration542.7.3Accuracy of Monte Carlo approximation54Information theory562.8.1Entropy562.8.2KL divergence572.8.3Mutual information59Generative models for discrete data653.1Introduction653.2Bayesian concept ior683.2.4Posterior predictive distribution3.2.5A more complex prior723.3The beta-binomial 7571

CONTENTS3.43.543.3.4Posterior predictive distribution77The Dirichlet-multinomial 793.4.4Posterior predictive81Naive Bayes classifiers823.5.1Model fitting833.5.2Using the model for prediction853.5.3The log-sum-exp trick863.5.4Feature selection using mutual information3.5.5Classifying documents using bag of wordsix8687Gaussian s974.1.3 e2568.4.5Residual analysis (outlier detection) *2608.5Online learning and stochastic optimization2618.5.1Online learning and regret minimization262

xiiCONTENTS8.68.5.2Stochastic optimization and risk minimization8.5.3The LMS algorithm2648.5.4The perceptron algorithm2658.5.5A Bayesian view266Generative vs discriminative classifiers2678.6.1Pros and cons of each approach2688.6.2Dealing with missing data2698.6.3Fisher’s linear discriminant analysis (FLDA) *2622719 Generalized linear models and the exponential family2819.1Introduction2819.2The exponential g partition function2849.2.4MLE for the exponential family2869.2.5Bayes for the exponential family *2879.2.6Maximum entropy derivation of the exponential family *2899.3Generalized linear models (GLMs)2909.3.1Basics2909.3.2ML and MAP estimation2929.3.3Bayesian inference2939.4Probit regression2939.4.1ML/MAP estimation using gradient-based optimization2949.4.2Latent variable interpretation2949.4.3Ordinal probit regression *2959.4.4Multinomial probit models *2959.5Multi-task learning2969.5.1Hierarchical Bayes for multi-task learning2969.5.2Application to personalized email spam filtering2969.5.3Application to domain adaptation2979.5.4Other kinds of prior2979.6Generalized linear mixed models *2989.6.1Example: semi-parametric GLMMs for medical data2989.6.2Computational issues3009.7Learning to rank *3009.7.1The pointwise approach3019.7.2The pairwise approach3019.7.3The listwise approach3029.7.4Loss functions for ranking30310 Directed graphical models (Bayes nets)10.1 Introduction30710.1.1Chain rule30710.1.2Conditional independence307308

CONTENTS10.210.310.410.510.610.1.3Graphical models30810.1.4Graph terminology30910.1.5Directed graphical models310Examples31110.2.1Naive Bayes classifiers31110.2.2Markov and hidden Markov models31210.2.3Medical diagnosis31310.2.4Genetic linkage analysis *31510.2.5Directed Gaussian graphical models *318Inference319Learning32010.4.1Plate notation32010.4.2Learning from complete data32210.4.3Learning with missing and/or latent variables323Conditional independence properties of DGMs32410.5.1d-separation and the Bayes Ball algorithm (global Markovproperties)32410.5.2Other Markov properties of DGMs32710.5.3Markov blanket and full conditionals327Influence (decision) diagrams *32811 Mixture models and the EM algorithm33711.1Latent variable models33711.2 Mixture models33711.2.1Mixtures of Gaussians33911.2.2Mixture of multinoullis34011.2.3Using mixture models for clustering34011.2.4Mixtures of experts34211.3 Parameter estimation for mixture models34511.3.1Unidentifiability34611.3.2Computing a MAP estimate is non-convex34711.4 The EM algorithm34811.4.1Basic idea34911.4.2EM for GMMs35011.4.3EM for mixture of experts35711.4.4EM for DGMs with hidden variables35811.4.5EM for the Student distribution *35911.4.6EM for probit regression *36211.4.7Theoretical basis for EM *36311.4.8Online EM36511.4.9Other EM variants *36711.5 Model selection for latent variable models37011.5.1Model selection for probabilistic models37011.5.2Model selection for non-probabilistic methods37011.6 Fitting models with missing data372xiii

xivCONTENTS11.6.1EM for the MLE of an MVN with missing data37312 Latent linear models38112.1 Factor analysis38112.1.1FA is a low rank parameterization of an MVN38112.1.2Inference of the latent factors38212.1.3Unidentifiability38312.1.4Mixtures of factor analysers38512.1.5EM for factor analysis models38612.1.6Fitting FA models with missing data38712.2 Principal components analysis (PCA)38712.2.1Classical PCA: statement of the theorem38712.2.2Proof *38912.2.3Singular value decomposition (SVD)39212.2.4Probabilistic PCA39512.2.5EM algorithm for PCA39612.3 Choosing the number of latent dimensions39812.3.1Model selection for FA/PPCA39812.3.2Model selection for PCA39912.4 PCA for categorical data40212.5 PCA for paired and multi-view data40412.5.1Supervised PCA (latent factor regression)40512.5.2Partial least squares40612.5.3Canonical correlation analysis40712.6 Independent Component Analysis (ICA)40712.6.1Maximum likelihood estimation41012.6.2The FastICA algorithm41112.6.3Using EM41412.6.4Other estimation principles *41513 Sparse linear models42113.1 Introduction42113.2 Bayesian variable selection42213.2.1The spike and slab model42413.2.2From the Bernoulli-Gaussian model to 0 regularization42513.2.3Algorithms42613.3 1 regularization: basics42913.3.1Why does 1 regularization yield sparse solutions?43013.3.2Optimality conditions for lasso43113.3.3Comparison of least squares, lasso, ridge and subset selection43513.3.4Regularization path43613.3.5Model selection43913.3.6Bayesian inference for linear models with Laplace priors44013.4 1 regularization: algorithms44113.4.1Coordinate descent441

CONTENTS13.513.613.713.813.4.2LARS and other homotopy methods44113.4.3Proximal and gradient projection methods44213.4.4EM for lasso447 1 regularization: extensions44913.5.1Group Lasso44913.5.2Fused lasso45413.5.3Elastic net (ridge and lasso combined)455Non-convex regularizers45713.6.1Bridge regression45813.6.2Hierarchical adaptive lasso45813.6.3Other hierarchical priors462Automatic relevance determination (ARD)/sparse Bayesian learning (SBL)13.7.1ARD for linear regression46313.7.2Whence sparsity?46513.7.3Connection to MAP estimation46513.7.4Algorithms for ARD *46613.7.5ARD for logistic regression468Sparse coding *46813.8.1Learning a sparse coding dictionary46913.8.2Results of dictionary learning from image patches47013.8.3Compressed sensing47213.8.4Image inpainting and denoising47214 Kernels47914.1 Introduction47914.2 Kernel functions47914.2.1RBF kernels48014.2.2Kernels for comparing documents48014.2.3Mercer (positive definite) kernels48114.2.4Linear kernels48214.2.5Matern kernels48214.2.6String kernels48314.2.7Pyramid match kernels48414.2.8Kernels derived from probabilistic generative models48514.3 Using kernels inside GLMs48614.3.1Kernel machines48614.3.2L1VMs, RVMs, and other sparse vector machines48714.4 The kernel trick48814.4.1Kernelized nearest neighbor classification48914.4.2Kernelized K-medoids clustering48914.4.3Kernelized ridge regression49214.4.4Kernel PCA49314.5 Support vector machines (SVMs)49614.5.1SVMs for regression49714.5.2SVMs for classification498xv463

xviCONTENTS14.614.714.5.3Choosing C50414.5.4Summary of key points50414.5.5A probabilistic interpretation of SVMs505Comparison of discriminative kernel methods505Kernels for building generative models50714.7.1Smoothing kernels50714.7.2Kernel density estimation (KDE)50814.7.3From KDE to KNN50914.7.4Kernel regression51014.7.5Locally weighted regression51215 Gaussian processes51515.1 Introduction51515.2 GPs for regression51615.2.1Predictions using noise-free observations51715.2.2Predictions using noisy observations51815.2.3E ect of the kernel parameters51915.2.4Estimating the kernel parameters52115.2.5Computational and numerical issues *52415.2.6Semi-parametric GPs *52415.3 GPs meet GLMs52515.3.1Binary classification52515.3.2Multi-class classification52815.3.3GPs for Poisson regression53115.4 Connection with other methods53215.4.1Linear models compared to GPs53215.4.2Linear smoothers compared to GPs53315.4.3SVMs compared to GPs53415.4.4L1VM and RVMs compared to GPs53415.4.5Neural networks compared to GPs53515.4.6Smoothing splines compared to GPs *53615.4.7RKHS methods compared to GPs *53815.5 GP latent variable model54015.6 Approximation methods for large datasets54216 Adaptive basis function models54316.1 Introduction54316.2 Classifica Handling non-Gaussian data using copulas *942922

xxiv26.8CONTENTSLearning undirected discrete graphical models26.8.1Graphical lasso for MRFs/CRFs94226.8.2Thin junction trees94494227 Latent variable models for discrete data94527.1 Introduction94527.2 Distributed state LVMs for discrete data94627.2.1Mixture models94627.2.2Exponential family PCA94727.2.3LDA and mPCA94827.2.4GaP model and non-negative matrix factorization94927.3 Latent Dirichlet allocation (LDA)95027.3.1Basics95027.3.2Unsupervised discovery of topics95327.3.3Quantitatively evaluating LDA as a language model95327.3.4Fitting using (collapsed) Gibbs sampling95527.3.5Example95627.3.6Fitting using batch variational inference95727.3.7Fitting using online variational inference95927.3.8Determining the number of topics96027.4 Extensions of LDA96127.4.1Correlated topic model96127.4.2Dynamic topic model96227.4.3LDA-HMM96327.4.4Supervised LDA96727.5 LVMs for graph-structured data97027.5.1Stochastic block model97127.5.2Mixed membership stochastic block model97327.5.3Relational topic model97427.6 LVMs for relational data97527.6.1Infinite relational model97627.6.2Probabilistic matrix factorization for collaborative filtering27.7 Restricted Boltzmann machines (RBMs)98327.7.1Varieties of RBMs98527.7.2Learning RBMs98727.7.3Applications of RBMs99128 Deep learning99528.1 Introduction99528.2 Deep generative models99528.2.1Deep directed networks99628.2.2Deep Boltzmann machines99628.2.3Deep belief networks99728.2.4Greedy layer-wise learning of DBNs28.3 Deep neural networks999998979

CONTENTS28.428.5xxv28.3.1Deep multi-layer perceptrons99928.3.2Deep auto-encoders100028.3.3Stacked denoising auto-encoders1001Applications of deep networks100128.4.1Handwritten digit classification using DBNs100128.4.2Data visualization and feature discovery using deep auto-encoders28.4.3Information retrieval using deep auto-encoders (semantic hashing)28.4.4Learning audio features using 1d convolutional DBNs100428.4.5Learning image features using 2d convolutional Indexes1045Index to code1046Index to keywords104910021003

PrefaceIntroductionWith the ever increasing amounts of data in electronic form, the need for automated methodsfor data analysis continues to grow. The goal of ma

Machine learning : a probabilistic perspective / Kevin P. Murphy. p. cm. — (Adaptive computation and machine learning series) Includes bibliographical references and index. ISBN 978--262-01802-9 (hardcover : alk. paper) 1. Machine learning. 2. Probabilities. I. Title. Q325.5.M87 2012 006.3'1—dc23 2012004558 10 9 8 7 6 5 4 3 2 1