Bayesian Machine Learning - Lecture 1 - Dipartimento Di Informatica

Transcription

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningBayesian Machine Learning - Lecture 1Guido SanguinettiInstitute for Adaptive and Neural ComputationSchool of InformaticsUniversity of Edinburghgsanguin@inf.ed.ac.ukFebruary 23, 2015Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningWelcomeBroad introduction to statistical machine learning conceptswithin the Bayesian probabilistic frameworkFocus on using statistics as a modelling tool, and algorithmsfor efficient inferenceKey objective: theoretical and practical familiarity with somefundamental ML methodsStructure: four two hours lectures and one two hours lab eachweekAssessment: coinciding with the labs. NB: I believe PhDstudents have already demonstrated their ability to passexams.Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMain refsD. Barber, Bayesian Reasoning and Machine Learning, CUP2010Some slides also taken from the teaching material attached tothe book (thanks David!)Other good books: C.M. Bishop, Pattern Recognition andMachine Learning (Springer 2006); K. Murphy, MachineLearning - a probabilistic perspective (MIT Press 2012).Rasmussen and Williams, Gaussian Processes for MachineLearning (MIT Press 2007) for Lecture 3Lecture 4 (Active Learning and Bayesian Optimisation): B.Settles, Active learning literature survey, sections 2 and 3 andBrochu et al, http://arxiv.org/abs/1012.2599Wikipedia also has good pages for most of the materialIMPORTANT FACT: these slides are not a book!!!!Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningToday’s lecture1Some facts2Philosophy and road map3Basics of probability theory4Some probability distributions5Fitting distributions6Basics of learningGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningA few things worth consideringMobile traffic in 2013 18 total internet traffic in 2000UK National Health Service plans to sequence genome of750.000 cancer patients in the next ten yearsGoogle purchased DeepMind (after 1 year of operation) for450M GBPNumber of job postings for data scientists increased globallyby 15.000% between 2011 and 2012Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningThe problemVast amounts of quantitative data arising from every aspectof lifeAdvanced informatics tools necessary just to handle the dataWidespread belief that data is valuable, yet worthless withoutanalytic toolsConverting data to knowledge is the challengeGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningA memorable quoteIf you ignore philosophy, it comes back and bites your bottom(Dr R. Shillcock, Informatics, Edinburgh)What is a model? Discuss for 5 minutes and provide 3examplesGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMy own answerA model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).A stochastic model is a mathematical model where theobjects are probability distributions.All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.Machine learning deals with algorithms for automatic selectionof a model from observations of the system.Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMy own answerA model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).A stochastic model is a mathematical model where theobjects are probability distributions.All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.Machine learning deals with algorithms for automatic selectionof a model from observations of the system.Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMy own answerA model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).A stochastic model is a mathematical model where theobjects are probability distributions.All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.Machine learning deals with algorithms for automatic selectionof a model from observations of the system.Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMy own answerA model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).A stochastic model is a mathematical model where theobjects are probability distributions.All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.Machine learning deals with algorithms for automatic selectionof a model from observations of the system.Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMy own answerA model is a hypothesis that certain features of a system ofinterest are well replicated in another, simpler system.A mathematical model is a model where the simpler systemconsists of a set of mathematical relations between objects(equations, inequalities, etc).A stochastic model is a mathematical model where theobjects are probability distributions.All modelling usually starts by defining a family of modelsindexed by some parameters, which are tweaked to reflect howwell the feature of interest is captured.Machine learning deals with algorithms for automatic selectionof a model from observations of the system.Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningCourse contentReferences to Barber 2010 book.Lecture 1: Statistical basics. Probability refresher, probabilitydistributions, entropy and KL divergence (Ch 1, Ch 8.2, 8.3).Multivariate Gaussian (8.4). Estimators and maximumlikelihood (8.6 and 8.7.3). Supervised and unsupervisedlearning (13.1)Lecture 2: Linear models. Regression with additive noise andlogistic regression (probabilistic perspective): maximumlikelihood and least squares (18.1 and 17.4.1). Duality andkernels (17.3).Lecture 3: Bayesian regression models and GaussianProcesses. Bayesian models and hyperparameters (18.1.1,18.1.2). Gaussian Process regression (19.1-19.4).Lecture 4: Active learning and Bayesian optimisation. ActiveGuido SanguinettiBayesianLearning- Lecture 1 Bayesianlearning, basic conceptsand typesofMachineactivelearning.

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningCourse content cont’dLecture 5: Latent variables and mixture models. Latentvariables and the EM algorithm (11.1 and 11.2.1). Gaussianmixture models and mixture of experts (20.3, 20.4).Lecture 6: Graphical models. Belief networks and Markovnetworks (3.3 and 4.2). Factor graphs (4.4).Lecture 7: Exact inference in trees. Message passing andbelief propagation (5.1 and 28.7.1).Lecture 8: Approximate inference in graphical models.Variational inference: Gaussian and mean field approximations(28.3, 28.4). Sampling methods and Gibbs sampling (27.4and 27.3).Lab 2: Bayesian Gaussian Mixture ModelsGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningDefinitionsRandom variables: results of non exactly reproducibleexperimentsEither intrinsically random (e.g. quantum mechanics) or thesystem is incompletely known, cannot be controlled preciselyThe probability pi of an experiment taking a certain value i isthe frequency with which that value is taken in the limit ofinfinite experimental trialsAlternatively, we can take probability to be our belief that acertain value will be takenGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMore definitionsLet x be a random variable, the set of possible values of x isthe sample space ΩLet x and y be two random variables, p(x i, y j) is thejoint probability of x taking value i and y taking value j (withi and j in the respective sample spaces. Often just writtenp(x, y ) to indicate the function (as opposed to its evaluationover the outcomes i and j)p(x y ) is the conditional probability, i.e. the probability of x ifyou know y has a certain valueGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningRulesNormalisation: the sum of the probabilitiesof all possiblePexperimental outcomes must be 1, x Ω p(x) 1Sum rule: the marginal probability p(x) is given by summingthe joint p(x, y ) over all possible values of y ,Xp(x) p(x, y )y ΩProduct rule: the joint is the product of the conditional andthe marginal, p(x, y ) p(x y )p(y )Bayes rule: the posterior is the ratio of the joint and themarginalp(x y )p(y )p(y x) p(x)Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningDistributions and expectationsA probability distribution is a rule associating Pa number0 p(x) 1 to each state x Ω, such that x Ω p(x) 1For finite state space can be given by a table, in general isgiven by a functional formProbability distributions (over numerical objects) are useful tocompute expectations of functionsXhf i f (x)p(x)x ΩImportant expectations are the mean hxi and variancevar (x) h(x hxi)2 i. For more variables, also the covariancecov (x, y ) h(x hxi)(y hy i)i orpits scaled relative thecorrelation corr (x, y ) cov (x, y )/ var (x)var (y )Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningComputing expectationsIf you know analytically the probability distribution and cancompute the sums (integrals), no problemIf you know the distribution but cannot compute the sums(integrals), enter the magical realm of approximate inference(fun but out of scope)If you know nothing but have NS samples, then use a sampleapproximationApproximate the probability of an outcome with the frequencyin the samplehf (x)i 'NSX nx1 Xf (x) f (xi )NSNSxi 1(prove the last equality)Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningIndependenceTwo random variables x and y are independent if their jointprobability factorises in terms of marginalsp(x, y ) p(x)p(y )Using the product rule, this is equivalent to the conditionalbeing equal to the marginalp(x, y ) p(x)p(y ) p(x y ) p(x)Exercise: if two variables are independent, then theircorrelation is zero. NOT TRUE viceversa (no correlationdoes not imply independence)Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningContinuous statesIf the state space Ω is continuous some of the previousdefinitions must be modifiedThe general case is mathematically difficult; we restrictourselves to Ω Rn and to distributions which admit adensity, a functionZp : Ω R s.t. p(x) 0 x andp(x)dx 1ΩIt can be shown that the rules of probability distributions holdalso for probability densitiesNotice that p(x) is NOT the probability of the randomvariable being in state x (that is always zero for boundeddensities); probabilities are only defined as integrals oversubsets of ΩGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningEntropy and divergenceProbability theory is the basis of information theory(interesting, but not the topic of this course).An important quantity is the entropy of a distributionXH[p] pi log2 piiEntropy measures the level of disorder of a distribution; fordiscrete distributions, it is always 0 and 0 only fordeterministic distirbutionsThe relative entropy or Kullback-Leibler (KL) divergencebetween two distributions isXqiKL[qkp] qi logpiiFact: KL is convex and 0Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningBasic distributionsDiscrete distribution: a random variable can take N distinctvalues with probability pi , i 1, . . . , N . FormallyY δijp(x i) pjjδij is the Kronecker delta and the pi s are parameters.Poisson distribution: a distribution over non-negative integersµnp (n µ) exp[ µ]n!The parameter µ is often called the rate of the distribution.The Poisson distribution is often used for rare events, e.g.decaying of particles of binding of DNA fragments to a probe(more later!)Exercise: compute mean and variance of a Poisson distributionGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningBasic distributionsMultivariate normal: distribution over vectors x, density 11T 1pp (x µ, Σ) exp (x µ) Σ (x µ)22π Σ µ is the mean and Σ is the covariance matrix. Often useful toparametrise it in terms of the precision matrix Σ 1 . Howmany parameters does a multivariate normal have?Gamma distribution: distribution over positive real numbers,densityx k 1 exp( x/θ)p (x k, θ) θk Γ(θ)with shape parameter k and scale parameter θGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningInteresting exerciseThis exercise illustrates the pitfalls of working in high dimensions.1Curse of dimensionality: Suppose you want to exploreuniformly a region by gridding it. How many grid points doyou need?2Even worse: Suppose you sample from a spherical Gaussiandistribution. Where do the points lie as the dimensionsincrease?Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningMixtures: how to build more distributionsMore general distributions can be built via mixtures: e.g.X2p(x µ1.,n , σ1,.,n) πi N (µi , σi2 )iwhere the mixing coefficients πi are discretely distributedYou can interpret this as a two stage hierarchical process:choose one component out of a discrete distribution, thenchoose the distribution for that componentIMPORTANT CONCEPT: this is an example of latentvariable model, with a latent class variable and an observedcontinuous value. The mixture is the marginal distribution forthe observationsThe probability of the latent variables given the observationscan be obtained using Bayes’ theorem: see next weekGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningContinuous mixtures: some cool distributionsNo need for the mixing distribution (latent variable) to bediscreteSuppose you are interested in the means of normallydistributed samples (possibly with different variances/precisions)Marginalising the precision in a Gaussian using a Gammamixing distribution yields a Student t-distributionSuppose you have multiple rare event processes happeningwith slightly different ratesMarginalising the rate in a Poisson distribution using aGamma mixing distribution yields a negative binomialdistributionGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningParameters?Many distributions are written as conditional probabilitiesgiven the parametersOften the values of the parameters are not knownGiven independent and identically distributed (i.i.d.)observations, we can estimate them; e.g., we pick θ bymaximum likelihood"#Yθ̂ argmaxθp(xi θ)iAlternatively, you could have a prior over the parameters p(θ)and take the maximum a posteriori (MAP) estimate"#Yθ̂MAP argmaxθ p(θ)p(xi θ)iGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningJustification for maximum likelihoodGiven a data set {xi },distribution bei 1, . . . , N, let the empiricalN1 Xpemp (x) I(xi )Ni 1with I the indicator function of a setTo find a suitable distribution q to model the data, one maywish to minimize the Kullback-Leibler divergenceKL[pemp kq] H[pemp ] hlog q(x)ipemp 1 Xlog q(xi )NMaximum likelihood is equivalent to minimizing a KLdivergence with the empirical distirbutionGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningExercise: fitting a discrete distributionWe have independent observations x1 , . . . , xN each taking oneof D possible values, giving a likelihoodL NYp (xi p)i 1Compute the Maximum Likelihood estimate of p. What is theintuitive meaning of the result? What happens if one of the Dvalues is not represented in your sample?Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningExercise II: fitting a Gaussian distributionWe have independent, real valued observations x1 , . . . , xNFind the parameters of the optimal Gaussian fit by maximumlikelihoodGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningBayesian estimationThe Bayesian approach quantifies uncertainty at every stepThe parameters are treated as additional random variableswith their own prior distribution p(θ)The observation likelihood is combined with the prior toobtain a posterior distribution via Bayes’ theoremp(θ xI ) p(xI θ)p(θ)p(xI )where I is the set indexing the observationsThe distribution of the observable x (predictive distribution) isobtained asZp(x xI ) dθp(x θ)p(θ xI )Guido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningExercise: Bayesian fitting of GaussiansLet data xi i 1, . . . , N be distributed according to aGaussian with mean µ and variance σ 2Let the prior distribution over the mean µ be a Gaussian withmean m and variance v 2Compute the posterior and predictive distributionGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningEstimatorsA procedure to calculate an expectation is called an estimatore.g., fitting a Gaussian to data by maximum likelihoodprovides the M.L. estimator for mean and variance, orBayesian posterior meanAn estimator will be a noisy estimate of the true value, due tofinite sample effectsAn estimator fˆ is unbiased if its expectation (under the jointdistribution of the data set) coincides with the true valueExercise: show that the ML estimator of variance is biasedGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningLearning as fitting distributionsThe world-view of this course is that a model consists of a setof random variables and probabilistic relationships describingtheir interactionsLearning refers to computing conditional distributions w.r.t.some observations of subsets of the modelPredictions are then carried out using the Bayesian predictivedistributionGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningSupervised and unsupervised learningSlightly meaningless terms still heavily usedFocus on models involving more than one random variableIn particular, supervised learning applies when data is in theform of input-output pairsSupervised learning aims at learning the (probabilistic)functional relationship between the output and the inputUnsupervised learning refers to purely learn the structure ofthe probability distribution underlying the dataGuido SanguinettiBayesian Machine Learning - Lecture 1

Some factsPhilosophy and road mapBasics of probability theorySome probability distributionsFitting distributionsBasics of learningGenerative and discriminative modelsSupervised learning can have two flavoursTwo different types of question can be asked:what is the joint probability of input/ output pairs?given a new input, what will be the output?The first question requires a model of the population structureof the inputs, and of the conditional probability of the outputgiven the input generative modellingThe second question is more parsimonious but lessexplanatory discriminative learningNotice that the difference between generative supervisedlearning and unsupervised learning is mootGuido SanguinettiBayesian Machine Learning - Lecture 1

D. Barber, Bayesian Reasoning and Machine Learning, CUP 2010 Some slides also taken from the teaching material attached to the book (thanks David!) Other good books: C.M. Bishop, Pattern Recognition and Machine Learning (Springer 2006); K. Murphy, Machine Learning - a probabilistic perspective (MIT Press 2012).