Transcription
Lecture 11Bayesian Modeling (Part 1)(Based on slides by Dr. Tom Rainforth, HT 2020)Jiarui Ganjiarui.gan@cs.ox.ac.uk
Last Lecture: the Bayesian Pipeline
Last Lecture: Bayesβ RuleLikelihoodPriorπ π΄ π΅ % π(π΅)π π΅π΄ π π΄PosteriorEvidence*Image Credit: Paul Epps
Last Lecture: Coin Flipping Exampleπ π» ππππ ππ) 0.2π π ππππ ππ) 0.8prior beliefπ ππππ ππ 0.7π ππππ 0.3biased 0.36!.# !.%π ππππ ππ π» !.# !.%&!.' !.( 0.48π π» ππππ) 0.5π π ππππ) 0.5π»fairπ π» π» π π» ππππ ππ 1 π ππππ ππ π» π π» ππππ 1 π ππππ π»!.' !.(π ππππ π» !.# !.%&!.' !.( 0.52
Outline of This Lecture What is a Bayesian model? Bayesian modeling through the eyes of multiple hypotheses Example: Bayesian linear regression
What is a Bayesian Model?
What is a Model? Models are mechanisms for reasoningabout the worldParticle physicsNuclear physicsWeatherMaterial design E.g. Newtonian mechanics, simulators,internal models our brain constructs Good models balance fidelity, predictivepower and tractability- E.g. Quantum mechanics is a moreaccurate model than Newtonianmechanics, but it is actually lessuseful for everyday tasksDrug discoveryCosmology
Example Model: Poler Playersβ Reasoning about Each Other
What is a Bayesian Model?A probabilistic generative model π(π, π) over latents π and dataπ It forms a probabilistic βsimulatorβ for generating data that we might haveseen Almost any stochastic simulator can be used as a Bayesian model (we willreturn to this idea in more detail when we cover probabilisticprogramming)
Example Bayesian Model: Captcha Simulatorπ(letters image)π(image letters)
Example Bayesian Model: Gaussian Mixture Model
Example Bayesian Model: Gaussian Mixture ModelGaussian 1:π! [ 3, 3], Ξ£! Gaussian 2:π" [3,3], Ξ£" 1 0.7 0.711 00 1Generative mode:π Categorical 0.5, 0.5π₯ π©(π# , Ξ£# )π π π !"# π(π₯! π)
A Fundamental Assumption An assumption made by virtually all Bayesian models is that data points areconditionally independent given the parameter values. In other words, if our data is given by π π₯!likelihood factorizes as: !"# ,we assume that the,π π π 5 π(π₯) π))* Effectively equates to assuming that our model captures all informationrelevant to prediction For more details, see the lecture notes
βAll models are wrong,but some are usefulβGeorge Box(1919β2013)
βAll models are wrong, but some are usefulβ The purpose of a model is to help provide insights into a target problem ordata and sometimes to further use these insights to make predictions Its purpose is not to try and fully encapsulate the βtrueβ generative processor perfectly describe the data There are infinite different ways to generate any given dataset. Trying touncover the βtrueβ generative process is not even a well-defined problem In any realβworld scenario, no Bayesian model can be βcorrectβ. Theposterior is inherently subjective It is still important to criticizeβmodels can be very wrong! E.g. we can usefrequentist methods to falsify the likelihood
Bayesian ModelingThrough the Eyes ofMultiple Hypotheses
Bayesian Modeling as Multiple HypothesesBayesian models are rooted in hypotheses: Each instance of our parameters π is a hypothesis. Given a π, we cansimulate data using the likelihood model π(π· π) Bayesian inference allows us to reason about these hypothesis, givingthe probability that each is true given the actual data we observe The posterior predictive is a weighted sum of the predictions from allpossible hypotheses, where these weights are how likely thathypothesis is to be true
Recap: Coin FlippingPosterior predictiveHypothesesπ π» ππππ ππ) 0.2π π ππππ ππ) 0.8prior beliefπ ππππ ππ 0.7π ππππ 0.3biased 0.36!.# !.%π ππππ ππ π» !.# !.%&!.' !.( 0.48π π» ππππ) 0.5π π ππππ) 0.5π»fairπ π» π» π π» ππππ ππ 1 π ππππ ππ π» π π» ππππ 1 π ππππ π»!.' !.(π ππππ π» !.# !.%&!.' !.( 0.52Posterior
Example: Density Estimation Suppose that we decide to use an isotropicGaussian likelihood with unknown mean πto model the data on the right:where πΌ is a two-dimensional identitymatrix
Example: Density EstimationHypothesis 1: π 2, 0π π π 2,0 0.00059 10"%
Example: Density EstimationHypothesis 1: π 2, 0π π π 2,0 0.00059 10"%Hypothesis 2: π 0, 0π π π 0,0 0.99 10"%
Example: Density EstimationHypothesis 1: π 2, 0π π π 2,0 0.00059 10"%Highest likelihoodHypothesis 2: π 0, 0π π π 0,0 0.99 10"%Hypothesis 3: π 2, 0π π π 2,0 0.021 10"%
The Posterior Predictive Averages over Hypotheses (1) The posterior predictive distribution allows us to average over each of ourhypotheses, weighting each by their posterior probability. For example, in our density estimation example, lets introduce (the ratherunusual but demonstrative) prior:
The Posterior Predictive Averages over Hypotheses (2) Then we have:
The Posterior Predictive Averages over Hypotheses (3) Inserting our likelihoods from earlier and trawling through the algebra givesWe thus have that the posterior predictive is a weighted sum of the threepossible predictive distributions
The Posterior Predictive Averages over Hypotheses
Some Subtleties Even though we average over π, a Bayesian model is still implicitly assuming thatthere is still a single true π- The averaging over hypotheses is from our own uncertainty as which one is correct- This can be problematic with lots of data given our model is an approximation In the limit of large data, the posterior is guaranteed to collapse to a point estimate:π(π π₯ :, ) πΏ π π as π The value of π and the exact nature of this convergence is dictated by theBernsteinβvon Mises Theorem (see the lecture notes) Note that, subject to mild assumptions, π is independent of the prior: with enoughdata, the likelihood always dominates the prior
Example:Bayesian Linear Regression
Linear RegressionHouse size is a good linear predictor for price (ignore the colors) Learn a function that maps size to price
Linear Regression Inputs: π₯ β. (where π· 1 for this example) Outputs: π¦ β Data: π· π₯) , π¦),)/ Regression model: π¦ π₯ 0π€ π where π€ β. and π βWe can simplify this notation by redefining π₯ 1, π₯ 0that the model becomes π¦ π₯ 0π€0and π€ π, π€ 0 0, soClassical least squares linear regression is a discriminative method aiming tominimize the empirical mean squared errorπΏ(π€) , , )/ #0π¦) π₯) π€
Linear Regression*Image credit: Pier Palamara
Bayesian Linear Regression Least square provides a point estimate without uncertaintyw argmin πΏ(π€)& Bayesian method introduces uncertainty by building a probabilisticgeneric model based around linear regression and then beingBayesian about the weights
Bayesian Linear Regression: Prior and Likelihood For example, prior of π€ is a zero-meanGaussian with a fixed covariance πΆπ π€ π©(π€; 0, πΆ) And given input π₯, the output is π¦ π₯ 0π€plus a Gaussian noise, and datapoints areindependent of each other:π π¦ π₯, π€ ,)/ π(π¦) π₯) , π€) 0# ,)/ π©(π¦) ; π₯) π€, π )where π is a fixed standard deviation*Image credit: Roger Grosse
Bayesian Linear Regression: Posterior Using Bayesβ rule (and some math) to derive the posterior. See Bishop, Patternrecognition and machine learning, 2006, Chapter 3
Bayesian Linear Regression: Posterior Note here that the fact the prior and posterior share the same form is highlyspecial case. This is known as a conjugate distribution and it is why we were ableto find an analytic solution for the posterior.Posterior after 1 observationPosterior after 2 observations
Bayesian Linear Regression: Posterior*Bishop, Pattern recognition and machine learning.
Bayesian Linear Regression: Posterior*Bishop, Pattern recognition and machine learning.
Bayesian Linear Regression: Posterior Predictive Some more math to derive the posterior predictive where the result is again aconsequence of Gaussian identities, and π and π are as before
Bayesian Linear Regression: Posterior Predictive*Image credit: https://www.dataminingapps.com/2017/09/simple- linear- regression- do- it- the- bayesian- way/
Further Reading Information on non-parametric models and Gaussian processes in course notes Bishop, Pattern recognition and machine learning, Chapters 1-3 K P Murphy. Machine learning: a probabilistic perspective. 2012, Chapter 5 D Barber. Bayesian reasoning and machine learning. 2012, Chapter 12 T P Minka. βBayesian model averaging is not model combinationβ. In: (2000) Zoubin Ghahramani on Bayesian machine learning (there are various alternativevariations of this talk): https://www.youtube.com/watch?v y0FgHOQhG4w Iain Murray on Probabilistic Modeling https://www.youtube.com/watch?v pOtvyVYAuW4
Machine learning: a probabilistic perspective. 2012, Chapter 5 D Barber. Bayesian reasoning and machine learning. 2012, Chapter 12 T P Minka. "Bayesian model averaging is not model combination". In: (2000) ZoubinGhahramanion Bayesian machine learning (there are various alternative