Machine Learning: A Probabilistic Perspective Solution . - GitHub Pages

Transcription

Machine Learning: A ProbabilisticPerspective Solution Manual Version 1.2Fangqi Li, SJTUContents1 Introduction81.1A Brief Introduction . . . . . . . . . . . . . . . . . . . . . . .81.2On Machine Learning: A Probabilistic Perspective . . . . . .81.3Constitutions of this Document . . . . . . . . . . . . . . . . .91.4Updating log . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 Probability2.112Probability are sensitive to the form of the question that wasused to generate the answer . . . . . . . . . . . . . . . . . . . 122.2Legal reasoning . . . . . . . . . . . . . . . . . . . . . . . . . . 122.3Variance of a sum2.4Bayes rule for medical diagnosis . . . . . . . . . . . . . . . . . 132.5The Monty Hall problem(The dilemma of three doors) . . . . 132.6Conditional Independence . . . . . . . . . . . . . . . . . . . . 132.7Pairwise independence does not imply mutual independence . 142.8Conditional independence iff joint factorizes . . . . . . . . . . 152.9Conditional independence . . . . . . . . . . . . . . . . . . . . 15. . . . . . . . . . . . . . . . . . . . . . . . 122.10 Deriving the inverse gamma density . . . . . . . . . . . . . . 162.11 Normalization constant for a 1D Gaussian . . . . . . . . . . . 162.12 Expressing mutual information in terms of entropies . . . . . 162.13 Mutual information for correlated normals . . . . . . . . . . . 172.14 A measure of correlation . . . . . . . . . . . . . . . . . . . . . 181

CONTENTS22.15 MLE minimizes KL divergence to the empirical distribution . 182.16 Mean, mode, variance for the beta distribution . . . . . . . . 182.17 Expected value of the minimum . . . . . . . . . . . . . . . . . 193 Generative models for discrete data203.1MLE for the Beroulli/binomial model . . . . . . . . . . . . . 203.2Marginal likelihood for the Beta-Bernoulli model . . . . . . . 203.3Posterior predictive for Beta-Binomial model . . . . . . . . . 213.4Beta updating from censored likelihood . . . . . . . . . . . . 213.5Uninformative prior for log-odds ratio . . . . . . . . . . . . . 213.6MLE for the Poisson distribution . . . . . . . . . . . . . . . . 223.7Bayesian analysis of the Poisson distribution3.8MLE for the uniform distrbution . . . . . . . . . . . . . . . . 223.9Bayesian analysis of the uniform distribution . . . . . . . . . 23. . . . . . . . . 223.10 Taxicab problem . . . . . . . . . . . . . . . . . . . . . . . . . 233.11 Bayesian analysis of the exponential distribution . . . . . . . 243.12 MAP estimation for the Bernoulli with non-conjugate priors . 253.13 Posterior predictive distribution for a batch of data with thedirichlet-multinomial model . . . . . . . . . . . . . . . . . . . 253.14 Posterior predictive for Dirichlet-multinomial . . . . . . . . . 253.15 Setting the hyper-parameters I . . . . . . . . . . . . . . . . . 263.16 Setting the beta hyper-parameters II . . . . . . . . . . . . . . 263.17 Marginal likelihood for beta-binomial under uniform prior . . 273.18 Bayes factor for coin tossing* . . . . . . . . . . . . . . . . . . 273.19 Irrelevant features with naive Bayes . . . . . . . . . . . . . . 273.20 Class conditional densities for binary data . . . . . . . . . . . 293.21 Mutual information for naive Bayes classifiers with binaryfeatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 293.22 Fitting a naive Bayesian spam filter by hand* . . . . . . . . . 304 Gaussian models314.1Uncorrelated does not imply independent . . . . . . . . . . . 314.2Uncorrelated and Gaussian does not imply independent unless jointly Gaussian . . . . . . . . . . . . . . . . . . . . . . . 31

CONTENTS34.3Correlation coefficient is between -1 and 1 . . . . . . . . . . . 314.4Correlation coefficient for linearly related variables is 1 or -1 . 324.5Normalization constant for a multidimensional Gaussian . . . 324.6Bivariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . 334.7Conditioning a bivariate Gaussian . . . . . . . . . . . . . . . 334.8Whitening vs standardizing* . . . . . . . . . . . . . . . . . . 334.9Sensor fusion with known variances in 1d . . . . . . . . . . . 334.10 Derivation of information form formulae for marginalizingand conditioning . . . . . . . . . . . . . . . . . . . . . . . . . 344.11 Derivation of the NIW posterior. . . . . . . . . . . . . . . . 344.12 BIC for Gaussians . . . . . . . . . . . . . . . . . . . . . . . . 364.13 Gaussian posterior credible interval . . . . . . . . . . . . . . . 364.14 MAP estimation for 1d Gaussians . . . . . . . . . . . . . . . . 364.15 Sequential(recursive) updating of covariance matrix . . . . . . 374.16 Likelihood ratio for Gaussians . . . . . . . . . . . . . . . . . . 374.17 LDA/QDA on height/weight data* . . . . . . . . . . . . . . . 384.18 Naive Bayes with mixed features . . . . . . . . . . . . . . . . 384.19 Decision boundary for LDA with semi tied covariances . . . . 394.20 Logistic regression vs LDA/QDA . . . . . . . . . . . . . . . . 394.21 Gaussian decision boundaries*. . . . . . . . . . . . . . . . . 404.22 QDA with 3 classes* . . . . . . . . . . . . . . . . . . . . . . . 404.23 Scalar QDA* . . . . . . . . . . . . . . . . . . . . . . . . . . . 405 Bayesian statistics415.1Proof that a mixture of conjugate priors is indeed conjugate . 415.2Optimal threshold on classification probability . . . . . . . . 415.3Reject option in classifiers . . . . . . . . . . . . . . . . . . . . 415.4More reject options* . . . . . . . . . . . . . . . . . . . . . . . 425.5Newsvendor problem . . . . . . . . . . . . . . . . . . . . . . . 425.6Bayes factors and ROC curves* . . . . . . . . . . . . . . . . . 425.7Bayes model averaging helps predictive accuracy . . . . . . . 425.8MLE and model selection for a 2d discrete distribution . . . . 435.9Posterior median is optimal estimate under L1 loss . . . . . . 445.10 Decision rule for trading off FPs and FNs . . . . . . . . . . . 44

CONTENTS46 Frequentist statistics*457 Linear regression467.1Behavior of training set error with increasing sample size. . 467.2Multi-output linear regression . . . . . . . . . . . . . . . . . . 467.3Centering and ridge regression. . . . . . . . . . . . . . . . . 4627.4MLE for σ for linear regression . . . . . . . . . . . . . . . . . 477.5MLE for the offset term in linear regression . . . . . . . . . . 477.6MLE for simple linear regression* . . . . . . . . . . . . . . . . 487.7Sufficient statistics for online linear regression . . . . . . . . . 487.8Bayesian linear regression in 1d with known σ 2 . . . . . . . . 487.9Generative model for linear regression . . . . . . . . . . . . . 497.10 Bayesian linear regression using the g-prior . . . . . . . . . . 508 Logistic regression528.1Spam classification using logistic regression* . . . . . . . . . . 528.2Spam classification using naive Bayes* . . . . . . . . . . . . . 528.3Gradient and Hessian of log-likelihood for logistic regression . 528.4Gradient and Hessian of log-likelihood for multinomial logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 538.5Symmetric version of l2 regularized multinomial logistic regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548.6Elementary properties of l2 regularized logistic regression . . 548.7Regularizing separate terms in 2d logistic regression* . . . . . 559 Generalized linear models and the exponential family9.156Conjugate prior for univariate Gaussian in exponential familyform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 569.2The MVN is in the exponential family . . . . . . . . . . . . . 5710 Directed graphical models(Bayes nets)5811 Mixture models and the EM algorithm5911.1 Student T as infinite mixture of Gaussian . . . . . . . . . . . 5911.2 EM for mixture of Gaussians . . . . . . . . . . . . . . . . . . 59

CONTENTS511.3 EM for mixtures of Bernoullis . . . . . . . . . . . . . . . . . . 6011.4 EM for mixture of Student distributions . . . . . . . . . . . . 6111.5 Gradient descent for fitting GMM. . . . . . . . . . . . . . . 6211.6 EM for a finite scale mixture of Gaussians . . . . . . . . . . . 6311.7 Manual calculation of the M step for a GMM* . . . . . . . . 6411.8 Moments of a mixture of Gaussians . . . . . . . . . . . . . . . 6411.9 K-means clustering by hand* . . . . . . . . . . . . . . . . . . 6511.10 Deriving the K-means cost function . . . . . . . . . . . . . . 6511.11 Visible mixtures of Gaussians are in exponential family . . . 6611.12 EM for robust linear regression with a Student t likelihood . 6611.13 EM for EB estimation of Gaussian shrinkage model . . . . . 6711.14 EM for censored linear regression* . . . . . . . . . . . . . . . 6711.15 Posterior mean and variance of a truncated Gaussian . . . . 6712 Latent linear models6912.1 M-step for FA . . . . . . . . . . . . . . . . . . . . . . . . . . . 6912.2 MAP estimation for the FA model . . . . . . . . . . . . . . . 7012.3 Heuristic for assessing applicability of PCA* . . . . . . . . . . 7112.4 Deriving the second principal component . . . . . . . . . . . . 7112.5 Deriving the residual error for PCA. . . . . . . . . . . . . . 7212.6 Derivation of Fisher’s linear discriminant* . . . . . . . . . . . 7212.7 PCA via successive deflation* . . . . . . . . . . . . . . . . . . 7212.8 Latent semantic indexing* . . . . . . . . . . . . . . . . . . . . 7312.9 Imputation in a FA model* . . . . . . . . . . . . . . . . . . . 7312.10 Efficiently evaluating the PPCA density . . . . . . . . . . . 7312.11 PPCA vs FA* . . . . . . . . . . . . . . . . . . . . . . . . . . 7313 Sparse linear models7413.1 Partial derivative of the RSS . . . . . . . . . . . . . . . . . . 7413.2 Derivation of M-step for EB for linear regression . . . . . . . 7413.3 Derivation of fixed point updates for EB for linear regression* 7613.4 Marginal likelihood for linear regression* . . . . . . . . . . . . 7613.5 Reducing elastic net to lasso . . . . . . . . . . . . . . . . . . . 7613.6 Shrinkage in linear regression . . . . . . . . . . . . . . . . . . 76

CONTENTS613.7 Prior for the Bernoulli rate parameter in the spike and slabmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7713.8 Deriving E step for GSM prior . . . . . . . . . . . . . . . . . 7813.9 EM for sparse probit regression with Laplace prior . . . . . . 7813.10 GSM representation of group lasso* . . . . . . . . . . . . . . 8013.11 Projected gradient descent for l1 regularized least squares. 8013.12 Subderivative of the hinge loss function . . . . . . . . . . . . 8113.13 Lower bounds to convex functions . . . . . . . . . . . . . . . 8114 Kernels8215 Gaussian processes8315.1 Reproducing property . . . . . . . . . . . . . . . . . . . . . . 8316 Adaptive basis function models*8416.1 Nonlinear regression for inverse dynamics . . . . . . . . . . . 8417 Markov and hidden Markov models8517.1 Derivation of Q function for HMM . . . . . . . . . . . . . . . 8517.2 Two filter approach to smoothing in HMMs . . . . . . . . . . 8517.3 EM for HMMs with mixture of Gaussian observations . . . . 8617.4 EM for HMMs with tied mixtures . . . . . . . . . . . . . . . . 8718 State space models8818.1 Derivation of EM for LG-SSM . . . . . . . . . . . . . . . . . . 8818.2 Seasonal LG-SSM model in standard form . . . . . . . . . . . 8919 Undirected graphical models(Markov random fields)9019.1 Derivation of the log partition function . . . . . . . . . . . . . 9019.2 CI properties of Gaussian graphical models . . . . . . . . . . 9019.3 Independencies in Gaussian graphical models . . . . . . . . . 9219.4 Cost of training MRFs and CRFs . . . . . . . . . . . . . . . . 9219.5 Full conditional in an Ising model . . . . . . . . . . . . . . . . 93

CONTENTS20 Exact inference for graphical models79420.1 Variable elimination* . . . . . . . . . . . . . . . . . . . . . . . 9420.2 Gaussian times Gaussian is Gaussian . . . . . . . . . . . . . . 9420.3 Message passing on a tree . . . . . . . . . . . . . . . . . . . . 9420.4 Inference in 2D lattice MRFs . . . . . . . . . . . . . . . . . . 9621 Variational inference9721.1 Laplace approximation to p(µ, log σ D) for a univariate Gaussian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9721.2 Laplace approximation to normal-gamma . . . . . . . . . . . 9821.3 Variational lower bound for VB for univariate Gaussian . . . 9821.4 Variational lower bound for VB for GMMs . . . . . . . . . . . 9921.5 Derivation of E[log πk ] . . . . . . . . . . . . . . . . . . . . . . 10121.6 Alternative derivation of the mean field updates for the Isingmodel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10221.7 Forwards vs reverse KL divergence . . . . . . . . . . . . . . . 10221.8 Derivation of the structured mean field updates for FHMM . 10221.9 Variational EM for binary FA with sigmoid link . . . . . . . . 10321.10 VB for binary FA with probit link . . . . . . . . . . . . . . . 103

1INTRODUCTION811.1IntroductionA Brief IntroductionHere we should have demonstrated the solution to problems in ChapterOne in Machine Learning, A Probabilistic Perspective(MLAPP). Since thenumber of problem in Chapter is zero, we save this section as an introductionto this document, i.e.a solution manual.This document provides detailed solution to almost all problems oftextbook MLAPP from Chapter One to Chapter Fourteen(Chinese version)/ Twenty-one(English version). We generally save the restatement of problems for readers themselves.There are two class for problems in MLAPP: theortical inference andpratical projects. We provide solution to most inference problems apartfrom those which are nothing but straightforward algebra(and few whichwe failed to solve). Practical problems, which base on a Matlab toolbox,are beyond the scope of this document.1.2On Machine Learning: A Probabilistic PerspectiveBooming studies and literatures have made the boundary of ”machinelearning” vague.On one hand, the rapid development of AI technology has kept thesociety shocked, which also results in sharply increase in number of studentswho would try to take related courses in colleges. On the other hand,some scholars are still uncertain in learning-related theories, especially deeplearning.The extraordinary achievements of machine learning in recent days often make one forget that this discipline has undergone a long evolution andwhose establishment dates back at least to the studies of ”electronic brains”in the 1940s. Be that as it may, machine learning has not been defined asa ”closed” theory. Even in the some community of researchers, machinelearning are crowned metaphysics or alchemistry. Personally, I believe thabeing called metaphysics is a common experience shared by many branchesof theory which are undergoing the most rapid development.

1INTRODUCTION9To be a completed theory, machine learning is still looking for a wayto conclude itself in a closed system. The most successful attempt so farhas been the one based on probability. As commented by David Blei fromPrincton on the back of MLAPP: ”In Machine Learning, the language ofprobability and statistics reveals important connections between seeminglydisparate algorithms and strategies. Thus, its readers will become articulatein a holistic view of the state-of-art and poised to build the next generationof machine learning algorithms.”The crucial idea in MLAPP is: machine learning is tantamount toBayesian Statistics, which draws connections between numerous ”indepedent” algorithms. But the history of Bayesian statistics(which can be tracedback to days of Laplace) outlengths the one of machinea learning a lot.MLAPP is not noval in holding such an idea. C.M.Bishop’s Pattern Recognition and Machine Learning is another typical example. Both of them areconsidered as classical textbooks in elementary machine learning.In general, MLAPP reduces the difficulty of the entire book at theexpense of partially deduced completeness(for the first seven chapters). Itcovers a wider range of models and is suitable for those with backgroundin mathemathcal tools. The chapters that concerning classical probabilisticmodels (e.g, chapter 2, 3, 4, 5, 7, 8, 11, 12) is comparable to PRML. Butdue to the reordering and more details, they worth a read for one who havefinished reading PRML.1.3Constitutions of this DocumentThe motivation for writing this document is that I need to read text-book MLAPP after selecting machine learning course, but I failed to findany free compiled solution manuals. Although several Github projects havestarted working on it, the velocity has been too slow. Also I want to focusmore on the theoretical part of the text rather than the implementationcode.Hence I began working on this document. It is completed(first version,Chapter One to Chapter Fourteen) within the first two weeks before theofficial semester. Bacase of the hurry process, it is suggested that readers

1INTRODUCTION10should read from a critical perspective and not hesitate to believe in everything I have written down. In the end, I hope that readers can providecomments and revise opinions. Apart from correcting the wrong answers,those who good at using MATLAB, Latex typesetting or those who are willing to participate in the improvement of the document are always welcometo contact me.22/10/2017Fangqi LiMunich, Germanysolour lfq@sjtu.edu.cnge72bug@tum.de

1INTRODUCTION1.4Updating log22/10/2017(First Chinese compilation)02/03/2018(English compilation)06/03/2018(Begin Revising)24/03/2018(First Revision)11

2PROBABILITY1222.1ProbabilityProbability are sensitive to the form of the questionthat was used to generate the answerDenote two children by A and B.We use the following denotations:E1 : A is a boy, B is a girlE2 : B is a boy, A is a girlE3 : A is a boy, B is a boyIn question a:P (E1 ) P (E2 ) P (E3 ) 142P (E1 ) P (E2 ) P (E1 ) P (E2 ) P (E3 )3For question b,w.l.o.g, assume child A is observed:P (one girl one boy) P (B is a girl A is a boy) 2.212Legal reasoningLet E1 denote the event ”the defendant commited the crime” and E2denotes ”the defendant has special blood type” respectively. Thus:p(E1 , E2 )p(E2 E1 )p(E1 ) p(E2 )p(E2 )11·1 80000011008000p(E1 E2 ) 2.3Variance of a sumBy straightforward calculation:var[X Y ] E[(X Y )2 ] E2 [X Y ] E[X 2 ] E2 [X] E[Y 2 ] E2 [Y ] 2E[XY ] 2E2 [XY ] var[X] var[Y ] 2cov[X, Y ]

2PROBABILITY2.413Bayes rule for medical diagnosisWe use Ei , Eh and Ep denotes whether one is ill or health, and one hasbeen detected as positive. Applying Bayes’s rules:P (Ei Ep ) P (Ei )P (Ep Ei )P (Ei )P (Ep Ei ) P (Eh )P (Ep Eh ) 0.00982.5The Monty Hall problem(The dilemma of three doors)The answer is b. We use Ea,i denotes the event that something happensto the ith box, a can be p(prize is in ith box), c(the gamer pick ith box),o(the host opens ith box). We assumes the participant choose the first boxand the host reveals the third one. Applying Bayes’s rules:P (Ec,1 )P (Ep,1 )P (Ec,3 Ep,1 , Ec,1 )P (Ec,1 )P (Eo,3 Ec,1 )P (Ep,1 )P (Ec,3 Ep,1 , Ec,1 ) P (Eo,3 Ec,1 )1 1·1 1 1 31 2 1 3· 3 ·0 3 ·13 2P (Ep,1 Ec,1 , Eo,3 ) In the last step we summarize over the potential location of the prize.We conclude that it is always better to switch to another box after the hostrevealing one.2.6Conditional IndependenceIn question a, we have:p(H e1 , e2 ) p(H)p(e1 , e2 H)p(e1 , e2 )Thus the answer is (ii).For question b, we have further decomposition:p(H e1 , e2 ) p(H)p(e1 H)p(e2 H)p(e1 , e2 )

2PROBABILITY14So both (i) and (ii) are sufficient obviously. Since:p(e1 , e2 ) Xp(e1 , e2 , H)H Xp(H)p(e1 H)p(e2 H)H(iii) is sufficint as well since we can calculate p(e1 , e2 ) independently.2.7Pairwise independence does not imply mutual independenceLet’s assmue three boolean variables x1 , x2 , x3 . x1 and x2 have valuesof 0 or 1 with equal possibility independently:p(x1 , x2 ) p(x1 )p(x2 )p(x1 0) p(x2 0) 12And x3 XOR(x1 , x2 ). Now we have:p(x3 1) Xp(x1 , x2 )p(x3 1 x1 , x2 )x1 ,x21111 ·0 ·0 ·1 ·144441 2Also:p(x3 1, x1 1) Xp(x2 )p(x3 1, x1 1 x2 )x2 Xp(x2 )p(x1 1)p(x3 1 x1 1, x2 )x2 1 p(x3 1) · p(x1 1)4Thus x3 is pairwise independent w.r.t x1 and x2 . But given x1 and x2 ,x3 is uniquely determined and mutual independence failes.

2PROBABILITY2.815Conditional independence iff joint factorizesWe prove 2.129 is equal to 2.130.Firstly, by denoting:g(x, z) p(x z)h(y, z) p(y z)We have the first half of proof.Secondly we have:p(x z) Xp(x, y z)y Xg(x, z)h(y, z)y g(x, z)Xh(y, z)yp(y z) h(y, z)Xg(x, z)xAnd:1 Xp(x, y z)x,yXX (g(x, z))(h(y, z))xyThus:XXp(x z)p(y z) g(x, z)h(y, z)(g(x, z))(h(y, z))xy g(x, z)h(y, z) p(x, y z)2.9Conditional independenceFrom a graphic view, both arguments are correct. But from a generalview, both of them do not have a decomposition form, thus false.

2PROBABILITY2.1016Deriving the inverse gamma densityAccording to:p(y) p(x) dx dyWe easily have:IG(y) Ga(x) · y 2ba 1 (a 1) 2 yb( )eΓ(a) ybba (y) (a 1) e yΓ(a) 2.11Normalization constant for a 1D GaussianThis proof should be found around any textbook about multivariablecalculus.Omitted here.2.12Expressing mutual information in terms of entropiesI(X; Y ) X X Xp(x, y) logp(x, y)p(x)p(y)p(x, y) logp(x y)p(x)x,yx,yp(x, y) log p(x y) x,yXX(p(x, y)) log p(x)xy H(X Y ) H(X)Inversing X and Y yields to another formula.

2PROBABILITY2.1317Mutual information for correlated normalsWe have:I(X1 ; X2 ) H(X1 ) H(X1 X2 ) H(X1 ) H(X2 ) H(X1 , X2 )111 log 2πσ 2 log 2πσ 2 log(2π)2 σ 4 (1 ρ2 )22212 log(1 ρ )2(refer to Elements of Information Theory,Example 8.5.1)We also give the derivation of Gaussian’s entropy here:Zh p(x) ln p(x)dx Zn11T 1 p(x) ln 2π ln Σ (x µ) Σ (x µ) dx222Z11np(x)(x µ)T Σ 1 (x µ)dx ln 2π ln Σ 222n11 X ln 2π ln Σ Ep [ (xi µi )Σ 1ij (xj µj )]222i,jn11X ln 2π ln Σ Ep [(xi µi )(xj µj )]Σ 1ij222 i,jn11X ln 2π ln Σ Σji Σ 1ij222 i,j1n ln 2π ln Σ 22n1 ln 2π ln Σ 221n ln(2πe) Σ 21 1tr ΣΣ2n2The trick here is to use the defintion of covariance and the trace mark.

2PROBABILITY2.1418A measure of correlationIn question a:H(Y X)H(X) H(Y X) H(X)H(X)H(Y ) H(Y X) H(X)I(X; Y ) H(X)r 1 We have 0 r 1 in question b for I(X; Y ) 0 and H(X Y ) H(X)(properties of entropy).r 0 iff X and Y are independent.r 1 iff X is determined(not necassary equal) by Y .2.15MLE minimizes KL divergence to the empirical distributionExpand the KL divergence:θ arg min {KL(pemp q(θ))}θ pemp arg min Epemp [log]θq(θ) Z arg min H(pemp ) pemp (x) log q(x; θ)dxθ()X arg min H(pemp ) (log q(x; θ))θx dataset( arg maxθ)Xlog q(x; θ)x datasetWe end up in MLE. We use the weak form of the law of large numbersin the fourth step and drop the entropy of empirical distribution in the laststep.2.16Mean, mode, variance for the beta distributionFirstly, derive the mode for beta distribution by differentiating the pdf:d a 1x (1 x)b 1 [(1 x)(a 1) (b 1)x]xa 2 (1 x)b 2dx

2PROBABILITY19Setting this to zero yields:a 1a b 2Secondly, derive the moment in beta distribution:Z1E[xN ] xa N 1 (1 x)b 1 dxB(a, b)B(a N, b) B(a, b)Γ(a N )Γ(b) Γ(a b) Γ(a N b) Γ(a)Γ(b)mode Setting N 1, 2:aa ba(a 1)E[x2 ] (a b)(a b 1)Where we have used the property of Gamma function. StraightforwardE[x] algebra gives:mean E[x] variance E[x2 ] E2 [x] 2.17aa bab(a b)2 (a b 1)Expected value of the minimumLet m denote the location of the left most point, we have:p(m x) p([X x]and[Y x]) p(X x)p(Y x) (1 x)2Therefore:Zx · p(m x)dxE[m] Z p(m x)dxZ 01 31(1 x)2 dx

3GENERATIVE MODELS FOR DISCRETE DATA33.1Generative models for discrete dataMLE for the Beroulli/binomial modelLikelihood:p(D θ) θN1 (1 θ)N0Log-Likelihood:ln p(D θ) N1 ln θ N0 ln(1 θ)Setting the derivative to zero: N1N0ln p(D θ) 0 θθ1 θThis ends in 3.22:θ 3.2N1N1 N1 N0NMarginal likelihood for the Beta-Bernoulli modelLikelihood:p(D θ) θN1 (1 θ)N0Prior distribution:p(θ a, b) Beta(θ a, b) θa 1 (1 θ)b 1Posterior distribution:p(θ D) p(D θ) · p(θ a, b) θN1 a 1 (1 θ)N0 b 1 Beta(θ N1 a, N0 b)Predictive distribution:Zp(xnew 1 θ) · p(θ D)dθp(xnew 1 D) Z θp(θ D)dθ E(θ) N1 aN1 a N0 b20

3GENERATIVE MODELS FOR DISCRETE DATA21Calcualte p(D) where D 1, 0, 0, 1, 1:p(D) p(x1 )p(x2 x1 )p(x3 x2 , x1 ).p(XN xN 1 , XN 2 , .X1 ) bb 2a 1a 2aa ba b 1a b 2a b 3a b 4Denote α a b, α1 a, α0 b, we have 3.83. To derive 3.80, wemake use of:[(α1 ).(α1 N1 1)] 3.3(α1 N1 1)!Γ(α1 N1 ) (α1 1)!Γ(α1 )Posterior predictive for Beta-Binomial modelStraightforward algebra:B(α10 1, α00 )B(α10 , α00 )Γ(α00 α10 ) Γ(α10 1) Γ(α00 α10 1) Γ(α10 )α0 0 1 0α1 α0Bb(α10 , α00 , 1) 3.4Beta updating from censored likelihoodThe derivation is straightforward:p(θ, X 3) p(θ)p(X 3 θ) p(θ)(p(X 1 θ) p(X 2 θ)) Beta(θ 1, 1)(Bin(1 5, θ) Bin(2 5, θ))3.5Uninformative prior for log-odds ratioSince:θ1 θBy using change of variables formula:φ logp(θ) p(φ) dφ1 dθθ(1 θ)Hence:p(θ) Beta(θ 0, 0)

3GENERATIVE MODELS FOR DISCRETE DATA3.622MLE for the Poisson distributionLikelihood (we use Poi in condition to represent the fact that we haveknowledge about the form of the distribution):p(D Poi, λ) NYPoi(xn λ) exp( λN ) · λPNn 1xnn 1· QN1n 1xn !Setting the derivative of Log-Likelihood to zero:()NXP xn 0p(D Poi, λ) exp( λN ) · λ x 1 · N λ λn 1Thus:PNλ 3.7n 1xnNBayesian analysis of the Poisson distributionWe have:p(λ D) p(λ)p(D λ)PN exp( λ(N b)) · λ n 1 xn a 1X Gamma(a x, N b)This prior distribution equals introduing b prior observations with meana.b3.8MLE for the uniform distrbutionThe likelihood goes to zero if a max(xn ), so we must have â max(xn ), likelihood lookes like:p(D a) NY12an 1Which has a negative correlation with a, so:nâ max {xi }i 1nThis model assign p(xn 1 ) 0 if xn 1 max {xi }i 1 , which causediscontinuity in predictive distribution, and is not an adorable feature.

3GENERATIVE MODELS FOR DISCRETE DATA3.923Bayesian analysis of the uniform distributionThe conjugate prior for uniform distribution if Pareto distribution:p(θ) Pareto(θ K, b) KbK θ (K 1) [θ b]nLet m max {xi }i 1 , the joint distribution is:p(θ, D) p(θ)p(D θ) KbK θ (K N 1) [θ b][θ m]The marginal likelihood/evidence is:ZKbKp(D) p(D, θ)dθ (N K) max(m, b)N KLet µ max {m, b}, the posterior distribution is again the form of aParato distribution:p(θ D) 3.10p(θ, D)(N K)µN K [θ µ] Pareto(θ N K, µ)p(D)θN K 1Taxicab problemFor question a, we have D {100}, m max {D} 100, N 1.Using a prior K 0, b 0, we have the posterior:Pareto(θ 1, 100)A for question b, with posterior mode given by (k 1, thus meanand variance does not exist):mode 100To calculate the median:Z mediankmk x (k 1) dx mPlug in figures and using the factR12x 2 dx x 1 C, solve formedian 200In question c, we already had (from exercise 3.9) the predictivedistribution, in this case α (b 0, K 0), β (c m, N K 1).Plug them into the form of evidence:

3GENERATIVE MODELS FOR DISCRETE DATA24In case when x m(the number of new taxi is larger than the one wesaw):p(x D, α) m2x2If x m:m1 22m2mWe plug in m 100 into the equations above to solve for question d:p(x D, α) p(x 50 m 100) 120012001p(x 150 m 100) 450(Do please reason on this result!)p(x 100 m 100) We omit question e as an open quesion.3.11Bayesian analysis of the exponential distributionLog-Likelihood for an exponential distribution is:ln p(D θ) N ln θ θNXxnn 1The derivative is:N N Xln p(D θ) xn θθn 1Thus in question a:NθM L PNn 1xnWe skip other questions and state that the conjugate prior for exponential distribution is Gamma distribution:p(θ D) p(θ)p(D θ) Gamma(θ a, b)p(D θ) Gamma(θ N a, b Xxn )A Gamma prior introduces a 1 prior observation with the sum b.

3GENERATIVE MODELS FOR DISCRETE DATA3.1225MAP estimation for the Bernoulli with non-conjugatepriorsIn question a, we have:p(θ 0.5 D) p(θ 0.5)p(D θ 0.5) p(θ 0.4 D) p(θ 0.4)p(D θ 0.4) 1 1 N1 N021 1 2 N1 3 N0··2 55If the MAP estimation is θ 0.5, i.e:lnp(θ 0.5 D)55 N1 ln N0 ln 0p(θ 0.4 D)46Then this must be held:N1 0.817N0Else MAP estimation gives θ 0.4.For question b, in case N is small, prior in question a is able to yielda fairly good estimation (the prior is not conjugate yet close to truth). Butas N grows, it can only getting close to 0.4, while Beta-prior tends to yieldthe true parameter with less error.3.13Posterior predictive distribution for a batch of datawith the dirichlet-multinomial modelSince we already have 3.51:p(X j D, α) αj N jα0 NWe can easily derive:p(D̃ D, α) Yp(x D, α)x D̃ 3.14CYαj Njold N new() jα0 N oldj 1Posterior predictive for Dirichlet-multinomialSolutions to this exercise can be obtained from conclusions drawn fromexercise 3.13.

3GENERATIVE MODELS FOR DISCRETE DATA3.1526Setting the hyper-parameters IWe already have: mean var α1α1 α2(α1 α1 α22α2 ) (α1 α2 1)Using notation A α1 and B α1 α2 , we have:m A(B A)B 2 (B 1)v Cancell A:B ABm(1 m) 1vA mBFor given figures, we calculate:B 130A 91Hence:α1 91α2 393.16Setting the beta hyper-parameters IIFor paremeters of a Beta distribution α1 and α2 are connected through:α2 α1 (1 1) f (α1 )mCalculate this intergral:Z u1θα1 (1 θ)f (α1 ) u(α1 )B(α,f(α))11lSetting this intergral u(α1 ) 0.95 by altering α1 through numericalmethod will do.

3GENERATIVE MODELS FOR DISCRETE DATA3.17Marginal likelihood for beta-binomial under uniformpriorThe marginal likelihood is given by:Z 1Zp(N1 N ) p(N1 , θ N )dθ 01p(N1 θ, N )p(θ)dθ0We already have:p(N1 θ, N ) Bin(N1 θ, N )p(θ) Beta(1, 1)Thus:Z 1 Np(N1 N ) θN1 (1 θ)N N1 dθN1 0 NB(N1 1, N N1 1) N1N!N1 !(N N1 )! N1 !(N N1 )! (N 1)!1 N 1Where B is the regulizer for a Beta distribution:B(a, b) 3.18Γ(a)Γ(b)Γ(a b)Bayes fa

we failed to solve). Practical problems, which base on a Matlab toolbox, are beyond the scope of this document. 1.2 On Machine Learning: A Probabilistic Perspective Booming studies and literatures have made the boundary of "machine learning" vague. On one hand, the rapid development of AI technology has kept the