Machine Learning: A Probabilistic Perspective Pdf Online Book Pdf

Transcription

Continue

Machine learning: a probabilistic perspective pdf download online book pdfAnother approach is to use ARD; this is discussed in more detail in Section 16.5.7.5. 16.5.6.3 Soft weight sharing* Another way to regularize the parameters is to encourage similar weights to share statistical strength. This justifies the common strategy ˆ but not bothering to compute p(θ D). This is an example of induction. Solving this recurrenceyields the following sequence: 1, 3, 25, 543, 29281, 3781503, etc.2 In view of the enormous size of the hypothesis space, we are generally forced to use approximate methods, some of which we review below. (2005) describe a deployed system called JamBayes for predicting traffic flow in the Seattle area; predictions are made using a graphical modelwhose structure was learned from data. See Algorithm 3 for the pseudocode. The columns of U are the left singular vectors, and the columns of V are the right singular vectors. Resnick, S. The different colors/symbols represent the 3 phases of oil flow. For some convex sets, it is easy to compute the projection operator. Let us try to understand whythis behavior is desirable. In addition the data is quite imbalanced, with many users rating fewer than 5 movies, and a few users rating over 10,000 movies. Bayesian Matching pursuit The algorithm of (Schniter et al. concept, imagine sampling many different data sets D(s) from some true model, p(· θ ), i.e., let (s) s D(s) {xi }N i 1 , where xi p(· θ ), and θ is the true parameter. Madhavan, and A. We can combine bidirected and directed edges to get a directed mixed graphical model. (Compare this to Newton’s method in Figure 8.4(a), which repeatedly fits and then optimizes a quadratic approximation.) 11.4.7.2 EM monotonically increases the observed data log likelihood We now provethat EM monotonically increases the observed data log likelihood until it reaches a local optimum. Now suppose the joint probability distribution p(zi , xi θ) is in the exponential family, which means it can be written as follows: p(x, z θ) 1 exp[θ T φ(x, z)] Z(θ) (11.13) Chapter 11. To train this model, let yi be the relevance scores of the documents forquery i. There are two reasons. Jaynes, E. So we model each document as an admixture over topics. Fit a full or diagonal covariance matrix by MAP estimation. We can drop the log H it is independent of N , and thus will get overwhelmed by the likelihood. We will use a Dirichlet process prior for p(c), which allows V to grow automatically. Technicalreport, M. 5.7. Bayesian decision theory 179 x 0.2 x 1.0 x 2.0 2 2 2 1 1 1 0 2 1 0 1 2 (a) 0 2 1 0 1 0 2 2 1 (b) 0 1 2 (c) Figure 5.14 (a-c). Thus BFGS can be thought of as a “diagonal plus low-rank” approximation to the Hessian. (zi is an example of a hidden or 6. 4.3.4.3 Proof of Gaussian conditioning formulas We can now return to ouroriginal goal, which is to derive Equation 4.69. (It is always a good idea to perform exploratory data analysis, such as plotting the data, before applying a machine learning method.) Image classification and handwriting recognition Now consider the harder problem of classifying images directly, where a human has not preprocessed the data. TheAnnals of Statistics 3, 1272–1317. Latent linear models 400 test set reconstruction error 60 50 50 40 40 rmse rmse train set reconstruction error 60 30 30 20 20 10 10 0 0 100 200 300 400 500 0 0 100 200 300 num PCs num PCs (a) (b) 400 500 Figure 12.14 Reconstruction error on MNIST vs number of latent dimensions used by PCA. Comp.Tractable inference for complex stochastic processes. 1998) . The corresponding objective function is usually written in the following form J C N 1 L (yi , yˆi ) w 2 2 i 1 (14.47) Chapter 14. Figure generated by svdImageDemo. For each dimensionality, we conduct 5 random trials. 18.6.3 Application: fault diagnosis Consider the model in Figure18.15(a). However, these are usually too slow to use inside of a search over models. Friedman, J., T. Christensen, O., G. For example, consider the MRF in Figure 10.1(b). However note that most leaves will be off, since most words do not occur in a given query; such leaves can be analytically removed, as shown in Exercise 10.7. We an also prune outunlikely hidden nodes by following the strongest links from the words that are on up to their parents to get a candidate set of concepts. Nevertheless, let us assume we are in a simple setting where the theorem holds. Methods such as SVMs that do not produce calibrated probabilities were post-processed using Platt’s logistic regression trick (Section14.5.2.3), or using isotonic regression. Raftery (1994). 17.2.4.2 Web spam PageRank is not foolproof. Figure 3.6(b) gives an example where we update a strong Beta(5,2) prior with a peaked likelihood function; now we see that the posterior is a “compromise” between the prior and likelihood. Figure generated by steepestDescentDemo. Based onFigure 3.5 of (Hastie et al. where W is an D L matrix, and t N (0, Ψ). Hence for a K-ary random variable, the entropy is maximized if p(x k) 1/K; in this case, H (X) log2 K. Machine Learning 42, 287–320. Ghahramani, W. 16.4.1 Forward stagewise additive modeling The goal of boosting is to solve the following optimization problem: min f N L(yi , f (xi )) (16.25) i 1 and L(y, yˆ) is some loss function, and f is assumed to be an ABM model as in Equation 16.3. Common choices for the loss function are listed in Table 16.1. If we use squared error loss, the optimal estimate is given by f (x) argmin Ey x (Y f (x))2 E [Y x] (16.26) f (x) Chapter 16. 2009, p22). Eventually it will overfit.The cumulative distribution function or cdf of the Gaussian is defined as x N (z μ, σ 2 )dz (2.44) Φ(x; μ, σ 2 ) See Figure 2.3(a) for a plot of this cdf when μ 0, σ 2 1. (b) F is blood pressure and is caused by C. Figure generated by logregXorDemo. In (Opper 1998) a version of this algorithm is derived using a probit likelihood (see Section 9.4).(We use the hat symbol to denote an estimate.) Our main goal is to make predictions on novel inputs, meaning ones that we have not seen before (this is called generalization), since predicting the response on the training set is easy (we can just look up the answer). Yu, K. For q(μ) we have μN κN κ0 μ0 N x κ0 N aN (κ0 N ) bN (21.79)(21.80) and for q(λ) we have aN bN N 1 2 N 1 2 x E μ2 2E [μ] xi b0 κ0 (E μ2 μ20 2E [μ] μ0 ) 2 i 1 i a0 (21.81) (21.82) We see that μN and aN are in fact fixed constants, and only κN and bN need to be updated iteratively. An n-gram models has O(K n ) parameters. We can classify a feature vector using the following decisionrule, derived from Equation 2.13: yˆ(x) argmax [log p(y c π) log p(x θ c )] (4.31) c When we compute the probability of x under each class conditional density, we are measuring the distance from x to the center of each class, μc , using Mahalanobis distance. These hybrid particles are are sometimes called distributional particles or collapsedparticles (Koller and Friedman 2009, Sec 12.4). (d) All of the messages needed to compute all singleton marginals. In this case, instead of requiring probability vector μ to live in the marginal polytope M(G), we allow it to Chapter 22. We can identify this subspace using PCA (see the above references for details). Now the only way to get a factoredposterior is if the prior p(h1 W1 ) is a complementary prior. Thin junction tree filters for simultaneous localization and mapping. Also, it is relatively easy to extend to the multi-class case (see (Hastie et al. Figure generated by leastSquaresProjection. However, this can be slow and difficult to implement. The term σj2 controls how much group jdepends on the common parents and the σ 2 term controls the strength of the overall prior. Note that agglomerative and divisive clustering are both just heuristics, which do not optimize any well-defined objective function. likelihood; call this the directional gradient. 2.6.1 Linear transformations Suppose f () is a linear function: y f (x) Ax b(2.78) In this case, we can easily derive the mean and covariance of y as follows. We can also endow each dimension with its own characteristic length scale, M2 diag( ) 2 . 27.3.3 Quantitatively evaluating LDA as a language model In order to evaluate LDA quantitatively, we can treat it as a language model, i.e., a probability distribution oversequences of words. In AAAI. Eliminating the deterministic αi we have π i Dir(exp(Wxi )) (27.72) Unlike (Law et al. See mcQuantileDemo for a demo. For example, “play” might refer to a verb (e.g., “to play ball” or “to play the coronet”), or to a noun (e.g., “Shakespeare’s play”). Such models have applications in computational biology, naturallanguage processing, time series forecasting, etc. Bayesian Statistics: A Review. This gives equivalent results to the serial version but is less efficient when implemented on a serial machine. To give equal weight to each class, use macro-averaging. The reasons for this is that we can maximize the likelihood by driving w to infinity (subject to beingon this line), since large regression weights make the sigmoid function very steep, turning it into a step function. Linear regression 226 0.9 mean squared error negative log marg. See (Hansen and Yu 2001) for details. 17.6.3 Input-output HMMs It is straightforward to extend an HMM to handle inputs, as shown in Figure 17.18(a). Diffusion of contextand credit inforing human interactions with mation in markovian models. , D}, we see that this is in the same form as Equation 9.74. In this case, we can derive the EB estimate in closed form, as we now show. Deterministic annealing EM algorithm. Based on Figure 10.2 of (Bishop 2006b). We have a joint model of the form p(x, y) p(x)p(y x) (21.36)Chapter 21. 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 0.2 0.2 0.4 0.4 0.6 0.6 0.8 1 1 0.8 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 (a) 1 1 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 1 (b) Figure 21.2 Illustrating forwards vs reverse KL on a symmetric Gaussian. We may want to use a feature for all of the tasks or none of the tasks, and thusselect weights at the group level (Obozinski et al. Of course, working with an “infinite” model sounds scary. Solid black line joins the class-conditional means. Teh (2005). One can adopt a variety of approximate Bayesian inference techniques in this context. This can be solved optimally using an HMM filter, since we are assuming the state space isdiscrete. Notation Symbol s t bel C chj descj G E mbt nbdt pat predt ψc (xc ) S θsjk V 1013 Meaning Node s is connected to node t Belief function Cliques of a graph Child of node j in a DAG Descendants of node j in a DAG A graph Edges of a graph Markov blanket of node t Neighborhood of node t Parents of node t in a DAG Predecessors of node t ina DAG wrt some ordering Potential function for clique c Separators of a graph prob. A KNN classifier with K 1 induces a Voronoi tessellation of the points (see Figure 1.14(b)). 235–243. Figure generated by knnVoronoi. The fact that ai is not constant when using the JJ bound, unlike when using the Bohning bound, means we cannot compute VNoutside of the main loop, making the method a constant factor slower. That is, the M step is guaranteed to modify the parameters so as to increase the likelihood of the observed data (unless it is already at a local maximum). Exercise 3.19 Irrelevant features with naive Bayes (Source: Jaakkola.) Let xiw 1 if word w occurs in document i and xiw 0otherwise. In AISTATS. We can determine the size of the largest factor graphically, without worrying about the actual numerical values of the factors. Discovering hidden variables: A structure-based approach. 20.2.4.1 20.2.4.2 Max-product algorithm It is possible to devise a max-product version of the BP algorithm, by replacing the operator with themax operator. Compute the new posterior q new (x) by solving 1 new fi (x)q i (x) q (x) min KL Zi q new (x) (22.84) This can be done by equating the moments of q new (x) with those of qi (x) q i (x)fi (x). For example, consider a 2 2 grid MRF, with the following pairwise factors: θf (x1 , x2 ), θg (x1 , x3 ), θh (x2 , x4 ), and θk (x3 , x4 ). 9.3Generalized linear models (GLMs) Linear and logistic regression are examples of generalized linear models, or GLMs (McCullagh and Nelder 1989). Latent Variable Models for Predicting File Dependencies in Large-Scale Software Development. This algorithm can be used to perform Bayesian inference in low-dimensional settings (Smith and Gelfand1992). This is called a (first order) linear programming relaxtion. Note that the difference from standard binary classification is that we are classifying yi based on all the data, not just based on xi . Turnbaugh, E. Hence the kernel method can be useful in high dimensional settings, even if we only use a linear kernel (c.f., the SVD trick in Equation7.44). Suppose that for each value of F , taking the drug is harmful, that is, p(E do(C), F ) p(E do(C), F ) p(E do( C), F ) p(E do( C), F ) (26.56) (26.57) Then we can show that taking the drug is harmful overall: p(E do(C)) p(E do( C)) (26.58) The proof is as follows (Pearl 2000, p181). See e.g., (Hastie et al. The yellow circle is harder toclassify, since some yellow things are labeled y 1 and some are labeled y 0, and some circles are labeled y 1 and some y 0. (We kept λ 1/σ 2 fixed in both methods, to make them comparable.) The principle practical advantage of the evidence procedure over CV will become apparent in Section 13.7, where we generalize the prior by allowinga different αj for every feature. A standard frequentist measure of uncertainty of an estimate is the standard error of the mean, defined by σ ˆ σ ˆ2 se (6.64) N N where σ ˆ 2 is an estimate of the variance of the loss: N 1 (Li L)2 , σ ˆ N i 1 2 Li k(i) L(yi , fm (xi )) N 1 L Li N i 1 (6.65) Note that σ measures the intrinsic variability of Liacross samples, whereas se measures our uncertainty about the mean L. In addition to visualizing the data, a dependency network can be used for inference. Give a geometric interpretation of your result, if possible. Elad, and G. Dynamic Bayesian multinets. In 2008, Google said it had indexed 1 trillion (1012 ) unique URLs. If we assume there areabout 10 URLs per page (on average), this means there were about 100 billion unique web pages. Kalyanpur, A. Finding Groups in Data: An Introduction to Cluster Analysis. happens to have value xi . 16.3.1 Backfitting We now discuss how to fit the model using MLE. 21.7 Variational message passing and VIBES We have seen that mean field methods,at least of the fully-factorized variety, are all very similar: just compute each node’s full conditional, and average out the neighbors. For example, suppose we observe the following sequence (part of a children’s nursery rhyme): Mary had a little lamb, little lamb, little lamb, Mary had a little lamb, its fleece as white as snow Furthermore, suppose ourvocabulary consists of the following words: mary lamb little big fleece white black snow rain unk 1 2 3 4 5 6 7 8 9 10 Here unk stands for unknown, and represents all other words that do not appear elsewhere on the list. If the graph is loopy, the posterior means may still be exact, but the posterior variances are often too small (Weiss and Freeman1999). 3.5.2 Using the model for prediction At test time, the goal is to compute p(y c x, D) p(y c D) D p(xj y c, D) (3.63) j 1 The correct Bayesian procedure is to integrate out the unknown parameters: p(y c x, D) Cat(y c π)p(π D)dπ D (3.64) Ber(xj y c, θjc )p(θ jc D) (3.65) j 1 Fortunately, this is easy to do, at least if the posterior isDirichlet. However, the need to use cross validation can make SVMs slower than RVMs. L1VM should be faster than an RVM, since an RVM requires multiple rounds of 1 minimization (see Section 13.7.4.3). Nieto, and E. Suppose all the D features are binary, xj {0, 1}. and D. A document by term matrix is in lsiMatrix.txt. (c) Decreasing αb by factorof 10. Minimax estimators have a certain appeal. Wild (2009). Since π i is a deterministic function of the inputs, it is effectively observed, rendering the qil (and hence the tags yil ) independent. This problem arises, for example, when trying to perform personalized recommendation of web pages. The advantage of performing unsupervised learningfirst is that the model is forced to model a high-dimensional response, namely the input feature vector, rather than just predicting a scalar response. 3.3.3.2 Posterior variance The mean and mode are point estimates, but it is useful to know how much we can trust them. 1987). To (hopefully) λ α α avoid confusion, I use aλ 0 , b0 as the hyperparameters for p(λ), and a0 , b0 as the hyper-parameters for p(α). The simplest approximation is the plug-in approximation, which, in the binary case, takes the form p(y 1 x, D) p(y 1 x, E [w]) (8.60) where E [w] is the posterior mean. Osborne, M. Bayesian inference of a uniform distribution. In (Heller and Ghahramani 2005), they show how onecan back-propagate k Tk ) gradients of the form p(D λ through the tree, and thus perform an empirical Bayes estimate of the hyper-parameters. Landauer (1997). Genetics 155, 945–959. So if η is chosen small enough, then f (θ ηd) f (θ), since the gradient will be negative. q1,1 . Ensemble learning for Hidden Markov Models. A modern Bayesianlook at the multi-armed bandit. Chopra, R. (2.69) corr [Xd , Xd ] One can show (Exercise 4.3) that 1 corr [X, Y ] 1. Bishop (2005). Mairal, J., F. One can show (Heller and Ghahramani 2005) that p(Dk Tk ) computed by the BHC algorithm is similar to p(Dk ) given above, except for the fact that it only sums over partitions which are consistent withtree Tk . Markov chain Monte Carlo (MCMC) inference 840 24.2.3 Example: Gibbs sampling for inferring the parameters of a GMM It is straightforward to derive a Gibbs sampling algorithm to “fit” a mixture model, especially if we use conjugate priors. Now consider predicting a batch of new data, D multinomial trials (think of predicting the next mwords in a sentence, assuming they are drawn iid). Angelo, C. Source: Figure 2 of (Blei and Lafferty 2007). Many generic classification methods ignore any structure in the input features, such as spatial layout. Mixture models and the EM algorithm 352 These equations make intuitive sense: the mean of cluster k is just the weighted average of allpoints assigned to cluster k, and the covariance is proportional to the weighted empirical scatter matrix. The model is defined as follows p(xtm xt 1,m )p(yt xtm ) (21.53) p(x, y) m t where p(xtm k xt 1,m j) Amjk is an entry in the transition matrix for chain m, p(x1m k x0m ) p(x1m k) πmk , is the initial state distribution for chain m,and M p(yt xt ) N yt Wm xtm , Σ (21.54) m 1 is the observation model, where xtm is a 1-of-K encoding of xtm and Wm is a D K matrix (assuming yt RD ). Simple MCMC methods often do not work well in such cases, and more advanced algorithms, such as parallel tempering, are necessary (see e.g., (Liu 2001)). 6.4.3 Minimum varianceestimators It seems intuitively reasonable that we want our estimator to be unbiased (although we shall give some arguments against this claim below). Compare this to Figure 1.21(a). Simulation approaches to general probabilistic inference on belief networks. (1960). An interesting alternative is to learn a mixture of trees (Meila and Jordan 2000),where each mixture component may have a different tree topology. One can use similar techniques for image denoising. Assignment Problems. Instead, we just give a few “edited highlights”. However, we will also see that these methods in fact are equivalent to maximum likelihood after all. This is appropriate in the second example, but not the first.From Equation 20.10, the belief at node s is given by the product of incoming messages times the local evidence (Equation 20.15) and hence bels (xs ) ψs (xs ) mts (xs ) N (xs μs , λ 1 (20.22) s ) t nbr(s) λs s (20.23) λts t nbr(s) μs s ms λ 1 s t nbr(s) λts μts (20.24) Chapter 20. , D do n aj 2 i 1 x2ij ; n cj 2 i 1 xij (yi wT xi wj xij ) ; c wj soft( ajj , aλj ); until converged; LARS and other homotopy methods The problem with coordinate descent is that it only updates one variable at a time, so can be slow to converge. 15.3 GPs meet GLMs In this section, we extend GPs to the GLM setting, focussing on the classification case. Explain why. How should we set the thresholdτ ? 22.2.1 A brief history When applied to loopy graphs, BP is not guaranteed to give correct results, and may not even converge. 2010), this technique was applied to some yeast gene expression data. The reason is that a generative model defines a joint distribution on x and y, and hence treats both inputs and outputs symmetrically. 26.4. LearningDAG structures evidence msg case course 921 question fact drive god gun christian government religion law jesus jews rights power car disease engine bible honda mission patients computer bmw dealer medicine earth problem studies mars system program season launch lunar university war world state oil water food research dos files players nhlpuck league cancer technology satellite hockey israel children disk team human president shuttle games solar moon scsi space orbit science nasa hit baseball fans version windows won win email phone number image memory ftp format video mac data driver software health aids insurance pc doctor card help server vitamin display graphics Figure26.8 A locally optimal DAG learned from the 20-newsgroup data. Heilman (2006). Such models are equivalent to pairwise MRFs. 22.2. Loopy belief propagation: algorithmic issues 771 Figure 22.4 Message passing on a bipartite factor graph. This assumes the training data is randomly sampled, and we don’t just get repetitions of the same examples.That is, the new messages at iteration k 1 are computed in parallel using mk 1 (f1 (mk ), . XT YT L2 Figure 18.2 Illustration of graphical model underlying SLAM. The task of computing the MLE for a (non-decomposable) GGM is called covariance selection (Dempster 1972). The reason is two-fold: first, they are widely used, so readers shouldknow about them; second, they often contain good ideas, which can be used to speed up inference in a probabilistic models. In this case, one simple strategy is to revert to steepest descent, dk gk . The basic idea is to start with λ large, so that the margin 1/ w(λ) is wide, and hence all points are inside of it and have αi 1. , E 1/τD ) denote theresult of this E step. Since the lower bound “touches” the function, maximizing the lower bound will also “push up” on the function itself. This corresponds to a whitening of the data. (For notational simplicity, we assume we have already estimated μ One can show that the gradient of this is given by (Ω) Ω 1 S (26.67) However, we have toenforce the constraints that Ωst 0 if Gst 0 (structural zeros), and that Ω is positive definite. If all the weights are positive, J 0, then neighboring spins are likely to be in the same state; this can be used to model ferromagnets, and is an example of an associative Markov network. For example, if Rij 0, it may be due to the fact that person i and jhave not had an opportunity to interact, or that data is not available for that interaction, as opposed to the fact that these people don’t want to interact. In addition, the JTA does not work for continuous random variables outside of the Gaussian case, nor for mixed discrete-continuous variables, outside of the conditionally Gaussian case. Minka (2006).Therefore just averaging the performance does not necessarily give reliable conclusions. Alternatively, we might use f (θ, zi ) L(yi , h(xi , θ)) (8.76) where h(xi , θ) is a prediction function, and L(y, yˆ) is some other loss function such as squared error or the Huber loss. Exact inference in this model is obviously infeasible. Raiko (2010). Tibshirani, R. Ifwe knew the true parameters θ , we could generate many (say S) fake datasets, each of size N , from the true distribution, xsi p(· θ ), for s 1 : S, i 1 : N . 85(411), 699–704. Probability 52 Figure 2.16 Change of variables from polar to Cartesian. Similarly, if we write x [x1 , . If each node has at most one parent (and hence the graph is a chainor simple tree), then there will be one factor per edge (root nodes can have their prior CPDs absorvbed into their children’s factors). (b) Noisy image. 13.7.2 Whence sparsity? 17.6.7 Dynamic Bayesian networks (DBNs) A dynamic Bayesian network is just a way to represent a stochastic process using a directed graphical model.7 Note that the networkis not dynamic (the structure and parameters are fixed), 7. The probability that X lies in any interval a X b can be computed as follows. This raises several questions: how can we predict when convergence will occur? Newsgroups is extracted from the 20 newsgroups dataset (D 500, N 800, C 4, binary features). We can replace the deltafunction with a narrow Gaussian, and then the E step amounts to classifying wj under the two possible Gaussian models. Tibshirani, R., G. If Eu (1) 0, we add the edge xu t with cost Eu (0). One example of this is in financial portfolio management, where accurate models of the covariance between large numbers of different stocks is important. Theboys020 see a game166 for two. that it has a spike near μ 0. Hero, J. Technical report, USC. In PCA, we assume each source is independent, and has a Gaussian distribution p(zt ) L N (ztj 0, 1) (12.96) j 1 We will now relax this Gaussian assumption and let the source distributions be any non-Gaussian 12.6. Independent Component Analysis (ICA)409 uniform data uniform data after linear mixing 3 3 2 2 1 1 0 0 1 1 2 2 3 3 2 1 0 1 2 3 3 3 2 1 (a) 3 2 2 1 1 0 0 1 1 2 2 2 1 0 (c) 2 3 ICA applied to mixed data from uniform source 3 3 1 (b) PCA applied to mixed data from uniform source 3 0 1 2 3 3 3 2 1 0 1 2 3 (d) Figure 12.21 Illustration of ICA and PCAapplied to 100 iid samples of a 2d source signal with a uniform distribution. 0.8 SL 5.45 W 2.8 SL 6.15 SW 2.8 versicolor setosa SW 3.45 SL 5.75 SW 3.1 versicolor SW 2.95 SL 6.15 SL 7.05 SW 3.45 SL 5.75 Cross validation Training set Min 1 std. 11.4. The EM algorithm 363 probit regression with L2 regularizerof 0.100 70 em minfunc 60 penalized NLL 50 40 30 20 10 0 0 20 40 60 iter 80 100 120 Figure 11.15 Fitting a probit regression model in 2d using a quasi-Newton method or EM. We will use a joint distribution of the form p(μ, Σ) p(Σ)p(μ Σ) (4.199) Looking at the form of the likelihood equation, Equation 4.197, we see that a natural conjugate 4.6.Inferring the parameters of an MVN 133 prior has the form of a Normal-inverse-wishart or NIW distribution, defined as follows: NIW(μ, Σ m0 , κ0 , ν0 , S0 ) 1 N (μ m0 , Σ) IW(Σ S0 , ν0 ) κ0 κ , 1 1 0 Σ 2 exp (μ m0 )T Σ 1 (μ m0 ) ZN IW 2 ν0 D 1 1 1 2 exp tr(Σ S0 ) Σ 2 ν0 D 2 1 Σ 2 ZN IW 1 κ0 exp (μ m0 )TΣ 1 (μ m0 ) tr(Σ 1 S0 ) 2 2 ZN IW 2v0 D/2 ΓD (ν0 /2)(2π/κ0 )D/2 S0 ν0 /2 (4.200) (4.201) (4.202) (4.203) (4.204) (4.205) (4.206) where ΓD (a) is the multivariate Gamma function. Increasing the number of steps m in boosting is analogous to decreasing the regularization penalty λ. In AI/Statistics. is p(D θ, d. To avoid both of these problems,it is common (in the statistics literature) to use an alternative parameterization of the IG distribution, known as the (scaled) inverse chi-squared distribution. Romberg, and T. Lally, J. How might we represent p(x y c) in this case? 23.4 Importance sampling We now describe a Monte Carlo method known as importance sampling for approximatingintegrals of the form I E [f ] f (x)p(x)dx (23.19) 23.4.1 Basic idea The idea is to draw samples x in regions which have high probability, p(x), but also where f (x) is large. 17.6.5 Factorial HMM An HMM represents the hidden state using a single discrete random variable zt {1, . To compute the gradient, we need to be able to solve subproblemsof the following form: t δ δft i (xi ) (22.178) argmax θf (xf ) argmax θf (xf ) xf xf i f Chapter 22. The resulting model has the form p(y x, θ) 16.5.1 Cat(y S(Wz(x)) (16.64) Convolutional neural networks The purpose of the hidden units is to learn non-linear combinations of the original inputs; this is called feature extraction or featureconstruction. Beauchamp (1988). Electoral Studies 23(1), 107–122. Figure 19.20 illustrates the difference between these two forms of prior. Figure generated by kpcaDemo2, based on code by L.J.P. van der Maaten. This theorem shows that the best way to minimize frequentist risk is to be Bayesian! See (Bernardo and Smith 1994, p448) for furtherdiscussion of this point. Below we discuss some other advantages and disadvantages of each approach. (In Section 6.4.2, we show that this is an unbiased estimate of the variance.) Hence the marginal posterior for the mean is given by p(μ D) T (μ x, s2 , N 1) N (4.235) 4.6. Inferring the parameters of an MVN 137 and the posterior variance of μ isvar [μ D] s2 N 1 s2 νN 2 σN νN 2 N 3N N The square root of this is called the standard error of the mean: s var [μ D] N Thus an approximate 95% posterior credible interval for the mean is s I.95 (μ D) x 2 N (4.236) (4.237) (4.238) (Bayesian credible intervals are discussed in more detail in Section 5.2.2; they are contrasted withfrequentist confidence intervals in Section 6.6.1.) 4.6.3.8 Bayesian t-test Suppose we want to test the hypothesis that μ μ0 for some known value μ0 (often 0), given values xi N (μ, σ 2 ). For future reference, t

Machine learning: a probabilistic perspective pdf download online book pdf . (It is always a good idea to perform exploratory data analysis, such as plotting the data, before applying a machine learning method.) Image classification and handwriting recognition Now consider the harder problem of classifying images directly, where a human has .