All Of Statistics A Concise Course In Statistical Inference - Springer

Transcription

11Bayesian Inference11.1 The Bayesian PhilosophyThe statistical methods that we have discussed so far are known as frequentist (or classical) methods. The frequentist point of view is based on thefollowing postulates:F1 Probability refers to limiting relative frequencies. Probabilities are objective properties of the real world.F2 Parameters are fixed, unknown constants. Because they are not fluctuating, no useful probability statements can be made about parameters.F3 Statistical procedures should be designed to have well-defined long runfrequency properties. For example, a 95 percent confidence interval shouldtrap the true value of the parameter with limiting frequency at least 95percent.There is another approach to inference called Bayesian inference. TheBayesian approach is based on the following postulates:

17611. Bayesian InferenceB1 Probability describes degree of belief, not limiting frequency. As such,we can make probability statements about lots of things, not just datawhich are subject to random variation. For example, I might say that“the probability that Albert Einstein drank a cup of tea on August 1,1948” is .35. This does not refer to any limiting frequency. It reflects mystrength of belief that the proposition is true.B2 We can make probability statements about parameters, even thoughthey are fixed constants.B3 We make inferences about a parameter θ by producing a probabilitydistribution for θ. Inferences, such as point estimates and interval estimates, may then be extracted from this distribution.Bayesian inference is a controversial approach because it inherently embraces a subjective notion of probability. In general, Bayesian methods provide no guarantees on long run performance. The field of statistics puts moreemphasis on frequentist methods although Bayesian methods certainly havea presence. Certain data mining and machine learning communities seem toembrace Bayesian methods very strongly. Let’s put aside philosophical arguments for now and see how Bayesian inference is done. We’ll conclude thischapter with some discussion on the strengths and weaknesses of the Bayesianapproach.11.2 The Bayesian MethodBayesian inference is usually carried out in the following way.1. We choose a probability density f (θ) — called the prior distribution— that expresses our beliefs about a parameter θ before we see anydata.2. We choose a statistical model f (x θ) that reflects our beliefs about xgiven θ. Notice that we now write this as f (x θ) instead of f (x; θ).3. After observing data X1 , . . . , Xn , we update our beliefs and calculatethe posterior distribution f (θ X1 , . . . , Xn ).To see how the third step is carried out, first suppose that θ is discrete andthat there is a single, discrete observation X. We should use a capital letter

11.2 The Bayesian Method177now to denote the parameter since we are treating it like a random variable,so let Θ denote the parameter. Now, in this discrete setting,P(Θ θ X x)P(X x, Θ θ)P(X x)P(X x Θ θ)P(Θ θ)!θ P(X x Θ θ)P(Θ θ) which you may recognize from Chapter 1 as Bayes’ theorem. The versionfor continuous variables is obtained by using density functions:f (θ x) "f (x θ)f (θ).f (x θ)f (θ)dθ(11.1)If we have n iid observations X1 , . . . , Xn , we replace f (x θ) withf (x1 , . . . , xn θ) n#i 1f (xi θ) Ln (θ).Notation. We will write X n to mean (X1 , . . . , Xn ) and xn to mean (x1 , . . . , xn ).Now,wheref (θ xn ) "Ln (θ)f (θ)f (xn θ)f (θ) Ln (θ)f (θ)cnf (xn θ)f (θ)dθcn Ln (θ)f (θ)dθ(11.2)(11.3)is called the normalizing constant. Note that cn does not depend on θ. Wecan summarize by writing:Posterior is proportional to Likelihood times Prioror, in symbols,f (θ xn ) L(θ)f (θ).You might wonder, doesn’t it cause a problem to throw away the constantcn ? The answer is that we can always recover the constant later if we need to.What do we do with the posterior distribution? First, we can get a pointestimate by summarizing the center of the posterior. Typically, we use themean or mode of the posterior. The posterior mean is" θLn (θ)f (θ)n".(11.4)θn θf (θ x )dθ Ln (θ)f (θ)dθ

17811. Bayesian InferenceWe can also obtain a Bayesian interval estimate. We find a and b such that!a! f (θ xn )dθ b f (θ xn )dθ α/2. Let C (a, b). Then P(θ C xn ) "baf (θ xn ) dθ 1 αso C is a 1 α posterior interval.11.1 Example. Let X1 , . . . , Xn Bernoulli(p). Suppose we take the uniformdistribution f (p) 1 as a prior. By Bayes’ theorem, the posterior has theformf (p xn ) f (p)Ln (p) ps (1 p)n s ps 1 1 (1 p)n s 1 1#nwhere s i 1 xi is the number of successes. Recall that a random variablehas a Beta distribution with parameters α and β if its density isf (p; α, β) Γ(α β) α 1(1 p)β 1 .pΓ(α)Γ(β)We see that the posterior for p is a Beta distribution with parameters s 1and n s 1. That is,f (p xn ) Γ(n 2)p(s 1) 1 (1 p)(n s 1) 1 .Γ(s 1)Γ(n s 1)We write this asp xn Beta(s 1, n s 1).Notice that we have figured out the normalizing constant without actually!doing the integral Ln (p)f (p)dp. The mean of a Beta(α, β) distribution isα/(α β) so the Bayes estimator isp s 1.n 2(11.5)It is instructive to rewrite the estimator asp λn p (1 λn )%p(11.6)where p s/n is the mle, p% 1/2 is the prior mean and λn n/(n 2) 1.A 95 percent posterior interval can be obtained by numerically finding a and!bb such that a f (p xn ) dp .95.Suppose that instead of a uniform prior, we use the prior p Beta(α, β).If you repeat the calculations above, you will see that p xn Beta(α s, β

11.2 The Bayesian Method179n s). The flat prior is just the special case with α β 1. The posteriormean is!"!"nα βα s p# p0p α β nα β nα β nwhere p0 α/(α β) is the prior mean.!In the previous example, the prior was a Beta distribution and the posteriorwas a Beta distribution. When the prior and the posterior are in the samefamily, we say that the prior is conjugate with respect to the model.11.2 Example. Let X1 , . . . , Xn N (θ, σ 2 ). For simplicity, let us assume thatσ is known. Suppose we take as a prior θ N (a, b2 ). In problem 1 in theexercises it is shown that the posterior for θ isθ X n N (θ, τ 2 )(11.7)whereθ wX (1 w)a,w 1se212se 1b2,111 2 2,2τseb and se σ/ n is the standard error of the mle X. This is another exampleof a conjugate prior. Note that w 1 and τ /se 1 as n . So, for large# se2 ). The same is true if n is fixed butn, the posterior is approximately N (θ,b , which corresponds to letting the prior become very flat.Continuing with this example, let us find C (c, d) such that P(θ C X n ) .95. We can do this by choosing c and d such that P(θ c X n ) .025 and P(θ d X n ) .025. So, we want to find c such that%& %c θθθ % nn P(θ c X ) P%Xττ %"!c θ .025. P Z τWe know that P(Z 1.96) .025. So,c θ 1.96τimplying that c θ 1.96τ. By similar arguments, d θ 1.96. So a 95 percentBayesian interval is θ 1.96 τ . Since θ θ# and τ se, the 95 percent Bayesianinterval is approximated by θ# 1.96 se which is the frequentist confidenceinterval. !

18011. Bayesian Inference11.3 Functions of ParametersHow do we make inferences about a function τ g(θ)? Remember in Chapter3 we solved the following problem: given the density fX for X, find the densityfor Y g(X). We now simply apply the same reasoning. The posterior cdffor τ is!H(τ xn ) P(g(θ) τ xn ) f (θ xn )dθAwhere A {θ : g(θ) τ }. The posterior density is h(τ xn ) H ! (τ xn ).11.3 Example. Let X1 , . . . , Xn Bernoulli(p) and f (p) 1 so that p X n "nBeta(s 1, n s 1) with s i 1 xi . Let ψ log(p/(1 p)). Then& '# %&P& ψ & xnH(ψ xn ) P(Ψ ψ xn ) P log&1 P& '#eψ && n P P &x1 eψ &! eψ /(1 eψ ) f (p xn ) dp0 andh(ψ xn )!eψ /(1 eψ )0ps (1 p)n s dp H ! (ψ xn ) Γ(n 2)Γ(s 1)Γ(n s 1) Γ(n 2)Γ(s 1)Γ(n s 1) for ψ R.Γ(n 2)Γ(s 1)Γ(n s 1)Γ(n 2)Γ(s 1)Γ(n s 1) eψ1 eψeψ1 eψeψ1 eψ%s %s %s 11 eψ11 eψ11 eψ%n s * ψ e 1 eψ ψ%n s %n s 211 eψ%2!11.4 SimulationThe posterior can often be approximated by simulation. Suppose we drawθ1 , . . . , θB p(θ xn ). Then a histogram of θ1 , . . . , θB approximates the posterior density p(θ xn ). An approximation to the posterior mean θn E(θ xn ) is

11.5 Large Sample Properties of Bayes’ Procedures181!BB 1 j 1 θj . The posterior 1 α interval can be approximated by (θα/2 , θ1 α/2 )where θα/2 is the α/2 sample quantile of θ1 , . . . , θB .Once we have a sample θ1 , . . . , θB from f (θ xn ), let τi g(θi ). Thenτ1 , . . . , τB is a sample from f (τ xn ). This avoids the need to do any analyticalcalculations. Simulation is discussed in more detail in Chapter 24.11.4 Example. Consider again Example 11.3. We can approximate the posterior for ψ without doing any calculus. Here are the steps:1. Draw P1 , . . . , PB Beta(s 1, n s 1).2. Let ψi log(Pi /(1 Pi )) for i 1, . . . , B.Now ψ1 , . . . , ψB are iid draws from h(ψ xn ). A histogram of these valuesprovides an estimate of h(ψ xn ). !11.5 Large Sample Properties of Bayes’ ProceduresIn the Bernoulli and Normal examples we saw that the posterior mean wasclose to the mle. This is true in greater generality.#" 1/ nI(θ"n ). Under appropriate11.5 Theorem. Let θ"n be the mle and let seregularity conditions, the posterior is approximately Normal with mean θ"n and" Hence, θn θ"n . Also, if Cn (θ"n zα/2 se," θ"n zα/2 se)"standard deviation se.is the asymptotic frequentist 1 α confidence interval, then Cn is also anapproximate 1 α Bayesian posterior interval:P(θ Cn X n ) 1 α.There is also a Bayesian delta method. Let τ g(θ). Then 2)τ X n N ("τ , se" and se" se" g " (θ) .where τ" g(θ)11.6 Flat Priors, Improper Priors, and“Noninformative” PriorsAn important question in Bayesian inference is: where does one get the priorf (θ)? One school of thought, called subjectivism says that the prior should

18211. Bayesian Inferencereflect our subjective opinion about θ (before the data are collected). This maybe possible in some cases but is impractical in complicated problems especiallyif there are many parameters. Moreover, injecting subjective opinion into theanalysis is contrary to the goal of making scientific inference as objectiveas possible. An alternative is to try to define some sort of “noninformativeprior.” An obvious candidate for a noninformative prior is to use a flat priorf (θ) constant.In the Bernoulli example, taking f (p) 1 leads to p X n Beta(s 1, n s 1) as we saw earlier, which seemed very reasonable. But unfettered use offlat priors raises some questions.Improper Priors. Let X N (θ, σ 2 ) with σ known. Suppose we adopt!a flat prior f (θ) c where c 0 is a constant. Note that f (θ)dθ sothis is not a probability density in the usual sense. We call such a prior animproper prior. Nonetheless, we can still formally carry out Bayes’ theoremand compute the posterior density by multiplying the prior and the likelihood:f (θ) Ln (θ)f (θ) Ln (θ). This gives θ X n N (X, σ 2 /n) and the resultingpoint and interval estimators agree exactly with their frequentist counterparts.In general, improper priors are not a problem as long as the resulting posterioris a well-defined probability distribution.Flat Priors are Not Invariant. Let X Bernoulli(p) and suppose weuse the flat prior f (p) 1. This flat prior presumably represents our lack ofinformation about p before the experiment. Now let ψ log(p/(1 p)). Thisis a transformation of p and we can compute the resulting distribution for ψ,namely,fΨ (ψ) eψ(1 eψ )2which is not flat. But if we are ignorant about p then we are also ignorantabout ψ so we should use a flat prior for ψ. This is a contradiction. In short,the notion of a flat prior is not well defined because a flat prior on a parameterdoes not imply a flat prior on a transformed version of the parameter. Flatpriors are not transformation invariant.Jeffreys’ Prior. Jeffreys came up with a rule for creating priors. Therule is: takef (θ) I(θ)1/2where I(θ) is the Fisher information function. This rule turns out to be transformation invariant. There are various reasons for thinking that this priormight be a useful prior but we will not go into details here.

11.7 Multiparameter Problems18311.6 Example. Consider the Bernoulli (p) model. Recall thatI(p) 1.p(1 p)Jeffreys’ rule says to use the prior!f (p) I(p) p 1/2 (1 p) 1/2 .This is a Beta (1/2,1/2) density. This is very close to a uniform density.!In a multiparameter problem, the Jeffreys’ prior is defined to be f (θ) I(θ) where A denotes the determinant of a matrix A and I(θ) is theFisher information matrix.!11.7 Multiparameter ProblemsSuppose that θ (θ1 , . . . , θp ). The posterior density is still given byf (θ xn ) Ln (θ)f (θ).(11.8)The question now arises of how to extract inferences about one parameter.The key is to find the marginal posterior density for the parameter of interest.Suppose we want to make inferences about θ1 . The marginal posterior for θ1is""f (θ1 xn ) · · · f (θ1 , · · · , θp xn )dθ2 . . . dθp .(11.9)In practice, it might not be feasible to do this integral. Simulation can help.Draw randomly from the posterior:θ1 , . . . , θB f (θ xn )where the superscripts index the different draws. Each θj is a vector θj (θ1j , . . . , θpj ). Now collect together the first component of each draw:θ11 , . . . , θ1B .These are a sample from f (θ1 xn ) and we have avoided doing any integrals.11.7 Example (Comparing Two Binomials). Suppose we have n1 control patients and n2 treatment patients and that X1 control patients survive whileX2 treatment patients survive. We want to estimate τ g(p1 , p2 ) p2 p1 .Then,X1 Binomial(n1 , p1 ) and X2 Binomial(n2 , p2 ).

18411. Bayesian InferenceIf f (p1 , p2 ) 1, the posterior isf (p1 , p2 x1 , x2 ) px1 1 (1 p1 )n1 x1 px2 2 (1 p2 )n2 x2 .Notice that (p1 , p2 ) live on a rectangle (a square, actually) and thatf (p1 , p2 x1 , x2 ) f (p1 x1 )f (p2 x2 )wheref (p1 x1 ) px1 1 (1 p1 )n1 x1 and f (p2 x2 ) px2 2 (1 p2 )n2 x2which implies that p1 and p2 are independent under the posterior. Also,p1 x1 Beta(x1 1, n1 x1 1) and p2 x2 Beta(x2 1, n2 x2 1).If we simulate P1,1 , . . . , P1,B Beta(x1 1, n1 x1 1) and P2,1 , . . . , P2,B Beta(x2 1, n2 x2 1), then τb P2,b P1,b , b 1, . . . , B, is a sample fromf (τ x1 , x2 ). !11.8 Bayesian TestingHypothesis testing from a Bayesian point of view is a complex topic. Wewill only give a brief sketch of the main idea here. The Bayesian approachto testing involves putting a prior on H0 and on the parameter θ and thencomputing P(H0 X n ). Consider the case where θ is scalar and we are testingH0 : θ θ 0versusH1 : θ θ0 .It is usually reasonable to use the prior P(H0 ) P(H1 ) 1/2 (although thisis not essential in what follows). Under H1 we need a prior for θ. Denote thisprior density by f (θ). From Bayes’ theoremP(H0 X n xn ) f (xn H0 )P(H0 )n0 )P(H0 ) f (x H1 )P(H1 )1n2 f (x θ0 )11nn2 f (x θ0 ) 2 f (x H1 )nf (x θ0 )!nf (x θ0 ) f (xn θ)f (θ)dθL(θ0 )!.L(θ0 ) L(θ)f (θ)dθf (xn HWe saw that, in estimation problems, the prior was not very influential andthat the frequentist and Bayesian methods gave similar answers. This is not

11.9 Strengths and Weaknesses of Bayesian Inference185the case in hypothesis testing. Also, one can’t use improper priors in testingbecause this leads to an undefined constant in the denominator of the expression above. Thus, if you use Bayesian testing you must choose the prior f (θ)very carefully. It is possible to get a prior-free bound on P(H0 X n xn ).!" Hence,Notice that 0 L(θ)f (θ)dθ L(θ).L(θ0 )"L(θ0 ) L(θ) P(H0 X n xn ) 1.The upper bound is not very interesting, but the lower bound is non-trivial.11.9 Strengths and Weaknesses of Bayesian InferenceBayesian inference is appealing when prior information is available since Bayes’theorem is a natural way to combine prior information with data. Some people find Bayesian inference psychologically appealing because it allows us tomake probability statements about parameters. In contrast, frequentist inference provides confidence sets Cn which trap the parameter 95 percent of thetime, but we cannot say that P(θ Cn X n ) is .95. In the frequentist approachwe can make probability statements about Cn , not θ. However, psychologicalappeal is not a compelling scientific argument for using one type of inferenceover another.In parametric models, with large samples, Bayesian and frequentist methodsgive approximately the same inferences. In general, they need not agree.Here are three examples that illustrate the strengths and weakness of Bayesianinference. The first example is Example 6.14 revisited. This example showsthe psychological appeal of Bayesian inference. The second and third showthat Bayesian methods can fail.11.8 Example (Example 6.14 revisited). We begin by reviewing the example.Let θ be a fixed, known real number and let X1 , X2 be independent randomvariables such that P(Xi 1) P(Xi 1) 1/2. Now define Yi θ Xiand suppose that you only observe Y1 and Y2 . Let#if Y1 Y2{Y1 1}C {(Y1 Y2 )/2} if Y1 Y2 .This is a 75 percent confidence set since, no matter what θ is, Pθ (θ C) 3/4.Suppose we observe Y1 15 and Y2 17. Then our 75 percent confidenceinterval is {16}. However, we are certain, in this case, that θ 16. So calling

18611. Bayesian Inferencethis a 75 percent confidence set, bothers many people. Nonetheless, C is avalid 75 percent confidence set. It will trap the true value 75 percent of thetime.The Bayesian solution is more satisfying to many. For simplicity, assumethat θ is an integer. Let f (θ) be a prior mass function such that f (θ) 0 forevery integer θ. When Y (Y1 , Y2 ) (15, 17), the likelihood function is!1/4 θ 16L(θ) 0otherwise.Applying Bayes’ theorem we see thatP(Θ θ Y (15, 17)) !10θ 16otherwise.Hence, P(θ C Y (15, 17)) 1. There is nothing wrong with saying that{16} is a 75 percent confidence interval. But is it not a probability statementabout θ. !11.9 Example. This is a simplified version of the example in Robins and Ritov(1997). The data consist of n iid triples(X1 , R1 , Y1 ), . . . , (Xn , Yn , Rn ).Let B be a finite but very large number, like B 100100 . Any realistic samplesize n will be small compared to B. Letθ (θ1 , . . . , θB )be a vector of unknown parameters such that 0 θj 1 for 1 j B. Letξ (ξ1 , . . . , ξB )be a vector of known numbers such that0 δ ξj 1 δ 1,1 j B,where δ is some, small, positive number. Each data point (Xi , Ri , Yi ) is drawnin the following way:1. Draw Xi uniformly from {1, . . . , B}.2. Draw Ri Bernoulli(ξXi ).3. If Ri 1, then draw Yi Bernoulli(θXi ). If Ri 0, do not draw Yi .

11.9 Strengths and Weaknesses of Bayesian Inference187The model may seem a little artificial but, in fact, it is caricature of somereal missing data problems in which some data points are not observed. Inthis example, Ri 0 can be thought of as meaning “missing.” Our goal is toestimateψ P(Yi 1).Note that P(Yi 1) ψB!P(Yi 1 X j)P(X j)j 1B1 !θj g(θ)B j 1 so ψ g(θ) is a function of θ.Let us consider a Bayesian analysis first. The likelihood of a single observation isf (Xi , Ri , Yi ) f (Xi )f (Ri Xi )f (Yi Xi )Ri .The last term is raised to the power Ri since, if Ri 0, then Yi is not observedand hence that term drops out of the likelihood. Since f (Xi ) 1/B and thatYi and Ri are Bernoulli,f (Xi )f (Ri Xi )f (Yi Xi )Ri 1 RiYi Ri(1 θXi )(1 Yi )Ri .ξ (1 ξXi )1 Ri θXiB XiThus, the likelihood function isL(θ) n"i 1n"f (Xi )f (Ri Xi )f (Yi Xi )Ri1 RiYi RiξXi (1 ξXi )1 Ri θX(1 θXi )(1 Yi )RiiBi 1Yi Ri θX(1 θXi )(1 Yi )Ri .iWe have dropped all the terms involving B and the ξj ’s since these are knownconstants, not parameters. The log-likelihood is (θ) n!i 1 B!j 1Yi Ri log θXi (1 Yi ) Ri log(1 θXi )nj log θj B!j 1mj log(1 θj )

18811. Bayesian Inferencewherenj #{i : Yi 1, Ri 1, Xi j}mj #{i : Yi 0, Ri 1, Xi j}.Now, nj mj 0 for most j since B is so much larger than n. This hasseveral implications. First, the mle for most θj is not defined. Second, formost θj , the posterior distribution is equal to the prior distribution, sincethose θj do not appear in the likelihood. Hence, f (θ Data) f (θ). It followsthat f (ψ Data) f (ψ). In other words, the data provide little informationabout ψ in a Bayesian analysis.Now we consider a frequentist solution. Definen1 " Ri Yi.ψ! n i 1 ξXi(11.10)We will now show that this estimator is unbiased and has small mean-squarederror. It can be shown (see Exercise 7) that! ψE(ψ)! and V(ψ)1.nδ 2(11.11)Therefore, the mse is of order 1/n which goes to 0 fairly quickly as we collectmore data, no matter how large B is. The estimator defined in (11.10) is calledthe Horwitz-Thompson estimator. It cannot be derived from a Bayesian orlikelihood point of view since it involves the terms ξXi . These terms dropout of the log-likelihood and hence will not show up in any likelihood-basedmethod including Bayesian estimators.The moral of the story is this. Bayesian methods are tied to the likelihood function. But in high dimensional (and nonparametric) problems, thelikelihood may not yield accurate inferences. !11.10 Example. Suppose that f is a probability density function and thatf (x) cg(x)where g(x) 0 is a known function and c is unknown. In principle we can##compute c since f (x) dx 1 implies that c 1/ g(x) dx. But in many cases#we can’t do the integral g(x) dx since g might be a complicated function andx could be high dimensional. Despite the fact that c is not known, it is oftenpossible to draw a sample X1 , . . . , Xn from f ; see Chapter 24. Can we use thesample to estimate the normalizing constant c? Here is a frequentist solution:

11.10 Bibliographic Remarks189Let f!n (x) be a consistent estimate of the density f . Chapter 20 explains how toconstruct such an estimate. Choose any point x and note that c f (x)/g(x).Hence, !c f!(x)/g(x) is a consistent estimate of c. Now let us try to solve thisproblem from a Bayesian approach. Let π(c) be a prior such that π(c) 0 forall c 0. The likelihood function isLn (c) n"i 1f (Xi ) n"cg(Xi ) cni 1n"i 1g(Xi ) cn .Hence the posterior is proportional to c π(c). The posterior does not dependon X1 , . . . , Xn , so we come to the startling conclusion that, from the Bayesianpoint of view, there is no information in the data about c. Moreover, theposterior mean is# n 1cπ(c) dc0# n π(c) dcc0nwhich tends to infinity as n increases.!These last two examples illustrate an important point. Bayesians are slavesto the likelihood function. When the likelihood goes awry, so will Bayesianinference.What should we conclude from all this? The important thing is to understand that frequentist and Bayesian methods are answering different questions. To combine prior beliefs with data in a principled way, use Bayesian inference. To construct procedures with guaranteed long run performance, suchas confidence intervals, use frequentist methods. Generally, Bayesian methodsrun into problems when the parameter space is high dimensional. In particular, 95 percent posterior intervals need not contain the true value 95 percentof the time (in the frequency sense).11.10 Bibliographic RemarksSome references on Bayesian inference include Carlin and Louis (1996), Gelman et al. (1995), Lee (1997), Robert (1994), and Schervish (1995). See Cox(1993), Diaconis and Freedman (1986), Freedman (1999), Barron et al. (1999),Ghosal et al. (2000), Shen and Wasserman (2001), and Zhao (2000) for discussions of some of the technicalities of nonparametric Bayesian inference. TheRobins-Ritov example is discussed in detail in Robins and Ritov (1997) whereit is cast more properly as a nonparametric problem. Example 11.10 is due toEdward George (personal communication). See Berger and Delampady (1987)

19011. Bayesian Inferenceand Kass and Raftery (1995) for a discussion of Bayesian testing. See Kassand Wasserman (1996) for a discussion of noninformative priors.11.11 AppendixProof of Theorem 11.5.It can be shown that the effect of the prior diminishes as n increases sothat f (θ X n ) Ln (θ)f (θ) Ln (θ). Hence, log f (θ X n ) "(θ). Now, "(θ) ! (θ θ)"! ! (θ)! [(θ θ)! 2 /2]"!! (θ)! "(θ)! [(θ θ)! 2 /2]"!! (θ)! since "! (θ)! 0."(θ)Exponentiating, we get approximately that#"!21 (θ θ)nf (θ X ) exp 2 σn2where σn2 1/"!! (θ!n ). So the posterior of θ is approximately Normal withmean θ! and variance σn2 . Let "i log f (Xi θ), then 1 "!! (θ!n ) "!!i (θ!n )2σni% & '(1 n "!!i (θ!n ) nEθ "!!i (θ!n )ni nI(θ!n )!and hence σn se(θ).!11.12 Exercises1. Verify (11.7).2. Let X1 , ., Xn Normal(µ, 1).(a) Simulate a data set (using µ 5) consisting of n 100 observations.(b) Take f (µ) 1 and find the posterior density. Plot the density.(c) Simulate 1,000 draws from the posterior. Plot a histogram of thesimulated values and compare the histogram to the answer in (b).(d) Let θ eµ . Find the posterior density for θ analytically and bysimulation.(e) Find a 95 percent posterior interval for µ.(f) Find a 95 percent confidence interval for θ.

11.12 Exercises1913. Let X1 , ., Xn Uniform(0, θ). Let f (θ) 1/θ. Find the posteriordensity.4. Suppose that 50 people are given a placebo and 50 are given a newtreatment. 30 placebo patients show improvement while 40 treated patients show improvement. Let τ p2 p1 where p2 is the probability ofimproving under treatment and p1 is the probability of improving underplacebo.(a) Find the mle of τ . Find the standard error and 90 percent confidenceinterval using the delta method.(b) Find the standard error and 90 percent confidence interval using theparametric bootstrap.(c) Use the prior f (p1 , p2 ) 1. Use simulation to find the posteriormean and posterior 90 percent interval for τ .(d) Letψ log!!p11 p1" !p21 p2""be the log-odds ratio. Note that ψ 0 if p1 p2 . Find the mle of ψ.Use the delta method to find a 90 percent confidence interval for ψ.(e) Use simulation to find the posterior mean and posterior 90 percentinterval for ψ.5. Consider the Bernoulli(p) observations0101000000Plot the posterior for p using these priors: Beta(1/2,1/2), Beta(1,1),Beta(10,10), Beta(100,100).6. Let X1 , . . . , Xn Poisson(λ).(a) Let λ Gamma(α, β) be the prior. Show that the posterior is alsoa Gamma. Find the posterior mean.(b) Find the Jeffreys’ prior. Find the posterior.7. In Example 11.9, verify (11.11).8. Let X N (µ, 1). Consider testingH0 : µ 0versusH1 : µ 0.

19211. Bayesian InferenceTake P(H0 ) P(H1 ) 1/2. Let the prior for µ under H1 be µ N (0, b2 ). Find an expression for P(H0 X x). Compare P(H0 X x)to the p-value of the Wald test. Do the comparison numerically for avariety of values of x and b. Now repeat the problem using a sample ofsize n. You will see that the posterior probability of H0 can be large evenwhen the p-value is small, especially when n is large. This disagreementbetween Bayesian and frequentist testing is called the Jeffreys-Lindleyparadox.

Bayesian Inference 11.1 The Bayesian Philosophy The statistical methods that we have discussed so far are known as frequen-tist (or classical) methods. The frequentist point of view is based on the following postulates: F1 Probability refers to limiting relative frequencies. Probabilities are ob-jective properties of the real world.