Probabilistic Graphical Models - Daniel J. Saunders

Transcription

Probabilistic Graphical Models

Adaptive Computation and Machine LearningThomas Dietterich, EditorChristopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, Associate EditorsBioinformatics: The Machine Learning Approach, Pierre Baldi and Søren BrunakReinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. BartoGraphical Models for Machine Learning and Digital Communication, Brendan J. FreyLearning in Graphical Models, Michael I. JordanCausation, Prediction, and Search, 2nd ed., Peter Spirtes, Clark Glymour, and Richard ScheinesPrinciples of Data Mining, David Hand, Heikki Mannila, and Padhraic SmythBioinformatics: The Machine Learning Approach, 2nd ed., Pierre Baldi and Søren BrunakLearning Kernel Classifiers: Theory and Algorithms, Ralf HerbrichLearning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond, Bernhard Schölkopf and Alexander J. SmolaIntroduction to Machine Learning, Ethem AlpaydinGaussian Processes for Machine Learning, Carl Edward Rasmussen and Christopher K. I. WilliamsSemi-Supervised Learning, Olivier Chapelle, Bernhard Schölkopf, and Alexander Zien, eds.The Minimum Description Length Principle, Peter D. GrünwaldIntroduction to Statistical Relational Learning, Lise Getoor and Ben Taskar, eds.Probabilistic Graphical Models: Principles and Techniques, Daphne Koller and Nir Friedman

Probabilistic Graphical ModelsPrinciples and TechniquesDaphne KollerNir FriedmanThe MIT PressCambridge, MassachusettsLondon, England

2009 Massachusetts Institute of TechnologyAll rights reserved. No part of this book may be reproduced in any form by any electronicor mechanical means (including photocopying, recording, or information storage and retrieval)without permission in writing from the publisher.For information about special quantity discounts, please email special sales@mitpress.mit.eduThis book was set by the authors in LATEX2 .Printed and bound in the United States of America.Library of Congress Cataloging-in-Publication DataKoller, Daphne.Probabilistic Graphical Models: Principles and Techniques / Daphne Koller and Nir Friedman.p. cm. – (Adaptive computation and machine learning)Includes bibliographical references and index.ISBN 978-0-262-01319-2 (hardcover : alk. paper)1. Graphical modeling (Statistics) 2. Bayesian statistical decision theory—Graphic methods. I.Koller, Daphne. II. Friedman, Nir.QA279.5.K65 2010519.5’420285–dc2220090086151098765

To our familiesmy parents Dov and Ditzamy husband Danmy daughters Natalie and MayaD.K.my parents Noga and Gadmy wife Yaelmy children Roy and LiorN.F.

As far as the laws of mathematics refer to reality, they are not certain, as far as they arecertain, they do not refer to reality.Albert Einstein, 1921When we try to pick out anything by itself, we find that it is bound fast by a thousandinvisible cords that cannot be broken, to everything in the universe.John Muir, 1869The actual science of logic is conversant at present only with things either certain, impossible,or entirely doubtful . . . Therefore the true logic for this world is the calculus of probabilities,which takes account of the magnitude of the probability which is, or ought to be, in areasonable man’s mind.James Clerk Maxwell, 1850The theory of probabilities is at bottom nothing but common sense reduced to calculus; itenables us to appreciate with exactness that which accurate minds feel with a sort of instinctfor which ofttimes they are unable to account.Pierre Simon Laplace, 1819Misunderstanding of probability may be the greatest of all impediments to scientific literacy.Stephen Jay Gould

ContentsxxiiiAcknowledgmentsxxvList of FiguresList of AlgorithmsList of ctured Probabilistic Models21.2.1Probabilistic Graphical Models31.2.2Representation, Inference, Learning51.3Overview and Roadmap61.3.1Overview of Chapters61.3.2Reader’s Guide91.3.3Connection to Other Disciplines111.4Historical Notes122 Foundations152.1Probability Theory152.1.1Probability Distributions152.1.2Basic Concepts in Probability182.1.3Random Variables and Joint Distributions192.1.4Independence and Conditional Independence232.1.5Querying a Distribution252.1.6Continuous Spaces272.1.7Expectation and Variance312.2Graphs342.2.1Nodes and Edges342.2.2Subgraphs352.2.3Paths and Trails36

xCONTENTS2.32.42.2.4Cycles and LoopsRelevant Literature39Exercises3936IRepresentation3The Bayesian Network Representation453.1Exploiting Independence Properties453.1.1Independent Random Variables3.1.2The Conditional Parameterization3.1.3The Naive Bayes Model483.2Bayesian Networks513.2.1The Student Example Revisited3.2.2Basic Independencies in Bayesian3.2.3Graphs and Distributions603.3Independencies in Graphs683.3.1D-separation693.3.2Soundness and Completeness3.3.3An Algorithm for d-Separation3.3.4I-Equivalence763.4From Distributions to Graphs783.4.1Minimal I-Maps793.4.2Perfect Maps813.4.3Finding Perfect Maps ?833.5Summary923.6Relevant ndirected Graphical Models1034.1The Misconception .2Gibbs Distributions and Markov Networks1084.2.3Reduced Markov Networks1104.3Markov Network Independencies1144.3.1Basic Independencies1144.3.2Independencies Revisited1174.3.3From Distributions to Graphs1204.4Parameterization Revisited1224.4.1Finer-Grained Bayesian Networks and Markov Networks1344.5.1From Bayesian Networks to Markov Networks1344.5.2From Markov Networks to Bayesian Networks137

CONTENTS4.64.74.84.9xi4.5.3Chordal Graphs139Partially Directed Models1424.6.1Conditional Random Fields1424.6.2Chain Graph Models ?148Summary and Discussion151Relevant Literature152Exercises1535 Local Probabilistic Models1575.1Tabular CPDs1575.2Deterministic 95.3Context-Specific 15.4Independence of Causal Influence1755.4.1The Noisy-Or Model1755.4.2Generalized Linear Models1785.4.3The General Formulation1825.4.4Independencies1845.5Continuous Variables1855.5.1Hybrid Models1895.6Conditional Bayesian Networks1915.7Summary1935.8Relevant Literature1945.9Exercises1956 Template-Based Representations1996.1Introduction1996.2Temporal Models2006.2.1Basic Assumptions2016.2.2Dynamic Bayesian Networks2026.2.3State-Observation Models2076.3Template Variables and Template Factors2126.4Directed Probabilistic Models for Object-Relational Domains6.4.1Plate Models2166.4.2Probabilistic Relational Models2226.5Undirected Representation2286.6Structural Uncertainty ?2326.6.1Relational Uncertainty2336.6.2Object Uncertainty2356.7Summary2406.8Relevant Literature2426.9Exercises243216

xii7CONTENTSGaussian Network Models2477.1Multivariate Gaussians2477.1.1Basic Parameterization2477.1.2Operations on Gaussians2497.1.3Independencies in Gaussians2507.2Gaussian Bayesian Networks2517.3Gaussian Markov Random Fields2547.4Summary2577.5Relevant Literature2587.6Exercises2588 The Exponential Family2618.1Introduction2618.2Exponential Families2618.2.1Linear Exponential Families8.3Factored Exponential Families2668.3.1Product Distributions2668.3.2Bayesian Networks2678.4Entropy and Relative Entropy2698.4.1Entropy2698.4.2Relative 8.7Relevant Literature2838.8Exercises283IIInference2632859 Exact Inference: Variable Elimination2879.1Analysis of Complexity2889.1.1Analysis of Exact Inference2889.1.2Analysis of Approximate Inference2909.2Variable Elimination: The Basic Ideas2929.3Variable Elimination2969.3.1Basic Elimination2979.3.2Dealing with Evidence3039.4Complexity and Graph Structure: Variable Elimination9.4.1Simple Analysis3069.4.2Graph-Theoretic Analysis3069.4.3Finding Elimination Orderings 3109.5Conditioning 315305

CONTENTS9.69.79.89.9xiii9.5.1The Conditioning Algorithm3159.5.2Conditioning and Variable Elimination3189.5.3Graph-Theoretic Analysis3229.5.4Improved Conditioning323Inference with Structured CPDs 3259.6.1Independence of Causal Influence3259.6.2Context-Specific Independence3299.6.3Discussion335Summary and Discussion336Relevant Literature337Exercises33810 Exact Inference: Clique Trees34510.1 Variable Elimination and Clique Trees34510.1.1Cluster Graphs34610.1.2Clique Trees34610.2 Message Passing: Sum Product34810.2.1Variable Elimination in a Clique Tree34910.2.2Clique Tree Calibration35510.2.3A Calibrated Clique Tree as a Distribution36110.3 Message Passing: Belief Update36410.3.1Message Passing with Division36410.3.2Equivalence of Sum-Product and Belief Update Messages10.3.3Answering Queries36910.4 Constructing a Clique Tree37210.4.1Clique Trees from Variable Elimination37210.4.2Clique Trees from Chordal Graphs37410.5 Summary37610.6 Relevant Literature37710.7 Exercises37811 Inference as Optimization38111.1Introduction38111.1.1Exact Inference Revisited 38211.1.2The Energy Functional38411.1.3Optimizing the Energy Functional38611.2 Exact Inference as Optimization38611.2.1Fixed-Point Characterization38811.2.2Inference as Optimization39011.3 Propagation-Based Approximation39111.3.1A Simple Example39111.3.2Cluster-Graph Belief Propagation39611.3.3Properties of Cluster-Graph Belief Propagation11.3.4Analyzing Convergence 40111.3.5Constructing Cluster Graphs404399368

xivCONTENTS11.411.511.611.711.811.3.6Variational Analysis41111.3.7Other Entropy Approximations ?41411.3.8Discussion428Propagation with Approximate Messages ?43011.4.1Factorized Messages43111.4.2Approximate Message Computation43311.4.3Inference with Approximate Messages43611.4.4Expectation Propagation44211.4.5Variational Analysis44511.4.6Discussion448Structured Variational Approximations44811.5.1The Mean Field Approximation44911.5.2Structured Approximations45611.5.3Local Variational Methods ?469Summary and Discussion473Relevant Literature475Exercises47712 Particle-Based Approximate Inference48712.1 Forward Sampling48812.1.1Sampling from a Bayesian Network48812.1.2Analysis of Error49012.1.3Conditional Probability Queries49112.2 Likelihood Weighting and Importance Sampling49212.2.1Likelihood Weighting: Intuition49212.2.2Importance Sampling49412.2.3Importance Sampling for Bayesian Networks12.2.4Importance Sampling Revisited50412.3 Markov Chain Monte Carlo Methods50512.3.1Gibbs Sampling Algorithm50512.3.2Markov Chains50712.3.3Gibbs Sampling Revisited51212.3.4A Broader Class of Markov Chains ?51512.3.5Using a Markov Chain51812.4 Collapsed Particles52612.4.1Collapsed Likelihood Weighting ?52712.4.2Collapsed MCMC53112.5 Deterministic Search Methods ?53612.6 Summary54012.7 Relevant Literature54112.8 Exercises54413 MAP Inference55113.1 Overview55113.1.1Computational Complexity551498

CONTENTSxv13.1.2Overview of Solution Methods552Variable Elimination for (Marginal) MAP55413.2.1Max-Product Variable Elimination55413.2.2Finding the Most Probable Assignment55613.2.3Variable Elimination for Marginal MAP ?55913.3 Max-Product in Clique Trees56213.3.1Computing Max-Marginals56213.3.2Message Passing as Reparameterization56413.3.3Decoding Max-Marginals56513.4 Max-Product Belief Propagation in Loopy Cluster Graphs56713.4.1Standard Max-Product Message Passing56713.4.2Max-Product BP with Counting Numbers ?57213.4.3Discussion57513.5 MAP as a Linear Optimization Problem ?57713.5.1The Integer Program Formulation57713.5.2Linear Programming Relaxation57913.5.3Low-Temperature Limits58113.6 Using Graph Cuts for MAP58813.6.1Inference Using Graph Cuts58813.6.2Nonbinary Variables59213.7 Local Search Algorithms ?59513.8 Summary59713.9 Relevant Literature59813.10 Exercises60113.214 Inference in Hybrid Networks60514.1 tion60614.1.3Overview60714.2 Variable Elimination in Gaussian Networks60814.2.1Canonical Forms60914.2.2Sum-Product Algorithms61114.2.3Gaussian Belief Propagation61214.3 Hybrid Networks61514.3.1The Difficulties61514.3.2Factor Operations for Hybrid Gaussian Networks61814.3.3EP for CLG Networks62114.3.4An “Exact” CLG Algorithm ?62614.4 Nonlinear tion Propagation with Gaussian Approximation14.5 Particle-Based Approximation Methods64214.5.1Sampling in Continuous Spaces64214.5.2Forward Sampling in Bayesian Networks643637

xviCONTENTS14.614.714.814.5.3MCMC Methods64414.5.4Collapsed Particles64514.5.5Nonparametric Message PassingSummary and Discussion646Relevant Literature647Exercises64964615 Inference in Temporal Models65115.1 Inference Tasks65215.2 Exact Inference65315.2.1Filtering in State-Observation Models65315.2.2Filtering as Clique Tree Propagation65415.2.3Clique Tree Inference in DBNs65515.2.4Entanglement65615.3 Approximate Inference66115.3.1Key Ideas66115.3.2Factored Belief State Methods66315.3.3Particle Filtering66515.3.4Deterministic Search Techniques67515.4 Hybrid DBNs67515.4.1Continuous Models67615.4.2Hybrid Models68315.5 Summary68815.6 Relevant Literature69015.7 Exercises692IIILearning69516 Learning Graphical Models: Overview69716.1 Motivation69716.2 Goals of Learning69816.2.1Density Estimation69816.2.2Specific Prediction Tasks70016.2.3Knowledge Discovery70116.3 Learning as Optimization70216.3.1Empirical Risk and Overfitting70316.3.2Discriminative versus Generative Training16.4 Learning Tasks71116.4.1Model Constraints71216.4.2Data Observability71216.4.3Taxonomy of Learning Tasks71416.5 Relevant Literature71517 Parameter Estimation71717.1 Maximum Likelihood Estimation717709

CONTENTS17.217.317.417.517.617.717.817.917.1.1The Thumbtack Example71717.1.2The Maximum Likelihood Principle720MLE for Bayesian Networks72217.2.1A Simple Example72317.2.2Global Likelihood Decomposition72417.2.3Table-CPDs72517.2.4Gaussian Bayesian Networks ?72817.2.5Maximum Likelihood Estimation as M-Projection ?Bayesian Parameter Estimation73317.3.1The Thumbtack Example Revisited73317.3.2Priors and Posteriors737Bayesian Parameter Estimation in Bayesian Networks74117.4.1Parameter Independence and Global Decomposition17.4.2Local Decomposition74617.4.3Priors for Bayesian Network Learning74817.4.4MAP Estimation ?751Learning Models with Shared Parameters75417.5.1Global Parameter Sharing75517.5.2Local Parameter Sharing76017.5.3Bayesian Inference with Shared Parameters76217.5.4Hierarchical Priors ?763Generalization Analysis ?76917.6.1Asymptotic Analysis76917.6.2PAC-Bounds770Summary776Relevant Literature777Exercises77818 Structure Learning in Bayesian Networks78318.1 Introduction78318.1.1Problem Definition78318.1.2Overview of Methods78518.2 Constraint-Based Approaches78618.2.1General Framework78618.2.2Independence Tests78718.3 Structure Scores79018.3.1Likelihood Scores79118.3.2Bayesian Score79418.3.3Marginal Likelihood for a Single Variable79718.3.4Bayesian Score for Bayesian Networks79918.3.5Understanding the Bayesian Score80118.3.6Priors80418.3.7Score Equivalence ?80718.4 Structure Search80718.4.1Learning Tree-Structured Networks808xvii731742

xviii18.518.618.718.818.9CONTENTS18.4.2Known Order80918.4.3General Graphs81118.4.4Learning with Equivalence Classes ?821Bayesian Model Averaging ?82418.5.1Basic Theory82418.5.2Model Averaging Given an Order82618.5.3The General Case828Learning Models with Additional Structure83218.6.1Learning with Local Structure83318.6.2Learning Template Models837Summary and Discussion838Relevant Literature840Exercises84319 Partially Observed Data84919.1 Foundations84919.1.1Likelihood of Data and Observation Models84919.1.2Decoupling of Observation Mechanism85319.1.3The Likelihood Function85619.1.4Identifiability86019.2 Parameter Estimation86219.2.1Gradient Ascent86319.2.2Expectation Maximization (EM)86819.2.3Comparison: Gradient Ascent versus EM88719.2.4Approximate Inference ?89319.3 Bayesian Learning with Incomplete Data ?89719.3.1Overview89719.3.2MCMC Sampling89919.3.3Variational Bayesian Learning90419.4 Structure Learning90819.4.1Scoring Structures90919.4.2Structure Search91719.4.3Structural EM92019.5 Learning Models with Hidden Variables92519.5.1Information Content of Hidden Variables92619.5.2Determining the Cardinality92819.5.3Introducing Hidden Variables93019.6 Summary93319.7 Relevant Literature93419.8 Exercises93520 Learning Undirected Models94320.1 Overview94320.2 The Likelihood Function94420.2.1An Example944

CONTENTSxix20.2.2Form of the Likelihood Function94620.2.3Properties of the Likelihood Function94720.3 Maximum (Conditional) Likelihood Parameter Estimation94920.3.1Maximum Likelihood Estimation94920.3.2Conditionally Trained Models95020.3.3Learning with Missing Data95420.3.4Maximum Entropy and Maximum Likelihood ?95620.4 Parameter Priors and Regularization95820.4.1Local Priors95820.4.2Global Priors96120.5 Learning with Approximate Inference96120.5.1Belief Propagation96220.5.2MAP-Based Learning ?96720.6 Alternative Objectives96920.6.1Pseudolikelihood and Its Generalizations97020.6.2 Contrastive Optimization Criteria97420.7 Structure Learning97820.7.1Structure Learning Using Independence Tests97920.7.2Score-Based Learning: Hypothesis Spaces98120.7.3Objective Functions98220.7.4Optimization Task98520.7.5Evaluating Changes to the Model99220.8 Summary99620.9 Relevant Literature99820.10 Exercises1001IVActions and Decisions100721 Causality100921.1 Motivation and Overview100921.1.1Conditioning and Intervention100921.1.2Correlation and Causation101221.2 Causal Models101421.3 Structural Causal Identifiability101721.3.1Query Simplification Rules101721.3.2Iterated Query Simplification102021.4 Mechanisms and Response Variables ?102621.5 Partial Identifiability in Functional Causal Models ?103121.6 Counterfactual Queries ?103421.6.1Twinned Networks103421.6.2Bounds on Counterfactual Queries103721.7 Learning Causal Models104021.7.1Learning Causal Models without Confounding Factors21.7.2Learning from Interventional Data10441041

xxCONTENTS21.7.3Dealing with Latent Variables ?104821.7.4Learning Functional Causal Models ?105121.8 Summary105321.9 Relevant Literature105421.10 Exercises105522 Utilities and Decisions105922.1 Foundations: Maximizing Expected Utility105922.1.1Decision Making Under Uncertainty105922.1.2Theoretical Justification ?106222.2 Utility Curves106422.2.1Utility of Money106522.2.2Attitudes Toward Risk106622.2.3Rationality106722.3 Utility Elicitation106822.3.1Utility Elicitation Procedures106822.3.2Utility of Human Life106922.4 Utilities of Complex Outcomes107122.4.1Preference and Utility Independence ?107122.4.2Additive Independence Properties107422.5 Summary108122.6 Relevant Literature108222.7 Exercises108423 Structured Decision Problems108523.1 Decision Trees108523.1.1Representation108523.1.2Backward Induction Algorithm108723.2 Influence Diagrams108823.2.1Basic Representation108923.2.2Decision Rules109023.2.3Time and Recall109223.2.4Semantics and Optimality Criterion109323.3 Backward Induction in Influence Diagrams109523.3.1Decision Trees for Influence Diagrams109623.3.2Sum-Max-Sum Rule109823.4 Computing Expected Utilities110023.4.1Simple Variable Elimination110023.4.2Multiple Utility Variables: Simple Approaches110223.4.3Generalized Variable Elimination ?110323.5 Optimization in Influence Diagrams110723.5.1Optimizing a Single Decision Rule110723.5.2Iterated Optimization Algorithm110823.5.3Strategic Relevance and Global Optimality ?111023.6 Ignoring Irrelevant Information ?1119

CONTENTSxxi23.7Value of Information112123.7.1Single Observations23.7.2Multiple Observations23.8 Summary112623.9 Relevant Literature112723.10 Exercises113024 Epilogue112211241133A Background Material1137A.1Information Theory1137A.1.1Compression and Entropy1137A.1.2Conditional Entropy and Information1139A.1.3Relative Entropy and Distances Between Distributions1140A.2 Convergence Bounds1143A.2.1Central Limit Theorem1144A.2.2Convergence Bounds1145A.3 Algorithms and Algorithmic Complexity1146A.3.1Basic Graph Algorithms1146A.3.2Analysis of Algorithmic Complexity1147A.3.3Dynamic Programming1149A.3.4Complexity Theory1150A.4 Combinatorial Optimization and Search1154A.4.1Optimization Problems1154A.4.2Local Search1154A.4.3Branch and Bound Search1160A.5 Continuous Optimization1161A.5.1Characterizing Optima of a Continuous Function1161A.5.2Gradient Ascent Methods1163A.5.3Constrained Optimization1167A.5.4Convex Duality1171BibliographyNotation IndexSubject Index117312111215

AcknowledgmentsThis book owes a considerable debt of gratitude to the many people who contributed to itscreation, and to those who have influenced our work and our thinking over the years.First and foremost, we want to thank our students, who, by asking the right questions, andforcing us to formulate clear and precise answers, were directly responsible for the inception ofthis book and for any clarity of presentation.We have been fortunate to share the same mentors, who have had a significant impact onour development as researchers and as teachers: Joe Halpern, Stuart Russell. Much of our coreviews on probabilistic models have been influenced by Judea Pearl. Judea through his persuasivewriting and vivid presentations inspired us, and many other researchers of our generation, toplunge into research in this field.There are many people whose conversations with us have helped us in thinking throughsome of the more difficult concepts in the book: Nando de Freitas, Gal Elidan, Dan Geiger,Amir Globerson, Uri Lerner, Chris Meek, David Sontag, Yair Weiss, and Ramin Zabih. Others,in conversations and collaborations over the year, have also influenced our thinking and thepresentation of the material: Pieter Abbeel, Jeff Bilmes, Craig Boutilier, Moises Goldszmidt,Carlos Guestrin, David Heckerman, Eric Horvitz, Tommi Jaakkola, Michael Jordan, Kevin Murphy,Andrew Ng, Ben Taskar, and Sebastian Thrun.We especially want to acknowledge Gal Elidan for constant encouragement, valuable feedback,and logistic support at many critical junctions, throughout the long years of writing this book.Over the course of the years of work on this book, many people have contributed to itby providing insights, engaging in enlightening discussions, and giving valuable feedback. It isimpossible to individually acknowledge all of the people who made such contributions. However,we specifically wish to express our gratitude to those people who read large parts of the bookand gave detailed feedback: Rahul Biswas, James Cussens, James Diebel, Yoni Donner, Tal ElHay, Gal Elidan, Stanislav Funiak, Amir Globerson, Russ Greiner, Carlos Guestrin, Tim Heilman,Geremy Heitz, Maureen Hillenmeyer, Ariel Jaimovich, Tommy Kaplan, Jonathan Laserson, KenLevine, Brian Milch, Kevin Murphy, Ben Packer, Ronald Parr, Dana Pe’er, and Christian Shelton.We are deeply grateful to the following people, who contributed specific text and/or figures,mostly to the case studies and concept boxes without which this book would be far lessinteresting: Gal Elidan, to chapter 11, chapter 18, and chapter 19; Stephen Gould, to chapter 4and chapter 13; Vladimir Jojic, to chapter 12; Jonathan Laserson, to chapter 19; Uri Lerner, tochapter 14; Andrew McCallum and Charles Sutton, to chapter 4; Brian Milch, to chapter 6; Kevin

xxivAcknowledgmentsMurphy, to chapter 15; and Benjamin Packer, to many of the exercises used throughout the book.In addition, we are very grateful to Amir Globerson, David Sontag and Yair Weiss whose insightson chapter 13 played a key role in the development of the material in that chapter.Special thanks are due to Bob Prior at MIT Press who convinced us to go ahead with thisproject and was constantly supportive, enthusiastic and patient in the face of the recurringdelays and missed deadlines. We thank Greg McNamee, our copy editor, and Mary Reilly, ourartist, for their help in improving this book considerably. We thank Chris Manning, for allowingus to use his LATEX macros for typesetting this book, and for providing useful advice on how touse them. And we thank Miles Davis for invaluable technical support.We also wish to thank the many colleagues who used drafts of this book in teaching providedenthusiastic feedback that encouraged us to continue this project at times where it seemedunending. Sebastian Thrun deserves a special note of thanks, for forcing us to set a deadline forcompletion of this book and to stick to it.We also want to thank the past and present members of the DAGS group at Stanford, and theComputational Biology group at the Hebrew University, many of whom also contributed ideas,insights, and useful comments. We specifically want to thank them for bearing with us whilewe devoted far too much of our time to working on this book.Finally, noone deserves our thanks more than our long-suffering families — Natalie AnnaKoller Avida, Maya Rika Koller Avida, and Dan Avida; Lior, Roy, and Yael Friedman — for theircontinued love, support, and patience, as they watched us work evenings and weekends tocomplete this book. We could never have done this without you.

List of Figures1.11.2Different perspectives on probabilistic graphical modelsA reader’s guide to the structure and dependencies in this book4102.12.22.32.42.5Example of a joint distribution P (Intelligence, Grade)Example PDF of three Gaussian distributionsAn example of a partially directed graph KInduced graphs and their upward closureAn example of a .113.123.133.143.153.16Simple Bayesian networks for the student exampleThe Bayesian network graph for a naive Bayes modelThe Bayesian Network graph for the Student exampleStudent Bayesian network B student with CPDsThe four possible two-edge trailsA simple example for the d-separation algorithmSkeletons and v-structures in a networkThree minimal I-maps for PBstudent , induced by different orderingsNetwork for the OneLetter exampleAttempted Bayesian network models for the Misconception exampleSimple example of compelled edges in an equivalence class.Rules for orienting edges in PDAGMore complex example of compelled edges in an equivalence classA Bayesian network with qualitative influencesA simple network for a burglary alarm domainIllustration of the concept of a self-contained 54.6Factors for the Misconception exampleJoint distribution for the Misconception exampleAn example of factor productThe cliques in two simple Markov networksAn example of factor reductionMarkov networks for the factors in an extended Student example104105107109111112

xxviLIST OF FIGURES4.74.84.94.104.114.124.134.144.154.16An attempt at an I-map for a nonpositive distribution PDifferent factor graphs for the same Markov networkEnergy functions for the Misconception exampleAlternative but equivalent energy functionsCanonical energy function for the Misconception exampleExample of alternative definition of d-separation based on Markov networksMinimal I-map Bayesian networks for a nonchordal Markov networkDifferent linear-chain graphical modelsA chain graph K and its moralized versionExample for definition of c-separation in a chain 5.65.75.85.95.105.115.125.135.145.15Example of a network with a deterministic CPDA slightly more complex example with deterministic CPDsThe Student example augmented with a Job variableA tree-CPD for P (J A, S, L)The OneLetter example of a multiplexer dependencytree-CPD for a rule-based CPDExample of removal of spurious edgesTwo reduced CPDs for the OneLetter exampleDecomposition of the noisy-or model for LetterThe behavior of the noisy-or modelThe behavior of the sigmoid CPDExample of the multinomial logistic CPDIndependence of causal influenceGeneralized linear model for a thermostatExample of encapsulated CPDs for a computer system 6.16.26.36.46.56.66.76.86.9A highly simplified DBN for monitoring a vehicleHMM as a DBNTwo classes of DBNs constructed from HMMsA simple 4-state HMMOne possible world for the University examplePlate model for a set of coin tosses sampled from a single coinPlate models and ground Bayesian networks for a simplified Student exampleIllustration of probabilistic interactions in the University domainExamples of dependency graphs2032032052082152172192202277.1Examples of 2-dimensional Gaussians2498.18.28.3Example of M- and I-projections into the family of Gaussian distributionsExample of M- and I-projections for a discrete distributionRelationship between parameters, distributions, and expected sufficient statistics2752762799.19.29.3Network used to prove N P-hardness of exact inferenceComputing P (D) by summing out the joint distributionThe first transformation on the sum of figure 9.2289294295

LIST OF 9.159.16The second transformation on the sum of figure 9.2The third transformation on the sum of figure 9.2The fourth transformation on the sum of figure 9.2Example of factor marginalizationThe Extended-Student Bayesian networkUnderstanding intermediate factors in variable eliminationVariable elimination as graph transformation in the Student exampleInduced graph and clique tree for the Student exampleNetworks where conditioning performs unnecessary computationInduced graph for the Student example using both conditioning and eliminationDifferent decompositions for a noisy-or CPDExample Bayesian network with rule-based structureConditioning in a network with 10.310.410.510.610.710.810.910.10Cluster tree for the VE execution in table 9.1Simplified clique tree T for the Extended Student networkMessage propagations with different root cliques in the Student clique treeAn abstract clique tree that is not chain-structuredTwo steps in a downward pass in the Student networkFinal beliefs for the Misconception exampleAn example of factor divisionA modified Student BN with an unambitious studentA clique tree for the modified Student BN of figure 10.8Example of cli

No part of this book may be reproduced in any form by any electronic or mechanical means (including photocopying, recording, or information storage and retrieval) without permission in writing from the publisher. For information about special quantity discounts, please email special_sales@mitpress.mit.edu This book was set by the authors in .