##### Differences

This shows you the differences between two versions of the page.

 cs-677sp2010:undirected-models [2014/12/09 09:48]ryancha created cs-677sp2010:undirected-models [2014/12/09 09:50]ryancha 2014/12/11 16:40 ryancha 2014/12/09 09:50 ryancha 2014/12/09 09:48 ryancha created 2014/12/11 16:40 ryancha 2014/12/09 09:50 ryancha 2014/12/09 09:48 ryancha created Last revision Both sides next revision Line 9: Line 9: === Example === === Example === - To show one situation where this model is appropriate,​ suppose we have a model with four Bernoulli nodes {A, B, C, D} with independency properties ​<​math>​(A \perp C |  \{B, D\}) ​and <​math>​(B \perp D |  \{A, C\})​.  How would you represent this model as a Bayesian network? + To show one situation where this model is appropriate,​ suppose we have a model with four Bernoulli nodes {A, B, C, D} with independency properties ​$(A \perp C | \{B, D\})$ and $(B \perp D | \{A, C\})$.  How would you represent this model as a Bayesian network? - [[File:​Ben_img1directedA.png]] In this attempt, we get the property ​<​math>​(A \perp C |  \{B, D\})​, but this network also implies that B and D are independent given only A. + [[File:​Ben_img1directedA.png]] In this attempt, we get the property ​$(A \perp C | \{B, D\})$, but this network also implies that B and D are independent given only A. - [[File:​Ben_img1directedB.png]] This network also implies ​<​math>​(A \perp C |  \{B, D\})​, but this network also implies that B and D are marginally independent. + [[File:​Ben_img1directedB.png]] This network also implies ​$(A \perp C | \{B, D\})$, but this network also implies that B and D are marginally independent. [[File:​Ben_img1undirected.png]] The only good way to naturally capture this network is with an MRF as shown in the image to the left. [[File:​Ben_img1undirected.png]] The only good way to naturally capture this network is with an MRF as shown in the image to the left. Line 30: Line 30: *''​Pairwise Markov property'':​ Any two non-adjacent random variables are conditionally independent given all other variables in the network are observed. *''​Pairwise Markov property'':​ Any two non-adjacent random variables are conditionally independent given all other variables in the network are observed. - [[File:​Ben_pairwise.png]] ​<​math>​(A \perp Z | \{all other nodes\})​ + [[File:​Ben_pairwise.png]] ​$(A \perp Z | \{all other nodes\})$ *''​Local Markov property'':​ A variable is conditionally independent of all other variables given its neighbors. *''​Local Markov property'':​ A variable is conditionally independent of all other variables given its neighbors. - [[File:​Ben_local.png]] ​<​math>​(B \perp C | \{neighbors of C\})​ + [[File:​Ben_local.png]] ​$(B \perp C | \{neighbors of C\})$ *''​Global Markov property'':​ Any two subsets of variables are conditionally independent given a separating subset where every path between the two subsets pass through the observed subset. *''​Global Markov property'':​ Any two subsets of variables are conditionally independent given a separating subset where every path between the two subsets pass through the observed subset. - [[File:​Ben_global.png]] ​<​math>​(S1 \perp S2 | \{S\})​ + [[File:​Ben_global.png]] ​$(S1 \perp S2 | \{S\})$ These properties also lead to the conclusion that the Markov blanket for any particular node A in an MRF is simply A and all the nodes adjacent to A. These properties also lead to the conclusion that the Markov blanket for any particular node A in an MRF is simply A and all the nodes adjacent to A. = Parametrization = = Parametrization = - Because the node interactions are not directed, there is no reason to use a conditional probability distribution to represent node interactions. ​ It is more appropriate to capture "​affinities"​ between variables. We, therefore, assign a general-purpose function called a factor or potential ​<​math>​\phi ​to each pair of adjacent variables. In the discrete case, <​math>​\phi ​contains a value for each variable state combination. This will be made clear in the example below. + Because the node interactions are not directed, there is no reason to use a conditional probability distribution to represent node interactions. ​ It is more appropriate to capture "​affinities"​ between variables. We, therefore, assign a general-purpose function called a factor or potential ​$\phi$ to each pair of adjacent variables. In the discrete case, $\phi$ contains a value for each variable state combination. This will be made clear in the example below. Combining this set of factors for a network leads to a table that represents the various affinities of the network. Table values in an MRF are generally non-negative,​ but have no other restrictions. ​ These values do not have to represent probabilities and they do not have to sum up to one. They are merely an association between variables. Combining this set of factors for a network leads to a table that represents the various affinities of the network. Table values in an MRF are generally non-negative,​ but have no other restrictions. ​ These values do not have to represent probabilities and they do not have to sum up to one. They are merely an association between variables. - With these factors, we now have the ability to find probability distributions. For a network consisting of the random variables ​<​math>​\{X_1, X_2,..., X_n\} ​and parametrized by the set of factors ​<​math>​\Phi = \{\phi_1(D_1),​\phi_2(D_2),​...,​\phi_k(D_k)\}​, the joint probability distribution of a network is formally defined as <​math>​P(X_1, X_2, ..., X_n) = (1/​Z)\tilde{P}(X_1,​ X_2, ..., X_n) ​where <​math>​\tilde{P}(X_1,​ X_2, ..., X_n) = \prod_{k} \phi_k(D_k) ​and where Z is a normalizing constant (also called a partition function) defined by <​math>​Z = \sum_{X_1, X_2,​...,​X_n}\tilde{P}(X_1,​ X_2, ..., X_n)​. The normalizing constant is necessary because the affinities do not represent actual probabilities. + With these factors, we now have the ability to find probability distributions. For a network consisting of the random variables ​$\{X_1, X_2,..., X_n\}$ and parametrized by the set of factors ​$\Phi = \{\phi_1(D_1),​\phi_2(D_2),​...,​\phi_k(D_k)\}$, the joint probability distribution of a network is formally defined as $P(X_1, X_2, ..., X_n) = (1/​Z)\tilde{P}(X_1,​ X_2, ..., X_n)$ where $\tilde{P}(X_1,​ X_2, ..., X_n) = \prod_{k} \phi_k(D_k)$ and where Z is a normalizing constant (also called a partition function) defined by $Z = \sum_{X_1, X_2,​...,​X_n}\tilde{P}(X_1,​ X_2, ..., X_n)$. The normalizing constant is necessary because the affinities do not represent actual probabilities. - In the discrete case, computing all of the joint distributions for the various variable values creates a table. With this, we can find any conditional probability using Bayes' Law just as we do for a Bayesian network (e.g. <​math>​P(X_1 | X_2, ..., X_n) = P(X_1, X_2, ..., X_n)/P(X_2, ..., X_n)​). + In the discrete case, computing all of the joint distributions for the various variable values creates a table. With this, we can find any conditional probability using Bayes' Law just as we do for a Bayesian network (e.g. $P(X_1 | X_2, ..., X_n) = P(X_1, X_2, ..., X_n)/P(X_2, ..., X_n)$). === Example === === Example === - Going back to our discrete Bernoulli case from Section 1.1 above, we could create a table of factors to represent the associations between the nodes in the network as shown below. The values within a factor essentially weight the different cases. So for <​math>​\phi_1(A,B) ​we feel that the <​math>​a^0, b^0 ​case is the most likely and we have weighted it accordingly. + Going back to our discrete Bernoulli case from Section 1.1 above, we could create a table of factors to represent the associations between the nodes in the network as shown below. The values within a factor essentially weight the different cases. So for $\phi_1(A,B)$ we feel that the $a^0, b^0$ case is the most likely and we have weighted it accordingly. {| border="​1"​ {| border="​1"​ - !colspan="​3"​|<​math>​\phi_1(A,B)​ + !colspan="​3"​|$\phi_1(A,B)$ - !colspan="​3"​|<​math>​\phi_2(B,C)​ + !colspan="​3"​|$\phi_2(B,C)$ - !colspan="​3"​|<​math>​\phi_3(C,D)​ + !colspan="​3"​|$\phi_3(C,D)$ - !colspan="​3"​|<​math>​\phi_4(D,A)​ + !colspan="​3"​|$\phi_4(D,A)$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​30​ + |$30$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​100​ + |$100$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​1​ + |$1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​a^0​ + |$a^0$ - |<​math>​100​ + |$100$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​5​ + |$5$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​1​ + |$1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100​ + |$100$ - |<​math>​d^0​ + |$d^0$ - |<​math>​a^1​ + |$a^1$ - |<​math>​1​ + |$1$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​1​ + |$1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​1​ + |$1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100​ + |$100$ - |<​math>​d^1​ + |$d^1$ - |<​math>​a^0​ + |$a^0$ - |<​math>​1​ + |$1$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​10​ + |$10$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​100​ + |$100$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​1​ + |$1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​a^1​ + |$a^1$ - |<​math>​100​ + |$100$ |} |} - With these factors, we now have the ability to find probability distributions. ​ To find the joint probability distribution P(a, b, c, d) we combine the locals models just as we do in a Bayesian network and as described above giving us <​math>​\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)​. Again, since there is no guarantee that this will be a normalized joint distribution,​ we must multiply this by a normalizing constant. ​ This gives us <​math>​P(a, b, c, d) = (1/​Z)*\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a) ​where <​math>​Z = \sum_{a,​b,​c,​d} \phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a) ​ A breakdown of all the joint distributions for this network is given in the table below. + With these factors, we now have the ability to find probability distributions. ​ To find the joint probability distribution P(a, b, c, d) we combine the locals models just as we do in a Bayesian network and as described above giving us $\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)$. Again, since there is no guarantee that this will be a normalized joint distribution,​ we must multiply this by a normalizing constant. ​ This gives us $P(a, b, c, d) = (1/​Z)*\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)$ where $Z = \sum_{a,​b,​c,​d} \phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)$  A breakdown of all the joint distributions for this network is given in the table below. {| border="​1"​ {| border="​1"​ Line 120: Line 120: !colspan="​1"​|Normalized !colspan="​1"​|Normalized |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​300000​ + |$300000$ - |<​math>​.04​ + |$.04$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​300000​ + |$300000$ - |<​math>​.04​ + |$.04$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​300000​ + |$300000$ - |<​math>​.04​ + |$.04$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​30​ + |$30$ - |<​math>​4.1*10^-6​ + |$4.1*10^-6$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​500​ + |$500$ - |<​math>​6.9*10^-5​ + |$6.9*10^-5$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​500​ + |$500$ - |<​math>​6.9*10^-5​ + |$6.9*10^-5$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​5000000​ + |$5000000$ - |<​math>​.69​ + |$.69$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​500​ + |$500$ - |<​math>​6.9*10^-5​ + |$6.9*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100​ + |$100$ - |<​math>​1.4*10^-5​ + |$1.4*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​1000000​ + |$1000000$ - |<​math>​.14​ + |$.14$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100​ + |$100$ - |<​math>​1.4*10^-5​ + |$1.4*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100​ + |$100$ - |<​math>​1.4*10^-5​ + |$1.4*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​10​ + |$10$ - |<​math>​1.4*10^-6​ + |$1.4*10^-6$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100000​ + |$100000$ - |<​math>​.014​ + |$.014$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100000​ + |$100000$ - |<​math>​.014​ + |$.014$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100000​ + |$100000$ - |<​math>​.014​ + |$.014$ |} |} - Now using Bayes' Law, we can compute various conditional probabilities. For example, ​<​math>​P(a^0|b^0,​c^1) = P(a^0,​b^0,​c^1)/​P(b^0,​c^1) = (.04+4.1*10^-6)/​(.04+4.1*10^-6+1.4*10^-5+1.4*10^-5) = .99930​. + Now using Bayes' Law, we can compute various conditional probabilities. For example, ​$P(a^0|b^0,​c^1) = P(a^0,​b^0,​c^1)/​P(b^0,​c^1) = (.04+4.1*10^-6)/​(.04+4.1*10^-6+1.4*10^-5+1.4*10^-5) = .99930$. = Monte Carlo Markov Chain (MCMC) = = Monte Carlo Markov Chain (MCMC) = Line 243: Line 243: == Metropolis and Gibbs Sampling == == Metropolis and Gibbs Sampling == Of course, there are many times where we do not know how to sample from the conditional distributions. In this case and just as with a Bayesian network, we can apply the same Gibbs sampling with Metropolis algorithm in which we: Of course, there are many times where we do not know how to sample from the conditional distributions. In this case and just as with a Bayesian network, we can apply the same Gibbs sampling with Metropolis algorithm in which we: - # Sample ​<​math>​x^* ​from a proposal distribution ​<​math>​g(x^*) ​where g is close to the actual distribution f, contains the support of f, and is symmetric. + # Sample ​$x^*$ from a proposal distribution ​$g(x^*)$ where g is close to the actual distribution f, contains the support of f, and is symmetric. - # Compute ​<​math>​r = f(x^*)/f(x)​ + # Compute ​$r = f(x^*)/f(x)$ - # With probability ​<​math>​min(r, 1)​, update the node value to <​math>​x^*​; otherwise, leave the value the same. + # With probability ​$min(r, 1)$, update the node value to $x^*$; otherwise, leave the value the same. The fact that we can use the same Bayesian algorithms on an undirected graph shows just how closely related the two networks are and how powerful the Bayesian methods actually are. The fact that we can use the same Bayesian algorithms on an undirected graph shows just how closely related the two networks are and how powerful the Bayesian methods actually are. Line 281: Line 281: = Candidate Exam Question = = Candidate Exam Question = - Given the Bayesian network below, convert the network to a Markov random field. Then, assuming that each node can take a state of zero or one, create a potential table assigning values of your choice to the potentials. Then find the following probabilities: ​<​math>​P(a^0, b^1, c^1), P(a^1), P(b^1 | c^1), P(a^0 | b^0, c^0)​. Show your work so that it is easy to follow what you did. + Given the Bayesian network below, convert the network to a Markov random field. Then, assuming that each node can take a state of zero or one, create a potential table assigning values of your choice to the potentials. Then find the following probabilities: ​$P(a^0, b^1, c^1), P(a^1), P(b^1 | c^1), P(a^0 | b^0, c^0)$. Show your work so that it is easy to follow what you did. [[File:​Examquestion.png‎]] [[File:​Examquestion.png‎]]