##### Differences

This shows you the differences between two versions of the page.

 cs-677sp2010:undirected-models [2014/12/09 09:48]ryancha created cs-677sp2010:undirected-models [2014/12/11 16:40]ryancha 2014/12/11 16:40 ryancha 2014/12/09 09:50 ryancha 2014/12/09 09:48 ryancha created Next revision Previous revision 2014/12/11 16:40 ryancha 2014/12/09 09:50 ryancha 2014/12/09 09:48 ryancha created Line 9: Line 9: === Example === === Example === - To show one situation where this model is appropriate,​ suppose we have a model with four Bernoulli nodes {A, B, C, D} with independency properties ​<​math>​(A \perp C |  \{B, D\}) ​and <​math>​(B \perp D |  \{A, C\})​.  How would you represent this model as a Bayesian network? + To show one situation where this model is appropriate,​ suppose we have a model with four Bernoulli nodes {A, B, C, D} with independency properties ​$(A \perp C | \{B, D\})$ and $(B \perp D | \{A, C\})$.  How would you represent this model as a Bayesian network? - [[File:​Ben_img1directedA.png]] In this attempt, we get the property ​<​math>​(A \perp C |  \{B, D\})​, but this network also implies that B and D are independent given only A. + [[media:​cs-677sp10:​Ben_img1directedA.png]] In this attempt, we get the property ​$(A \perp C | \{B, D\})$, but this network also implies that B and D are independent given only A. - [[File:​Ben_img1directedB.png]] This network also implies ​<​math>​(A \perp C |  \{B, D\})​, but this network also implies that B and D are marginally independent. + [[media:​cs-677sp10:​Ben_img1directedB.png]] This network also implies ​$(A \perp C | \{B, D\})$, but this network also implies that B and D are marginally independent. - [[File:​Ben_img1undirected.png]] The only good way to naturally capture this network is with an MRF as shown in the image to the left. + [[media:​cs-677sp10:​Ben_img1undirected.png]] The only good way to naturally capture this network is with an MRF as shown in the image to the left. Now that we have shown a situation where a Bayesian network comes up lacking, we will go through a comparison of Markov random fields and Bayesian networks and how the techniques we have applied in class can also apply to MRFs. Now that we have shown a situation where a Bayesian network comes up lacking, we will go through a comparison of Markov random fields and Bayesian networks and how the techniques we have applied in class can also apply to MRFs. Line 24: Line 24: There are, however, dependencies that an MRF cannot represent. ​ One of these is the induced dependency as shown below. If the child node is observed and has multiple parents, in a Bayesian network (BN) information can flow between the parent nodes. In an MRF, no information can flow across known nodes. There are, however, dependencies that an MRF cannot represent. ​ One of these is the induced dependency as shown below. If the child node is observed and has multiple parents, in a Bayesian network (BN) information can flow between the parent nodes. In an MRF, no information can flow across known nodes. - [[File:​Ben_induceddependency.png]] + [[media:​cs-677sp10:​Ben_induceddependency.png]] No information can flow because of the three Markov properties that define conditional independence for a Markov network: No information can flow because of the three Markov properties that define conditional independence for a Markov network: Line 30: Line 30: *''​Pairwise Markov property'':​ Any two non-adjacent random variables are conditionally independent given all other variables in the network are observed. *''​Pairwise Markov property'':​ Any two non-adjacent random variables are conditionally independent given all other variables in the network are observed. - [[File:​Ben_pairwise.png]] ​<​math>​(A \perp Z | \{all other nodes\})​ + [[media:​cs-677sp10:​Ben_pairwise.png]] ​$(A \perp Z | \{all other nodes\})$ *''​Local Markov property'':​ A variable is conditionally independent of all other variables given its neighbors. *''​Local Markov property'':​ A variable is conditionally independent of all other variables given its neighbors. - [[File:​Ben_local.png]] ​<​math>​(B \perp C | \{neighbors of C\})​ + [[media:​cs-677sp10:​Ben_local.png]] ​$(B \perp C | \{neighbors of C\})$ *''​Global Markov property'':​ Any two subsets of variables are conditionally independent given a separating subset where every path between the two subsets pass through the observed subset. *''​Global Markov property'':​ Any two subsets of variables are conditionally independent given a separating subset where every path between the two subsets pass through the observed subset. - [[File:​Ben_global.png]] ​<​math>​(S1 \perp S2 | \{S\})​ + [[media:​cs-677sp10:​Ben_global.png]] ​$(S1 \perp S2 | \{S\})$ These properties also lead to the conclusion that the Markov blanket for any particular node A in an MRF is simply A and all the nodes adjacent to A. These properties also lead to the conclusion that the Markov blanket for any particular node A in an MRF is simply A and all the nodes adjacent to A. = Parametrization = = Parametrization = - Because the node interactions are not directed, there is no reason to use a conditional probability distribution to represent node interactions. ​ It is more appropriate to capture "​affinities"​ between variables. We, therefore, assign a general-purpose function called a factor or potential ​<​math>​\phi ​to each pair of adjacent variables. In the discrete case, <​math>​\phi ​contains a value for each variable state combination. This will be made clear in the example below. + Because the node interactions are not directed, there is no reason to use a conditional probability distribution to represent node interactions. ​ It is more appropriate to capture "​affinities"​ between variables. We, therefore, assign a general-purpose function called a factor or potential ​$\phi$ to each pair of adjacent variables. In the discrete case, $\phi$ contains a value for each variable state combination. This will be made clear in the example below. Combining this set of factors for a network leads to a table that represents the various affinities of the network. Table values in an MRF are generally non-negative,​ but have no other restrictions. ​ These values do not have to represent probabilities and they do not have to sum up to one. They are merely an association between variables. Combining this set of factors for a network leads to a table that represents the various affinities of the network. Table values in an MRF are generally non-negative,​ but have no other restrictions. ​ These values do not have to represent probabilities and they do not have to sum up to one. They are merely an association between variables. - With these factors, we now have the ability to find probability distributions. For a network consisting of the random variables ​<​math>​\{X_1, X_2,..., X_n\} ​and parametrized by the set of factors ​<​math>​\Phi = \{\phi_1(D_1),​\phi_2(D_2),​...,​\phi_k(D_k)\}​, the joint probability distribution of a network is formally defined as <​math>​P(X_1, X_2, ..., X_n) = (1/​Z)\tilde{P}(X_1,​ X_2, ..., X_n) ​where <​math>​\tilde{P}(X_1,​ X_2, ..., X_n) = \prod_{k} \phi_k(D_k) ​and where Z is a normalizing constant (also called a partition function) defined by <​math>​Z = \sum_{X_1, X_2,​...,​X_n}\tilde{P}(X_1,​ X_2, ..., X_n)​. The normalizing constant is necessary because the affinities do not represent actual probabilities. + With these factors, we now have the ability to find probability distributions. For a network consisting of the random variables ​$\{X_1, X_2,..., X_n\}$ and parametrized by the set of factors ​$\Phi = \{\phi_1(D_1),​\phi_2(D_2),​...,​\phi_k(D_k)\}$, the joint probability distribution of a network is formally defined as $P(X_1, X_2, ..., X_n) = (1/​Z)\tilde{P}(X_1,​ X_2, ..., X_n)$ where $\tilde{P}(X_1,​ X_2, ..., X_n) = \prod_{k} \phi_k(D_k)$ and where Z is a normalizing constant (also called a partition function) defined by $Z = \sum_{X_1, X_2,​...,​X_n}\tilde{P}(X_1,​ X_2, ..., X_n)$. The normalizing constant is necessary because the affinities do not represent actual probabilities. - In the discrete case, computing all of the joint distributions for the various variable values creates a table. With this, we can find any conditional probability using Bayes' Law just as we do for a Bayesian network (e.g. <​math>​P(X_1 | X_2, ..., X_n) = P(X_1, X_2, ..., X_n)/P(X_2, ..., X_n)​). + In the discrete case, computing all of the joint distributions for the various variable values creates a table. With this, we can find any conditional probability using Bayes' Law just as we do for a Bayesian network (e.g. $P(X_1 | X_2, ..., X_n) = P(X_1, X_2, ..., X_n)/P(X_2, ..., X_n)$). === Example === === Example === - Going back to our discrete Bernoulli case from Section 1.1 above, we could create a table of factors to represent the associations between the nodes in the network as shown below. The values within a factor essentially weight the different cases. So for <​math>​\phi_1(A,B) ​we feel that the <​math>​a^0, b^0 ​case is the most likely and we have weighted it accordingly. + Going back to our discrete Bernoulli case from Section 1.1 above, we could create a table of factors to represent the associations between the nodes in the network as shown below. The values within a factor essentially weight the different cases. So for $\phi_1(A,B)$ we feel that the $a^0, b^0$ case is the most likely and we have weighted it accordingly. {| border="​1"​ {| border="​1"​ - !colspan="​3"​|<​math>​\phi_1(A,B)​ + !colspan="​3"​|$\phi_1(A,B)$ - !colspan="​3"​|<​math>​\phi_2(B,C)​ + !colspan="​3"​|$\phi_2(B,C)$ - !colspan="​3"​|<​math>​\phi_3(C,D)​ + !colspan="​3"​|$\phi_3(C,D)$ - !colspan="​3"​|<​math>​\phi_4(D,A)​ + !colspan="​3"​|$\phi_4(D,A)$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​30​ + |$30$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​100​ + |$100$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​1​ + |$1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​a^0​ + |$a^0$ - |<​math>​100​ + |$100$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​5​ + |$5$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​1​ + |$1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100​ + |$100$ - |<​math>​d^0​ + |$d^0$ - |<​math>​a^1​ + |$a^1$ - |<​math>​1​ + |$1$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​1​ + |$1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​1​ + |$1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100​ + |$100$ - |<​math>​d^1​ + |$d^1$ - |<​math>​a^0​ + |$a^0$ - |<​math>​1​ + |$1$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​10​ + |$10$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​100​ + |$100$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​1​ + |$1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​a^1​ + |$a^1$ - |<​math>​100​ + |$100$ |} |} - With these factors, we now have the ability to find probability distributions. ​ To find the joint probability distribution P(a, b, c, d) we combine the locals models just as we do in a Bayesian network and as described above giving us <​math>​\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)​. Again, since there is no guarantee that this will be a normalized joint distribution,​ we must multiply this by a normalizing constant. ​ This gives us <​math>​P(a, b, c, d) = (1/​Z)*\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a) ​where <​math>​Z = \sum_{a,​b,​c,​d} \phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a) ​ A breakdown of all the joint distributions for this network is given in the table below. + With these factors, we now have the ability to find probability distributions. ​ To find the joint probability distribution P(a, b, c, d) we combine the locals models just as we do in a Bayesian network and as described above giving us $\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)$. Again, since there is no guarantee that this will be a normalized joint distribution,​ we must multiply this by a normalizing constant. ​ This gives us $P(a, b, c, d) = (1/​Z)*\phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)$ where $Z = \sum_{a,​b,​c,​d} \phi_1(a,​b)*\phi_2(b,​c)*\phi_3(c,​d)*\phi_4(d,​a)$  A breakdown of all the joint distributions for this network is given in the table below. {| border="​1"​ {| border="​1"​ Line 120: Line 120: !colspan="​1"​|Normalized !colspan="​1"​|Normalized |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​300000​ + |$300000$ - |<​math>​.04​ + |$.04$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​300000​ + |$300000$ - |<​math>​.04​ + |$.04$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​300000​ + |$300000$ - |<​math>​.04​ + |$.04$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​30​ + |$30$ - |<​math>​4.1*10^-6​ + |$4.1*10^-6$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​500​ + |$500$ - |<​math>​6.9*10^-5​ + |$6.9*10^-5$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​500​ + |$500$ - |<​math>​6.9*10^-5​ + |$6.9*10^-5$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​5000000​ + |$5000000$ - |<​math>​.69​ + |$.69$ |- |- - |<​math>​a^0​ + |$a^0$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​500​ + |$500$ - |<​math>​6.9*10^-5​ + |$6.9*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100​ + |$100$ - |<​math>​1.4*10^-5​ + |$1.4*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​1000000​ + |$1000000$ - |<​math>​.14​ + |$.14$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100​ + |$100$ - |<​math>​1.4*10^-5​ + |$1.4*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^0​ + |$b^0$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100​ + |$100$ - |<​math>​1.4*10^-5​ + |$1.4*10^-5$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^0​ + |$d^0$ - |<​math>​10​ + |$10$ - |<​math>​1.4*10^-6​ + |$1.4*10^-6$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^0​ + |$c^0$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100000​ + |$100000$ - |<​math>​.014​ + |$.014$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^0​ + |$d^0$ - |<​math>​100000​ + |$100000$ - |<​math>​.014​ + |$.014$ |- |- - |<​math>​a^1​ + |$a^1$ - |<​math>​b^1​ + |$b^1$ - |<​math>​c^1​ + |$c^1$ - |<​math>​d^1​ + |$d^1$ - |<​math>​100000​ + |$100000$ - |<​math>​.014​ + |$.014$ |} |} - Now using Bayes' Law, we can compute various conditional probabilities. For example, ​<​math>​P(a^0|b^0,​c^1) = P(a^0,​b^0,​c^1)/​P(b^0,​c^1) = (.04+4.1*10^-6)/​(.04+4.1*10^-6+1.4*10^-5+1.4*10^-5) = .99930​. + Now using Bayes' Law, we can compute various conditional probabilities. For example, ​$P(a^0|b^0,​c^1) = P(a^0,​b^0,​c^1)/​P(b^0,​c^1) = (.04+4.1*10^-6)/​(.04+4.1*10^-6+1.4*10^-5+1.4*10^-5) = .99930$. = Monte Carlo Markov Chain (MCMC) = = Monte Carlo Markov Chain (MCMC) = Line 243: Line 243: == Metropolis and Gibbs Sampling == == Metropolis and Gibbs Sampling == Of course, there are many times where we do not know how to sample from the conditional distributions. In this case and just as with a Bayesian network, we can apply the same Gibbs sampling with Metropolis algorithm in which we: Of course, there are many times where we do not know how to sample from the conditional distributions. In this case and just as with a Bayesian network, we can apply the same Gibbs sampling with Metropolis algorithm in which we: - # Sample ​<​math>​x^* ​from a proposal distribution ​<​math>​g(x^*) ​where g is close to the actual distribution f, contains the support of f, and is symmetric. + # Sample ​$x^*$ from a proposal distribution ​$g(x^*)$ where g is close to the actual distribution f, contains the support of f, and is symmetric. - # Compute ​<​math>​r = f(x^*)/f(x)​ + # Compute ​$r = f(x^*)/f(x)$ - # With probability ​<​math>​min(r, 1)​, update the node value to <​math>​x^*​; otherwise, leave the value the same. + # With probability ​$min(r, 1)$, update the node value to $x^*$; otherwise, leave the value the same. The fact that we can use the same Bayesian algorithms on an undirected graph shows just how closely related the two networks are and how powerful the Bayesian methods actually are. The fact that we can use the same Bayesian algorithms on an undirected graph shows just how closely related the two networks are and how powerful the Bayesian methods actually are. Line 258: Line 258: # Moralize. # Moralize. - [[File:​Bayestomarkov.png‎]] + [[media:​cs-677sp10:​Bayestomarkov.png‎]] Moralizing handles the induced dependency problem that was described in the Section 2. The term moralize comes from the idea that it would be immoral for any child node to have multiple parents if the parents were not connected. Therefore, an undirected edge must be introduced between any parents nodes that share a child. Moralizing handles the induced dependency problem that was described in the Section 2. The term moralize comes from the idea that it would be immoral for any child node to have multiple parents if the parents were not connected. Therefore, an undirected edge must be introduced between any parents nodes that share a child. Line 267: Line 267: Converting from a Markov to a Bayesian network faces an interesting problem: data flow. In the Markov network below information can flow from B to C, but if this were converted to a Bayesian network, this flow of information would not exist. Converting from a Markov to a Bayesian network faces an interesting problem: data flow. In the Markov network below information can flow from B to C, but if this were converted to a Bayesian network, this flow of information would not exist. - [[File:​Markovtobayesproblem.png]] + [[media:​cs-677sp10:​Markovtobayesproblem.png]] To overcomes this, you perform three steps: To overcomes this, you perform three steps: Line 274: Line 274: # Add directionality. # Add directionality. - [[File:​Markovtobayes.png‎]] + [[media:​cs-677sp10:​Markovtobayes.png‎]] It is important that when adding directionality,​ a topological sort is performed and that directionality is set in the direction of the sort.  This guarantees the necessary dependencies. It is important that when adding directionality,​ a topological sort is performed and that directionality is set in the direction of the sort.  This guarantees the necessary dependencies. Line 281: Line 281: = Candidate Exam Question = = Candidate Exam Question = - Given the Bayesian network below, convert the network to a Markov random field. Then, assuming that each node can take a state of zero or one, create a potential table assigning values of your choice to the potentials. Then find the following probabilities: ​<​math>​P(a^0, b^1, c^1), P(a^1), P(b^1 | c^1), P(a^0 | b^0, c^0)​. Show your work so that it is easy to follow what you did. + Given the Bayesian network below, convert the network to a Markov random field. Then, assuming that each node can take a state of zero or one, create a potential table assigning values of your choice to the potentials. Then find the following probabilities: ​$P(a^0, b^1, c^1), P(a^1), P(b^1 | c^1), P(a^0 | b^0, c^0)$. Show your work so that it is easy to follow what you did. - [[File:​Examquestion.png‎]] + [[media:​cs-677sp10:​Examquestion.png‎]] = References and Further Reading = = References and Further Reading = 