A Markov random field (MRF) is a graphical model where a set of random variables are described by an undirected graph. An MRF is also called a Markov network or an undirected graphical model.

Note: Because the book avoids the continuous case because of difficulty, I have also avoided it and will be working only the discrete case unless otherwise specified. However, most of what applies to the discrete case also applies to the continuous case.

So far in this class we have dealt with Bayesian networks, which are directed acyclic graphs. These networks are powerful and are able to represent many different dependency/independency interactions between nodes and, therefore, what those nodes represent. This makes them extremely useful for many real world situations.

There are, however, several situations where there are independecies for which a Bayesian network cannot represent the true network. This leads to another important type of graphical model based on undirected graphs called a Markov random field (MRF). An MRF represents the joint probability distribution over events, which, like in the Bayesian network, are represented by variables/nodes.

To show one situation where this model is appropriate, suppose we have a model with four Bernoulli nodes {A, B, C, D} with independency properties $(A \perp C | \{B, D\})$ and $(B \perp D | \{A, C\})$. How would you represent this model as a Bayesian network?

Ben_img1directedA.png In this attempt, we get the property $(A \perp C | \{B, D\})$, but this network also implies that B and D are independent given only A.

Ben_img1directedB.png This network also implies $(A \perp C | \{B, D\})$, but this network also implies that B and D are marginally independent.

Ben_img1undirected.png The only good way to naturally capture this network is with an MRF as shown in the image to the left.

Now that we have shown a situation where a Bayesian network comes up lacking, we will go through a comparison of Markov random fields and Bayesian networks and how the techniques we have applied in class can also apply to MRFs.

An MRF is naturally cyclic as information flows both ways through the graph. This property allows us to create graphs like the one in the previous section. This obviously is a property that cannot be naturally represented by a Bayesian network which is an acyclic graph.

There are, however, dependencies that an MRF cannot represent. One of these is the induced dependency as shown below. If the child node is observed and has multiple parents, in a Bayesian network (BN) information can flow between the parent nodes. In an MRF, no information can flow across known nodes.

No information can flow because of the three Markov properties that define conditional independence for a Markov network:

*Pairwise Markov property*: Any two non-adjacent random variables are conditionally independent given all other variables in the network are observed.

Ben_pairwise.png $(A \perp Z | \{all other nodes\})$

*Local Markov property*: A variable is conditionally independent of all other variables given its neighbors.

Ben_local.png $(B \perp C | \{neighbors of C\})$

*Global Markov property*: Any two subsets of variables are conditionally independent given a separating subset where every path between the two subsets pass through the observed subset.

Ben_global.png $(S1 \perp S2 | \{S\})$

These properties also lead to the conclusion that the Markov blanket for any particular node A in an MRF is simply A and all the nodes adjacent to A.

Because the node interactions are not directed, there is no reason to use a conditional probability distribution to represent node interactions. It is more appropriate to capture “affinities” between variables. We, therefore, assign a general-purpose function called a factor or potential $\phi$ to each pair of adjacent variables. In the discrete case, $\phi$ contains a value for each variable state combination. This will be made clear in the example below.

Combining this set of factors for a network leads to a table that represents the various affinities of the network. Table values in an MRF are generally non-negative, but have no other restrictions. These values do not have to represent probabilities and they do not have to sum up to one. They are merely an association between variables.

With these factors, we now have the ability to find probability distributions. For a network consisting of the random variables $\{X_1, X_2,\ldots, X_n\}$ and parametrized by the set of factors $\Phi = \{\phi_1(D_1),\phi_2(D_2),\ldots,\phi_k(D_k)\}$, the joint probability distribution of a network is formally defined as $P(X_1, X_2, \ldots, X_n) = (1/Z)\tilde{P}(X_1, X_2, \ldots, X_n)$ where $\tilde{P}(X_1, X_2, \ldots, X_n) = \prod_{k} \phi_k(D_k)$ and where Z is a normalizing constant (also called a partition function) defined by $Z = \sum_{X_1, X_2,\ldots,X_n}\tilde{P}(X_1, X_2, \ldots, X_n)$. The normalizing constant is necessary because the affinities do not represent actual probabilities.

In the discrete case, computing all of the joint distributions for the various variable values creates a table. With this, we can find any conditional probability using Bayes' Law just as we do for a Bayesian network (e.g. $P(X_1 | X_2, \ldots, X_n) = P(X_1, X_2, \ldots, X_n)/P(X_2, \ldots, X_n)$).

Going back to our discrete Bernoulli case from Section 1.1 above, we could create a table of factors to represent the associations between the nodes in the network as shown below. The values within a factor essentially weight the different cases. So for $\phi_1(A,B)$ we feel that the $a^0, b^0$ case is the most likely and we have weighted it accordingly.

$\phi_1(A,B)$ | $\phi_2(B,C)$ | $\phi_3(C,D)$ | $\phi_4(D,A)$ | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|

$a^0$ | $b^0$ | $30$ | $b^0$ | $c^0$ | $100$ | $c^0$ | $d^0$ | $1$ | $d^0$ | $a^0$ | $100$ |

$a^0$ | $b^1$ | $5$ | $b^0$ | $c^1$ | $1$ | $c^0$ | $d^1$ | $100$ | $d^0$ | $a^1$ | $1$ |

$a^1$ | $b^0$ | $1$ | $b^1$ | $c^0$ | $1$ | $c^1$ | $d^0$ | $100$ | $d^1$ | $a^0$ | $1$ |

$a^1$ | $b^1$ | $10$ | $b^1$ | $c^1$ | $100$ | $c^1$ | $d^1$ | $1$ | $d^1$ | $a^1$ | $100$ |

With these factors, we now have the ability to find probability distributions. To find the joint probability distribution P(a, b, c, d) we combine the locals models just as we do in a Bayesian network and as described above giving us $\phi_1(a,b)*\phi_2(b,c)*\phi_3(c,d)*\phi_4(d,a)$. Again, since there is no guarantee that this will be a normalized joint distribution, we must multiply this by a normalizing constant. This gives us $P(a, b, c, d) = (1/Z)*\phi_1(a,b)*\phi_2(b,c)*\phi_3(c,d)*\phi_4(d,a)$ where $Z = \sum_{a,b,c,d} \phi_1(a,b)*\phi_2(b,c)*\phi_3(c,d)*\phi_4(d,a)$ A breakdown of all the joint distributions for this network is given in the table below.

Assignment | Unnormalized | Normalized | |||
---|---|---|---|---|---|

$a^0$ | $b^0$ | $c^0$ | $d^0$ | $300000$ | $.04$ |

$a^0$ | $b^0$ | $c^0$ | $d^1$ | $300000$ | $.04$ |

$a^0$ | $b^0$ | $c^1$ | $d^0$ | $300000$ | $.04$ |

$a^0$ | $b^0$ | $c^1$ | $d^1$ | $30$ | $4.1*10^-6$ |

$a^0$ | $b^1$ | $c^0$ | $d^0$ | $500$ | $6.9*10^-5$ |

$a^0$ | $b^1$ | $c^0$ | $d^1$ | $500$ | $6.9*10^-5$ |

$a^0$ | $b^1$ | $c^1$ | $d^0$ | $5000000$ | $.69$ |

$a^0$ | $b^1$ | $c^1$ | $d^1$ | $500$ | $6.9*10^-5$ |

$a^1$ | $b^0$ | $c^0$ | $d^0$ | $100$ | $1.4*10^-5$ |

$a^1$ | $b^0$ | $c^0$ | $d^1$ | $1000000$ | $.14$ |

$a^1$ | $b^0$ | $c^1$ | $d^0$ | $100$ | $1.4*10^-5$ |

$a^1$ | $b^0$ | $c^1$ | $d^1$ | $100$ | $1.4*10^-5$ |

$a^1$ | $b^1$ | $c^0$ | $d^0$ | $10$ | $1.4*10^-6$ |

$a^1$ | $b^1$ | $c^0$ | $d^1$ | $100000$ | $.014$ |

$a^1$ | $b^1$ | $c^1$ | $d^0$ | $100000$ | $.014$ |

$a^1$ | $b^1$ | $c^1$ | $d^1$ | $100000$ | $.014$ |

Now using Bayes' Law, we can compute various conditional probabilities. For example, $P(a^0|b^0,c^1) = P(a^0,b^0,c^1)/P(b^0,c^1) = (.04+4.1*10^-6)/(.04+4.1*10^-6+1.4*10^-5+1.4*10^-5) = .99930$.

In Bayesian networks, we used Gibbs sampling to solve for a joint distribution when it was not known explicity, but we did know the conditional distribution of each variable. In a Markov network, we can use Gibbs sampling in the same way.

For any Markov network N, we can cycle through the non-evidence variables resampling each node assuming all other node values are constant. Since we know the conditional distribution, sampling is trivial assuming we know how to sample from the distribution. We sample the distribution and update the value of the current non-evidence node.

Of course, there are many times where we do not know how to sample from the conditional distributions. In this case and just as with a Bayesian network, we can apply the same Gibbs sampling with Metropolis algorithm in which we:

- Sample $x^*$ from a proposal distribution $g(x^*)$ where g is close to the actual distribution f, contains the support of f, and is symmetric.
- Compute $r = f(x^*)/f(x)$
- With probability $min(r, 1)$, update the node value to $x^*$; otherwise, leave the value the same.

The fact that we can use the same Bayesian algorithms on an undirected graph shows just how closely related the two networks are and how powerful the Bayesian methods actually are.

Because Bayesian and Markov networks are so related, it is important to understand the relationship between them. We see this relationship converting from one to the other. In the conversion, it is important that independencies of the converted model are a superset of the independencies of the original and that the new structure can handle any evidence from the original model.

To convert from a Bayesian network to a Markov network we follow three steps:

- Maintain the structure of the Bayesian network.
- Eliminate directionality by making all edges undirected.
- Moralize.

Moralizing handles the induced dependency problem that was described in the Section 2. The term moralize comes from the idea that it would be immoral for any child node to have multiple parents if the parents were not connected. Therefore, an undirected edge must be introduced between any parents nodes that share a child.

The final network is a Markov network that has all of the independecies/dependencies of the original, plus a few more. This shows that we can completely represent the original network using an undirected graph.

Converting from a Markov to a Bayesian network faces an interesting problem: data flow. In the Markov network below information can flow from B to C, but if this were converted to a Bayesian network, this flow of information would not exist.

To overcomes this, you perform three steps:

- Maintain the structure of the Markov network.
- Triangulate the graph to guarantee all dependency representations.
- Add directionality.

It is important that when adding directionality, a topological sort is performed and that directionality is set in the direction of the sort. This guarantees the necessary dependencies.

The end result is a Bayesian network that contains a superset of all the properties represented in the original Markov network.

Given the Bayesian network below, convert the network to a Markov random field. Then, assuming that each node can take a state of zero or one, create a potential table assigning values of your choice to the potentials. Then find the following probabilities: $P(a^0, b^1, c^1), P(a^1), P(b^1 | c^1), P(a^0 | b^0, c^0)$. Show your work so that it is easy to follow what you did.

- Probabilistic Graphical Models: Principles and Techniques (Chapter 4), Koller and Friedman