# Introduction

Exact Inference is computationally expensive. If we are given a bayes net, calculating the marginal distribution of a variable is a lot of work.

Consider the following Bayesian network:

To compute p(D), we calculate the joint probability and sum out everything but D:

$p(D) = \sum_a \sum_b \sum_c p(A,B,C,D) = \sum_a \sum_b \sum_c p(A)p(B|A)p(C|B)p(D|C)$

It seems like this is more work than it is worth. If each of the variables have k values, then we end up having to generate $k^n$ probabilities in the joint distribution that we have to sum over. There must be a more efficient way to compute these marginal probabilities.

Enter Variable Elimination.

# The Basic Ideas

Variable Elimination fundamentally contains two main ideas that help address this exponential blowup of the joint distribution.

The first idea is about exploiting properties of arithmetic. For example, if you have an expression:

$yx_1 + yx_2 + yx_3 + yx_4$

This expression has four multiplications and three additions. However, using the well known properties of arithmetic we can rearrange the expression:

$y(x_1 + x_2 + x_3 + x_4)$

Now we have ONE multiplication and three additions.

The second idea is about exploiting the structure of the Bayesian network. Many of the sub expressions in the joint only depend on a small number of variables.

For example, lets take our Bayes net from above (which is structured as a chain), again the marginal probability of p(D) is:

$p(D) = \sum_a \sum_b \sum_c p(a)p(b|a)p(c|b)p(D|c)$

Now if we apply what we know about arithmetic and the structure of the network we can rearrange:

$p(B) = \sum_a p(a)p(B|a)$

$p(C) = \sum_b p(b)p(C|b)$

$p(D) = \sum_c p(c)p(D|c)$

Essentially what we did here was perform the computation from the inside out. If we stick the above expressions back into the equation for p(D), we get:

$p(D) = \sum_c p(D|c) \Bigg( \sum_b p(C|b) \Bigg( \sum_a p(a)p(B|a) \Bigg)\Bigg)$

The inner expression (or p(B) ) is computed first for all values of B and stored so we only compute them once (thus eliminating A). Then p(C) is computed with the values of p(B) and stored. Finally p(D) is computed with the stored values of p(C).

In general terms, lets assume a Bayesian network that has the structure of a chain with n variables $X_1 \rightarrow \ldots \rightarrow X_n$ ,where each variable has $k$ possible values. Computing $p(X_{i+1})$ can be defined recursively.

$p(X_{i+1}) = \sum_{x_{i}} p(X_{i+1} | x_i)p(x_i)$

Each recursive step is $O(k^2)$ and we recurse through all n variables in the worst case scenerio, so the total bigO is $O(nk^2)$. This is much better than generating the full $k^n$ probabilities to sum over in the joint distribution.

# The Algorithm

To define the basic variable elimination algorithm, we must first define a few terms.

• $\phi$ is called a factor,or a conditional probability from a Bayesian network. For example, from the above network, p(B|A) represents one factor.
• $Scope[\phi]$ is the set of variables that are used by the factor $\phi$.

VariableElimination(
$\Phi$  // Set of factors for a given Bayesian network
Q  // Query variable
E=e  // Set of observed variables (and their values)
Z  // Ordered set of variables not in Q$\cup$E
)

* For each $\phi \in \Phi$ set all observed variables E $\in Scope[\phi]$ to E=e

* For each $Z_i \in Z$
$\Phi \leftarrow Eliminate(\Phi,Z_i)$

* $\phi^{*} \leftarrow \prod_{\phi \in \Phi} \phi$

* $\alpha \leftarrow \sum_{q \in Val(Q)} \phi^{*}(q)$

* return $\phi^{*} / \alpha$
Eliminate(
$\Phi$ \\ Set of factors
Z \\ Variable to eliminate

* $\Phi' \leftarrow \{$ Every $\phi \in \Phi$ where $Z \in Scope[\phi]\}$
* $\Phi'' \leftarrow \Phi - \Phi'$
* $\psi \leftarrow \prod_{\phi \in \Phi'} \phi$
* $\tau \leftarrow \sum_{z} \psi$
* return $\Phi'' \cup \{\tau\}$

Note that this algorithm is based off the version presented in (Koller, Friedman, 2009, see references).

# Example walk through

For an example, lets take the following Bayesian network that you have seen before (Russel,Norvig, 2005):

For convenience let B = Burglary, E = Earthquake, A = Alarm, M = Marycalls, and J = Johncalls.

p(B) = .001
p(E) = .002

p(A|B,E):                p(M|A):             p(J|A):
----------------         ------------       ------------
| B | E | p(A) |         | A | p(M) |       | A | p(J) |
----------------         ------------       ------------
| t | t | .95  |         | t | .70  |       | t | .90  |
| t | f | .94  |         | f | .01  |       | f | .05  |
| f | t | .29  |         ------------       ------------
| f | f | .001 |
----------------

Lets say we would like to compute p(B|J,M), which is the probability that there was a Burglary given John and Mary called.

$p(B|M,J) \propto \sum_E \sum_A p(B,E,A,M,J) \propto \sum_E \sum_A p(B)p(E)p(A|B,E)p(M|A)p(J|A)$

We will now apply the Variable Elimination Algorithm:

• M and J are observed and B is the query variable, so we need to eliminate E and A

<b>Eliminate E:</b>

$\tau_1(B,A) = \sum_E p(E)*p(A|B,E)$

• Note that $\tau_1(B,A)$ is a vector of values for each combination of B and A. Specifically:

$\tau_1(B,A) = p(E)*p(A|B,E) + p(\bar{E})*p(A|B,\bar{E}) = (.002 * .95) + (.998 * .94) =$<b>.94002</b>

$\tau_1(B,\bar{A}) = p(E)*p(\bar{A}|B,E) + p(\bar{E})*p(\bar{A}|B,\bar{E}) = (.002 * .05) + (.998 * .06) =$<b>.05998</b>

$\tau_1(\bar{B},A) = p(E)*p(A|\bar{B},E) + p(\bar{E})*p(A|\bar{B},\bar{E}) = (.002 * .29) + (.998 * .001) =$<b>.00158</b>

$\tau_1(\bar{B},\bar{A}) = p(E)*p(\bar{A}|\bar{B},E) + p(\bar{E})*p(\bar{A}|\bar{B},\bar{E}) = (.002 * .71) + (.998 * .999) =$<b>.99842</b>

So we now have $p(B)\tau_1(B,A)p(M|A)p(J|A)$

<b>Eliminate A:</b>

$\tau_2(B,M,J) = \sum_A \tau_1(B,A)*p(M|A)*p(J|A)$

• Note that M and J are observed as true and so $\tau_2(B,M,J)$ is only a vector over the values of B and can be written as $\tau_2'(B)$. Specifically:

$\tau_2'(B) = \tau_1(B,A)*p(J|A)*p(M|A) + \tau_1(B,\bar{A})*p(J|\bar{A})*p(M|\bar{A}) = (.94002 * .90 * .70) + (.05998 * .05 * .01) =$<b>.59224</b>

$\tau_2'(\bar{B}) = \tau_1(\bar{B},A)*p(J|A)*p(M|A) + \tau_1(\bar{B},\bar{A})*p(J|\bar{A})*p(M|\bar{A}) = (.00158 * .90 * .70) + (.99842 * .05 * .01) =$<b>.00149</b>

<b>Finishing up</b>

So we now have $p(B)\tau_2'(B)$ and can easily compute p(B|M,J):

$p(B|M,J) = \frac{p(B)\tau_2'(B)}{\sum_B p(B)\tau_2'(B)} = \frac{p(B)\tau_2'(B)}{p(B)\tau_2'(B) + p(\bar{B})\tau_2'(\bar{B})} = \frac{(.001 * .59224)}{(.001 * .59224)+(.999 * .00149)} =$<b>.2846</b>

Note that this whole algorithm performed only 18 multiplication, 7 additions and 1 division for a total of 26 arithmetic operations. We never had to calculate the full joint distribution which would have required 128 multiplications alone.

# Other Considerations

## Graph Structure

It should be apparent that the structure of the network has a lot to do with how much advantage is gained by the Variable Elimination algorithm. More complicated networks will result in factors that depend on a large number of variables that cannot be easily eliminated. A simple example is shown here:

The resulting factors are: <b>p(B)p(C)p(D)p(E)p(A|B,C,D,E)</b>. Trying to eliminate even one variable will result in having to enumerate almost every combination of values for each variable, not much different than computing the whole joint distribution.

## Elimination Ordering

The order in which you eliminate variables also has an impact on the efficiency of the Variable Elimination Algorithm. For example, in the previous section, had we eliminated the Alarm variable first, we would have had to perform four additional multiplications. Not a big deal for a simple example, but for a more complicated network, the right ordering will make a significant difference. Unfortunately, finding the optimal elimination ordering is an NP-Hard problem. However, there are a few techniques out there that can find good elimination orderings. A couple examples are:

• <b>Chordal Graphs</b> - Uses, as a heuristic, the definition of a Chordal graph and how it relates to the resultant graph after a variable is eliminated to find a good elimination ordering.
• <b>Minimal Fill/Size/Weight Search</b> - Basically does a greedy search that orders the variables one at a time by determining which one will result in the smallest graph after that variable has been eliminated.

# Candidate Question

Given the following Bayesian Network (binary values), calculate p(B|A,E) using variable elimination (show steps).

Probabilities:   p(variable|condition) = {condition=false, condition=true}

p(A|B) = {.25, .33}   p(B|C) = {.10, .99}

p(C|D) = {.40, .70}   p(E|C) = {.20, .04}

p(F|B) = {.08, .85}   p(G|B) = {.50, .60}

p(D) = .115

# References

• Probabilistic Graphical Models Principles and Techniques, by Daphne Koller and Nir Friedman, 2009 MIT Press
• Artificial Intelligence A Modern Approach, by Stuart Russell and Peter Norvig, 2003 Pearson Education, Inc
• A Simple approach to Bayesian Network Computations, by N. Zhang and David Poole, 1994
• Efficient Reasoning in Graphical Models, by Rina Dechter, 1999