In the variable elimination algorithm, the basic computational step is the manipulation of factors. In clique trees, a factor $\phi_i$ is considered as a computational data structure, which receives messages generated by other factors and sends a message used by another factor.

A cluster graph for a set of factors F is an undirected graph with the following properties:

- Each node i is associated with a subset ( cluster ) $C_i$
- Family preserving: for each factor $f_i \in F$ , such that scope[$f_i$]$\subseteq C_i$
- Each edge i-j is associated with a sepset $S_{ij} = C_i \cap C_j$

Example: From Variable Elimination Execution to cluster tree

Fig.1 Original graph

Fig.2 Cluster tree after the variable elimination execution

Execution of variable elimination defines a cluster-graph

- Each factor used in elimination becomes a cluster-node

- An edge is drawn between two clusters if a message is passed between them in elimination

A clique tree is a cluster tree that satisfies the running intersection property

- Running Intersection Property: if cluster $C_i$ and $C_j$ both contain variable X, then all clusters in the ( unique ) path between $C_i$ and $C_j$ contain X too.
- The running intersection property implies that $S_{ij} = C_i \cap C_j$

So the tree shown in Fig.2 is also a clique tree, each box represents a clique C and on each edge between them the sepset $S_{ij} $ is labeled.

Given a clique tree as a starting point, this data structure can be used to perform variable elimination. Furthermore, the clique tree provides a data structure allowing multiple executions of variable elimination to be performed much more efficiently than simply performing each one separately.

Here is an example of using a clique tree to do the variable elimination.

Fig.3 Given a possible clique tree at the beginning

Notice that this clique tree is different from the one in Fig.2 which is obtained from the variable elimination execution. And we could verify that this clique tree satisfy the definition.

Let's define a term before we show the clique tree algorithm

The initial potential $\pi^0_i(C_i)$ is computed by multiplying the initial factors assigned to the clique $C_i$ . For example, $\pi^0_4(G,J,S,L) = P(L|G)P(J|S,L)$.

Fig.4 initial potentials

Now, assume that we want to compute the probability P(J). We do the varibale elimination process so that J is not eliminated. We select a clique containing J as our root clique, for example,$ C_4$.

Each clique $C_i$ , except for the root, performs a message passing computation and sends a message to its upstream neighbor. The clique $C_i$ multiplies all incoming messages from its other neighbors with its initial potential, resulting in a factor whose scope is the clique. It then sums out all variables except those in the sepset between $C_i$ and $C_j$ , and sends the resulting factor as a message to $C_j$ .

Fig.5 Message passing

In Fig.5, for example, the clique $C_2(G,I,D)$ generates a factor ( message ) by multiplying the message$\delta_{1 \rightarrow 2}(D)$ from $C_1$ with its initial potential $\pi^0_2(G,I,D)$ and then summing out the variable D. The resulting factor ( message ) which has a scope G, I, is then sent to clique $C_3(G,S,I)$ .

We say that a clique is *ready* when it has received all of its incoming messages. Thus, in this example, we could see that $C_4$ is *ready* at the very start of the algorithm, and the computation associated with it can be performed at any point in the execution.

If we want to compute the posterior distribution over every random variable in the network with least computation, there is a “Shafer-Shenoy” algorithm to solve this.

- Define a root
- Asynchromous implementation of of messages passing in two passes: upward ( sending towards the root ) and downward ( send from the root )
- Marginalize clique’s ancillary variables to get the marginal probability distribution of each variable

Fig.6 Messages passing in two passes

A key point is that the resulting marginal distribution does not depend on the clique we selected. That is, if variable X appears in two cliques, they must agree on its marginal. So here comes the definition of *calibrated clique tree*.

- Define a term, belief $\beta_i(C_i)=\pi^0_i \prod_{k \in Nb_i} \delta_{k \rightarrow i} $
- Two adjacent cliques $C_i$ and $C_j$ are said to be
*calibrated*if $\sum_{C_i - S_{i,j}} \beta_i(C_i) = \sum_{C_j - S_{i,j}} \beta_j(C_j)$ - A clique tree is
*calibrated*if all pairs of adjacent cliques are*calibrated*.

An alternative approach to computing the message is to multiply in all of the messages and then divide the resulting factor by returning message.

$\delta_{i \rightarrow j} = \dfrac{\sum_{C_i - S_{i,j}} \beta_i(C_i)}{ \delta_{j \rightarrow i}} $

In principle, they both use the same basic operations of multiplying factors and summing out variables. The cliques in clique tree are basically the factors in variable elimination. In practice, clique trees have:

- Advantages:
- Providesing answers to mutiple cliques using a single computation
- Allowing the same data structure to answer a broad range of queries
- Executing the required operations in advance, such as the construction of the basic data structures, the choice of eliminating ordering and the product of CPDs assigned to a clique.

- Disadvantages:
- More expensive in terms of space
- The structure of the computation is fixed and predetermined

Given a Baysian Network, construct your own clique tree and compute the posterior marginal distribution of each node in the network. (Using the calibrated clique tree algorithm)

- Probabilistic Graphical Models Principles and Techniques, by Daphne Koller and Nir Friedman, 2009 MIT Press