Many real world problems present the dilemma of inferring the structure of a Bayesian network from data without any prior knowledge of the structure. Consider for example the problem of inferring the way that genes interact in a gene network given microarray data representing the expression levels of genes in various cellular states.

<i>Score-based structure learning</i> approaches this problem by defining the possible candidate structures and then defining a score function with which to find high-scoring structures with respect to the training data. Though ultimately models learned from data are best verified with experimentation, Bayesian methods are particularly advantageous in solving these problems in that they return a distribution over possible models rather than a single best model (Murphy, 1999).

As in parameter learning, the likelihood function measures the probability of the data given a model. To maximize the likelihood function for any given model, maximum likelihood parameters $\theta_i$ are used to maximize the likelihood of the data $\mathcal{D}$ given a graph $\mathcal{G}$:

$\max_{\mathcal{G},\Theta_{\mathcal{G}}}L(<\mathcal{G},\Theta_{\mathcal{G}}>) = \max_{\mathcal{G}}[\max_{\Theta_{\mathcal{G}}}L(<{\mathcal{G}},\Theta_{\mathcal{G}}> : \mathcal{D})]$

Thus the structure with the highest likelihood score using its respective maximum likelihood parameters is the maximum likelihood ($\mathcal{G}$, $\Theta_{\mathcal{G}}$) pair for the data.

Th following example illustrates how the likelihood of a network ultimately measures the strength of the dependency between a node and its parents.

Consider that we want to compare the likelihood scores for two dual node Bayesian models, $\mathcal{G}_0$ and $\mathcal{G}_1$, given some data $\mathcal{D}$. In the case of $G_0$ X and Y are independent and the log likelihood function is calculated as

$score_L(\mathcal{G}_0 : \mathcal{D}) = \sum_m log\hat{\Theta}_{x[m]} + log\hat{\Theta}_{y[m]}$

where m is the particular data instance. In the case of $\mathcal{G}_0$, X → Y. We see then that the log likelihood for this instance given the data is

$score_L(\mathcal{G}_1 : \mathcal{D}) = \sum_m log\hat{\Theta}_{x[m]} + log\hat{\Theta}_{y[m]|x[m]}$

where $\hat{\Theta}_x$ is the MLE for $P(x)$ and $\hat{\Theta}_{y|x}$ is the MLE for $P(y|x)$.

In calculating the difference in likelihood scores, the first term cancels out to yield the equation

$score_L(\mathcal{G}_1 : \mathcal{D}) - score_L(\mathcal{G}_0 : \mathcal{D}) = \sum_m log\hat{\Theta}_{y[m]|x[m]} + log\hat{\Theta}_{y[m]}$

By summing the instances for each conditional probability parameter we can write this as

$score_L(\mathcal{G}_1 : \mathcal{D}) - score_L(\mathcal{G}_0 : \mathcal{D}) = \sum_{x,y} M[x,y]log\hat{\Theta}_{y|x} - \sum_{y} M[y]log\hat{\Theta}_{y}$

If we let $\hat{P}(x,y)$ be the empirical frequency of $x,y$ we can then write $M[x,y]$ as $M\cdot\hat{P}(x,y)$ and $M[y]$ as $M\cdot\hat{P}(y)$. With some manipulation we arrive at the form

$score_L(\mathcal{G}_1 : \mathcal{D}) - score_L(\mathcal{G}_0 : \mathcal{D}) = M\sum_{x,y} \hat{P}(x,y)log \frac {\hat{P}(y|x)}{\hat{P}(y)}= M \cdot \iota_{\hat{P}}(X;Y)$

where $\iota_{\hat{P}}(X;Y)$ is the mutual information between X and Y, implying that X provides information about Y.

Generalizing this result to a score function for comparing structures with respect to the same data set $\mathcal{D}$ the likelihood score for a graph $\mathcal{G}$ is:

$score_L(\mathcal{G} : \mathcal{D}) = M\sum_{i=1}^n \iota_{\hat{P}}(X_i;Pa_{X_i}^\mathcal{G})$

where $\iota_{\hat{P}}(X_i;Pa_{X_i})$ is 0 if the variable $X_i$ has no parents.

Note then that the likelihood score of a network reflects the strength of a dependency of a variable upon its parents. A network in which the parents provide more information about the variable scores higher.

Mutual information between two variable is nonnegative for which a simpler network will never score better than a more complex network using the likelihood scoring method. This is because

$\iota_P(X;Y\cup Z) \ge \iota_P(X;Y)$

or in simple terms if Y gives some information about X, then adding Z will only increase the mutual information and by consequence the likelihood score. Unless the data perfectly exhibit a pattern of perfect independence between Y and Z (which given the noise in natural data is practically impossible) the graph will try to infer some dependence. Thus the likelihood scoring technique <i>overfits</i> the data, meaning that it tailors the result so much to the data as to prevent its practical application for future instances beyond those represented in the training data.

Consider that there exists a network with the following nodes, but for which we know nothing of the dependencies between them.

Furthermore say that we gather some data from the distributions underlying the network:

B A M J1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 0 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 1 1 1 0 1 1 1 0 1 0 1 0 1 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

The attached pdf shows how to infer the network from the data using likelihood scoring. It also illustrates the limitations of likelihood scoring.

The problem of structure search is simply an optimization problem in which the input is a training set $\mathcal{D}$, a scoring function, and a set $\mathcal{G}$ of possible network structures (reflecting prior knowledge about the system). The output is the network structure that maximizes the score.

Two properties relating to structure score are important to the structure search:

$\bullet$ Score decomposability: a score for graph $\mathcal{G}$ can be written as a sum of scores:

$score(\mathcal{G} : \mathcal{D}) = \sum_i FamScore(X_i|Pa_i^\mathcal{G} :\mathcal{D})$

$\bullet$ Score equivalence: if $\mathcal{G}$ is I-equivalent to $\mathcal{G}'$ then $score(\mathcal{G} : \mathcal{D}) = score(\mathcal{G}' : \mathcal{D})$.

In the most general case where we have no ordering of variables, the problem of finding $arg \max_{\mathcal{G}\in\boldsymbol{\mathcal{G}}_d} score(\mathcal{G} : \mathcal{D})$ is $\mathcal{N}\mathcal{P}$-hard for any graph in which a node can have as many as $d$ parents. There are special cases that make the problem easier, but the problems that are encountered in practice do not generally fall into these cases.

Because finding an algorithm which can efficiently find the highest-scoring network for any data is impractical, the next best solution is to settle upon a method that surveys the breadth of the search space without necessarily seeing every possibility and then doing <i>localized searching</i>.

Localized searching, in order to be effective, requires that the neighbors of any given graph in the search space be few in number and represent a relatively small niche in the search space. This is to simplify the decision of identifying the direction in which the search should proceed. We define a neighbor of graph $\mathcal{G}$ to be a graph which differs from $\mathcal{G}$ by either the addition of an edge, the deletion of an edge or the reversal of an edge direction.

By defining the search space this way we not only take an intuitive approach, but we also limit the size of the possible search space given that the number of steps to take between any two graphs is bounded by the total number of possible edges, $n^2$. Additionally, the idea of making local modifications supports the idea that the score of a network is merely the sum of smaller score terms and that thus between networks in the search space there is some sense of continuity.

Though allowing for edge deletion and reversal may seem superfluous assuming we start with an empty graph, they both become important in the greedy search algorithm in order to remove eventual redundancy that might arise or to allow for the case when an edge reversal (which could be accomplished by a deletion and insertion) bypasses the initial cost of deletion that may be inhibiting in a greedy algorithm (See Koller pages 813-814 for more details).

The search procedure proceeds using a local optima search, starting with a randomly chosen graph from the search space (or a network derived from some prior knowledge of the system). The algorithm then makes a series of iterations in which it makes small local modifications, checks that the newly formed network is acyclic, and then computes the score for the network.

If $K$ is the number of iterations before converging, and given that the space of operators at any given point is $n^2$ where n is the number of variables, then the algorithm will perform $O(K \cdot n^2)$ operator trials before finding a local optima. To each iteration is added the cost of checking that the network is acyclic (linear in the number of edges, $nd$) and the added cost of calculating the score (which we will say is $M$ steps and thus a final estimate of the cost of the algorithm) is

$O(K \cdot n^2 \cdot (M+nd)))$

This estimate may be a bit high considering that many operator applications will produce unfavorable networks. Additionally we can speed up the process using <i>first-ascent hill climbing</i> which accepts the first neighboring network with an improved score, regardless of whether or not it is the neighbor with the highest score. The further the algorithm progresses, the fewer higher-scoring neighbors will be found until a local optima is reached.

Ioannis Tsamardinos et al (2006) investigated a similar approach in which they reconstruct the skeleton of a Bayesian network and then perform a Bayesian scoring (which from the literature appears to be the more popular scoring approach) greedy hill-climbing search to find local optima. Their implemented algorithm is reported to outperform many other similar algorithms.

In the calculation of network scores, the edge $A \to C$ has exactly the same score as the reversed edge $C \to A$. The implications of this are that when a local optima is reached, there may be many neighboring networks with equivalent scores that could be reached simply by the edge reversals. A group of neighboring networks with equivalent scores form an <i> I-equivalence class </i> and in the greedy local optima search create what is termed a <i>plateau</i>.

The implications of the algorithm reaching a local optima at one of these plateaus are interesting because though further optimizing might be accomplished via one of the other graphs in the I-equivalence class, the greedy algorithm proceeds only upon finding a neighbor with a higher score, not merely one with a matching score. Even adjusting for this quirk does not entirely remove the problems inherent in local optima searching.

A technique called <i>basin flooding</i> keeps track of all previous networks visited and then considers any local modification that would produce a network not already visited. Compared to the greedy algorithm which only stores the last network in the search, this is very expensive. A more commonly accepted approach is the <i>tabu search with random restarts</i> which rather than keeping a history of visited networks, keeps a history of recent operator applications and disallows reversing the effect of recently applied operators. This forces the algorithm to be more directional. The algorithm, upon finding a local maxima, will restart from a new initial network in hopes of finding a better scoring local maxima. This perfecting process continues until a score remains uncontested for a specified number of iterations.

Yet a different method makes small modifications to the data rather than to the network. This method can aid in circumventing the difficulty of local optima because what is found to be a local optima for one set of data is likely to be different when the data is slightly modified. The idea is to modify only a small portion of the data at the point in the search when a local maxima has been reached. This allows the algorithm to move past the optima to find a more optimal solution. Consider the following algorithm for the Data perturbation search where $t$ is the perturbation size and $\gamma$ is the reduction in perturbation size:

'''Procedure''' Data Perturbation Search $\mathcal{G} \gets Search(\mathcal{G}_\empty, \mathcal{D}, score, \mathcal{O})$ $\mathcal{G}_{best} \gets \mathcal{G}$ $t \gets t_0$ '''for''' i = 1,$\dots$ until convergence $\mathcal{D}' \gets Perturb(\mathcal{D},t)$ $\mathcal{G} \gets Search(\mathcal{G}, \mathcal{D}', score, \mathcal{O})$ '''if''' $score(\mathcal{G} : \mathcal{D}) > score(\mathcal{G}_{best} : \mathcal{D})$ '''then''' $\mathcal{G}_{best} \gets \mathcal{G}$ $t \gets \gamma \cdot t$ '''return''' $ \mathcal{G}_{best} $

Note that the score calculated after the perturbation is calculated starting from the network $\mathcal{G}$ which is in a state of local optima. Thus the search broadens from the point at which the previous local optima was reached.

The Perturb function, which is at the core of the algorithm, could entail duplicating and removing instances or alternatively it may simply weight instances in the current data set $\mathcal{D}$. Either scenario will allow the algorithm to be overcome the obstacle of local optima.

Given the data set in the above example, calculate the mutual information $\iota_{\hat{P}}$(B;M).

Why does using Maximum Likelihood Scoring limit the performance of the learned network on new instances sampled from the same underlying distribution as the learned network?

- Probabilistic Graphical Models Principles and Techniques, by Daphne Koller and Nir Friedman, 2009 MIT Press