Constraint-based structure learning is one of the few classes of algorithms for Bayesian network structure learning. Structure learning is the process of reconstructing the Bayesian network from a data set. Constraint-based structure learning is an intuitive process but can be very sensitive to individual failures.
There are two main reasons for structure learning: knowledge discovery, and density estimation. Knowledge discovery is useful for learning dependencies and relationships between variables in a problem domain. Density estimation, which is the more common reason, focuses on creating a model that can be used to perform inference on new data.
Brute-force techniques are not feasible for structure learning, and even some of the common methods turn out to be NP-hard in many cases. The problem stems from the super-exponential growth in the number of possible directed acyclic graphs (DAGs) for each added node. For example, with just five nodes, there are 29,281 possible DAGs. Increasing to ten nodes yields $ 4.2 \times 10^{18} $ possible DAGs.
Constraint-based methods use the data to find all the conditional independence statements, which are then used to construct a Bayesian network representative of these statements. To understand the general process used by constraint-based methods, it is easiest to start with a simplifying assumption—we have an oracle that answers conditional independence questions, such as is $A \perp B|C$? Another simplifying assumption is that we have the skeleton (undirected graph) for the Bayesian network represented by the data set.
Given a skeleton graph, there are four rules that can be used in conjunction with queries to the oracle to create a DAG. Earlier rules always take precedence over later rules.
(Example taken from Jensen, and Nielsen). Suppose we have the sequence of graphs below where the only v-structure that could be created is C $\rightarrow$ F, D $\rightarrow$ F ( see graph (a) ). An example application of Rules 2 -4 is as follows:
The skeleton for the graph can be obtained by asking the oracle various independence tests. The PC algorithm takes the complete graph as input and produces the graph skeleton. The PC algorithm is as follows:
Begin with complete graph G i = 0 while any node has at least i + 1 neighbors foreach node A with at least i + 1 neighbors foreach B $\in$ ADJ$_A$ for all sets X such that |X| = i and X $\subseteq$(ADJ$_A$-{B}) if $A \perp B|X$ then remove edge A -- B and store independence
(Example taken from Jensen, and Nielsen). Given the complete graph shown as (a) the first iteration (i=0) yields the following oracle queries and answers:
$A \perp B$ = yes<br> $A \perp C$ = no<br> $A \perp D$ = no<br> $A \perp E$ = no<br> $B \perp C$ = yes<br> $B \perp D$ = no<br> $B \perp E$ = no<br> $C \perp D$ = no<br> $C \perp E$ = no<br> $D \perp E$ = no<br>
Removing the edges A – B and B – C results in the graph shown in (b). Moving onto iteration (i=1):
$A \perp C|E$ = no, $B \perp C|D$ = no<br> $B \perp C|E$ = no, $B \perp D|C$ = no<br> $B \perp D|E$ = no, $B \perp E|C$ = no<br> $B \perp E|D$ = no, $C \perp B|A$ = no<br> $C \perp D|B$ = no, $C \perp D|A$ = yes<br> $C \perp E|A$ = no, $C \perp E|B$ = no<br> $D \perp B|E$ = no, $D \perp E|B$ = no<br> $E \perp A|B$ = no, $E \perp A|D$ = no<br> $E \perp B|A$ = no, $E \perp C|B$ = no<br> $E \perp C|D$ = no, $E \perp D|A$ = no<br> $E \perp D|C$ = no<br>
The only edge removed during this iteration is C – D, resulting in ©. Now some of the queries for iteration (i=2):
$A \perp C|D,E$ = no<br> $A \perp D|C,E$ = no<br> $A \perp E|C,D$ = yes<br> $B \perp E|C,D$ = yes<br>
So with edges A – E and B – E removed, we are left with (d). At the start of iteration (i=3) no node has four adjacent nodes, so the algorithm stops and the final graph skeleton is (d).
Unfortunately there is no magical oracle that can answer these independence queries. However, if the data set is truly representative of the Bayesian network, and we have enough samples, the data set itself can be used to answer these queries. There are many different methods for performing these queries. One is known as conditional mutual information and is defined by the following equation:
$ CMI(A,B|X)=\sum_X P(X) \sum_{A,B} P(A,B|X)log \frac{P(A,B|X)}{P(A|X)P(B|X)} $
CMI gives a measurement of the degree of dependence between A and B given X. In particular, CMI = 0 when A is conditionally independent of B given X. A common approach is to perform a $\chi^2$-test on the hypothesis $CMI(A,B|X) = 0$ with a user-defined significance level. The higher the significance level, the more links will be removed during the PC algorithm.
If the data set is truly representative of the unknown Bayesian network N, then the skeleton graph created using these methods will be the skeleton of N. Given the same conditions, all the correct v-structures can be assigned given the independence queries. Overall, this is a very intuitive approach that, in some cases, can yield good results.
Unfortunately these methods have some significant weaknesses, which often makes score-based methods a better choice. In the real world, the data set is not generated from the Bayesian network that you are trying to construct. Real world data sets also contain missing or incorrect values. Since the processes rely on receiving accurate answers to independence queries, even small inaccuracies can lead to the incorrect removal or non-removal of edges, as well as incorrectly assigned or missed v-structures.
As is common with many methods, there is also the problem of overfitting. Even if the constructed Bayesian network perfectly represents the data set, it may not generalize well, especially if the data set is small or has missing or incorrect values.
Starting with a complete graph with six nodes (like the one shown below) use the PC algorithm and your own oracle to create a skeleton graph (the skeleton must meet the requirements needed for the next step). Using your skeleton graph and oracle, go through the steps of creating a possible DAG that uses each of the four rules at least once. List the steps and rules used to add each of your directed edges.