cs-677sp2010:structure-learning [CS Wiki]

Constraint-based structure learning is one of the few classes of algorithms for Bayesian network structure learning. Structure learning is the process of reconstructing the Bayesian network from a data set. Constraint-based structure learning is an intuitive process but can be very sensitive to individual failures.

Introduction

There are two main reasons for structure learning: knowledge discovery, and density estimation. Knowledge discovery is useful for learning dependencies and relationships between variables in a problem domain. Density estimation, which is the more common reason, focuses on creating a model that can be used to perform inference on new data.

Brute-force techniques are not feasible for structure learning, and even some of the common methods turn out to be NP-hard in many cases. The problem stems from the super-exponential growth in the number of possible directed acyclic graphs (DAGs) for each added node. For example, with just five nodes, there are 29,281 possible DAGs. Increasing to ten nodes yields $ 4.2 \times 10^{18} $ possible DAGs.

Constraint-based methods use the data to find all the conditional independence statements, which are then used to construct a Bayesian network representative of these statements. To understand the general process used by constraint-based methods, it is easiest to start with a simplifying assumption—we have an oracle that answers conditional independence questions, such as is $A \perp B|C$? Another simplifying assumption is that we have the skeleton (undirected graph) for the Bayesian network represented by the data set.

Creating DAG

Given a skeleton graph, there are four rules that can be used in conjunction with queries to the oracle to create a DAG. Earlier rules always take precedence over later rules.

Rule 1 - Introduce v-structures. There is a v-structure in graph (a) formed by C – E and D – E. If the oracle cannot match any query $C \perp D|X$ where $E \notin X$, then create directed edges C $\rightarrow$ E and D $\rightarrow$ E as shown in graph (b). There is yet another v-structure in graph (b) with edges A – C and B – C, so we also add directed edges A $\rightarrow$ C and B $\rightarrow$ C.
Rule 2 - Avoid new v-structures. When Rule 1 can no longer be applied, and there are structures of the form A $\rightarrow$ C and C – B (and not A – B), then create directed edge C $\rightarrow$ B.
Rule 3 - Avoid cycles. If creating directed edge A $\rightarrow$ B introduces a directed cycle, then create edge A $\leftarrow$ B.
Rule 4 - Choose randomly. If Rules 1-3 cannot be applied to the graph, randomly create a directed edge. The only undirected edge remaining in the graph (b) is B – D. Since none of the other rules can be applied, we randomly select B $\rightarrow$ D as shown in graph ©.

500px

DAG Example

(Example taken from Jensen, and Nielsen). Suppose we have the sequence of graphs below where the only v-structure that could be created is C $\rightarrow$ F, D $\rightarrow$ F ( see graph (a) ). An example application of Rules 2 -4 is as follows:

By Rule 2 created directed edge F $\rightarrow$ G - graph (b)
By Rule 4 created directed edge E $\rightarrow$ D - graph ©
By Rule 2 created directed edges D $\rightarrow$ A and D $\rightarrow$ B - graph (d)
By Rule 2 created directed edge A $\rightarrow$ C - graph (e)
By Rule 4 created directed edge A $\rightarrow$ B - graph (f)

400px

Creating Graph Skeleton

The skeleton for the graph can be obtained by asking the oracle various independence tests. The PC algorithm takes the complete graph as input and produces the graph skeleton. The PC algorithm is as follows:

 Begin with complete graph G
 i = 0
 while any node has at least i + 1 neighbors
   foreach node A with at least i + 1 neighbors
     foreach B $\in$ ADJ$_A$
       for all sets X such that |X| = i and X $\subseteq$(ADJ$_A$-{B})
         if $A \perp B|X$ then remove edge A -- B and store independence

PC Example

400px

(Example taken from Jensen, and Nielsen). Given the complete graph shown as (a) the first iteration (i=0) yields the following oracle queries and answers:

$A \perp B$ = yes $A \perp C$ = no $A \perp D$ = no $A \perp E$ = no $B \perp C$ = yes $B \perp D$ = no $B \perp E$ = no $C \perp D$ = no $C \perp E$ = no $D \perp E$ = no

Removing the edges A – B and B – C results in the graph shown in (b). Moving onto iteration (i=1):

The only edge removed during this iteration is C – D, resulting in ©. Now some of the queries for iteration (i=2):

$A \perp C|D,E$ = no $A \perp D|C,E$ = no $A \perp E|C,D$ = yes $B \perp E|C,D$ = yes

So with edges A – E and B – E removed, we are left with (d). At the start of iteration (i=3) no node has four adjacent nodes, so the algorithm stops and the final graph skeleton is (d).

Learning From Data Instead of Oracle

Unfortunately there is no magical oracle that can answer these independence queries. However, if the data set is truly representative of the Bayesian network, and we have enough samples, the data set itself can be used to answer these queries. There are many different methods for performing these queries. One is known as conditional mutual information and is defined by the following equation:

$ CMI(A,B|X)=\sum_X P(X) \sum_{A,B} P(A,B|X)log \frac{P(A,B|X)}{P(A|X)P(B|X)} $

CMI gives a measurement of the degree of dependence between A and B given X. In particular, CMI = 0 when A is conditionally independent of B given X. A common approach is to perform a $\chi^2$-test on the hypothesis $CMI(A,B|X) = 0$ with a user-defined significance level. The higher the significance level, the more links will be removed during the PC algorithm.

Strengths and Weaknesses

If the data set is truly representative of the unknown Bayesian network N, then the skeleton graph created using these methods will be the skeleton of N. Given the same conditions, all the correct v-structures can be assigned given the independence queries. Overall, this is a very intuitive approach that, in some cases, can yield good results.

Unfortunately these methods have some significant weaknesses, which often makes score-based methods a better choice. In the real world, the data set is not generated from the Bayesian network that you are trying to construct. Real world data sets also contain missing or incorrect values. Since the processes rely on receiving accurate answers to independence queries, even small inaccuracies can lead to the incorrect removal or non-removal of edges, as well as incorrectly assigned or missed v-structures.

As is common with many methods, there is also the problem of overfitting. Even if the constructed Bayesian network perfectly represents the data set, it may not generalize well, especially if the data set is small or has missing or incorrect values.

Candidate Question

Starting with a complete graph with six nodes (like the one shown below) use the PC algorithm and your own oracle to create a skeleton graph (the skeleton must meet the requirements needed for the next step). Using your skeleton graph and oracle, go through the steps of creating a possible DAG that uses each of the four rules at least once. List the steps and rules used to add each of your directed edges.

200px

References

Koller, D. and Friedman, N. Probabilistic Graphical Models: Principles and Techniques. MIT Press, 2009.
Jensen, F. and Nielsen, T. Bayesian Networks and Decision Graphs. 2nd Edition. Springer, 2007.

cs-677sp2010/structure-learning.txt · Last modified: 2014/12/12 11:22 by ryancha

Back to top

Table of Contents