=Variational Bayes=
Similar to the sampling algorithms discussed in class (MCMC, Gibbs, etc.), variational bayes is a class of methods that avoid exact inference on intractable problems. However, to approximate the posterior distribution in variational bayes a free distribution function (meaning with free parameters) is chosen and the parameters to it are updated iteratively such that the combination of the free distribution and joint distribution approaches the true posterior. This avoids the need to know exactly what the structure is because our distribution will not have any conditionals in it. All structure will be approximated and encoded through an iterative expectation maximization (EM) process. One key component to finding the right distribution is knowing the distance between the free distribution and the actual distribution.
==Distribution Distance (Kullback-Leibler Divergence)==
The following equation provides the distance or divergence measure needed to compare distributions:

[[media:cs-677sp10:KL_distance_equation.png|200px]]

This equation may also be seen in the form: [[media:cs-677sp10:KL_distance_equation2.png|100px]]
The Kullback-Leibler divergence distance measure does not meet all the requirements to be called a distance metric – namely symmetry and triangle inequality. However, it does have the positivity property as well as other useful features. Some of these additional features are the ability to include conditionals and to apply the chain rule. For example:

[[media:cs-677sp10:KL_distance_equation_conditional.png]]

[[media:cs-677sp10:KL_distance_equation_chain_rule.png]]

==Formulating the Problem==
Starting with the joint distribution and the chain rule we have:

[[media:cs-677sp10:VB_problem_formulation.png]]

Because we’re in log land, we can add the same term to the denominator of the two terms in a subtraction and not change the result

[[media:cs-677sp10:VB_problem_formulation2.png]]

[[media:cs-677sp10:VB_problem_formulation3.png]] contains the free distribution of our choosing and because it is a pdf, it will integrate to 1 on the left hand side. The other side can be re-written in terms of the KL-divergence function and lower bound function. First, the second half of the right hand side is negative, which negative sign can be thought of as coming from flipping the inside of the log upside down and extracting the negative out to the front

[[media:cs-677sp10:KL_distance_rewrite.png]]

This becomes the second half of the right hand side the equation and because it is the distance between our distribution and the true distribution, it is the part we will try to minimize. Conversely the first half of the right hand side must be our lower bound for our posterior approximation which will be shorthanded to be [[media:cs-677sp10:Lower_bound_rewrite.png]]. The final equation looks like:

[[media:cs-677sp10:Problem_Formulated.png]]

The two parts work together to reproduce the left hand side. However, we’d like to get our lower bound as close to the true distribution as possible, so maximizing [[media:cs-677sp10:Lower_Bound.png]] also equates to minimizing [[media:cs-677sp10:KL_distance_equation_half.png]], and can be visualized as follows:

[[media:cs-677sp10:Problem_Formulation_Visual.png]]

==Choosing the Right Distribution==

In our problem q(y) is a single distribution, but it can easily be extended to be a set of distributions Q {q_1, q_2, …, q_n} each with a different set of parameters Y {y_1, y_2, …, y_m}. The family of distributions they come from is usually set or determined beforehand, and is chosen either to be efficient or to better approximate and fit the true model and structure. For example, if the dependencies are unknown but are assumed to exist in some fashion, nodes can be connected in arbitrary ways just to allow influence to flow, but without making the network too complex. How many dependencies you add will affect performance/efficiency in evaluating the distribution; therefore, very simple networks are often chosen. The following are simple structures that allow influence to flow between some of the nodes, but keeps the efficiency of calculating inference on it very quick and each can be tested easily:

1)
[[media:cs-677sp10:Simple_structure.PNG|200px]] 
2)
[[media:cs-677sp10:Simple_structure2.PNG|200px]]

==Iterate Till Convergence==

If the joint distribution can be calculated, then maximizing the lower bound is our goal. For each distribution, we’ll freeze all parameters but one and try different values until we find the value that maximizes the lower bound and then continue on to the next parameter until all parameters of all distributions have maximized. Repeat this process again until parameters aren’t changing anymore and you’ve reached convergence!

==Candidate Test Question==

Show how to go from the joint distribution "p(x,y)" to the Variational Bayes approximation of the log marginal likelihood.