**MAP** (Maximum a posteriori) **Inference** is a form of Bayesian inference that deals with MAP queries. The goal of a MAP query is to find the most likely assignment for all of the non-evidence random variables given the evidence variables, this is also called the *Most Probable Explanation* (MPE). This is similar to Bayesian learning, but the computational complexity of MAP Inference is less than that of Bayesian learning. [Russell, 2010, p.804]

Given a set of random variables, $\chi$, and a set of evidence variables $E$ <blockquote>$W = \chi - E$</blockquote>

The MAP query finds the most likely assignment to the query variables $W$ given the evidence variables $E$. <blockquote>$MAP( W | E) = \underset{w~\in~W}{argmax}~P( w | E)$</blockquote>

If there are several such assignments that are equal, then the MAP query can either return the set of all possible assignments or return an arbitrarily chosen assignment. [Koller, 2009, p.26]

The result to an MAP query is not the same thing as finding the most likely assignment to each individual variable, as is shown in the example.

The MAP query is sometimes referred to as the MAP hypothesis, where the evidence variables, $E$, are referred to as data, $D$, and the assignments to the query variables, $W$, are referred to as possible hypotheses, $h$, from a hypothesis space, $H$. The MAP hypothesis is found by the equation <blockquote>$h_{MAP} = \underset{h~\in~H}{argmax}~P( h | D )$ :$= \underset{h~\in~H}{argmax}~\frac{P( D | h ) P ( h )}{P( D )}$ :$= \underset{h~\in~H}{argmax}~P( D | h ) P ( h )$</blockquote>

If the $P(h)$ term is dropped in the previous equation, denoting a uniform prior probability among all hypotheses, then we get the *Maximum Likelihood* (ML) hypothesis.[Mitchell, 1997, p.157]

This example comes from Koller [Koller, 2009, p.26]

Conditional Probability Table for $B$

$A$ | $b^0$ | $b^1$ |
---|---|---|

$a^0$ | 0.1 | 0.9 |

$a^1$ | 0.5 | 0.5 |

Probability Table for $A$

$a^0$ | $a^1$ |
---|---|

0.4 | 0.6 |

We find that $MAP(A) = a^1$ from the Probability Table for $A$. However both assignments to $B$ given $A = a^1$ give a probability of $0.6 * 0.5 = 0.3$, while the assignment of $(a^0, b^1)$ gives a probability of $0.4 * 0.9 = 0.36$.

This gives us the following $MAP(A,B) = (a^0, b^1) \ne (\underset{a~\in~A}{argmax}~P(a), \underset{b~\in~B}{argmax}~P(b))$

The Marginal MAP query is very similar to the MAP query, except that it loosens the constraint that $W$ must contain all non-evidence variables.

Given a set of random variables, $\chi$, a set of evidence variables $E$, and a set of query variables $W$ <blockquote>$Z = \chi - W - E$</blockquote>

The MAP query finds the most likely assignment to the query variables $W$ given the evidence variables $E$. <blockquote>$MAP( W | E) = \underset{w~\in~W}{argmax}~\sum_Z{P( w, z | E)}$</blockquote>

Exact inference is not a simple task. For full MAP it is NP-complete, and for the case of a marginal MAP it is at least that, if not worse. In the worst case we have to generate the entire joint distribution, integrating out anything irrelevant to our query (in the case of a marginal MAP query), and find the entry with the highest probability. This combination of difficult tasks makes for an almost unavoidable exponential increase in complexity. Luckily this is for the worst case, and often the problems we actually run into are of a much more manageable size.

Another unfortunate fact is that even though we can use an approximation algorithm with good results, it does not remove the NP-hardness of the inference. [Koller, 2009, p.288]

MAP inference is useful in many different situations, but the main idea is that MAP inference gives us the most likely explanation for the evidence rather than the most probable assignment to a single variable given the evidence, in isolation from any other variables.

Some examples include:

- Natural Language Processing - We want to find the most likely sentence given the signal we received. We are much less interested in individual phonemes than in what the sentence is as a whole, the individual phoneme approach would likely not even yield an intelligible sentence.
- Signal Processing: Given a noisy signal and other previously known information, what is the most likely message sent.
- Medical Diagnosis: We have a set of symptoms and some test results for a patient and we want to infer what is ailing our patient.
- Car Problems: If a certain part, or parts, starts malfunctioning in our car we would like to infer the state of the rest of the parts in our car to determine what needs to be fixed.
- Machine Learning: We have been given a set of data and would like to infer what the best weights would be for our neural network.

This discussion is simplified by using the, equivalent, factor representation of a joint distribution. Our task is to find $\xi^{MAP}$, the most likely assignment to the variables in $\chi$. $P_{\Phi}(\chi)$ is a distribution defined via a set of factors $\Phi$ and an unnormalized density $\tilde{P_{\Phi}}$. $Z$ is a normalizing constant.

<blockquote>$\xi^{MAP} = \underset{\xi}{argmax}~P_{\Phi}(\xi)$ :$= \underset{\xi}{argmax}~\frac{1}{Z}\tilde{P_{\Phi}}(\xi)$ :$= \underset{\xi}{argmax}~\tilde{P_{\Phi}}(\xi)$</blockquote>

This is often called the *max-product* form. This maximizes the unnormalized probability calculated by taking the product of the factors in $\Phi$ evaluated at $\xi$. This does not give us an actual probability unless we also calculate the normalizing constant.

The *max-sum* form is also very useful to avoid some of the issues associated with multiplying floating point numbers.

<blockquote>$\xi^{MAP} = \underset{\xi}{argmax}~log~\tilde{P_{\Phi}}(\xi)$</blockquote>

Another benefit to the *max-sum* form is the easy translation into an integer programming problem.

The marginal MAP query is computationally more intensive since it also involves marginalizing the non-query non-evidence variables in addition to the maximization involved in the full MAP query. $Y$ is the set of query variables, $W$ is the set non-query non-evidence variables.

<blockquote>$y^{margMAP} = \underset{y}{argmax}~P_{\Phi}(y)$ :$= \underset{y}{argmax}~\sum_W{\tilde{P_{\Phi}}(y, W)}$</blockquote>

MAP inference is an optimization problem, while that brings an added complexity there are also many algorithms already designed to handle optimization problems. MAP queries are a useful tool for many applications, but the potential intractability of some of the MAP queries we would like to make gives us the need to pause and consider the difficulty of the query we are making before we try and employ an algorithm to do so.

There is an extra difficulty associated with the marginal MAP queries that makes even approximation techniques difficult. Another issue is that the results to an MAP query need not be the same as the results to a marginal MAP query, similarly the results to one marginal MAP query need not be the same as the results to another marginal MAP query even if the query variables in one query are a subset of the query variables in the other query.

Even though MAP inference is problematic in some situations, analysis can be used to find other methods that will still give the appropriate MAP assignment (or hypothesis) without ever using Bayesian procedures.[Mitchell, 1997, p.164]

MAP inference is also more efficient than Bayesian learning, but it isn't as accurate. This trade-off comes from the fact that MAP does not take into account any uncertainty about the assignment to the query variables but the Bayesian learning method does. [Russell, 2010, p.804]

Does MAP inference always give the same result as finding the most likely assignment to each variable individually? Explain your answer using a small example.

- Koller, D., & Friedman, N. (2009).
*Probabilistic Graphical models: Principles and Techniques*. Massachusetts: MIT Press. - Mitchel, T. M. (1997).
*Machine Learning*. McGraw-Hill. - Russell, S., & Norvig, P. (2010).
*Artificial Intelligence: A Modern Approach*(3rd ed). New Jersey: Pearson Education.