## Notation

• $\mathbf{A}$ matrix
• $\mathbf{a}_i$ row vector in matrix $\mathbf{A}$
• $a_{ij}$ cell $i,j$ in matrix $\mathbf{A}$
• $K$ number of classes
• $F$ number of features
• $\mathbf{X}$ the input vectors in the training data; each row is an input vector
• $\mathbf{y}$ the classes in the training data; $y_i$ is the class of input vector $\mathbf{x}_i$
• $N$ the amount of training data, i.e. the number of rows in $\mathbf{X}$ and $\mathbf{y}$
• $\mathbf{W}$ weight matrix with $K$ rows and $F$ columns
• $\boldsymbol{\Phi}(\mathbf{x})$ feature function of an input vector that returns a matrix, one row per output class; the number of columns is $F$. On analogy with matrix notation, we use $\boldsymbol{\phi}_y$ to indicate a specific row in the resulting matrix, and $\phi_{yf}$ to reference a single cell in the resulting matrix. A block feature vector is one in which $\forall{y'} \boldsymbol {\phi}_y ( \mathbf{x} ) = \boldsymbol{\phi}_{y'} ( \mathbf{x} )$, i.e. the feature function is independent of class. This is the most common scenario and can simplify implementation. In that case, the subscripts can be ignored and the output of $\boldsymbol{\phi}(\mathbf{x})$ can treated as a row vector.

## Log-linear models

The model we will be dealing with has the form of a log-linear model, i.e. if $y|\mathbf{x},\mathbf{W} \sim LL \left( \mathbf{x},\mathbf{W} \right)$ ($\mathbf{x}$ is a column vector) then the likelihood function $f$ is defined as:

:$f(y;\mathbf{x},\mathbf{W}) = \frac{ \exp \left\{ \mathbf{w}_y \cdot \mathbf{x} \right\} }{ \sum_{y'} { \exp \left\{ \mathbf{w}_{y'} \cdot \mathbf{x} \right\} } }$

In the context of maximum entropy and related models, we usually extract features from the original input vector. This can be represented as:

:$f(y;\mathbf{x},\mathbf{W}, \boldsymbol{\Phi} ( \cdot ) ) = \frac{ \exp \left\{ \mathbf{w}_y \cdot \boldsymbol{\phi}_y ( \mathbf{x} ) \right\} }{ \sum_{y'} { \exp \left\{ \mathbf{w}_{y'} \cdot \boldsymbol{\phi}_{y'} ( \mathbf{x} ) \right\} } }$

## The maximum entropy model (parameter estimation)

The maximum entropy model is the log-linear model having parameters $\mathbf{W}$ such that the resulting distribution (edit: distribution over what?) has maximum entropy subject to the constraints represented by the training data $\mathbf{X},\mathbf{y}$. It can be shown that the parameters $\mathbf{W}$ that maximize the log-likelihood of the data $\mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right)$ constitute the maximum entropy model. Using standard calculus techniques to find the maximum, we need to find the gradient $\nabla \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right)$ and therefore the partial derivatives $\frac{ \partial \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right) } { \partial w_{kf}}$.

### Log-likelihood and conditional log-likelihood

The log-likelihood of a log-linear model is: :\begin{align} \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right) &= \ln \prod_i { p \left( x_i, y_i | \mathbf{W} \right) } \\ &= \ln \prod_i { \Pr \left( y_i | \mathbf{x}_i, \mathbf{W} \right) } p \left( \mathbf{x}_i \right) \\ &= \ln \prod_i { f \left( y_i ; \mathbf{x}_i^T, \mathbf{W} \right) } p \left( \mathbf{x}_i \right) \\ &= \sum_i { \ln f \left( y_i ; \mathbf{x}_i^T, \mathbf{W} \right) } + \sum_i \ln p \left( \mathbf{x}_i \right) \\ &= \sum_i { \ln \frac{ \exp \left\{ \mathbf{w}_{y_i} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T \right\} }{ \sum_{y'} { \exp \left\{ \mathbf{w}_{y'} \cdot \boldsymbol{\phi}_{y'} ( \mathbf{x}_i )^T \right\} } } } + \sum_i \ln p \left( \mathbf{x}_i \right) \\ &= \sum_i { \left( \mathbf{w}_{y_i} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T - \ln \sum_{y'} { \exp \left\{ \mathbf{w}_{y'} \cdot \boldsymbol{\phi}_{y'} ( \mathbf{x}_i )^T \right\} } \right) } + \sum_i \ln p \left( \mathbf{x}_i \right) \\ \end{align}

Since the term $\sum_i \ln p \left( \mathbf{x}_i \right)$ is independent of the parameters $\mathbf{W}$ and the input vectors are fixed and known (i.e. observed), this term is a constant that can be ignored in the context of our maximization problem. The result is the conditional log-likelihood (so called because it consists of the product of the conditional likelihoods $\Pr \left( y_i | \mathbf{x}_i, \mathbf{W}, \boldsymbol{\phi}(\cdot) \right)$, i.e. ignoring $p(\mathbf{x}_i)$):

:$\mathcal{L}\left( \mathbf{y} | \mathbf{X}; \mathbf{W} \right) = \sum_i { \left( \mathbf{w}_{y_i} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T - \ln \sum_{y'} { \exp \left\{ \mathbf{w}_{y'} \cdot \boldsymbol{\phi}_{y'} ( \mathbf{x}_i )^T \right\} } \right) }$

### Partial Derivatives

When taking the partial derivatives of the conditional log-likelihood, it will be beneficial to further split the conditional log-likelihood of the data as follows:

:$\mathcal{L}\left( \mathbf{y} | \mathbf{X}; \mathbf{W} \right) = \sum_k { \sum_{i:y_i=k} { \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T } } - \sum_i { \ln \sum_{y'} { \exp \left\{ \mathbf{w}_{y'} \cdot \boldsymbol{\phi}_{y'} ( \mathbf{x}_i )^T \right\} } }$

i.e. we split the left side of the equation into a sum over partitions of the training data where each partition consists of instances in the training data having the same class.

It is possible to distribute the partial derivative of both operands of the subtraction.

#### Numerator (Left Term)

:\begin{align} \frac{\partial} {\partial w_{kf}} \sum_{k'} { \sum_{i:y_i=k'} { \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T } } &= \frac{\partial} {\partial w_{kf}} \sum_{k'} { \sum_{i:y_i=k'} { \sum_{f'} { w_{k 'f'} \phi_{k'f'}(\mathbf{x}_i) } } } \\ &= \sum_{i:y_i=k} \frac{\partial} {\partial w_{kf}} w_{kf} \phi_{kf}(\mathbf{x}_i) \\ &= \sum_{i:y_i=k} \phi_{kf}(\mathbf{x}_i) \\ \end{align}

#### Denominator (Right Term)

:\begin{align} \frac{\partial} {\partial w_{kf}} \sum_i { \ln \sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T \right\} } } &= \sum_i { \frac{\partial} {\partial w_{kf}} \ln \sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{k'} ( \mathbf{x}_i )^T \right\} } } \\ &= \sum_i { \frac{1} {\sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{k'} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} \sum_{k''} { \exp \left\{ \mathbf{w}_{k''} \cdot \boldsymbol{\phi}_{k''} ( \mathbf{x}_i )^T \right\} } } \\ &= \sum_i { \frac{1} {\sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{k'} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } \\ &= \sum_i { \frac{1} {\sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{k'} ( \mathbf{x}_i )^T \right\} } } \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} \frac{\partial} {\partial w_{kf}} \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T } \\ &= \sum_i { \frac{ \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } {\sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{k'} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} \sum_{f'} w_{kf'} \phi_{kf'}(\mathbf{x}_i) } \\ &= \sum_i { \frac{ \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } {\sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{k'} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} w_{kf} \phi_{kf}(\mathbf{x}_i) } \\ &= \sum_i { \frac{ \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } {\sum_{k'} { \exp \left\{ \mathbf{w}_{k'} \cdot \boldsymbol{\phi}_{k'} ( \mathbf{x}_i )^T \right\} } } \phi_{kf}(\mathbf{x}_i) } \\ &= \sum_i { \Pr \left( k | \mathbf{x}_i, \mathbf{W} \right) \phi_{kf}(\mathbf{x}_i) } \\ \end{align}

## Bayesian parameter estimation

From a Bayesian perspective, we can easily put a prior over $\mathbf{W}$. One common choice for a prior are independent Normals centered at 0 with variance $\sigma^2$, i.e.

:$p( \mathbf{W} ) = \left( \sigma\sqrt{2\pi} \right)^{-KF} \prod_k\prod_f \exp \left\{ -\frac{w_{kf}^2}{2\sigma^2} \right\}$

Typical values of $\sigma^2$ are 0.5-1.0.

We are now interested in the posterior distribution

:$p \left( \mathbf{W} | \mathbf{X},\mathbf{y} \right) \propto \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right) p \left( \mathbf{W} \right)$

although this is no longer the maximum entropy solution (edit: double check).

Evaluating the posterior is intractable, although, in theory, MCMC approaches could be used to generate samples from the posterior. Instead, often times the MAP parameter $\mathbf{W}$ is sought. As before, this optimization problem requires computing the gradient of the log of the posterior. Since the posterior is proportional to the likelihood times the prior, all that changes in the math above is the addition of the log of the prior:

:$\frac{\partial}{\partial w_{kf}} p \left( \mathbf{W} | \mathbf{X},\mathbf{y} \right) = \sum_{i:y_i=k} \phi_{kf}(\mathbf{x}_i) - \sum_i { \Pr \left( y_i | \mathbf{x}_i, \mathbf{W} \right) \phi_{kf}(\mathbf{x}_i) } - \frac{w_{kf}^2}{2\sigma^2}$

### Other priors

Another prior that has been used is the Laplacian (see Goodman). The advantage of this prior is that it is spiked at zero so that most weights are driven to zero (a form of feature selection).

### Relation to regularization

It can be shown that adding a Gaussian prior is the same as L2 normalization of the weights and adding a Laplacian prior is equivalent to L1 normalization of the weights in the MAP solution.

## Acknowledgments

This page is based heavily on material originally provided by Dan Klein of Berkeley and further refinements by Eric Ringger and students. See below for more links from both sources.

nlp/maximum-entropy.txt · Last modified: 2015/04/23 21:40 by ryancha
Back to top