Differences

This shows you the differences between two versions of the page.

 — nlp:maximum-entropy [2015/04/23 21:40] (current)ryancha created 2015/04/23 21:40 ryancha created 2015/04/23 21:40 ryancha created Line 1: Line 1: + == Preliminary definitions == + * [[Independent and identically distributed]] + * [[Log likelihood function]] + * [[Block feature vector]] + == Notation == + * $\mathbf{A}$ matrix + * $\mathbf{a}_i$ row vector in matrix $\mathbf{A}$ + * $a_{ij}$ cell $i,j$ in matrix $\mathbf{A}$ + * $K$ number of classes + * $F$ number of features + * $\mathbf{X}$ the input vectors in the training data; each row is an input vector + * $\mathbf{y}$ the classes in the training data; $y_i$ is the class of input vector $\mathbf{x}_i$ + * $N$ the amount of training data, i.e. the number of rows in $\mathbf{X}$ and $\mathbf{y}$ ​ + * $\mathbf{W}$ weight matrix with $K$ rows and $F$ columns + * $\boldsymbol{\Phi}(\mathbf{x})$ feature function of an input vector that returns a matrix, one row per output class; the number of columns is $F$. On analogy with matrix notation, we use $\boldsymbol{\phi}_y$ to indicate a specific row in the resulting matrix, and $\phi_{yf}$ to reference a single cell in the resulting matrix. A '''​block feature vector'''​ is one in which $\forall{y'​} \boldsymbol {\phi}_y ( \mathbf{x} ) = \boldsymbol{\phi}_{y'​} ( \mathbf{x} )$, i.e. the feature function is independent of class. This is the most common scenario and can simplify implementation. In that case, the subscripts can be ignored and the output of $\boldsymbol{\phi}(\mathbf{x})$ can treated as a row vector. + + == Log-linear models == + The model we will be dealing with has the form of a log-linear model, i.e. if $y|\mathbf{x},​\mathbf{W} \sim LL \left( \mathbf{x},​\mathbf{W} \right)$ ($\mathbf{x}$ is a column vector) then the likelihood function $f$ is defined as: + + :$+ f(y;​\mathbf{x},​\mathbf{W}) = \frac{ \exp \left\{ \mathbf{w}_y \cdot \mathbf{x} \right\} }{ \sum_{y'​} { \exp \left\{ \mathbf{w}_{y'​} \cdot \mathbf{x} \right\} } } +$ + + In the context of maximum entropy and related models, we usually extract features from the original input vector. This can be represented as: + + :$+ f(y;​\mathbf{x},​\mathbf{W},​ \boldsymbol{\Phi} ( \cdot ) ) = \frac{ \exp \left\{ \mathbf{w}_y \cdot \boldsymbol{\phi}_y ( \mathbf{x} ) \right\} }{ \sum_{y'​} { \exp \left\{ \mathbf{w}_{y'​} \cdot \boldsymbol{\phi}_{y'​} ( \mathbf{x} ) \right\} } } +$ + + == The maximum entropy model (parameter estimation) == + The maximum entropy model is the log-linear model having parameters $\mathbf{W}$ such that the resulting distribution ('''​edit''':​ distribution over what?) has maximum entropy subject to the constraints represented by the training data $\mathbf{X},​\mathbf{y}$. It can be shown that the parameters $\mathbf{W}$ that maximize the log-likelihood of the data $\mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right)$ constitute the maximum entropy model. Using standard calculus techniques to find the maximum, we need to find the gradient $\nabla \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right)$ and therefore the partial derivatives $\frac{ \partial \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right) } { \partial w_{kf}}$. + + === Log-likelihood and conditional log-likelihood === + The log-likelihood of a log-linear model is: + :+ \begin{align} + \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right) &= \ln \prod_i { p \left( x_i, y_i | \mathbf{W} \right) } \\ + &= \ln \prod_i { \Pr \left( y_i | \mathbf{x}_i,​ \mathbf{W} \right) } p \left( \mathbf{x}_i \right) \\ + &= \ln \prod_i { f \left( y_i ; \mathbf{x}_i^T,​ \mathbf{W} \right) } p \left( \mathbf{x}_i \right) \\ + &= \sum_i { \ln f \left( y_i ; \mathbf{x}_i^T,​ \mathbf{W} \right) } + \sum_i \ln p \left( \mathbf{x}_i \right) \\ + &= \sum_i { \ln \frac{ \exp \left\{ \mathbf{w}_{y_i} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T \right\} }{ \sum_{y'​} { \exp \left\{ \mathbf{w}_{y'​} \cdot \boldsymbol{\phi}_{y'​} ( \mathbf{x}_i )^T \right\} } } } + \sum_i \ln p \left( \mathbf{x}_i \right) \\ + &= \sum_i { \left( ​ \mathbf{w}_{y_i} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T - \ln \sum_{y'​} { \exp \left\{ \mathbf{w}_{y'​} \cdot \boldsymbol{\phi}_{y'​} ( \mathbf{x}_i )^T \right\} } \right) } + \sum_i \ln p \left( \mathbf{x}_i \right) \\ + \end{align} + + + Since the term $\sum_i \ln p \left( \mathbf{x}_i \right)$ is independent of the parameters $\mathbf{W}$ and the input vectors are fixed and known (i.e. observed), this term is a constant that can be ignored in the context of our maximization problem. The result is the '''''​conditional''​ log-likelihood'''​ (so called because it consists of the product of the conditional likelihoods $\Pr \left( y_i | \mathbf{x}_i,​ \mathbf{W}, \boldsymbol{\phi}(\cdot) \right)$, i.e. ignoring $p(\mathbf{x}_i)$):​ + + :$+ \mathcal{L}\left( \mathbf{y} | \mathbf{X}; \mathbf{W} \right) = + \sum_i { \left( \mathbf{w}_{y_i} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T - \ln \sum_{y'​} { \exp \left\{ \mathbf{w}_{y'​} \cdot \boldsymbol{\phi}_{y'​} ( \mathbf{x}_i )^T \right\} } \right) } +$ + + === Partial Derivatives === + When taking the partial derivatives of the conditional log-likelihood,​ it will be beneficial to further split the conditional log-likelihood of the data as follows: + + :$+ \mathcal{L}\left( \mathbf{y} | \mathbf{X}; \mathbf{W} \right) + = \sum_k { \sum_{i:​y_i=k} { \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T } } - \sum_i { \ln \sum_{y'​} { \exp \left\{ \mathbf{w}_{y'​} \cdot \boldsymbol{\phi}_{y'​} ( \mathbf{x}_i )^T \right\} } } +$ + + i.e. we split the left side of the equation into a sum over partitions of the training data where each partition consists of instances in the training data having the same class. ​ + + It is possible to distribute the partial derivative of both operands of the subtraction. + + ==== Numerator (Left Term) ==== + :+ \begin{align} + \frac{\partial} {\partial w_{kf}} \sum_{k'​} { \sum_{i:​y_i=k'​} { \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T } } &= \frac{\partial} {\partial w_{kf}} \sum_{k'​} { \sum_{i:​y_i=k'​} { \sum_{f'​} { w_{k '​f'​} \phi_{k'​f'​}(\mathbf{x}_i) } } } \\ + &= \sum_{i:​y_i=k} \frac{\partial} {\partial w_{kf}} w_{kf} \phi_{kf}(\mathbf{x}_i) \\ + &= \sum_{i:​y_i=k} \phi_{kf}(\mathbf{x}_i) \\ + \end{align} + + + ==== Denominator (Right Term) ==== + :+ \begin{align} + \frac{\partial} {\partial w_{kf}} \sum_i { \ln \sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{y} ( \mathbf{x}_i )^T \right\} } } &= \sum_i { \frac{\partial} {\partial w_{kf}} \ln \sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{k'​} ( \mathbf{x}_i )^T \right\} } } \\ + &= \sum_i { \frac{1} {\sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{k'​} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} \sum_{k''​} { \exp \left\{ \mathbf{w}_{k''​} \cdot \boldsymbol{\phi}_{k''​} ( \mathbf{x}_i )^T \right\} } } \\ + &= \sum_i { \frac{1} {\sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{k'​} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } \\ + &= \sum_i { \frac{1} {\sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{k'​} ( \mathbf{x}_i )^T \right\} } } \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} \frac{\partial} {\partial w_{kf}} \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T } \\ + &= \sum_i { \frac{ \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } {\sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{k'​} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} \sum_{f'​} w_{kf'​} \phi_{kf'​}(\mathbf{x}_i) } \\ + &= \sum_i { \frac{ \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } {\sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{k'​} ( \mathbf{x}_i )^T \right\} } } \frac{\partial} {\partial w_{kf}} w_{kf} \phi_{kf}(\mathbf{x}_i) } \\ + &= \sum_i { \frac{ \exp \left\{ \mathbf{w}_{k} \cdot \boldsymbol{\phi}_{k} ( \mathbf{x}_i )^T \right\} } {\sum_{k'​} { \exp \left\{ \mathbf{w}_{k'​} \cdot \boldsymbol{\phi}_{k'​} ( \mathbf{x}_i )^T \right\} } } \phi_{kf}(\mathbf{x}_i) } \\ + &= \sum_i { \Pr \left( k | \mathbf{x}_i,​ \mathbf{W} \right) \phi_{kf}(\mathbf{x}_i) } \\ + \end{align} + + + == Bayesian parameter estimation == + + From a Bayesian perspective,​ we can easily put a prior over $\mathbf{W}$. One common choice for a prior are independent Normals centered at 0 with variance $\sigma^2$, i.e. + + :$+ p( \mathbf{W} ) = \left( \sigma\sqrt{2\pi} \right)^{-KF} \prod_k\prod_f \exp \left\{ -\frac{w_{kf}^2}{2\sigma^2} ​ \right\} +$ + + Typical values of $\sigma^2$ are 0.5-1.0. + + We are now interested in the posterior distribution + + :$+ p \left( \mathbf{W} | \mathbf{X},​\mathbf{y} \right) \propto \mathcal{L}\left( \mathbf{X}, \mathbf{y}; \mathbf{W} \right) p \left( \mathbf{W} \right) +$ + + although this is no longer the maximum entropy solution ('''​edit''':​ double check). + + Evaluating the posterior is intractable,​ although, in theory, MCMC approaches could be used to generate samples from the posterior. Instead, often times the MAP parameter $\mathbf{W}$ is sought. As before, this optimization problem requires computing the gradient of the log of the posterior. Since the posterior is proportional to the likelihood times the prior, all that changes in the math above is the addition of the log of the prior: + + :$+ \frac{\partial}{\partial w_{kf}} p \left( \mathbf{W} | \mathbf{X},​\mathbf{y} \right) = \sum_{i:​y_i=k} \phi_{kf}(\mathbf{x}_i) - \sum_i { \Pr \left( y_i | \mathbf{x}_i,​ \mathbf{W} \right) \phi_{kf}(\mathbf{x}_i) } - \frac{w_{kf}^2}{2\sigma^2} +$ + + === Other priors === + + Another prior that has been used is the Laplacian (see Goodman). The advantage of this prior is that it is spiked at zero so that most weights are driven to zero (a form of feature selection). + + === Relation to regularization === + + It can be shown that adding a Gaussian prior is the same as L2 normalization of the weights and adding a Laplacian prior is equivalent to L1 normalization of the weights in the MAP solution. + + == Acknowledgments == + This page is based heavily on [http://​www.cs.berkeley.edu/​~klein/​cs294-19/​SP08%20cs294%20lecture%205%20--%20maximum%20entropy%20(6PP).pdf material] originally provided by Dan Klein of Berkeley and [http://​faculty.cs.byu.edu/​~ringger/​Winter2009-CS601R-4/​lectures/​Lecture07-maxent.pdf further refinements] by Eric Ringger and students. See below for more links from both sources. + + == Links == + * Eric Ringger'​s [http://​faculty.cs.byu.edu/​~ringger/​Fall2009-CS479/​lectures/​Lecture17-maxent.pdf maxent slides] in the context of word sense disambiguation (see also [http://​faculty.cs.byu.edu/​~ringger/​Fall2008-CS401R-1/​lectures/​Lecture13-wsd.pdf Intro] and [http://​faculty.cs.byu.edu/​~ringger/​Fall2008-CS401R-1/​lectures/​Lecture15-smoothmaxent.pdf Smoothing MaxEnt]) + * Eric Ringger'​s [http://​faculty.cs.byu.edu/​~ringger/​Winter2009-CS601R-4/​lectures/​Lecture07-maxent.pdf maxent slides] in the context of document classification + * Eric Ringger'​s [http://​faculty.cs.byu.edu/​~ringger/​Winter2009-CS601R-4/​lectures/​Lecture08-logisticreg.pdf slides] on the relation of maxent to logistic regression + * Dan Klein'​s [http://​www.cs.berkeley.edu/​~klein/​cs288/​sp09/​SP09%20cs288%20lecture%205%20--%20maximum%20entropy%20(6PP).pdf current slides] ([http://​www.cs.berkeley.edu/​~klein/​cs288/​sp09/​SP09%20cs288%20lecture%205%20--%20maximum%20entropy%20(2PP).pdf 2PP]) and [http://​www.cs.berkeley.edu/​~klein/​cs294-19/​SP08%20cs294%20lecture%205%20--%20maximum%20entropy%20(6PP).pdf previous slides] ([http://​www.cs.berkeley.edu/​~klein/​cs294-19/​SP08%20cs294%20lecture%205%20--%20maximum%20entropy%20(2PP).pdf 2PP]) + * http://​www.cs.cmu.edu/​afs/​cs/​user/​aberger/​www/​html/​tutorial/​tutorial.html + * http://​maxent.sourceforge.net/​about.html + * http://​opennlp.sourceforge.net/​