Before we dive into this topic, here are some usefule links.
This section outlines the notation used throughout this document.
First, what we're looking for is $\hat c = argmax_c P(C_i = c| \underline{D_i})$, where $\underline{D_i}$ is the feature vector for document $i$ which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable $C$ takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class $C_i$?
Now that we know what we want, here is the derivation:
using Bayes' Theorem:
$\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}$
Note that with $a$ a constant:
$argmax_x \frac{f(x)}{a} = argmax_x f(x)$
Therefore:
$argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)$
(Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point).
By the multiplication rule
$P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},C_i = c)$
Because of the naive bayes assumption,
$P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,1}|C_i) \cdots P(F_{i,n}|C_i)$
Now, the second part of the right-hand-side of this last equation can be written in short-hand as
$\prod_{j=1}^n P(F_{i,j}|C_i = c)$
so we now have
$P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$
leading to
$\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$
For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow.
$\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,j}|C_i = c))$
The MLE (maximum likelihood estimator) for the first term, $P(C_i = c)$ is computed by taking the number of documents with label $c$ divided by the total number of documents in the training set.
The MLE for the term in the sumation …
Bayes' Theorem relies on some probability theory, which is covered here.
An experiment is any action or process with uncertain outcomes. The possible outcomes of an experiment are represented by a sample space. Each of the possible outcomes is called an event. Events form a set, making up the sample space for a particular experiment.
Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space $S$ for this experiment is $S = \{H,T\}$; the only possible outcomes are a head or tail facing upward.
An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, $S = \{H,T\}$, contains two events, $H$ and $T$, heads and tails respectively.
Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let $A$ be the event that Stan is late, and $B$ be the event of inclement weather. Then, $P(A|B)$ is read, the probability that Stan will be late given that the weather is bad. Or, the probability of $A$ given $B$.
In this case, $P(A|B) = .25$, and $P(A|B^{'}) = 0.5$ where $B^{'}$ is the compliment of $B$, that is, the weather is not bad.
Formally, conditional probability is defined as follows: :$P(A|B) = \frac{P(A \cap B)}{P(B)} \!$
This formula is interpreted as follows: The Probability of class $A_i$ given feature $B$ is equal to the probability of $B$ given class $A_i$
Let $A_i$ be the $i^{th}$ class that we are interested in, and $B$ be a feature of a document.
Let $A_1, \ldots , A_k$ be mutually exclusive and exhaustive events. Then for any other event $B$, $P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)$.
:$P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!$
:$P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)} \!$
.