Before we dive into this topic, here are some usefule links.

This section outlines the notation used throughout this document.

- $D_i$ - Document $i$.
- $F_j$ - Feature $j$ (word in a text document).
- $F_{i,j}$ - Feature $j$ in Document $i$.
- $C$ - Class label.
- $C_i$ - Class label for Document $i$.
- $\{c_1, \ldots , c_n\}$ - Values representing specific labels.
- $P(C=c_i| \ldots )$ - This is a notation for a conditional probability.
- $P(c_i| \ldots )$ - This is also a notation for a conditional probability.
- $P(C| \ldots )$ - This is notation for a probability distribution.
- $P(C_i| \ldots )$ - This is also notation for a probability distribution.

First, what we're looking for is $\hat c = argmax_c P(C_i = c| \underline{D_i})$, where $\underline{D_i}$ is the feature vector for document $i$ which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable $C$ takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class $C_i$?

Now that we know what we want, here is the derivation:

using Bayes' Theorem:

$\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}$

Note that with $a$ a constant:

$argmax_x \frac{f(x)}{a} = argmax_x f(x)$

Therefore:

$argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)$

(Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point).

By the multiplication rule

$P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},C_i = c)$

Because of the naive bayes assumption,

$P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,1}|C_i) \cdots P(F_{i,n}|C_i)$

Now, the second part of the right-hand-side of this last equation can be written in short-hand as

$\prod_{j=1}^n P(F_{i,j}|C_i = c)$

so we now have

$P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$

leading to

$\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$

For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow.

$\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,j}|C_i = c))$

The MLE (maximum likelihood estimator) for the first term, $P(C_i = c)$ is computed by taking the number of documents with label $c$ divided by the total number of documents in the training set.

The MLE for the term in the sumation …

Bayes' Theorem relies on some probability theory, which is covered here.

An experiment is any action or process with uncertain outcomes. The possible outcomes of an experiment are represented by a **sample space**. Each of the possible outcomes is called an **event**. Events form a set, making up the sample space for a particular experiment.

Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space $S$ for this experiment is $S = \{H,T\}$; the only possible outcomes are a head or tail facing upward.

An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, $S = \{H,T\}$, contains two events, $H$ and $T$, heads and tails respectively.

Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let $A$ be the event that Stan is late, and $B$ be the event of inclement weather. Then, $P(A|B)$ is read, the probability that Stan will be late given that the weather is bad. Or, the probability of $A$ given $B$.

In this case, $P(A|B) = .25$, and $P(A|B^{'}) = 0.5$ where $B^{'}$ is the compliment of $B$, that is, the weather is not bad.

Formally, conditional probability is defined as follows: :$P(A|B) = \frac{P(A \cap B)}{P(B)} \!$

This formula is interpreted as follows: The Probability of class $A_i$ given feature $B$ is equal to the probability of $B$ given class $A_i$

Let $A_i$ be the $i^{th}$ class that we are interested in, and $B$ be a feature of a document.

Let $A_1, \ldots , A_k$ be mutually exclusive and exhaustive events. Then for any other event $B$, $P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)$.

:$P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!$

:$P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)} \!$

.