===Links=== Before we dive into this topic, here are some usefule links. * http://en.wikipedia.org/wiki/Naive_Bayes * http://www.intellektik.informatik.tu-darmstadt.de/~tom/IJCAI01/Rish.pdf * http://citeseer.ist.psu.edu/30545.html * http://jbnc.sourceforge.net/ * http://nltk.sourceforge.net/tutorial/classifying/index.html * http://www.geocities.com/ResearchTriangle/Forum/1203/NaiveBayes.html * http://www.resample.com/xlminer/help/NaiveBC/classiNB_intro.htm ===Notation=== This section outlines the notation used throughout this document. * $D_i$ - Document $i$. * $F_j$ - Feature $j$ (word in a text document). * $F_{i,j}$ - Feature $j$ in Document $i$. * $C$ - Class label. * $C_i$ - Class label for Document $i$. * $\{c_1, \ldots , c_n\}$ - Values representing specific labels. * $P(C=c_i| \ldots )$ - This is a notation for a conditional probability. * $P(c_i| \ldots )$ - This is also a notation for a conditional probability. * $P(C| \ldots )$ - This is notation for a probability distribution. * $P(C_i| \ldots )$ - This is also notation for a probability distribution. ===Derivation of Naive Bayes for Classification=== First, what we're looking for is $\hat c = argmax_c P(C_i = c| \underline{D_i})$, where $\underline{D_i}$ is the feature vector for document $i$ which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable $C$ takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class $C_i$? Now that we know what we want, here is the derivation: using Bayes' Theorem: $\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}$ Note that with $a$ a constant: $argmax_x \frac{f(x)}{a} = argmax_x f(x)$ Therefore: $argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)$ (Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point). By the multiplication rule $P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},C_i = c)$ Because of the naive bayes assumption, $P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,1}|C_i) \cdots P(F_{i,n}|C_i)$ Now, the second part of the right-hand-side of this last equation can be written in short-hand as $\prod_{j=1}^n P(F_{i,j}|C_i = c)$ so we now have $P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$ leading to $\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,j}|C_i = c)$ For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow. $\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,j}|C_i = c))$ The MLE (maximum likelihood estimator) for the first term, $P(C_i = c)$ is computed by taking the number of documents with label $c$ divided by the total number of documents in the training set. The MLE for the term in the sumation ... ===Probability Theory=== Bayes' Theorem relies on some probability theory, which is covered here. ====Sample Spaces and Events==== An experiment is any action or process with uncertain outcomes. The possible outcomes of an experiment are represented by a '''sample space'''. Each of the possible outcomes is called an '''event'''. Events form a set, making up the sample space for a particular experiment. =====Sample Space===== Consider the experiment of flipping a coin. This experiment is subject to uncertainty, because the outcome could be one of two possibilities. Formally, the sample space $S$ for this experiment is $S = \{H,T\}$; the only possible outcomes are a head or tail facing upward. =====Events===== An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, $S = \{H,T\}$, contains two events, $H$ and $T$, heads and tails respectively. =====Some Set Theory===== ===Conditional Probability=== Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let $A$ be the event that Stan is late, and $B$ be the event of inclement weather. Then, $P(A|B)$ is read, the probability that Stan will be late given that the weather is bad. Or, the probability of $A$ given $B$. In this case, $P(A|B) = .25$, and $P(A|B^{'}) = 0.5$ where $B^{'}$ is the compliment of $B$, that is, the weather is not bad. Formally, conditional probability is defined as follows: :$P(A|B) = \frac{P(A \cap B)}{P(B)} \!$ This formula is interpreted as follows: The Probability of class $A_i$ given feature $B$ is equal to the probability of $B$ given class $A_i$ Let $A_i$ be the $i^{th}$ class that we are interested in, and $B$ be a feature of a document. ===The Law of Total Probability=== Let $A_1, \ldots , A_k$ be mutually exclusive and exhaustive events. Then for any other event $B$, $P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)$. ===Bayes' Theorem=== :$P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!$ :$P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,P(A_j)} \!$ .