##### Differences

This shows you the differences between two versions of the page.

 nlp:naive-bayes [2015/04/23 15:37]ryancha created nlp:naive-bayes [2015/04/23 15:39] (current)ryancha 2015/04/23 15:39 ryancha 2015/04/23 15:37 ryancha created 2015/04/23 15:39 ryancha 2015/04/23 15:37 ryancha created Line 13: Line 13: This section outlines the notation used throughout this document. This section outlines the notation used throughout this document. - * <​math>​D_i ​- Document ​<​math>​i​. + * $D_i$ - Document ​$i$. - * <​math>​F_j ​- Feature ​<​math>​j ​(word in a text document). + * $F_j$ - Feature ​$j$ (word in a text document). - * <​math>​F_{i,j} ​- Feature ​<​math>​j ​in Document ​<​math>​i​. + * $F_{i,j}$ - Feature ​$j$ in Document ​$i$. - * <​math>​C ​- Class label. + * $C$ - Class label. - * <​math>​C_i ​- Class label for Document ​<​math>​i​. + * $C_i$ - Class label for Document ​$i$. - * <​math>​\{c_1, \ldots , c_n\} ​- Values representing specific labels. + * $\{c_1, \ldots , c_n\}$ - Values representing specific labels. - * <​math>​P(C=c_i| \ldots ) ​- This is a notation for a conditional probability. + * $P(C=c_i| \ldots )$ - This is a notation for a conditional probability. - * <​math>​P(c_i| \ldots ) ​- This is also a notation for a conditional probability. + * $P(c_i| \ldots )$ - This is also a notation for a conditional probability. - * <​math>​P(C| \ldots ) ​- This is notation for a probability distribution. + * $P(C| \ldots )$ - This is notation for a probability distribution. - * <​math>​P(C_i| \ldots ) ​- This is also notation for a probability distribution. + * $P(C_i| \ldots )$ - This is also notation for a probability distribution. ===Derivation of Naive Bayes for Classification=== ===Derivation of Naive Bayes for Classification=== - First, what we're looking for is <​math>​\hat c =  argmax_c P(C_i = c| \underline{D_i})​, where <​math>​\underline{D_i} ​is the feature vector for document ​<​math>​i ​which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable ​<​math>​C ​takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class <​math>​C_i​? + First, what we're looking for is $\hat c = argmax_c P(C_i = c| \underline{D_i})$, where $\underline{D_i}$ is the feature vector for document ​$i$ which is given. In other words, we have a document and its feature vector (that is, the words in the dcoument) and we want to know the probability that the random variable ​$C$ takes on a specific value or label given this document and its feature vector. In English, what is the probability that this document belongs to class $C_i$? Now that we know what we want, here is the derivation: Now that we know what we want, here is the derivation: Line 31: Line 31: using Bayes' Theorem: using Bayes' Theorem: - <​math>​\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}​ + $\hat c = argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})}$ - Note that with <​math>​a ​a constant: + Note that with $a$ a constant: - <​math>​argmax_x \frac{f(x)}{a} = argmax_x f(x)​ + $argmax_x \frac{f(x)}{a} = argmax_x f(x)$ Therefore: Therefore: - <​math>​argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)​ + $argmax_c \frac{P(\underline{D_i}|C_i = c)P(C_i = c)}{P(\underline{D_i})} = argmax_c P(\underline{D_i}|C_i = c)P(C_i = c)$ (Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point). (Note: See below for explanation of Bayesian Networks and the naive bayes assumption, which we make at this point). Line 45: Line 45: By the multiplication rule By the multiplication rule - <​math>​P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},​C_i = c)​ + $P(\underline{D_i}|C_i = c)P(C_i = c) = P(\underline{D_i},​C_i = c)$ Because of the naive bayes assumption, ​ Because of the naive bayes assumption, ​ - <​math>​P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,​1}|C_i) \cdots P(F_{i,​n}|C_i)​ + $P(C_i, F_{i,1}, F_{i,2}, \ldots ,F_{i,n}) = P(C_i)P(F_{i,​1}|C_i) \cdots P(F_{i,​n}|C_i)$ Now, the second part of the right-hand-side of this last equation can be written in short-hand as Now, the second part of the right-hand-side of this last equation can be written in short-hand as - <​math>​\prod_{j=1}^n P(F_{i,​j}|C_i = c)​ + $\prod_{j=1}^n P(F_{i,​j}|C_i = c)$ so we now have so we now have - <​math>​P(C_i = c)\prod_{j=1}^n P(F_{i,​j}|C_i = c)​ + $P(C_i = c)\prod_{j=1}^n P(F_{i,​j}|C_i = c)$ leading to leading to - <​math>​\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,​j}|C_i = c)​ + $\hat c = argmax_c P(C_i = c)\prod_{j=1}^n P(F_{i,​j}|C_i = c)$ For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow. For the sake of representing small probabilities in digital hardware, the log function is useful for overcoming floating point underflow. - <​math>​\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,​j}|C_i = c))​ + $\hat c = argmax_c log(P(C_i = c)) + \sum_{j=1}^n log(P(F_{i,​j}|C_i = c))$ - The MLE (maximum likelihood estimator) for the first term, <​math>​P(C_i = c) ​is computed by taking the number of documents with label <​math>​c ​divided by the total number of documents in the training set. + The MLE (maximum likelihood estimator) for the first term, $P(C_i = c)$ is computed by taking the number of documents with label $c$ divided by the total number of documents in the training set. The MLE for the term in the sumation ... The MLE for the term in the sumation ... Line 78: Line 78: =====Sample Space===== =====Sample Space===== - Consider the experiment of flipping a coin. This experiment is subject to uncertainty,​ because the outcome could be one of two possibilities. Formally, the sample space <​math>​S ​for this experiment is <​math>​S = \{H,T\}​; the only possible outcomes are a head or tail facing upward. + Consider the experiment of flipping a coin. This experiment is subject to uncertainty,​ because the outcome could be one of two possibilities. Formally, the sample space $S$ for this experiment is $S = \{H,T\}$; the only possible outcomes are a head or tail facing upward. =====Events===== =====Events===== - An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, ​<​math>​S = \{H,T\}​, contains two events, ​<​math>​H ​and <​math>​T​, heads and tails respectively. + An event is an outcome from (or a result of) an experiment. In the above experiment of flipping a coin, the sample space, ​$S = \{H,T\}$, contains two events, ​$H$ and $T$, heads and tails respectively. =====Some Set Theory===== =====Some Set Theory===== ===Conditional Probability=== ===Conditional Probability=== - Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let <​math>​A ​be the event that Stan is late, and <​math>​B ​be the event of inclement weather. Then, <​math>​P(A|B) ​is read, the probability that Stan will be late given that the weather is bad. Or, the probability of <​math>​A ​given <​math>​B​. + Conditional probability allows us to use prior knowledge about how one random variable affects another random variable. For example, a lecturer, Stan, has a tendency of being late to class when the weather is good 5% of the time. However, when the weather is bad, Stan is late 25% of the time. Let $A$ be the event that Stan is late, and $B$ be the event of inclement weather. Then, $P(A|B)$ is read, the probability that Stan will be late given that the weather is bad. Or, the probability of $A$ given $B$. - In this case, <​math>​P(A|B) = .25​, and <​math>​P(A|B^{'​}) = 0.5 ​where <​math>​B^{'} ​is the compliment of <​math>​B​, that is, the weather is not bad. + In this case, $P(A|B) = .25$, and $P(A|B^{'​}) = 0.5$ where $B^{'}$ is the compliment of $B$, that is, the weather is not bad. Formally, conditional probability is defined as follows: Formally, conditional probability is defined as follows: - :<​math>​P(A|B) = \frac{P(A \cap B)}{P(B)} \!​ + :$P(A|B) = \frac{P(A \cap B)}{P(B)} \!$ - This formula is interpreted as follows: The Probability of class <​math>​A_i ​given feature ​<​math>​B ​is equal to the probability of <​math>​B ​given class <​math>​A_i​ + This formula is interpreted as follows: The Probability of class $A_i$ given feature ​$B$ is equal to the probability of $B$ given class $A_i$ - Let <​math>​A_i ​be the <​math>​i^{th} ​class that we are interested in, and <​math>​B ​be a feature of a document. + Let $A_i$ be the $i^{th}$ class that we are interested in, and $B$ be a feature of a document. ===The Law of Total Probability=== ===The Law of Total Probability=== - Let <​math>​A_1, \ldots , A_k ​be mutually exclusive and exhaustive events. Then for any other event <​math>​B​, <​math>​P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)​. + Let $A_1, \ldots , A_k$ be mutually exclusive and exhaustive events. Then for any other event $B$, $P(B) = \sum_{i=1}^k P(B|A_i)P(A_i)$. ===Bayes'​ Theorem=== ===Bayes'​ Theorem=== - :<​math>​P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!​ + :$P(A|B) = \frac{P(B | A)\, P(A)}{P(B)} \!$ - :<​math>​P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,​P(A_j)} ​ \!​ + :$P(A_i|B) = \frac{P(B | A_i)\, P(A_i)}{\sum_j P(B|A_j)\,​P(A_j)} ​ \!$ . . 