This assignment is designed to:

- help you become more fluent with the terminology and the techniques of the course
- give you an opportunity to work with some real text data in the context of probability and information theory
- help you start working with probability distributions in code
- help you get started conceiving design patterns for implementing probabilistic models

Show your work. Be clear and concise. **This assignment should be typed.**

[20 points]

(Adapted from: Manning & Schuetze, exercise 2.13)

Prove the following equality: $H_c(X,q) = E_p (log \frac{1}{q(x)})$ To help, note the following: $H_c(X,q) = H(X) + KL(p||q)$ and $-\sum_x p(x)log(q(x)) = E_p (log \frac{1}{q(x)})$

Be sure to apply good proof technique: justify each step in your proofs; set up your proofs in two-column format, with each step showing a statement on the left and a justification on the right.

The requirement for the programming part of the assignment is to learn distributions over characters for three texts of your choice and to compute important information-theoretic functions on those distributions:

- Choose three digital text documents, each at least 1000 words long, such that two are from a similar source (e.g., chapters of the same book, books by the same author, talks from the same venue, etc.) and the third is from a dissimilar source (e.g., by a different author, in a different language, etc.).
- [20 points] Distributions as bar charts: Produce bar charts representing the distributions over characters for each of your texts.
- [30 points] Entropy: Report the entropy of the distribution of characters for each of your texts. Use base 2 logarithms for the Entropy calculation.
- Remember that $ 0 \cdot log(0)$ is always defined to be 0.

- [30 points] K-L Divergence: Report the K-L divergence of the character distribution for each pair of your chosen texts. Be sure to compute the K-L divergence in each direction (text 1 vs. text 2 and
*vice versa*) for each pair of texts.- Use base 2 logarithms for the K-L Divergence calculation.
- When implementing the K-L divergence of two distributions (call them p and q), you will want to avoid dividing by zero probabilities in q. To avoid such problems, you should “smooth” q by iterating over the support of p to notice any zeros in q. Be sure to copy q, replace any zero probability with a small quantity
*epsilon*(use 1.0e-10), and then re-normalize the distribution after smoothing (to maintain the sum-to-1 property). Work with a*copy of*q; otherwise, you will permanently affect the distribution each time you compare models, which is undesirable.

- Select a language and environment of your choice. Be able to briefly justify your choice
- Plan to stick with that language and environment for the duration of the course.
- Build classes that you can reuse in future projects

Your **typed** report should include:

- Your work and solutions for the exercises
- URLs for the documents you selected
- The data itself. If too large (longer than a page), include just a brief description instead; e.g., “The Project Gutenberg copy of David Copperfield”.
- The per-document character distributions as bar charts over characters. Use at most one page per bar chart.
- The computed entropy and divergence values
- Your code (just the classes you implemented)

Organize your report and use clear headings and explanations where needed.

Submit a .pdf document through Learning Suite.