# Information Theory

## Objectives

This assignment is designed to:

• help you become more fluent with the terminology and the techniques of the course
• give you an opportunity to work with some real text data in the context of probability and information theory
• help you get started conceiving design patterns for implementing probabilistic models

## Exercises

Show your work. Be clear and concise. This assignment should be typed.

### Information Theory

[20 points]

(Adapted from: Manning & Schuetze, exercise 2.13)

Prove the following equality: $H_c(X,q) = E_p (log \frac{1}{q(x)})$ To help, note the following: $H_c(X,q) = H(X) + KL(p||q)$ and $-\sum_x p(x)log(q(x)) = E_p (log \frac{1}{q(x)})$

Be sure to apply good proof technique: justify each step in your proofs; set up your proofs in two-column format, with each step showing a statement on the left and a justification on the right.

## Programming Exercise

The requirement for the programming part of the assignment is to learn distributions over characters for three texts of your choice and to compute important information-theoretic functions on those distributions:

• Choose three digital text documents, each at least 1000 words long, such that two are from a similar source (e.g., chapters of the same book, books by the same author, talks from the same venue, etc.) and the third is from a dissimilar source (e.g., by a different author, in a different language, etc.).
• [20 points] Distributions as bar charts: Produce bar charts representing the distributions over characters for each of your texts.
• [30 points] Entropy: Report the entropy of the distribution of characters for each of your texts. Use base 2 logarithms for the Entropy calculation.
• Remember that $0 \cdot log(0)$ is always defined to be 0.
• [30 points] K-L Divergence: Report the K-L divergence of the character distribution for each pair of your chosen texts. Be sure to compute the K-L divergence in each direction (text 1 vs. text 2 and vice versa) for each pair of texts.
• Use base 2 logarithms for the K-L Divergence calculation.
• When implementing the K-L divergence of two distributions (call them p and q), you will want to avoid dividing by zero probabilities in q. To avoid such problems, you should “smooth” q by iterating over the support of p to notice any zeros in q. Be sure to copy q, replace any zero probability with a small quantity epsilon (use 1.0e-10), and then re-normalize the distribution after smoothing (to maintain the sum-to-1 property). Work with a copy of q; otherwise, you will permanently affect the distribution each time you compare models, which is undesirable.

### Coding Considerations

• Select a language and environment of your choice. Be able to briefly justify your choice
• Plan to stick with that language and environment for the duration of the course.
• Build classes that you can reuse in future projects

## Report

• Your work and solutions for the exercises
• URLs for the documents you selected
• The data itself. If too large (longer than a page), include just a brief description instead; e.g., “The Project Gutenberg copy of David Copperfield”.
• The per-document character distributions as bar charts over characters. Use at most one page per bar chart.
• The computed entropy and divergence values
• Your code (just the classes you implemented) 