Information Theory

Objectives

This assignment is designed to:

help you become more fluent with the terminology and the techniques of the course
give you an opportunity to work with some real text data in the context of probability and information theory
help you start working with probability distributions in code
help you get started conceiving design patterns for implementing probabilistic models

Exercises

Show your work. Be clear and concise. This assignment should be typed.

Information Theory

[20 points]

(Adapted from: Manning & Schuetze, exercise 2.13)

Prove the following equality: $H_c(X,q) = E_p (log \frac{1}{q(x)})$ To help, note the following: $H_c(X,q) = H(X) + KL(p||q)$ and $-\sum_x p(x)log(q(x)) = E_p (log \frac{1}{q(x)})$

Be sure to apply good proof technique: justify each step in your proofs; set up your proofs in two-column format, with each step showing a statement on the left and a justification on the right.

Programming Exercise

The requirement for the programming part of the assignment is to learn distributions over characters for three texts of your choice and to compute important information-theoretic functions on those distributions:

Choose three digital text documents, each at least 1000 words long, such that two are from a similar source (e.g., chapters of the same book, books by the same author, talks from the same venue, etc.) and the third is from a dissimilar source (e.g., by a different author, in a different language, etc.).
[20 points] Distributions as bar charts: Produce bar charts representing the distributions over characters for each of your texts.
[30 points] Entropy: Report the entropy of the distribution of characters for each of your texts. Use base 2 logarithms for the Entropy calculation.
- Remember that $ 0 \cdot log(0)$ is always defined to be 0.
[30 points] K-L Divergence: Report the K-L divergence of the character distribution for each pair of your chosen texts. Be sure to compute the K-L divergence in each direction (text 1 vs. text 2 and vice versa) for each pair of texts.
- Use base 2 logarithms for the K-L Divergence calculation.
- When implementing the K-L divergence of two distributions (call them p and q), you will want to avoid dividing by zero probabilities in q. To avoid such problems, you should “smooth” q by iterating over the support of p to notice any zeros in q. Be sure to copy q, replace any zero probability with a small quantity epsilon (use 1.0e-10), and then re-normalize the distribution after smoothing (to maintain the sum-to-1 property). Work with a copy of q; otherwise, you will permanently affect the distribution each time you compare models, which is undesirable.

Coding Considerations

Select a language and environment of your choice. Be able to briefly justify your choice
Plan to stick with that language and environment for the duration of the course.
Build classes that you can reuse in future projects

Report

Your typed report should include:

Your work and solutions for the exercises
URLs for the documents you selected
The data itself. If too large (longer than a page), include just a brief description instead; e.g., “The Project Gutenberg copy of David Copperfield”.
The per-document character distributions as bar charts over characters. Use at most one page per bar chart.
The computed entropy and divergence values
Your code (just the classes you implemented)

Table of Contents

Information Theory

Objectives

Exercises

Information Theory

Programming Exercise

Coding Considerations

Report

Submission