This assignment is designed to:

- give hands-on experience with topic analysis using the Latent Dirichlet Allocation (LDA) model
- provide experience with Gibbs sampling on a directed graphical model
- involve you in evaluating topic quality using qualitative inspection

In this project your goal is to build a topic analysis system for documents. You will implement a Gibbs sampler on the LDA model. You will track and report a meaningful statistic to understand the convergence of your sampler. For this project, you will assume a fixed number of topics. Optionally, you are free to try several topic counts and determine which one works best for your data. Finally, you will engage in qualitative evaluation of your topic results.

The ideas necessary for success on this project have been covered in several lectures in class:

- Gibbs sampling was introduced in Lecture #26
- the LDA model is described in Lecture #29
- the complete conditional distribution for the LDA Gibbs sampler is described in Lecture #31
- similar experiments are described in the assigned paper by Griffiths and Steyvers titled "Finding Scientific Topics"

For this project, a joint project (up to 2 persons) is welcome, as long as each contributor performs about half of the work.

You will work with a document collection of your choice. Some readily available options have been collected here: data sets. You may also choose some other collection. Whichever set you choose, it must consist of at least 1000 documents.

- Extract features into document feature vectors
- Use selected words as the evidence. Design and justify your choice (in the Data section of the report).
- It is recommended that you choose some frequency threshhold (e.g., 3): remove words that occur below your chosen threshhold in a given document.

- Choose a value of T (number of topics), and justify your choice in your report (in the Design Decisions section of your report).
- Implement the Gibbs sampling algorithm for LDA correctly.
- Track and report (a quantity proportional to) the joint likelihood of the words ($w$) and current topic assignments ($z$s) in order to track how well the sampling algorithm is converging to the desired stationary distribution. Further explanation is to be found here.

- Write enough code to present example topics, so that you can assess the quality of the topics (in the Qualitative Analysis section of your report).
- Write code to produce lists of the top N words according to $p(word | topic)$ for any topics you choose.

*Please limit the non-code portion of you report to about 5 pages.*

For this assignment, write a clear, well-structured, self-contained report on the work you have done, including the following ingredients:

**Time**: Please include at the top of page 1 of your report a clear measure (in hours) of how long it took you to complete this project.- [10 points]
**Data**: Dedicate one section of your report to specify your choice of data set. If you chose your own, then explain the source and content of the data.- Include your approach to feature selection, and justify your approach.

- [25 points]
**Design Decisions**:- Specify what you built and what choices you made.
- Someone should be able to read this section and recognize that your implementation is reasonable without having to read your code.
- Address all of the issues you encountered and how you resolved them in a coherent discussion. Please do not simply submit a debug or work log.
- Include issues encountered when implementing the Gibbs sampler.

- [25 points]
**Results**: include a plot of the joint likelihood demonstrating proper convergence of your Gibbs sampler. - [30 points]
**Qualitative Analysis**: Qualitative evaluation aims to answer the question**“are the results of this program any good?”**In other words,**“would this be useful to someone?”**For your qualitative evaluation, go considerably beyond a casual inspection or anecdotal evidence. Specifically, since you are evaluating a topic analysis technique, for several chosen topics report the top $N$ words according to $p(w | z)$, where $z$ is a topic of interest. Also include responses to the questions raised below – enough to convince us that you looked at the specific behavior of your models and thought about what they’re doing and how you would improve them.- Use tables, diagrams, graphs, and interesting examples, where needed, to make your points, and shares any other insights you think will be helpful in interpreting your work.
**Questions**: Address the following questions:

- How are the results surprising?
- Why do you think your algorithm is working properly?
- To what degree are the topics internally coherent?
- Can you give a name to each topic, based on your inspection?
- To what degree are the topics distinct from one another?
- Identify some specific problems that are particularly revealing as to why or how the algorithm makes errors.
- If you were going to invest a large amount of effort into raising the usefulness of this system, what would you do and why?

**Feedback**: Include at the end of your report a short section titled “Feedback”. Reflect on your experience in the project and provide any concrete feedback you have for us that would help make this project a better learning experience.**Code**: Include your code at the end of your report.- [10 points]
**Clarity and structure**of your report.

Your report should be submitted as a .pdf document via Learning Suite.

Thanks go to Dan Walker, formerly of the BYU NLP Lab, for collecting the data sets.

There are many libraries that make sampling from the Dirichlet and other prior distributions easy. (e.g., http://www.arbylon.net/projects/knowceans-tools/doc/)

The multivariate beta function can be implemented according to its definition using the gamma function. The gamma function is implemented in Apache commons (http://commons.apache.org/proper/commons-math/userguide/special.html )

public static double logBeta(double[] alpha) { double sum = Gamma.logGamma(alpha[0]); for (int i = 1; i < alpha.length; i++) { sum += Gamma.logGamma(alpha[i]); } return sum - Gamma.logGamma(DoubleArrays.sum(alpha)); }