## Topic Metrics

These metrics take as input a single topic.

### Alpha

This simply grabs the alpha value for the topic from a Mallet state file. This is generally proportional to the “Token Count” metric listed below, which is computed by default, so it may not be necessary. Note that this metric requires the additional parameter

state_file

. (See “Metrics with additional parameters” below)

### Attribute Entropy

This metric takes all attributes defined for the analysis and computes the entropy of the topic across the values for that attribute. For example, in the campaign speeches dataset, one of the attributes is the candidate who gave the speech corresponding to each document. This would compute how spread out across candidates any particular topic is, along with the entropy over all other attributes of the documents. Running this metric script actually produces one metric in the database for each attribute.

### Coherence

This metric computes the coherence of the words in a topic as defined by co-occurrence in Wikipedia articles (according to the method described in “Automatic Evaluation of Topic Coherence,” Newman et al., NAACL 2010). The script requires a database of counts in Wikipedia, which is far too large to distribute with the code (26 GB). We plan to put the database we created somewhere that you can download it. If you want to compute this metric on your dataset and we still have not posted the database, email us. You can also create a database yourself, as long as it conforms to the schema that the script expects. You could imagine computing “coherence” against several different sources, not just Wikipedia. The schema is the following:

CREATE TABLE cooccurrence_counts(word_pair primary key, count integer);
CREATE TABLE total_counts(words integer, cooccurrences integer);
CREATE TABLE word_counts(word primary key, count integer);

The path to the database of counts must be specified as an additional parameter in the build script. You can specify the additional parameter

counts

for this metric as described below, or you can define the variable

cooccurrence_counts

in your build script and backend.py will do the steps below for you (we figured this metric is desirable enough to have special code for it).

### Document Entropy

This metric computes the entropy of the distribution over documents for a given topic. I.e., given that a token is labeled with this topic, what is the probability distribution over documents that it probably came from? We get that distribution from the Mallet state file. This tells you how broadly a particular topic was used in the corpus. We have found that it is correlated closely with the log of the token count of the topic.

### Sentiment

There is a script in the database that is supposed to compute the overall sentiment of a topic. It works by classifying the documents in the database as bearing positive or negative sentiment, then computing the average sentiment for tokens labeled with each topic. However, it is currently in a broken state, and we do not have a sentiment model in the code base. We have plans to fix that, as we have computed this metric on a few specific datasets in house. It just hasn't been committed in a good state yet. If you have a sentiment classifier and you want to run this metric, it shouldn't be hard to modify the script to fit your needs.

### Subset Document Entropy

This is similar to Document Entropy, except the entropy is computed over a subset of the documents. The subset corresponds to all documents bearing a particular value for an attribute (e.g., all speeches given by Barack Obama). This is computed over all possible values for all attributes, so it produces a large number of metrics. Currently, that makes the UI really ugly in places, and we are working on fixing that. But it can show you some really interesting information, such as what topics Barack Obama used most consistently in his speeches (in my experiments, that turned out to be “markup in the pdfs containing his speeches,” “John McCain” (his opponent), and “change” - remarkable that an unsupervised algorithm and a simple metric can automatically find such things). When combined with the Topic Metric Comparison plot, or just with topic metric filters, this can also uncover such things as “Which topics were common to Obama and McCain, and which were unique to each?”

### Subset Token Count

This metric is Token Count counterpart to Subset Document Entropy. Instead of entropy over the subset, we compute the total number of tokens labeled with a topic in that subset. For example, Obama could have given a long speech about one particular topic, which would get a high rating from this metric, while a consistent theme that only made up a small percentage of each speech would get a lower value from this metric than from Subset Document Entropy.

### Token Count

This simply computes the total number of tokens in the dataset labeled with each topic. Ordering the topics by token count will show which topics are largest and which are smallest; typically the largest topics are meaningless, so filtering by coherence can give a better impression of what real topics are most prevalent in the dataset.

### Type Count

Analogous to Token Count, this counts the number of unique word types that are labeled with a topic.

### Word Entropy

This computes the entropy over the word distribution for a topic. Is all the mass centered on a few words, or is the mass evenly spread out across a lot of words? Word entropy is to type count as document entropy is to token count. They are the same metrics, just looking at words instead of documents.

## Pairwise Topic Metrics

These metrics take as input two topics and compute some number judging their similarity. We do not enforce that the triangle inequality holds, so these are not necessarily “metrics” in the technical sense.

### Document Correlation

This metric compares the usage of topics across documents. Topics that are often used together will have a high value for this metric. To compute the metric, we take the document vector for each topic (the number of tokens labeled with the topic in each document in the dataset), dot them together, and divide by each of their norms.

### Pairwise Coherence

This is an experimental metric that uses the Coherence topic metric to compare two topics with each other. To compute coherence within a single topic, we compute PMI over wikipedia between each word in the top ten words for the topic. As a pairwise metric, we compute the PMI between every pair of words in the two topics (i.e., for each word in topic 1, compute PMI with every word in topic 2, using only the top ten words in each topic). The idea is that this will give semantically related topics, perhaps better than simple Word Correlation (in my experience, Word Correlation is heavily dominated by the first word or two in the topic, while Pairwise Coherence does a better job semantically matching the topic as a whole). This metric also requires an additional parameter,

counts

, as in the Coherence topic metric. And as with the Coherence topic metric, if you specify

cooccurrence_counts

, backend.py will take care of the additional parameter specification for you.

### Word Correlation

This metric is similar to Document Correlation, but we use the word vector instead of the document vector. Thus, this metric gives high values for topics that use similar words in similar proportions.

## Pairwise Document Metrics

Some metrics require additional parameters in order to compute (such as the path to a database of counts, or other things). We currently only have this implemented in the build system for topic metrics and pairwise topic metrics; other metrics must be run from the command line if they require additional parameters. To specify the parameters for topic metrics in your build script, define the variable

topic_metric_args

as a dictionary. In that dictionary, give a dictionary of arguments for each metric that requires additional parameters. For example, to specify the

counts

parameter for the Coherence topic metric, the code would look like this:

topic_metric_args = {}
topic_metric_args['coherence'] = {'counts': '/path/to/counts/db'}