topical-guide-documentation:using-the-topical-guide [CS Wiki]

Installation And Setup

See Getting Started with the Topical Guide for instructions on installing the prerequisite for the Topical Guide.

Running the example dataset

In the code base there is a collection of State of the Union addresses that we give as an example dataset. It consists of about 200 documents totaling around 700,000 tokens. To run it, simply check out the code and type

python backend.py

into a terminal (in the base directory of the code, of course). All together that command took me about 10-15 minutes to run with a fresh copy of the code. The time it took was as follows:

Before running Mallet: about 1 minute
Mallet running time: almost 2 minutes
Importing the dataset into the database: 5 minutes (the bulk of this (4 minutes) was from creating the AttributeValueWord table)
Importing the analysis into the database: 2.25 minutes
After importing the analysis (metrics and so forth): about 1 minute
Graph generation: 1 or 2 minutes

Perhaps we should consider having a smaller example dataset, just so someone can get things running in a minute or two instead of 10 or 15. But it's probably not a big deal.

Importing Data into the Topical Guide

We currently use a python build system called doit in order to import data. In the base of the repository there is a script called backend.py that is able to take a directory of text files, tokenize them, pass them through MALLET, and import the result into the database of the browser. The build system is also smart enough to know if you have already run MALLET and have the necessary output files available.

In order to use the build system, you must provide a python script that contains a few necessary functions, including where the data files are to be found, what to call the dataset, and other things. See the dataset script specification for more information.

Some of the steps in the build system take quite a long time if your dataset is very large. We have tried to speed up the dataset and analysis import scripts, and they are much faster than they were, but they can still take a long time. The holdup is django's save mechanism; if we were to build a list of insert statements and do some kind of bulk insert in sqlite itself, it might go faster. While the import scripts are slow, the slowest parts of them output their progress, so you at least know how long you might have to wait.

In particular, computing pairwise document metrics where there are thousands of documents can take hours (I ran the topic_correlation script with 4325 documents and it took about an hour). Again the holdup here is django; it just takes too long to insert a row into a database, and we have millions of rows for this table. If you have a large dataset and you want to see the results before waiting for that to finish, you might want to just comment that task out in dodo.py and run the scripts from the command line overnight, or something. Everything else still works in the browser without computing those metrics; you just can't see similar documents.

What to do if you already have Mallet output

The build system will run everything for you, including Mallet. If your dataset is large, running Mallet might take a long time, and you may have already done it. In that case, the build system is smart enough to know that the Mallet files are already there, as long as you put them in the right place. Unfortunately, I don't know where the right place is, so Josh will have to enlighten us. If you really want to get things working before we fix this documentation, then run the example state of the union code and see where it puts its mallet output files, then put your files in the same location for your dataset.

Running the Server

To run a locally hosted server, just run the following command from the root directory of the code:

python topic_modeling/manage.py runserver

This will start a web server that is accessible from a browser at localhost:8000. To make the server available to other machines on the same network, run:

python topic_modeling/manage.py runserver 0.0.0.0:8000

See the django documentation for more information on running the server.

The browser UI

Choice of browser

We have noticed some problems when using Chrome on Linux (Fedora 14, at least) to view the Topic Browser. The only problem of which we are currently aware is that the SVGs don't scale properly. On the same machine, Firefox renders everything just fine. It may be some problems in our CSS and the SVG files we use; we'll have to look into that. But in the meantime, Firefox handles whatever problems we have in our stuff gracefully.

The Datasets Tab

The Documents Tab

The Attributes Tab

The Topics Tab

The Plots Tab

The Word Tab

Saving Favorites

Importing Metrics and other extra information

There are a number of metrics that are available to compute for a variety of items in the database, things like the coherence of the words in a topic or the entropy of the topic distribution for a document. These metrics add capabilities to the browser, as described elsewhere on this site. By default we include a subset of the available metrics, and that subset is defined in backend.py.

The place where the computed metrics are defined is found on line 134 in backend.py (at least in commit d54a12ce; searching for topic_metrics should be fruitful if the line number changes). To change the metrics that are run, modify the lists found in backend.py, or define those variables in the build script specific to your dataset (this overrides the default, as can be seen by the if statements in backend.py).

For a complete list of metrics, see the metrics page.

Turbo Topics Collocations

Turbo topics is a method developed by David Blei and John Lafferty to extract meaningful collocations for each topic in a topic model. They provided code to run their method under the GPL, and we include their code with the Topical Guide. You can run turbo topics directly from the build system if you include a line in your build script specifying that you want it to be run. Somewhere in your script include the following line:

turbo_topics = True

Setting

turbo_topics

to

False

, or leaving out the line, will stop turbo topics from being run when you run backend.py. The code can take a very long time to run (many hours even for small datasets, perhaps days for larger ones), so we recommend not specifying that line when first importing data into the browser. After the initial import is finished and you can browse the data, you can then re-run backend.py to import turbo topics information (you can tell it to just run turbo topics by running

python backend.py turbo_topics

). The part that takes a long time is running Blei and Lafferty's code (it was written in python, and thus is not the most efficient code out there, though it works), so it will not interfere with using the browser. When it is finished running, the data should just appear in the browser along with the standard unigram word cloud. Topical Guide

topical-guide-documentation/using-the-topical-guide.txt · Last modified: 2015/04/21 16:21 by ryancha

Back to top

Table of Contents