nlp-private:running-alfa-on-the-supercomputer [CS Wiki]

Accounts

You must have an account with the Fulton Supercomputing Lab (FSL). If you do not have an account, you can apply here. You will need to be careful to put all of the project in the correct directory. As of now (11/10/07) you will need to have the project in your compute directory to use the batch queue. For the hdx queue, you will need the project to be in your home directory.

Getting Started

When you log in, create a separate folder for your alfa stuff:
```
mkdir alfa
```
Move into the directory.
```
cd alfa
```
Do an svn checkout inside of your new alfa directory and put it in an ALFA directory.
```
svn checkout http://nlp.cs.byu.edu/subversion/alfa/trunk/ ALFA
```
Move into the ALFA directory.
```
cd ALFA
```
Make a data folder:
```
mkdir data
```
Place any data you need into this folder. This may be PTB, BNC, others… They can be found on entropy

Send the script to the supercomputer to run.

python scripts/submit.py -t 1 -P 100 -a 1 -cBaseline -m 1 -n1 -v -dPTB -xActiveLearner.xml

This script will run one run of the baseline on the full Penn Treebank.

Script Parameters

As much as the code is self-documenting, this format provides for more lengthly explanations.

-v --verbose Prints more messages to the screen than normal
-d --dataset The dataset you are using for the experiment. We currently have PTB, Syriac, and BNC
-x --xml The xml file used to start the launch. Usually it's either ActiveLearner.xml or MultiTagActiveLearner.xml
-m --models Only used for QBC experiments. If not running QBC, use one for the number of models.
-P --trainper The percent of the file allTraining.txt to be used as training data.
allTraining.txt is found in ALFA/data/dataset/ where dataset is PTB, BNC, Syriac
the percent is chosen randomly.
-T --traintype The type used to split this percent (either words or sentneces). With multiple runs and sufficient data,
the split should be about equal, so we usually use sentences
-a --amount The amount of data that starts out as annotated. This amount is either a percentage or a hard number of
sentences.
-p --use_percent Whether or not the --amount parameter is using a percentage of data, or a number of words or sentences.
-i --inittype the type of data used to split the --amount. For example, -iword -a50 starts with at least 50 words
(We don't cut any sentences in half, so we get the fewest number of sentences with at least 50 words) of
annotated data. -isentence -a1 -p means start with 1 percent of the sentences as annotated data.
We typically start with one sentence (-a1 -isentence)
-s --batchsize The size of the batch query. This is how many sentences we give to the oracle each iteration.
-b --batchtype The type we give to the oracle. This is either word or sentence
-c --comp The main algorithm used to find uncertainty. For example: QBU, LS, QBC, Baseline, etc.
-n --numtests The number of each experiment we want. Since we typically average 5 runs, -n is usually set to 5
-t --time The time estimated the experiment will take. It's generally good to overshoot, since the supercomputer will
terminate any processes that go over the (in hours?) specified time.
-f --filename If you want to change the filename of the experiments.
-C --candidates The number of candidates used from which the batch size will be chosen. The default is -1, which means make all
possible sentences candidates. In order to run an experiment similar to the Engelson and Dagan paper,
you'd set -C1000 -s100 -bsentence (I believe).
-O --switchover The number of iterations after which you will switch to the random baseline.
-G --stopping The number of iterations after which you will stop the program.
-o --outdir The main output directory. The default is "out/" This should (if it matters) end with a slash.
-B --switchbase Whether or not the switchover point switches to the baseline or keeps the last model without training.

Additional Parameters for Multi-tag options

-S    --subtags     The subtag indices used for a particular run. So, if I want to run a POS Tagger just considering the first subtag,
                     I'd add -S0 to the command line.
-D    --delimeter   What separates the subtag. For Syriac, the delimeter is #.

nlp-private/running-alfa-on-the-supercomputer.txt · Last modified: 2015/04/23 14:39 by ryancha

Back to top

Table of Contents

Accounts

Getting Started

Script Parameters

Additional Parameters for Multi-tag options