You must have an account with the Fulton Supercomputing Lab (FSL). If you do not have an account, you can apply here. You will need to be careful to put all of the project in the correct directory. As of now (11/10/07) you will need to have the project in your compute directory to use the batch queue. For the hdx queue, you will need the project to be in your home directory.
mkdir alfa
cd alfa
svn checkout http://nlp.cs.byu.edu/subversion/alfa/trunk/ ALFA
cd ALFA
mkdir data
python scripts/submit.py -t 1 -P 100 -a 1 -cBaseline -m 1 -n1 -v -dPTB -xActiveLearner.xml
As much as the code is self-documenting, this format provides for more lengthly explanations.
-v --verbose Prints more messages to the screen than normal -d --dataset The dataset you are using for the experiment. We currently have PTB, Syriac, and BNC -x --xml The xml file used to start the launch. Usually it's either ActiveLearner.xml or MultiTagActiveLearner.xml -m --models Only used for QBC experiments. If not running QBC, use one for the number of models. -P --trainper The percent of the file allTraining.txt to be used as training data. allTraining.txt is found in ALFA/data/dataset/ where dataset is PTB, BNC, Syriac the percent is chosen randomly. -T --traintype The type used to split this percent (either words or sentneces). With multiple runs and sufficient data, the split should be about equal, so we usually use sentences -a --amount The amount of data that starts out as annotated. This amount is either a percentage or a hard number of sentences. -p --use_percent Whether or not the --amount parameter is using a percentage of data, or a number of words or sentences. -i --inittype the type of data used to split the --amount. For example, -iword -a50 starts with at least 50 words (We don't cut any sentences in half, so we get the fewest number of sentences with at least 50 words) of annotated data. -isentence -a1 -p means start with 1 percent of the sentences as annotated data. We typically start with one sentence (-a1 -isentence) -s --batchsize The size of the batch query. This is how many sentences we give to the oracle each iteration. -b --batchtype The type we give to the oracle. This is either word or sentence -c --comp The main algorithm used to find uncertainty. For example: QBU, LS, QBC, Baseline, etc. -n --numtests The number of each experiment we want. Since we typically average 5 runs, -n is usually set to 5 -t --time The time estimated the experiment will take. It's generally good to overshoot, since the supercomputer will terminate any processes that go over the (in hours?) specified time. -f --filename If you want to change the filename of the experiments. -C --candidates The number of candidates used from which the batch size will be chosen. The default is -1, which means make all possible sentences candidates. In order to run an experiment similar to the Engelson and Dagan paper, you'd set -C1000 -s100 -bsentence (I believe). -O --switchover The number of iterations after which you will switch to the random baseline. -G --stopping The number of iterations after which you will stop the program. -o --outdir The main output directory. The default is "out/" This should (if it matters) end with a slash. -B --switchbase Whether or not the switchover point switches to the baseline or keeps the last model without training.
-S --subtags The subtag indices used for a particular run. So, if I want to run a POS Tagger just considering the first subtag, I'd add -S0 to the command line. -D --delimeter What separates the subtag. For Syriac, the delimeter is #.