Differences

This shows you the differences between two versions of the page.

Link to this comparison view

nlp-private:running-alfa-on-the-supercomputer [2015/04/23 20:39] (current)
ryancha created
Line 1: Line 1:
 +== Accounts ==
 +
 +You must have an account with the Fulton Supercomputing Lab (FSL). If you do not have an account, you can apply [http://​fsl.byu.edu/​gettingAnAccount.php here].
 +You will need to be careful to put all of the project in the correct directory. ​ As of now (11/10/07) you will need to have the project in your compute directory to use the batch queue. ​ For the hdx queue, you will need the project to be in your home directory.
 +
 +== Getting Started ==
 +
 +* When you log in, create a separate folder for your alfa stuff: ​ <​pre>​mkdir alfa</​pre>​
 +* Move into the directory. <​pre>​cd alfa</​pre>​
 +* Do an svn checkout inside of your new alfa directory and put it in an ALFA directory. <​pre>​svn checkout http://​nlp.cs.byu.edu/​subversion/​alfa/​trunk/​ ALFA</​pre>​
 +* Move into the ALFA directory. <​pre>​cd ALFA</​pre>​
 +* Make a data folder: <​pre>​mkdir data</​pre>​
 +* Place any data you need into this folder. ​ This may be PTB, BNC, others... ​ They can be found on entropy
 +* Send the script to the supercomputer to run. <​pre>​python scripts/​submit.py -t 1 -P 100 -a 1 -cBaseline -m 1 -n1 -v -dPTB -xActiveLearner.xml</​pre>​
 +* This script will run one run of the baseline on the full Penn Treebank.
 +
 +== Script Parameters ==
 +
 +As much as the code is self-documenting,​ this format provides for more lengthly explanations.
 +<pre>
 +-v    --verbose ​    ​Prints more messages to the screen than normal
 +-d    --dataset ​    The dataset you are using for the experiment. We currently have PTB, Syriac, and BNC
 +-x    --xml         The xml file used to start the launch. Usually it's either ActiveLearner.xml or MultiTagActiveLearner.xml
 +-m    --models ​     Only used for QBC experiments. If not running QBC, use one for the number of models.
 +-P    --trainper ​   The percent of the file allTraining.txt to be used as training data. 
 +                      allTraining.txt is found in ALFA/​data/​dataset/​ where dataset is PTB, BNC, Syriac
 +                      the percent is chosen randomly.
 +-T    --traintype ​  The type used to split this percent (either words or sentneces). With multiple runs and sufficient data, 
 +                     the split should be about equal, so we usually use sentences
 +-a    --amount ​     The amount of data that starts out as annotated. This amount is either a percentage or a hard number of
 +                     ​sentences. ​
 +-p    --use_percent Whether or not the --amount parameter is using a percentage of data, or a number of words or sentences.
 +-i    --inittype ​   the type of data used to split the --amount. For example, -iword -a50 starts with at least 50 words 
 +                     (We don't cut any sentences in half, so we get the fewest number of sentences with at least 50 words) of                              ​
 +                     ​annotated data. -isentence -a1 -p means start with 1 percent of the sentences as annotated data. 
 +                     We typically start with one sentence (-a1 -isentence)
 +-s    --batchsize ​  The size of the batch query. This is how many sentences we give to the oracle each iteration. ​
 +-b    --batchtype ​  The type we give to the oracle. This is either word or sentence
 +-c    --comp ​       The main algorithm used to find uncertainty. For example: QBU, LS, QBC, Baseline, etc.
 +-n    --numtests ​   The number of each experiment we want. Since we typically average 5 runs, -n is usually set to 5
 +-t    --time ​       The time estimated the experiment will take. It's generally good to overshoot, since the supercomputer will
 +                     ​terminate any processes that go over the (in hours?) specified time.
 +-f    --filename ​   If you want to change the filename of the experiments.
 +-C    --candidates ​ The number of candidates used from which the batch size will be chosen. The default is -1, which means make all
 +                     ​possible sentences candidates. In order to run an experiment similar to the Engelson and Dagan paper, ​
 +                     ​you'​d set -C1000 -s100 -bsentence (I believe).
 +-O    --switchover ​ The number of iterations after which you will switch to the random baseline.
 +-G    --stopping ​   The number of iterations after which you will stop the program.
 +-o    --outdir ​     The main output directory. The default is "​out/"​ This should (if it matters) end with a slash.
 +-B    --switchbase ​ Whether or not the switchover point switches to the baseline or keeps the last model without training. ​
 +</​pre>​
 +
 +== Additional Parameters for Multi-tag options ==
 +
 +<pre>
 +-S    --subtags ​    The subtag indices used for a particular run. So, if I want to run a POS Tagger just considering the first subtag,
 +                     ​I'​d add -S0 to the command line.
 +-D    --delimeter ​  What separates the subtag. For Syriac, the delimeter is #.
 +</​pre>​
  
nlp-private/running-alfa-on-the-supercomputer.txt ยท Last modified: 2015/04/23 20:39 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0