nlp-private:introduction-to-language-id [CS Wiki]

Introductory Tasks

Running An Experiment in Ten Shell Commands Or Less

Ensure that your system meets the current system requirements. On an Ubuntu Linux system, this can be done using a command such as this: 
```
sudo apt-get install sun-java6-jdk ant perl ruby gnuplot gcc cmake make pdl subversion
```

Check out the current Language ID repository:

svn co <nowiki>http://nlp.cs.byu.edu/subversion/NIST/HEAD</nowiki>

Enter the HEAD directory created by the previous command: 
```
cd HEAD
```
Be sure that the template from which the main configuration file will be generated represents your setup by editing
```
Language-ID/config/language-id.conf.cmake
```
. Do this using your favorite text editor, for example:
```
vim Language-ID/config/language-id.conf.cmake
```
. Pay particular attention to
```
PRAAT_EXE
```
,
```
PERL_EXE
```
,
```
WAV_DATA_ORIGINAL_LOCATION
```
, and
```
SEG_DATA_ORIGINAL_LOCATION
```
(or their
```
*_WIN
```
counterparts if running on Windows) to be sure that these contain the correct paths. (
```
LABTOOLS
```
seems to be unused at the moment.)
Generate the build system with default settings using cmake: 
```
cmake .
```
 This also generates the configuration file
```
Language-ID/config/language-id.conf
```
based on the settings you provided in
```
Language-ID/config/language-id.conf.cmake
```
.
Build DETware: 
```
make
```
Build Language-ID: 
```
cd Language-ID ant
```
Run the experiment: 
```
cd .. make detcurve
```
 The
```
detcurve
```
target will create the
```
experiments
```
directory in which all experiment data will be stored. It will then copy seg files and wav files into
```
experiments/data
```
and begin extracting features from these data files by running the seg2xml3.pl script. Once these preliminary data have been copied and analyzed, a directory specific to the current experiment will be created. As the default experiment is
```
fourgramall
```
, a directory called
```
experiments/fourgramall
```
will be created to house results specific to that experiment. ling files will be generated and placed in
```
experiments/fourgramall/data
```
. Language models will be trained and placed in
```
experiments/fourgramall/models
```
. Other results, including metrics and plot data, will be placed in
```
experiments/fougramall/results
```
Explore the DET curves, avgcost.txt, avgeer.txt, etc. in the
```
results
```
directory.

Running a Specific Experiment

The above example only runs the default 'fourgramall' experiment. Running a specific experiment requires almost exactly the same process. For example, to run the 'fivegram' experiment, substitute the following cmake command: :

cmake -D FEATURE_SET_NAME_FORCE=fivegram .

Then proceed to build the

detcurve

target as before: :

make detcurve

This time, results will be stored in

experiments/fivegram

rather than

experiments/fourgramall

as previously.

Other parameters can be set, such as the normalization option for resultbuilder.pl, and the result name that determines where plot output is stored: :

cmake -D FEATURE_SET_NAME_FORCE=fivegram -D RBLDR_NORM_FORCE=1 -D RESULT_NAME_FORCE=nist_norm1 .

This prepares the system for running

fivegram

with a normalization of 1. It will also set the result name to be

nist_norm1

to differentiate this run from our previous one. We can now run the experiment: :

make detcurve

The parameters we set in our most recent run of cmake are the exact parameters used by the regression test. Speaking of the regression test…. <segue>

Running All Experiments

If you wish to build all defined experiments, simply run

Language-ID/scripts/[[runall.rb]]

.

Regression Tests

Regression tests have been created to help verify the integrity of any changes we make to the system. The regression tests can be run by invoking

Language-ID/scripts/[[run_all_tests.rb]]

from within the

HEAD

directory. This will attempt to compare all currently-built experiments to the baseline data stored in

Language-ID/regression

to guarantee no significant changes to the output.

Papers

The following papers provide vital background information to the problem that the Spoken Language ID project is tackling:

Spoken Language ID

nlp-private/introduction-to-language-id.txt · Last modified: 2015/04/22 15:09 by ryancha

Back to top

Table of Contents