The Spoken Language ID project seeks to….
sudo apt-get install sun-java6-jdk ant perl ruby gnuplot gcc cmake make pdl subversion
svn co <nowiki>http://nlp.cs.byu.edu/subversion/NIST/HEAD</nowiki>
cd HEAD
Language-ID/config/language-id.conf.cmake
. Do this using your favorite text editor, for example:
vim Language-ID/config/language-id.conf.cmake
. Pay particular attention to
PRAAT_EXE
,
PERL_EXE
,
WAV_DATA_ORIGINAL_LOCATION
, and
SEG_DATA_ORIGINAL_LOCATION
(or their
*_WIN
counterparts if running on Windows) to be sure that these contain the correct paths. (
LABTOOLS
seems to be unused at the moment.)
cmake .
<br/>This also generates the configuration file
Language-ID/config/language-id.conf
based on the settings you provided in
Language-ID/config/language-id.conf.cmake
.
make
cd Language-ID<br/>ant
cd ..<br/>make detcurve
<br/>The
detcurve
target will create the
experiments
directory in which all experiment data will be stored. It will then copy seg files and wav files into
experiments/data
and begin extracting features from these data files by running the seg2xml3.pl script.<br/>Once these preliminary data have been copied and analyzed, a directory specific to the current experiment will be created. As the default experiment is
fourgramall
, a directory called
experiments/fourgramall
will be created to house results specific to that experiment. ling files will be generated and placed in
experiments/fourgramall/data
. Language models will be trained and placed in
experiments/fourgramall/models
. Other results, including metrics and plot data, will be placed in
experiments/fougramall/results
results
directory.
The above example only runs the default 'fourgramall' experiment. Running a specific experiment requires almost exactly the same process. For example, to run the 'fivegram' experiment, substitute the following cmake command: :
cmake -D FEATURE_SET_NAME_FORCE=fivegram .
Then proceed to build the
detcurve
target as before: :
make detcurve
This time, results will be stored in
experiments/fivegram
rather than
experiments/fourgramall
as previously.
Other parameters can be set, such as the normalization option for resultbuilder.pl, and the result name that determines where plot output is stored: :
cmake -D FEATURE_SET_NAME_FORCE=fivegram -D RBLDR_NORM_FORCE=1 -D RESULT_NAME_FORCE=nist_norm1 .
<br/> This prepares the system for running
fivegram
with a normalization of 1. It will also set the result name to be
nist_norm1
to differentiate this run from our previous one. We can now run the experiment: :
make detcurve
The parameters we set in our most recent run of cmake are the exact parameters used by the regression test. Speaking of the regression test…. <segue>
If you wish to build all defined experiments, simply run
Language-ID/scripts/[[runall.rb]]
.
Regression tests have been created to help verify the integrity of any changes we make to the system. The regression tests can be run by invoking
Language-ID/scripts/[[run_all_tests.rb]]
from within the
HEAD
directory. This will attempt to compare all currently-built experiments to the baseline data stored in
Language-ID/regression
to guarantee no significant changes to the output.
The following papers provide vital background information to the problem that the Spoken Language ID project is tackling: