Resulting from a discussion between Josh and Dr. Ringger on 18 December 2007.
Objective: to modularize the definition file mechanism (reducing redundancy/improving reuse) and to increase readability/usability of the definition format.
Steps
<s>Create a regression test from an experiment using a complex .def.xml file</s><br/>Create a regression test for every experiment – DONE in
r394
-
Check for regressions.
Flatten the XML files using the new inclusions mechanism – DONE in
r399
Quantizations are now defined in xml files in Language-ID/config/quantizations/
Feature templates are now defined in xml files in Language-ID/config/features/
The .def.xml files in Language-ID/config/feature_sets/ now use XML external entity inclusions to refer to the quantization and feature template files.
def.xml syntax now requires the type of feature to be specified.
<feature>
tags are no longer valid and must be replaced by
<file_feature>, <slice_feature>,
or
<count_feature>
tags.
NEW Based on this new syntax, we could remove the requirement for
<file_features>, <slice_features>, <count_features>,
and
<quantizations>
groups.
Check for regressions.
NEW Allow for one .def.xml file to “extend” another in an object-oriented inheritance sense. Thus when a lot of common code is shared between two defs, the simpler one could be extended by the more complex one, and so on. Implement this as an option in code, and then restructure the defs accordingly.
Check for regressions.
StatNLP Integration:
Move the feature definition mechanism into StatNLP.
Integrate this mechanism into the PNP experiment.
Apply StatNLP's ExperimentHarness system to SpokenLID experiments.
Check for regressions.
-
Define relevant data structures in Java (FeatureSet?, FeatureDefinition?) or ensure that they already exist (edu.byu.langid.features.Quantization and edu.byu.langid.features.Quantization.Quantile)
Use a scripting language supported by Java's script engines framework to implement the DSL in terms of the data structures.
Switch from static .def.xml files to DSL-based definitions. For now the DSL scripts will generate corresponding XML files until we retool the batch feature extractor (FeatureFileBatchConverter) to use DSL-based definitions directly.
Check for regressions.
-
Spoken Language ID
Back to top