nlp-private:feature-definition-xml-file-roadmap [CS Wiki]

Resulting from a discussion between Josh and Dr. Ringger on 18 December 2007.

Objective: to modularize the definition file mechanism (reducing redundancy/improving reuse) and to increase readability/usability of the definition format.

Steps

<s>Create a regression test from an experiment using a complex .def.xml file</s><br/>Create a regression test for every experiment – DONE in r394
In FeatureDefinitionFileParser migrate from the custom XML parser built by Nathan (SimpleDOMParser) to the standard (Xerces) parser provided by Java.
- Motivation: Allow inclusions at the XML level using external entity.
- Implication: We have to replace the <, >, and & symbols with the standard XML syntax:
- :
```
<
```
  becomes
```
&amp;lt;
```
- :
```
>
```
  becomes
```
&amp;gt;
```
- :
```
>
```
  becomes
```
&amp;amp;
```
Check for regressions.
Flatten the XML files using the new inclusions mechanism – DONE in r399
- Quantizations are now defined in xml files in Language-ID/config/quantizations/
- Feature templates are now defined in xml files in Language-ID/config/features/
- The .def.xml files in Language-ID/config/feature_sets/ now use XML external entity inclusions to refer to the quantization and feature template files.
- def.xml syntax now requires the type of feature to be specified.
```
<feature>
```
  tags are no longer valid and must be replaced by
```
<file_feature>, <slice_feature>,
```
  or
```
<count_feature>
```
  tags.
  - NEW Based on this new syntax, we could remove the requirement for
```
<file_features>, <slice_features>, <count_features>,
```
    and
```
<quantizations>
```
    groups.
Check for regressions.
NEW Allow for one .def.xml file to “extend” another in an object-oriented inheritance sense. Thus when a lot of common code is shared between two defs, the simpler one could be extended by the more complex one, and so on. Implement this as an option in code, and then restructure the defs accordingly.
Check for regressions.
StatNLP Integration:
1. Move the feature definition mechanism into StatNLP.
2. Integrate this mechanism into the PNP experiment.
3. Apply StatNLP's ExperimentHarness system to SpokenLID experiments.
Check for regressions.
Develop a Domain Specific Language to describe the feature definitions.
1. Define relevant data structures in Java (FeatureSet?, FeatureDefinition?) or ensure that they already exist (edu.byu.langid.features.Quantization and edu.byu.langid.features.Quantization.Quantile)
2. Use a scripting language supported by Java's script engines framework to implement the DSL in terms of the data structures.
3. Switch from static .def.xml files to DSL-based definitions. For now the DSL scripts will generate corresponding XML files until we retool the batch feature extractor (FeatureFileBatchConverter) to use DSL-based definitions directly.
- Ruby easy XML output library
- A presentation on Ruby for DSL implementation
Check for regressions.
GUI Integration in Feature Engineering Console

Spoken Language ID

nlp-private/feature-definition-xml-file-roadmap.txt · Last modified: 2015/04/22 15:19 by ryancha

Back to top