nlp-private:feature-engineering-console [CS Wiki]

__NOTOC__ For details on the design of the Feature Engineering Console, see FEC Design. For details on the current development status of the Feature Engineering Console, see FEC Development. This page focuses on current features of the FEC from a user's perspective. See also Feature Engineering Cycle.

The Feature Engineering Console, or FEC, is a graphical tool to assist the researcher in identifying the impact of different features on the performance of classification and identification systems. Important features include:

Experiment and Result Management
Cost Component Matrix
Trial Viewer
Model Weights Viewer
Integrated Regression Tests
Highly Extensible API

[mailto:joshhansen@byu.edu Email me!]

Getting Started

These instructions will show you how to start using the FEC for a simple proper noun phrase classification task:

Either grab the all-in-one jar file, or check out and build the projects separately. (The projects are at http://nlp.cs.byu.edu/subversion/statnlp/trunk, http://nlp.cs.byu.edu/subversion/FEC/trunk, and http://nlp.cs.byu.edu/subversion/statnlp/trunk/experimentation-module. experimentation-module depends on FEC and statnlp. FEC depends on statnlp.)
Extract the 'experiments' folder from the jar, or locate it within the statnlp project.
Get the PNP data and extract it in your 'corpora' directory. The 'corpora' and 'experiments' directories probably need to have the same parent.
If you build the projects separately, be sure that the experimentation module jar file is only on the runtime classpath, not the buildtime classpath. If you use the all-in-one jar then this shouldn't be an issue.
Run edu.byu.fec.FeatureEngineeringConsole, either using java -cp fec-full.jar edu.byu.fec.FeatureEngineeringConsole, or from within your IDE.
Upon first startup, you will be prompted to enter the location of the experiment XML files for the StatNLP experiment harness. This is the 'experiments' folder mentioned above. FEC will create a configuration file named engineering-env.conf in the current working directory. As long as the working directory is the same in the future, FEC will load this configuration file in the future, sparing you the trouble of entering the experiments directory path every time.
Once this is entered and you have clicked OK, a list with a single item in it will appear. This item should be “Statistical NLP ExperimentHarness”. Double-click on it.
You should now see a list of experiments available to be run. Right-click on “pnptester-fec.xml” and select “Run Experiment”.
Assuming that all the paths in the XML file are pointing to the data, and assuming temporary suspension of Murphy's Law, the experiment should now run in the background. Click on the “Show Console” button to get more detail about what's going on behind the scenes.
Once the experiment finishes, double-click on “pnpester-fec.xml”.
From the list of results that appears (there might be more than one – this is an outstanding bug), right click on one and select “View Cost Component Matrix”.
The Cost Component Matrix (sometimes referred to as the CCM) facilitates feature engineering by allowing you to “drill down” to specific error cases. Play around with it. Have fun!

See below for an image of the CCM.

Usage

When FEC loads, it should initially look something like this:<br/> FECInitialWindow.png

System List Panel

The above screenshot focuses on the System List Panel, which displays any experimentation systems currently loaded. These are discovered at runtime as outlined in Engineering Environment. To open a system for work, double-click on its entry in the System List Panel

Experimentation System Panel

SLIDSystemPanel.png

Experiment List Panel

The initially visible part of an Experimentation System Panel is the Experiment List Panel. The experiment list panel lists all available experiments within the system, and color-codes them according to whether the experiment has already been run to generate results:

Green - all there. Yellow - just aggregate curves. Red - nothing.

<!– ExperimentListPanel.jpg –>

Aggregate DET curve

Shows the overall effectiveness of the current featureset. This is the traditional output we have used in the past.<br/> Bigram_and_fivegram_global_det_curve_-_11_oct_2007.png

Language-to-language DET curves

A matrix of all pairs of languages will show the strength of the featureset in discriminating between any two languages. To determine a clean mechanism of generating language-language DET curves, look at how the LANGUAGE_LIST and LONG_LANGUAGE_LIST variables are consumed by the build system.

Cost Component Matrix

Feature_Engineering_Console_-_Cost_Component_Matrix.png

Spoken Language ID Feature Engineering Console

nlp-private/feature-engineering-console.txt · Last modified: 2015/04/22 15:00 by ryancha

Back to top

Table of Contents