nlp-private:per-template-feature-weight-viewer [CS Wiki]

Below are some emails discussing the per-template feature weight analysis

From: Robbie Haertel

Sent: Friday, November 07, 2008 8:01 AM

To: Eric Ringger; Peter McClanahan

Subject: Char model template ranking

I wrote a script to rank templates by the sum of the absolute value or square of the features they produce. Here are the results for the character model (using square of weights):

UVC-2-1-0 13667.672520080343

UVC-3-2-1 13660.474430102808

UVC+0+1+2 12399.267224576308

UVC+1+2+3 9606.828816105764

UVC-1-0 2559.46629499261

UVC+0+1 2114.3135426928425

UVC-2-1 2104.486720839376

UVC+1+2 1475.8852669479863

PREV_VOWELS 1400.7812718725113

…

I believe Peter's note UVC means unvoweled consonant and the plus and minus mean includes characters two to the left (-2) right here (-0) etc.

EOS means EOS (which should be EOW for word)

The PREV_VOWEL_* features have the same weight which helped me find a bug–retraining right now.

Fact: the model is only 5 Mb so we can add plenty of new features

Suggestion: it looks like we might get some mileage out of adding larger n-grams (the ones centered at zero seem to work slightly better).

Robbie

Currently the FEC allows you to view per-feature weights. What Robbie has implemented is a way to assess per-template weights. Very critical for template-level feature-engineering.

Josh, would you put this on the to-do list – for you or someone else to take up when ready.

Thanks, –Eric

From: Robbie Haertel

Sent: Saturday, November 08, 2008 9:16 PM

To: Eric Ringger

Cc: Peter McClanahan

Subject: Re: Char model template ranking

Here are the most up-to-date results. I have there methods for ranking templates: sum, avg, and max, which correspond to the sum of all features produced by a template, the avg. and the max. The problem with sum is that templates that produce a lot of features are unfairly represented; average has the problem that there could be one really good hidden feature; max leaves out the cumulative effect of lower weighted features. For the word model, I've summed across all models, but in order to put them on the same playing field, I've normalized each model by dividing the weights by the difference of the maximum and minimum weights for the model.

Of course, the features are not necessarily self-explanatory.

Character:

Sum

UVC+0+1+2 56.90642282021061

PARTIALLY_DIACRITICIZED_N-GRAM_4 48.21648326339715

UVC+1+2+3 46.71085783636655

UVC-2-1-0 41.28993978036493

UVC-3-2-1 28.411102871841752

PARTIALLY_DIACRITICIZED_N-GRAM_3 13.139955676041424

UVC-1-0 9.059466014618154

UVC+0+1 8.834905814140908

UVC+1+2 5.10312275907008

PREV_3DIACRITICS 4.651042786597986

UVC-2-1 3.729255127717895

PARTIALLY_DIACRITICIZED_N-GRAM_2 2.656147659768024

…

By average

PREV_DIACRITIC_1 0.003508388902034368

UVC+1 0.0033205114630904674

PREV_DIACRITIC_3 0.002422712014570924

UVC-4 0.0023483482910579027

UVC 0.0018694564189856001

EOW_0 0.0018616896244633382

PREV_DIACRITIC_2 0.0018021069851990807

UVC-2 0.0015103571822412162

EOW_1 0.0014282984886792042

…

By max

PREV_DIACRITIC_1 0.003508388902034368

UVC+1 0.0033205114630904674

PREV_DIACRITIC_3 0.002422712014570924

UVC-4 0.0023483482910579027

UVC 0.0018694564189856001

EOW_0 0.0018616896244633382

PREV_DIACRITIC_2 0.0018021069851990807

UVC-2 0.0015103571822412162

…

Word:

Sum

PREV_VOWELED_SUFFIX_1 5234.069297836419

PREV_VOWELED_PREFIX_1 4604.480946580072

PREV_VOWELED_SUFFIX_2 4286.125609120276

PREV_VOWELED_PREFIX_2 4035.467960754094

PREV_VOWELING_1 1731.6868118682146

PREV1 1715.622510182117

BEFORESUFFIX_1_1 1701.4053417460768

BEFOREPREFIX_1_1 1687.5277231724885

SUFFIX_AGREEMENT_PATTERN_1 1622.8327423461471

PREFIX_AGREEMENT_PATTERN2_1 1589.946248931253

…

By average SUFFIX_1 0.10446400188291545

PREFIX_1 0.10446400188291545

SUFFIX_3 0.10314193165181945

PREFIX_3 0.10314193165181945

SUFFIX_2 0.10290156189167299

PREFIX_2 0.10290156189167299

PREFIX_AGREEMENT_PATTERN2_0 0.03840109781621068

SUFFIX_AGREEMENT_PATTERN_0 0.026196459437768865

PREFIX_AGREEMENT_PATTERN2_1 0.021202393003390538

BEFOREPREFIX_1_1 0.019702370353790247

AFTERPREFIX_1_1 0.019369258361621268

BEFORESUFFIX_1_1 0.018596016544938705

AFTERSUFFIX_1_1 0.017126024961369425

…