Below are some emails discussing the per-template feature weight analysis
From: Robbie Haertel
Sent: Friday, November 07, 2008 8:01 AM
To: Eric Ringger; Peter McClanahan
Subject: Char model template ranking
I wrote a script to rank templates by the sum of the absolute value or square of the features they produce. Here are the results for the character model (using square of weights):
UVC-2-1-0 13667.672520080343
UVC-3-2-1 13660.474430102808
UVC+0+1+2 12399.267224576308
UVC+1+2+3 9606.828816105764
UVC-1-0 2559.46629499261
UVC+0+1 2114.3135426928425
UVC-2-1 2104.486720839376
UVC+1+2 1475.8852669479863
PREV_VOWELS 1400.7812718725113
…
I believe Peter's note UVC means unvoweled consonant and the plus and minus mean includes characters two to the left (-2) right here (-0) etc.
EOS means EOS (which should be EOW for word)
The PREV_VOWEL_* features have the same weight which helped me find a bug–retraining right now.
Fact: the model is only 5 Mb so we can add plenty of new features
Suggestion: it looks like we might get some mileage out of adding larger n-grams (the ones centered at zero seem to work slightly better).
Robbie
Currently the FEC allows you to view per-feature weights. What Robbie has implemented is a way to assess per-template weights. Very critical for template-level feature-engineering.
Josh, would you put this on the to-do list – for you or someone else to take up when ready.
Thanks, –Eric
From: Robbie Haertel
Sent: Saturday, November 08, 2008 9:16 PM
To: Eric Ringger
Cc: Peter McClanahan
Subject: Re: Char model template ranking
Here are the most up-to-date results. I have there methods for ranking templates: sum, avg, and max, which correspond to the sum of all features produced by a template, the avg. and the max. The problem with sum is that templates that produce a lot of features are unfairly represented; average has the problem that there could be one really good hidden feature; max leaves out the cumulative effect of lower weighted features. For the word model, I've summed across all models, but in order to put them on the same playing field, I've normalized each model by dividing the weights by the difference of the maximum and minimum weights for the model.
Of course, the features are not necessarily self-explanatory.
Character:
Sum
UVC+0+1+2 56.90642282021061
PARTIALLY_DIACRITICIZED_N-GRAM_4 48.21648326339715
UVC+1+2+3 46.71085783636655
UVC-2-1-0 41.28993978036493
UVC-3-2-1 28.411102871841752
PARTIALLY_DIACRITICIZED_N-GRAM_3 13.139955676041424
UVC-1-0 9.059466014618154
UVC+0+1 8.834905814140908
UVC+1+2 5.10312275907008
PREV_3DIACRITICS 4.651042786597986
UVC-2-1 3.729255127717895
PARTIALLY_DIACRITICIZED_N-GRAM_2 2.656147659768024
…
By average
PREV_DIACRITIC_1 0.003508388902034368
UVC+1 0.0033205114630904674
PREV_DIACRITIC_3 0.002422712014570924
UVC-4 0.0023483482910579027
UVC 0.0018694564189856001
EOW_0 0.0018616896244633382
PREV_DIACRITIC_2 0.0018021069851990807
UVC-2 0.0015103571822412162
EOW_1 0.0014282984886792042
…
By max
PREV_DIACRITIC_1 0.003508388902034368
UVC+1 0.0033205114630904674
PREV_DIACRITIC_3 0.002422712014570924
UVC-4 0.0023483482910579027
UVC 0.0018694564189856001
EOW_0 0.0018616896244633382
PREV_DIACRITIC_2 0.0018021069851990807
UVC-2 0.0015103571822412162
…
Word:
Sum
PREV_VOWELED_SUFFIX_1 5234.069297836419
PREV_VOWELED_PREFIX_1 4604.480946580072
PREV_VOWELED_SUFFIX_2 4286.125609120276
PREV_VOWELED_PREFIX_2 4035.467960754094
PREV_VOWELING_1 1731.6868118682146
PREV1 1715.622510182117
BEFORESUFFIX_1_1 1701.4053417460768
BEFOREPREFIX_1_1 1687.5277231724885
SUFFIX_AGREEMENT_PATTERN_1 1622.8327423461471
PREFIX_AGREEMENT_PATTERN2_1 1589.946248931253
…
By average SUFFIX_1 0.10446400188291545
PREFIX_1 0.10446400188291545
SUFFIX_3 0.10314193165181945
PREFIX_3 0.10314193165181945
SUFFIX_2 0.10290156189167299
PREFIX_2 0.10290156189167299
PREFIX_AGREEMENT_PATTERN2_0 0.03840109781621068
SUFFIX_AGREEMENT_PATTERN_0 0.026196459437768865
PREFIX_AGREEMENT_PATTERN2_1 0.021202393003390538
BEFOREPREFIX_1_1 0.019702370353790247
AFTERPREFIX_1_1 0.019369258361621268
BEFORESUFFIX_1_1 0.018596016544938705
AFTERSUFFIX_1_1 0.017126024961369425
…
By max
SUFFIX_1 0.10446400188291545
PREFIX_1 0.10446400188291545
SUFFIX_3 0.10314193165181945
PREFIX_3 0.10314193165181945
SUFFIX_2 0.10290156189167299
PREFIX_2 0.10290156189167299
PREFIX_AGREEMENT_PATTERN2_0 0.03840109781621068
SUFFIX_AGREEMENT_PATTERN_0 0.026196459437768865
PREFIX_AGREEMENT_PATTERN2_1 0.021202393003390538
BEFOREPREFIX_1_1 0.019702370353790247
AFTERPREFIX_1_1 0.019369258361621268
BEFORESUFFIX_1_1 0.018596016544938705
AFTERSUFFIX_1_1 0.017126024961369425
…
Robbie
Back to top