Observations about the data:

  • You're stupid not to do Active Learning from the first sentence. Support: On the words changed graph it is clear that if you bootstrap with 1% of the data, the number of words initially tagged + number of words needing changes is approximately the same as the total number of words initially tagged when starting with 5% of the data. This is further confirmed in the cost graphs on which at virtually no point do any of the lines graph, where the line representing training on the least data corresponds to the lowest cost.
  • If on a particular metric you are doing worse than random, just invert the priority queue (analagous to reversing your decision on a binary classification task). The reason you do worse than random is because you are purposefully selecting certain sentences instead of uncertain ones (this also helps us find bugs). However, what is good for one metric may not be good for another.
  • There appears to be a an inverse relationship between “sentence goodness” and “word goodness”. Those that do well on a sentence level tend to do lousy on a word level and vice-versa.
    • Many (all?) of the time this is because these algorithms are mostly just selecting long sentences (we should plot the longest sentence baseline)
  • The length-normalized variants appear to do excellent on the “word” metric, but very poorly on the sentence metric. This means that we've pulled in really uncertain words, but at a sentence level, these words did not help accuracy as much. Perhaps we are pulling in shorter sentences and hence have less data.
  • The frequency-weighted variants are not performing well. Hypothesis: if we find a sentence which determines a frequent word is uncertain, it will be very highly weighted. But what is in most of the rest of the sentences, the word is “certain”. In this case, we don't want to select the word. What we really want, is to sum over the uncertainty for all occurrences of that word.
  • COST
    • Scenario 1: The goal of an annotation project may be to fully annotate an entire corpus; all words will eventually be either tagged or verified by a human with the hope of having a perfectly-annotated corpus (yeah right :)). Total Cost = cost to “bootstrap” + cost to “fix” errors proposed by Active Learning + cost to fix errors after active learning. These last two may not need to be separate. Also note that this cost implies that accuracy (a component of the costs) need be calculated over the reference set. This is fine because there is not cost associated with misclassifcation on a “blind” set according to the definition of our “goal” above. If the project would like to produce a generic “tagger”, they should annotate the whole corpus, split it into training and test, train the model, report the results, then train the model over the whole corpus for distribution. If part of the project's goal is to reduce the error rate of such a tagger, then our cost changes and this is a different scenario that is some combination of the annotation cost and a cost associated with errors in the final model (not explored here).
    • Scenario 2: It may not be feasible to expect to annotate an entire corpus (e.g. the web). In this case, the task at hand is basically to bootstrap an automatic tagger that reduces error rates over the entire corpus (including those sentences annotated by hand). In this case, a cost must be assigned to errors in the final model. The cost function for this type of project is: initial cost of annotation (of the 1%, 5%, etc.) + cost to make changes using Active Learning + cost of errors in the final model. Here, the cost of errors in the final model is probably best computed over a held-out set.
nlp-private/alfa-data-analysis.txt · Last modified: 2015/04/22 20:44 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0