Home page for Active Learning for Annotation (ALFA) Project


Action List

Move action item to the Done section when complete.

Themes of Possible Papers

  • Word vs. Sentence-level annotation
  • Human waits vs. Computer waits
  • Cost-normalized algorithm (instead of just length)
  • Benefit normalizer (word appears a lot). Some words appear more often and hence are more valuable
  • How to choose the first word/sentence
  • POS tagging when tags are vectors of features
  • Syriac results
  • Feature engineering for POS Tagging on our BNC Poetry set

Possible Venues


  • model of cost
    • should render longest sentence no different than random (?)
  • granularity of annotation
  • EVSI-inspired model that works

To Do

  • We don't have the money to label all of the data in which we are interested –> we will do a subset
  • Other scenario: we will label it all, but want to do it more quickly. – requires user study to prove acceleration with a reasonable model of time/cost.


  • POS tagging experiment with MEMM: compare tagger accuracy with and without using model to score final transition P(</s> | …) to the stop tag.


  • Issue for paper
    • do we focus on tagging alone
    • do we focus on tagging plus morph. analysis (together: “disambiguation”)?
  • Different combinators:
    • combining the hypotheses of independent (attribute-cluster) taggers
  • Two ways of evaluating:
    • every attribute
    • every attribute that is not N/A
  • Newest (as decided in the meeting on March 5)

1. Take out randomization when using 100 % of the data <b>[DONE 3/5]</b> <br> 2. Change feature selection in XML; check Romanized characters <b> [DONE 3/25]</b> (only capital letters) <br> 3. Compare Independent subtags with countcutoff of 1. <b>[DONE 3/25]</b> <br> 4. Run monolithic with CC of 1. <b>[DONE 4/14 or so]</b><br>

  • Currently working on:
    • Refactoring existing code that did POS tagging for sets of subtags.
    • I built a Hebrew Reader and split the Hebrew data we have (160,000 tokens). Need to submit this.
  • Things to do:

Run a most-frequent tagger for Arabic and Syriac (done for Hebrew). Add Hebrew first-round results to wiki.

  • OLD (but reviewed)
  • Vectorial Tagging
    • Assume up front that tags considered occurred in the training set, and we do not consider other tags
    • perform feature extraction and maxent vectorization statically before training


  • possibly use listener to track the behavior of fast maxent

** Output of morphology will constrain the paths of the trie.

** One Trie for all allowable tags in each cluster (alternatively one trie for all allowable tags in each cluster for each word) **

  • Profile how many tags per word and how many in the test didn't occur in the training for each word (min, max, avg)
    • When scoring a local trigram state, the Scorer assigns a score to each brach by using the submodels on each edge
      • To enumerate all possible monolithic tags made possible by the clustering, you must take the cross product of the tries.
    • Trie captures a static version of allowable tags (we don't need to search).
    • The order of the attributes (in the trie and the chain rule) is determined by the clustering.


  • Track down most recent paper on Arabic results
    • Track down Arabic data set used in most recent results [I e-mailed Dr. Lonsdale a week ago, but he hasn't responded]
    • Build a Arabic reader from that data


  • Devise a system to archive experimental results on entropy, perhaps in a way that's consistent with the directory structure on the supercomputer [waiting to be reviewed and committed]
    • Put forth a proposal on the list and ask for reactions
    • Default: don't replace existing results
      • Include “-f” option to “force” replacement explicitly
    • Write a script that puts (scp or rsync) experimental results in the right place on entropy - should probably be invoked by the experimentor when satisified with results.


  • Implement N-best decoder
    • Use that decoder for approximation to forward entropy
  • run Mallet CRF (order-2) on Syriac data
    • first on 10% to see if it works correctly
  • co-code review with Peter to deal with multiple models in-memory, sub-tag taggers, etc.
    • finish Perceptron tagging with sub-tag models
      • independent sub-tag models
      • dependencies between sub-tag models
  • search out other DCRF, performing multiple, cascaded labeling tasks on the same sequence


  • User study on web: quick estimate of cost of annotation
    • Conduct user study to assess sentences/words/corrections per unit time.
    • Can we find these numbers in the literature? - some from PTB (ask Robbie)

<br> To accomplish a couple of purposes:

  1. to assess the damage of incomplete sessions
  2. to assess the per-template coverage and variance

Do the following:

  • produce another .csv file (imported into another tab in your Excel file) that summarizes stats for each template. Each template could have the same columns as the per-session sheet as well as a column that indicates how many sessions used that template. The averages would be over the sessions using the template. In addition, variances should be included so that we can see how much variation there was on common templates.
  • We should add yet one more tab as well summarizing similar results over the tutorial questions and (separately) the final control questions.


  • Implement “wait-less” ALFA
    • Vary the output model complexity to investigate tradeoffs
    • Vary the scorer model complexity
    • We might consider investigating “switch-over” points based on results of above, e.g. use Most frequent tagger, switch to order 2 HMM, then MEMM
  • More fully investigate different cost scenarios:
    • Send batch to Iran with no annotations
    • Add annotations
    • On web with and without annotations but without explicit batch (see above)
    • Real time updates using AJAX and constrained Viterbi
  • Investigate the need to tighten the “fast MaxEnt” convergence threshold
    • Verify that the threshhold is sufficient to get results a cold start at 100%
  • Investigate feature selection
    • Replace count cut-offs with something more sensitive for small data sets. UPDATE: For small data sets, no cut-offs should be fine; MaxEnt will just have a lambda of close to zero for that particular feature. This suggests a better feature selector: track features over time and those that remain sufficiently close to zero for sufficiently long should be removed.
  • NEW 7/16/7: Investigate P(w) and P(w,t) as QBUV-like informativeness metrics.
  • Word-at-a-time Active Learning
    • Look at Cite on Forward Entropy in our paper to find relevence
    • infrastructure:
      • support partially-annotated sentences
    • modify decoders to allow for partially labeled data; respect tagged-word contraint(s)
      • viterbi (beam)
      • monte carlo
    • modify the learner to allow for partially labeled data; use the constrained beam decoder
    • measure the impact of a single word-label constraint on performance of both decoders
      • viterbi (beam)
      • monte carlo
    • run experiment: asking the oracle to annotate one word at a time


  • write up paragraph or two about making the utility function explicit and using a new utility, like time or cost of annotation.
  • Include EVSI references. Add to bibliography AND draft.


  • propose an approximate EVSI for implementation
    • write up material on white-board
  • write up query-by-expected-model-improvement

Unassigned To Do

  • MEMM Stocastic??
    • while running a regular MEMM outside of Active Learning on 100% of the PTB, I got different results with different seeds. The results are below. One possible difference is the training data is shuffled by default. Even when 100% of the training data is used, it is shuffled. Both of these experiments were on 100% of the data. Still, I thought MEMMs were deterministic given the same data…???

Seed: 1196918229884 Model POSTagger

Training…done! (3.2913366666666666 hr(s)) Running evaluation Tag Accuracy…done! (15.734 sec(s)) Tag Accuracy: 0.9671567284570664 (Unknown Accuracy: 0.9002100840336135), Sentence Accuracy: 0.47771173848439824 Decoder Suboptimalities Detected: 1

Seed: 1196955854728 Model POSTagger

Training…done! (2.0125825 hr(s)) Running evaluation Tag Accuracy…done! (38.203 sec(s)) Tag Accuracy: 0.9667914650108057 (Unknown Accuracy: 0.9023109243697479), Sentence Accuracy: 0.46953937592867756 Decoder Suboptimalities Detected: 1

  • New schemes
    • retrain P(w) on ref. set instead of whole set
      • P_v1(t\ubar) / P_v2(t\ubar). When the top word sequence is significantly more probable than the second, there is less uncertainty.
      • Try query-by-uncertainty first; then run query-by-EVSI (or suitable approx.) on the top n (small n) from query-by-uncertainty
  • PTB
    • Prose (WSJ PTB)
      • Ref set: whole ~40K sent
      • Sweep curves to 10K annotated data
    • We need to measure the effects of active learning when we have different amounts of reference data (for example, to predict what will happen with poetry, or if we slurp in web data, etc.)
  • Poetry pain
    • different sent. length
    • BNC feature engineering
  • Constrain choices using morphology or dictionary.
  • strengthen our case for MEMMs in active learning by doing HMMs for comparison as well.
  • Named Entity Recognition using our AL framework – using CoNLL data
  • Compute entropy exactly using the forward algorithm (see Mann & McCallum, NAACL-HLT '07)
  • Other languages:
    • Japanese
    • Revive Marc's Spanish results


  • Test James's assertion “after they correct one word of the sentence, Viterbi may be able to correct the rest of the sentence on its own”. I suspect that we could construct a clear experiment to test this hypothesis.

Other Questions and Further Research

  • What is the size of the tagset for the Spanish data? The tagset seems to indicate more of a “multi-tag” approach since it is a vector of values.
  • Do vector of tags for Spanish, Syriac.
  • Semi-supervised learning combined with active learning
  • We can do tough computation in the background (on other thread) while the user annotates
    • priority queue of best from previous round (things to annotate)
    • more general: user has queue; learner has queue
  • Really start from 0
  • query-by-approx-EVSI: full EVSI with sampling
  • QBC: Try other types of models of the members of the committee
  • QBC: Method for computing the “total” model. Is an ensemble approach any better?
  • Idea: future work - optimizing on unannotated set rather than test set.
  • Idea: portability of models to new datasets


  • George: Finish up Forward Entropy paper/Comparison of QBUE and QBU
  • George: Prepare a presentation on Voted Perceptron/ Averaged Perceptron and CRFs
  • George: bibliography on private wiki
  • Eric: give all access to private wiki
  • Eric: action list page on the wiki
  • George: upload the PDF for each paper
  • Eric: create mailing list
  • Robbie: Coordinate the creation of Subversion repository with Marc
  • Marc: find out about POS annotated poetry (BNC data is stored in the data directory as BNC.zip, Emily Dickinson data is on the way)
  • Marc: abstract query-by-x
  • Marc: Submit appropriate subset of his code
  • Marc: add “future-work” list (from 401R/581 final projects) to this action list (This was added by Robbie under Abstraction and Parameterization)
  • Robbie: share suggestions on abstraction with Marc
  • James: do asymptotic analysis (Big-O) of query-by-EVSI (full EVSI)
  • Peter: make query-by-uncertainty conform to Marc's query-by-X interface in the shared code-base
  • Everyone: check out the Alembic Workbench, Callisto (Java)
  • Eric: write up other query-by-uncertainty approaches
  • Peter: another query-by-uncertainty, with uncertainty measured by (1-max_{_t_} P(_T_=_t_)) (i.e., 1 - P(viterbi sequence))
  • Peter: approx. per sentence QBU, and weighted QBU
  • James: theory - compare EVSI and Q-by-uncert
  • James: Write QBU v. EVSI insights
  • James: Write up asymptotic analysis of EVSI
  • Marc: share results of query-set batch size experiment (10, 100, 1000 sentences) on the experiment log page
  • Marc: QBC
  • Marc: Random Baseline, multiple runs
  • George: full-sentence query-by-uncertainty, where entropy is computed using Monte Carlo sampling
  • Peter Change ActiveLearner to do data splits online - enables randomization of all experiments; allow percentage based on word or based on sentence (should be close, but possibly small variance)
  • Peter: experiment on query-set batch size (10, 100, 1000, 10K words); take whole sentences only; word count is lower bound (allow for extra words if necessary to get whole sent.)
  • George: code review of MC math
  • Peter: automate ant build and python script for running on supercomputer.
  • Peter: code optimization and abstraction
  • Eric: write and submit draft
  • Peter: Put EMNLP results on entropy
  • George: pull together quickly your existing writings and some brainstorms about what to do next with MC decoder
  • George: post results summarizing the performance of Monte Carlo tagger to estimate P(_t_ | _w_)
    • automatic search for thresholds on MC decoding that yield perf. comparable to Viterbi/beam search
    • search for thresholds on MC decoding that yield a full distribution (measured by entropy) in as little time as possible.
  • George: post results from using MC decoding in full-sentence QBU
  • Peter: Build a little Unicode Syriac diplay app to verify that the plumbing works on the NT data
  • Peter: Change active learner to default with 1 sentence of Initial Training
  • Peter: Consolidation / refactoring of code
  • Peter: specify Normalization, Weighting in the config. file, independently of experiment's name
    • George: estimate word's importance by summing up its uncertainty everywhere in the ref set., and weight the uncertainty of the word with this sum (alternative to “weighting by probability”)
  • George: present results on QBU with per-word importance weighting
  • George: finish a 10-entry annotated bib. on active learning
  • Peter: Syriac Reader: employ the existing word_TAG reader
  • Peter: Complete Word/Tag/Not a Tag distinction
  • Peter: Run Syriac test involving monolithic tag
  • Peter: Measure mutual information of all subtag pairs
  • Peter: Measure the number (pctg.) of tags in Syriac devtest not seen in training set
  • Peter: Experiment with PTB as unlabeled set; compute informativeness on random sub-samples
  • Old: Experimental Regimen produced four graphs:
    • x: # of labeled sententences - sentence-at-a-time
    • x: # of corrected words while labeling sentences
    • x: # of labeled words - word-at-a-time
    • x: # of corrected words while labeling words
    • Ideal: x: total cost (assuming a model of cost in time or $$)
  • George: Linear-time sequence entropy
nlp-private/active-learning-for-annotation.txt · Last modified: 2015/04/22 20:45 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0