nlp-private:active-learning-for-annotation [CS Wiki]

Word vs. Sentence-level annotation
Human waits vs. Computer waits
Cost-normalized algorithm (instead of just length)
Benefit normalizer (word appears a lot). Some words appear more often and hence are more valuable
How to choose the first word/sentence

POS tagging when tags are vectors of features
Syriac results
Feature engineering for POS Tagging on our BNC Poetry set

Possible Venues

EMNLP 2008: http://conferences.inf.ed.ac.uk/emnlp08/
EACL 2008: http://www.eacl2009.gr/

Pain

model of cost
- should render longest sentence no different than random (?)
granularity of annotation
EVSI-inspired model that works

To Do

We don't have the money to label all of the data in which we are interested –> we will do a subset
Other scenario: we will label it all, but want to do it more quickly. – requires user study to prove acceleration with a reasonable model of time/cost.

Eric

POS tagging experiment with MEMM: compare tagger accuracy with and without using model to score final transition P(</s> | …) to the stop tag.

Peter

Issue for paper
- do we focus on tagging alone
- do we focus on tagging plus morph. analysis (together: “disambiguation”)?

Different combinators:
- combining the hypotheses of independent (attribute-cluster) taggers
Two ways of evaluating:
- every attribute
- every attribute that is not N/A

Newest (as decided in the meeting on March 5)

1. Take out randomization when using 100 % of the data [DONE 3/5] 2. Change feature selection in XML; check Romanized characters [DONE 3/25] (only capital letters) 3. Compare Independent subtags with countcutoff of 1. [DONE 3/25] 4. Run monolithic with CC of 1. [DONE 4/14 or so]

Currently working on:
- Refactoring existing code that did POS tagging for sets of subtags.
- I built a Hebrew Reader and split the Hebrew data we have (160,000 tokens). Need to submit this.

Things to do:

Run a most-frequent tagger for Arabic and Syriac (done for Hebrew). Add Hebrew first-round results to wiki.

OLD (but reviewed)

Vectorial Tagging
- Assume up front that tags considered occurred in the training set, and we do not consider other tags
- perform feature extraction and maxent vectorization statically before training

possibly use listener to track the behavior of fast maxent

** Output of morphology will constrain the paths of the trie.

** One Trie for all allowable tags in each cluster (alternatively one trie for all allowable tags in each cluster for each word) **

Profile how many tags per word and how many in the test didn't occur in the training for each word (min, max, avg)
- When scoring a local trigram state, the Scorer assigns a score to each brach by using the submodels on each edge
  - To enumerate all possible monolithic tags made possible by the clustering, you must take the cross product of the tries.
- Trie captures a static version of allowable tags (we don't need to search).
- The order of the attributes (in the trie and the chain rule) is determined by the clustering.

Track down most recent paper on Arabic results
- Track down Arabic data set used in most recent results [I e-mailed Dr. Lonsdale a week ago, but he hasn't responded]
- Build a Arabic reader from that data

Devise a system to archive experimental results on entropy, perhaps in a way that's consistent with the directory structure on the supercomputer [waiting to be reviewed and committed]
- Put forth a proposal on the list and ask for reactions
- Default: don't replace existing results
  - Include “-f” option to “force” replacement explicitly
- Write a script that puts (scp or rsync) experimental results in the right place on entropy - should probably be invoked by the experimentor when satisified with results.

George

Implement N-best decoder
- Use that decoder for approximation to forward entropy
run Mallet CRF (order-2) on Syriac data
- first on 10% to see if it works correctly
co-code review with Peter to deal with multiple models in-memory, sub-tag taggers, etc.
- finish Perceptron tagging with sub-tag models
  - independent sub-tag models
  - dependencies between sub-tag models
search out other DCRF, performing multiple, cascaded labeling tasks on the same sequence

Marc

User study on web: quick estimate of cost of annotation
- Conduct user study to assess sentences/words/corrections per unit time.
- Can we find these numbers in the literature? - some from PTB (ask Robbie)

To accomplish a couple of purposes:

to assess the damage of incomplete sessions
to assess the per-template coverage and variance

Do the following:

produce another .csv file (imported into another tab in your Excel file) that summarizes stats for each template. Each template could have the same columns as the per-session sheet as well as a column that indicates how many sessions used that template. The averages would be over the sessions using the template. In addition, variances should be included so that we can see how much variation there was on common templates.
We should add yet one more tab as well summarizing similar results over the tutorial questions and (separately) the final control questions.

Robbie

Implement “wait-less” ALFA
- Vary the output model complexity to investigate tradeoffs
- Vary the scorer model complexity
- We might consider investigating “switch-over” points based on results of above, e.g. use Most frequent tagger, switch to order 2 HMM, then MEMM

More fully investigate different cost scenarios:
- Send batch to Iran with no annotations
- Add annotations
- On web with and without annotations but without explicit batch (see above)
- Real time updates using AJAX and constrained Viterbi

Investigate the need to tighten the “fast MaxEnt” convergence threshold
- Verify that the threshhold is sufficient to get results a cold start at 100%
Investigate feature selection
- Replace count cut-offs with something more sensitive for small data sets. UPDATE: For small data sets, no cut-offs should be fine; MaxEnt will just have a lambda of close to zero for that particular feature. This suggests a better feature selector: track features over time and those that remain sufficiently close to zero for sufficiently long should be removed.
NEW 7/16/7: Investigate P(w) and P(w,t) as QBUV-like informativeness metrics.
Word-at-a-time Active Learning
- Look at Cite on Forward Entropy in our paper to find relevence
- infrastructure:
  - support partially-annotated sentences
- modify decoders to allow for partially labeled data; respect tagged-word contraint(s)
  - viterbi (beam)
  - monte carlo
- modify the learner to allow for partially labeled data; use the constrained beam decoder
- measure the impact of a single word-label constraint on performance of both decoders
  - viterbi (beam)
  - monte carlo
- run experiment: asking the oracle to annotate one word at a time

Kevin

write up paragraph or two about making the utility function explicit and using a new utility, like time or cost of annotation.
Include EVSI references. Add to bibliography AND draft.

James

propose an approximate EVSI for implementation
- write up material on white-board
write up query-by-expected-model-improvement

Unassigned To Do

MEMM Stocastic??
- while running a regular MEMM outside of Active Learning on 100% of the PTB, I got different results with different seeds. The results are below. One possible difference is the training data is shuffled by default. Even when 100% of the training data is used, it is shuffled. Both of these experiments were on 100% of the data. Still, I thought MEMMs were deterministic given the same data…???

Seed: 1196918229884 Model POSTagger

Training…done! (3.2913366666666666 hr(s)) Running evaluation Tag Accuracy…done! (15.734 sec(s)) Tag Accuracy: 0.9671567284570664 (Unknown Accuracy: 0.9002100840336135), Sentence Accuracy: 0.47771173848439824 Decoder Suboptimalities Detected: 1

Seed: 1196955854728 Model POSTagger

Training…done! (2.0125825 hr(s)) Running evaluation Tag Accuracy…done! (38.203 sec(s)) Tag Accuracy: 0.9667914650108057 (Unknown Accuracy: 0.9023109243697479), Sentence Accuracy: 0.46953937592867756 Decoder Suboptimalities Detected: 1

New schemes
- retrain P(w) on ref. set instead of whole set
  - P_v1(t\ubar) / P_v2(t\ubar). When the top word sequence is significantly more probable than the second, there is less uncertainty.
  - Try query-by-uncertainty first; then run query-by-EVSI (or suitable approx.) on the top n (small n) from query-by-uncertainty

PTB
- Prose (WSJ PTB)
  - Ref set: whole ~40K sent
  - Sweep curves to 10K annotated data
- We need to measure the effects of active learning when we have different amounts of reference data (for example, to predict what will happen with poetry, or if we slurp in web data, etc.)

Poetry pain
- different sent. length
- BNC feature engineering

Constrain choices using morphology or dictionary.

strengthen our case for MEMMs in active learning by doing HMMs for comparison as well.

Named Entity Recognition using our AL framework – using CoNLL data
Compute entropy exactly using the forward algorithm (see Mann & McCallum, NAACL-HLT '07)

Other languages:
- Japanese
- Revive Marc's Spanish results

Test James's assertion “after they correct one word of the sentence, Viterbi may be able to correct the rest of the sentence on its own”. I suspect that we could construct a clear experiment to test this hypothesis.

Done

George: Finish up Forward Entropy paper/Comparison of QBUE and QBU
George: Prepare a presentation on Voted Perceptron/ Averaged Perceptron and CRFs
George: bibliography on private wiki
Eric: give all access to private wiki
Eric: action list page on the wiki
George: upload the PDF for each paper
Eric: create mailing list
Robbie: Coordinate the creation of Subversion repository with Marc
Marc: find out about POS annotated poetry (BNC data is stored in the data directory as BNC.zip, Emily Dickinson data is on the way)
Marc: abstract query-by-x
Marc: Submit appropriate subset of his code
Marc: add “future-work” list (from 401R/581 final projects) to this action list (This was added by Robbie under Abstraction and Parameterization)
Robbie: share suggestions on abstraction with Marc
James: do asymptotic analysis (Big-O) of query-by-EVSI (full EVSI)
Peter: make query-by-uncertainty conform to Marc's query-by-X interface in the shared code-base
Everyone: check out the Alembic Workbench, Callisto (Java)
Eric: write up other query-by-uncertainty approaches
Peter: another query-by-uncertainty, with uncertainty measured by (1-max_{_t_} P(_T_=_t_)) (i.e., 1 - P(viterbi sequence))
Peter: approx. per sentence QBU, and weighted QBU
James: theory - compare EVSI and Q-by-uncert
James: Write QBU v. EVSI insights
James: Write up asymptotic analysis of EVSI
Marc: share results of query-set batch size experiment (10, 100, 1000 sentences) on the experiment log page
Marc: QBC
Marc: Random Baseline, multiple runs
George: full-sentence query-by-uncertainty, where entropy is computed using Monte Carlo sampling
Peter Change ActiveLearner to do data splits online - enables randomization of all experiments; allow percentage based on word or based on sentence (should be close, but possibly small variance)
Peter: experiment on query-set batch size (10, 100, 1000, 10K words); take whole sentences only; word count is lower bound (allow for extra words if necessary to get whole sent.)
George: code review of MC math
Peter: automate ant build and python script for running on supercomputer.
Peter: code optimization and abstraction
Eric: write and submit draft
Peter: Put EMNLP results on entropy
George: pull together quickly your existing writings and some brainstorms about what to do next with MC decoder
George: post results summarizing the performance of Monte Carlo tagger to estimate P(_t_ | _w_)
- automatic search for thresholds on MC decoding that yield perf. comparable to Viterbi/beam search
- search for thresholds on MC decoding that yield a full distribution (measured by entropy) in as little time as possible.
George: post results from using MC decoding in full-sentence QBU
Peter: Build a little Unicode Syriac diplay app to verify that the plumbing works on the NT data
Peter: Change active learner to default with 1 sentence of Initial Training
Peter: Consolidation / refactoring of code
Peter: specify Normalization, Weighting in the config. file, independently of experiment's name
- George: estimate word's importance by summing up its uncertainty everywhere in the ref set., and weight the uncertainty of the word with this sum (alternative to “weighting by probability”)
George: present results on QBU with per-word importance weighting
George: finish a 10-entry annotated bib. on active learning
Peter: Syriac Reader: employ the existing word_TAG reader
Peter: Complete Word/Tag/Not a Tag distinction
Peter: Run Syriac test involving monolithic tag
Peter: Measure mutual information of all subtag pairs
Peter: Measure the number (pctg.) of tags in Syriac devtest not seen in training set
Peter: Experiment with PTB as unlabeled set; compute informativeness on random sub-samples
Old: Experimental Regimen produced four graphs:
- x: # of labeled sententences - sentence-at-a-time
- x: # of corrected words while labeling sentences
- x: # of labeled words - word-at-a-time
- x: # of corrected words while labeling words
- Ideal: x: total cost (assuming a model of cost in time or $$)
George: Linear-time sequence entropy

nlp-private/active-learning-for-annotation.txt · Last modified: 2015/04/22 14:45 by ryancha

Table of Contents

Resources

Action List

Themes of Possible Papers