17 Oct 2007

all words are now rare

baseline (inspired by bug): train model at round 1 (on batch 1) and use it across the experiment - essentially longest sentence

alternative: stop updating model at round x - (similar to switching to random selection at round x) - for some sufficiently large x, should see no disadvantage

issue: cost of waiting for computer to select next sample

idea: time-limited sample selection (use stale scores, if time doesn't permit more work) - time limit could be fixed const. or determined by completion of annotator's work unit.

measure cumulative time of experiments

idea: most common words per length of sentence

oracle labeled data split into training and test.

19 Feb. 2008 ALFA Notes from whiteboard

  • Multiple annotators/annotations
  • Simple model w/better selection (at first)
  • More look-ahead w/ simple model
  • Interactive constrained Viterbi (cost perspective)
  • 3 processes
  • Proper initialization of prior – convergence of fast maxent (acero & Chelba)
  • QBC:
    • Committee selection
    • Disagreement metric
    • Size
  • Randomize of unannotated set every round
  • Profile
  • Net cost = gross cost – annotation time
  • For 1 cycle of a.l.
    • A. both idle (have this)
    • Human idle
    • No-one idle
    • Computer idle
  • Candidate set size
  • Data/results “commit”

Q_HC Q_CH 0 0 \infty 0 \infty \infty 0 \infty

  • First two rows: pay when human waits; no pay when human waits

Diagram: edge from H to C and from C to H. Edges annotated by Q_HC and Q_CH, respectively


6/6/2008 Syriac project meeting and 6/12/2008 Morph. Tagging project meeting


  • 1 character per prefix? yes
  • more than one prefix? 0-3
  • is a single separator ambiguous in some cases? use 2: one for prefix and one for suffix
  • multi-purpose interface: active learning, browsing, review for proofing
  • are we tagging prefixes or just segmenting them?

Action: Brandon: clicking on side words updates attribute-value box's label Peter: the prefix string is a value tagged by the Syriac tagger. shouldn't be. Eric: follow up with Harry Diakoff

  • Auto-complete in little-language box
  • Support keyboard-only data entry
  • Update public site for ALFA project

6/13/2008 Syriac project meeting


  • dolath (d)
  • lamad (l)
  • waw (conj)
  • prepositions

Suffixes: small, finite number

Idea: Layered / Prioritized Tags

  • tool should support full expressivity (full tag set)
  • tool should be configurable to hide layers (reduced tag set)
  • Configurable: may later want other layers of annotation, other distinctions

Important: supporting linkage of stems to headwords (“lexemes”) in the dictionary


  • Active learning
  • Review


  • Reveal context: tool-tips should reveal attributes on more distant tokens in any text view
  • configureable amount of context
  • situated in the corpus, in a document
  • ability to browse files: explorer right on the left
  • ability to browse dictionary: dictionary pane on the right
  • ability to link stems to lexemes in the dictionary
  • ability to add to dictionary with pointer(s) back to corpus for examples

Roles for annotators:

  • editors of dictionary
  • non-editors - can only propose new entries

24 June 2008 ALFA project meeting

Questions / variables for user study:

  • scope of jumps taken by active learner / measure the cost of context switching
    • article
    • genre
    • time period
    • author
    • corpus
  • availability of lexicon
  • how much context
    • sentence, phrase, QWIC, etc.
  • granularity of annotation:
    • sentence
    • phrase
    • word
    • word sub-tag
  • correct or annotated from scratch
  • presentation of top-N model hypotheses
  • order of forced annotation: step-by-step in order or jumping (per AL)

Research question:

  • 1st item choice
  • online cost model estimation
  • which model to select data for user study
    • minimize bias of item selector for user study
  • layers of annotation

Separate problem: vowel restoration


  • Hebrew OOV investigation

Interface modes: 1. active learning **machine determines granularity: sentence, phrase, word, sub-tag 2. review mode: sequential order 3. review mode: arbitrary order 4. review mode: AL order 5. browse (no changes)

Date unknown (prior to 8 July 2008)

Projects to complete:

  • User study for Syriac
  • Simulating AL for Syriac
  • Cost model on the fly
  • Get Habash/Rambow data for Arabic

8 July 2008 ALFA project meeting

Proposal to Harry Diakoff

  • Machine learning
  • Annotation
  • could be joint with BYU Classics or with Perseus Project
  • 2 pages

Paper ideas:

  • Wait for it! Cost/benefit trade-offs in waiting
  • probability of datum - later (15 July 08 ?) decided to be unpromising based on prior experiments and discussion
  • utility / loss as part of active learner
  • multiple annotators / imperfect annotators
  • Cost model on the fly
  • Cost implications of error correction propagation
  • see Culotta & McCallum
  • another point on the correction vs. from scratch spectrum
  • Greedy EVSI
  • Particle filters for Bayesian models in AL

17 July 2008 Morph Tagging project meeting

Ask Ivan @ MS about Win Server licenses

Re-engage with Marc Carmen

Paul & Brandon do web-based prototype

  • Paul: GWT & JSF
  • Brandon: ASP & JSF

Eric: architecture for client/server set-up in Visio diagram

18 July 2008 Syriac project meeting

Features for prototype

  • Inspect lexical entry
  • Nestorian font
  • Font size

Features for review mode in prototype

  • highlight word in top line
  • single line of review cells
  • divider bar separating text (top) from review cells (bottom); ability to move to create more or fewer lines of review cells
  • allowance for multiple annotations
  • ability to change
  • indicator: tagged by human (blue) or machine (yellow)
  • reveal auto tag. on demand
  • reviewer can acknowledge blue and yellow tags
    • turns green
    • retrains model, as appropriate
  • reveal levels of confidence on yellow cells - little confidence bars?
  • editing lens
  • progress tracker

Features for active learning mode in prototype:

  • “next” button should be “done button” when word-at-a-time
  • sent.-at-a-time: previous word, next word, done
  • “back” button to return to previous case
  • highlight
  • hide file browser
  • remove review boxes above and below the “lens” row
  • add path (corpus –> author –> doc) at top of view for context
  • place the highlighted word in the middle
  • use yellow & blue highlights on text in text view and in edit controls
  • constraint and prediction
  • changing attribute affects lexeme and vice versa
  • segment, then constraint lexicon, then attributes

Features for browse mode in prototype

  • tool tips on every word
  • link to dictionary

Features for all modes in prototype

  • annotator should be able to flag a transcription as possibly erroneous
  • allow viewing of image?

22 July 2008 Combined ALFA / Morph. Tagging project meetings

Focus: knowledge-free

Perspective: cost reduction

  • compare machine learning (data-driven approach) with knowledge engineering

Partial solutions:

  • morph. tagging (as we have defined it - one vector of attributes per token)
  • segmentation only
  • vowel restoration

Whole solution:

  • tag + segment
  • look up in / link to dictionary
  • required for Syriac project

Reminders on Syriac tagger:

  • Re-do without string prediction
  • Remove vowels and re-do


  • feature sets for predictive segmentation

1. predict # of characters in prefix, suffix 2. choose among letter sequence as prefix

LDAP: couldn't connect to LDAP server
nlp-private/2008-meeting-notes.txt · Last modified: 2015/04/23 14:46 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0