nlp-private:image-annotation-tools

We need to find, adapt or create an annotation tool to help us make a gold standard for document analysis and/or information extraction based on scanned document images.

__TOC__

Wish List

A list of features to keep in mind as we consider which annotation tool to choose.

Feature	Votes
Transcribe what we annotate, to compute WER of OCR, to train OCR error corrector, and to evaluate OCR error correction.	Thomas, Bill
Annotate coarse grained zones (e.g. page column sized units). Needed to evaluate basic document image analysis projects.	Josh H.
Annotate fine grained zones (e.g. rows of a table, sentences). Needed to evaluate table analysis or sentence splitting as an important intermediate step in extracting information.	Thomas, Aaron
Annotate entities.	Thomas, Joshua L.
Annotate relationships, meaning named, ordered tuples of entities.	Thomas,
Open-ended labels (e.g. not hard-coded with genealogy-specific entity and relationship labels like “Person”, “Place”, “Date”, etc.).	Thomas,
Annotate sub-entities, meaning contiguous tokens of which entities will be composed. Sub-entities are valuable to target in information extraction for the down-stream processes of information retrieval (searching for a name without conflating surname with given name) and record linkage (semantic-level interpretation, including record linkage, is helped by guessing that “A. Hitler” and “Hitler, Adolf” probably have the same given name.) Based on the Andrew Carlson's coupled bootstrapped learning which showed that learning both relations and entities simultaneously using predicate-argument-type constraints to limit semantic drift (a constraint between entities and relations), Thomas proposes that an analogous constrain be used between entities and sub-entities.	Thomas,
Selection of smaller pieces of text to annotate larger pieces. E.g. we can annotate entities by selecting tokens, as opposed to drawing rectangles. Selection can be done by dragging a mouse over contiguous tokens or by clicking on each token. Having helped create a fast annotation tool at Ancestry, Thomas recommends this approach.	Thomas, Dr. Ringger
Use OCR output as the basis of painlessly selecting tokens along with their bounding box coordinates and OCR transcriptions.	Thomas, Dr. Ringger
Save bounding boxes in output XML to reference during evaluation. Bounding boxes would make our annotations OCR-engine-agnostic. Using IDs will tie us to using just one OCR engine for all hand annotations and extractor annotations for a given gold standard reference. We may not need this feature if we are sure we will only ever use one OCR engine per document, e.g. if we settle on using Abbyy FineReader now and forever.	Thomas, Dr. Embley
Save token IDs in output XML to reference during evaluation. Annotated units larger than a token could have IDs based on their constituent tokens, e.g. an enumeration or a range of constituent token IDs.	Thomas

Options

Options ordered from best to worst. More input needed.

Write something simple from scratch
- No baggage to worry about.
- No external people or resources to coordinate with.
DEG's FOCIH (probably in conjunction with the image display code Aaron has already started working on):
- Developed here.
- Will require adding image annotation functionality as well as many other items on the wish list.
- Has some back-end code for annotating with respect to arbitrary conceptual model, including entities, relations, sub-entities (I think).
- Has some front-end code for specifying the arbitrary conceptual model target for annotation.
- Currently tied up in a large Java code base.
- Potential to be integrated with conceptual modeling code.
NLP's CCASH
- Developed here.
- Web interface.
- Will require adding image annotation functionality as well as many other items on the wish list.
- Greater potential of being integrated with active learning and other ML/NLP code.
University of Arizona, Hong Cui's tool
- Just starting this big project including developing an annotation tool for an IE project very similar to ours, but in the bioinformatics domain.
- Overview for bioinformatics annotation and unsupervised information exraction project: ://sites.google.com/site/biosemanticsproject/
- Annotation disucussion ://sites.google.com/site/biosemanticsproject/character-annotation-discussions. Note that “character” means “physical attribute” of some organism in this domain.
TrueViz
- Download: ://www.kanungo.com/software/software.html#trueviz
- Paper: ://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAsQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.32.7739%26rep%3Drep1%26type%3Dpdf&ei=KXOVS56JCoPetgPOka3hBg&usg=AFQjCNHGFIskGU9qzWBFQyxNfyQM0S9e9Q&sig2=YdUEhqEfvuVEHRks_k_iAg
- Paper: ://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAsQFjAA&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.32.7739%26rep%3Drep1%26type%3Dpdf&ei=KXOVS56JCoPetgPOka3hBg&usg=AFQjCNHGFIskGU9qzWBFQyxNfyQM0S9e9Q&sig2=YdUEhqEfvuVEHRks_k_iAg
- By Kanungo. Open source Java with a history of being extended.
- Already functioning for annotating characters, tokens, lines, coarse zones for document analysis applications.
- Annotations are rectangles or arbitrary polygons for “entities” and ordered line segments for logical relations among these “entities”.
- Would need to be extended to handle our domain-specific entities and relations.
- Probably needs to be extended to take OCR transcriptions as input as the basis for labeling entities with respect to existing tokens.
- Annotation tree view can get complex.
Don Curtis' (Ancestry developer's) annotation tool.
- Annotates given name, surname, dates and places within the major event categories, ties these to names, and family relationships between two names.
- Does not currently annotate zones of any kind, or any information outside the genealogy domain.
- He would be willing to extend it a little, but not to include all our wish list.
- Written in C#.
- They might allow us to extend it ourselves.
FootNote's crowd-sourcing annotation tool.
- http://www.footnote.com/tour/
- Can annotate Person, Place, Date, Text, and can probably allow for corrected transcriptions.
- Probably only annotates these four entity types and nothing else, and there's probably we probably would not be allowed to extend it.
Ancestry's production annotation tool.
- Probably not available.
LDS Church's “Internet Indexing” crowd-sourcing tool.
- Probably not available.
- ://indexing.familysearch.org/newuser/nuhome.jsf ://www.familysearch.org/eng/indexing/frameset_indexing.asp

Discussion:

Who should annotate?
- Suggestion: developers and non-developers annotate training set together to train the non-developers and to compute inter-annotator agreement. Non-developers label two tests sets so developers don't cheat.
How many annotators for the same pages?
- Suggestion: Two with a third for tie-breaking.
Should we and can we hire a non-researcher to annotate the two test sets?
- We should ask Ancestry if they would put one or more people on this task. There's a good chance.
The precise role for test sets: training, dev, blind
- Training set is for looking at and tuning parameters and hyper-parameters.
- Dev test set is for evaluating multiple times.
- Blind test set is reserved for the public competition (ICDAR-2011).
The amount of experimental data we need to process to be convincing
- http://www.surveysystem.com/sscalc.htm
- That and another method I can show in a spread sheet suggest we need a sample size on the order of thousands of instances to get p = 0.05 around 1% accuracy differences (depending on how close to 50% the accuracies are).
Annotation only for current projects v. annotation for future projects
A look at the annotation needs for each individual project
- Thomas' project involves about four levels of nested IE/segmentation annotation, specialized for several semi-tabular record types in the new Ancestry data, as well as table detection and analysis annotation throughout the corpus.
- If NLP or AML wants to do unstructured IE, they might also be interested in the table detection labeling as well as paragraph and sentence boundary labeling.
- Josh Hansen might be interested in page structure annotation.
- Dan Walker might be interested in page category annotation.
- Check out the wiki page for more detail.
Annotation tools
- DEG lab plan to produce a fairly general-purpose image-based annotation tool.