Back to Noisy OCR Group

<br />

We need to find, adapt or create an annotation tool to help us make a gold standard for document analysis and/or information extraction based on scanned document images.

<br />

__TOC__

<br />

Wish List

A list of features to keep in mind as we consider which annotation tool to choose.

Feature Votes
Transcribe what we annotate, to compute WER of OCR, to train OCR error corrector, and to evaluate OCR error correction. Thomas, Bill
Annotate coarse grained zones (e.g. page column sized units). Needed to evaluate basic document image analysis projects. Josh H.
Annotate fine grained zones (e.g. rows of a table, sentences). Needed to evaluate table analysis or sentence splitting as an important intermediate step in extracting information. Thomas, Aaron
Annotate entities. Thomas, Joshua L.
Annotate relationships, meaning named, ordered tuples of entities. Thomas,
Open-ended labels (e.g. not hard-coded with genealogy-specific entity and relationship labels like “Person”, “Place”, “Date”, etc.). Thomas,
Annotate sub-entities, meaning contiguous tokens of which entities will be composed. Sub-entities are valuable to target in information extraction for the down-stream processes of information retrieval (searching for a name without conflating surname with given name) and record linkage (semantic-level interpretation, including record linkage, is helped by guessing that “A. Hitler” and “Hitler, Adolf” probably have the same given name.) Based on the Andrew Carlson's coupled bootstrapped learning which showed that learning both relations and entities simultaneously using predicate-argument-type constraints to limit semantic drift (a constraint between entities and relations), Thomas proposes that an analogous constrain be used between entities and sub-entities. Thomas,
Selection of smaller pieces of text to annotate larger pieces. E.g. we can annotate entities by selecting tokens, as opposed to drawing rectangles. Selection can be done by dragging a mouse over contiguous tokens or by clicking on each token. Having helped create a fast annotation tool at Ancestry, Thomas recommends this approach. Thomas, Dr. Ringger
Use OCR output as the basis of painlessly selecting tokens along with their bounding box coordinates and OCR transcriptions. Thomas, Dr. Ringger
Save bounding boxes in output XML to reference during evaluation. Bounding boxes would make our annotations OCR-engine-agnostic. Using IDs will tie us to using just one OCR engine for all hand annotations and extractor annotations for a given gold standard reference. We may not need this feature if we are sure we will only ever use one OCR engine per document, e.g. if we settle on using Abbyy FineReader now and forever. Thomas, Dr. Embley
Save token IDs in output XML to reference during evaluation. Annotated units larger than a token could have IDs based on their constituent tokens, e.g. an enumeration or a range of constituent token IDs. Thomas

<br />

Options

Options ordered from best to worst. More input needed.

  1. Write something simple from scratch
    • No baggage to worry about.
    • No external people or resources to coordinate with.
  2. DEG's FOCIH (probably in conjunction with the image display code Aaron has already started working on):
    • Developed here.
    • Will require adding image annotation functionality as well as many other items on the wish list.
    • Has some back-end code for annotating with respect to arbitrary conceptual model, including entities, relations, sub-entities (I think).
    • Has some front-end code for specifying the arbitrary conceptual model target for annotation.
    • Currently tied up in a large Java code base.
    • Potential to be integrated with conceptual modeling code.
  3. NLP's CCASH
    • Developed here.
    • Web interface.
    • Will require adding image annotation functionality as well as many other items on the wish list.
    • Greater potential of being integrated with active learning and other ML/NLP code.
  4. University of Arizona, Hong Cui's tool
  5. TrueViz
  6. Don Curtis' (Ancestry developer's) annotation tool.
    • Annotates given name, surname, dates and places within the major event categories, ties these to names, and family relationships between two names.
    • Does not currently annotate zones of any kind, or any information outside the genealogy domain.
    • He would be willing to extend it a little, but not to include all our wish list.
    • Written in C#.
    • They might allow us to extend it ourselves.
  7. FootNote's crowd-sourcing annotation tool.
    • Can annotate Person, Place, Date, Text, and can probably allow for corrected transcriptions.
    • Probably only annotates these four entity types and nothing else, and there's probably we probably would not be allowed to extend it.
  8. Ancestry's production annotation tool.
    • Probably not available.
  9. LDS Church's “Internet Indexing” crowd-sourcing tool.

<br />

Discussion:

  • Who should annotate?
    • Suggestion: developers and non-developers annotate training set together to train the non-developers and to compute inter-annotator agreement. Non-developers label two tests sets so developers don't cheat.
  • How many annotators for the same pages?
    • Suggestion: Two with a third for tie-breaking.
  • Should we and can we hire a non-researcher to annotate the two test sets?
    • We should ask Ancestry if they would put one or more people on this task. There's a good chance.
  • The precise role for test sets: training, dev, blind
    • Training set is for looking at and tuning parameters and hyper-parameters.
    • Dev test set is for evaluating multiple times.
    • Blind test set is reserved for the public competition (ICDAR-2011).
  • The amount of experimental data we need to process to be convincing
    • That and another method I can show in a spread sheet suggest we need a sample size on the order of thousands of instances to get p = 0.05 around 1% accuracy differences (depending on how close to 50% the accuracies are).
  • Annotation only for current projects v. annotation for future projects
  • A look at the annotation needs for each individual project
    • Thomas' project involves about four levels of nested IE/segmentation annotation, specialized for several semi-tabular record types in the new Ancestry data, as well as table detection and analysis annotation throughout the corpus.
    • If NLP or AML wants to do unstructured IE, they might also be interested in the table detection labeling as well as paragraph and sentence boundary labeling.
    • Josh Hansen might be interested in page structure annotation.
    • Dan Walker might be interested in page category annotation.
    • Check out the wiki page for more detail.
  • Annotation tools
    • DEG lab plan to produce a fairly general-purpose image-based annotation tool.
nlp-private/image-annotation-tools.txt · Last modified: 2015/04/23 13:21 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0