Back to Noisy OCR Group

<br />

We need to find, adapt or create an annotation tool to help us make a gold standard for document analysis and/or information extraction based on scanned document images.

<br />

__TOC__

<br />

Wish List

A list of features to keep in mind as we consider which annotation tool to choose.

Feature Votes
Transcribe what we annotate, to compute WER of OCR, to train OCR error corrector, and to evaluate OCR error correction. Thomas, Bill
Annotate coarse grained zones (e.g. page column sized units). Needed to evaluate basic document image analysis projects. Josh H.
Annotate fine grained zones (e.g. rows of a table, sentences). Needed to evaluate table analysis or sentence splitting as an important intermediate step in extracting information. Thomas, Aaron
Annotate entities. Thomas, Joshua L.
Annotate relationships, meaning named, ordered tuples of entities. Thomas,
Open-ended labels (e.g. not hard-coded with genealogy-specific entity and relationship labels like “Person”, “Place”, “Date”, etc.). Thomas,
Annotate sub-entities, meaning contiguous tokens of which entities will be composed. Sub-entities are valuable to target in information extraction for the down-stream processes of information retrieval (searching for a name without conflating surname with given name) and record linkage (semantic-level interpretation, including record linkage, is helped by guessing that “A. Hitler” and “Hitler, Adolf” probably have the same given name.) Based on the Andrew Carlson's coupled bootstrapped learning which showed that learning both relations and entities simultaneously using predicate-argument-type constraints to limit semantic drift (a constraint between entities and relations), Thomas proposes that an analogous constrain be used between entities and sub-entities. Thomas,
Selection of smaller pieces of text to annotate larger pieces. E.g. we can annotate entities by selecting tokens, as opposed to drawing rectangles. Selection can be done by dragging a mouse over contiguous tokens or by clicking on each token. Having helped create a fast annotation tool at Ancestry, Thomas recommends this approach. Thomas, Dr. Ringger
Use OCR output as the basis of painlessly selecting tokens along with their bounding box coordinates and OCR transcriptions. Thomas, Dr. Ringger
Save bounding boxes in output XML to reference during evaluation. Bounding boxes would make our annotations OCR-engine-agnostic. Using IDs will tie us to using just one OCR engine for all hand annotations and extractor annotations for a given gold standard reference. We may not need this feature if we are sure we will only ever use one OCR engine per document, e.g. if we settle on using Abbyy FineReader now and forever. Thomas, Dr. Embley
Save token IDs in output XML to reference during evaluation. Annotated units larger than a token could have IDs based on their constituent tokens, e.g. an enumeration or a range of constituent token IDs. Thomas

<br />

Options

Options ordered from best to worst. More input needed.

  1. Write something simple from scratch
    • No baggage to worry about.
    • No external people or resources to coordinate with.
  2. DEG's FOCIH (probably in conjunction with the image display code Aaron has already started working on):
    • Developed here.
    • Will require adding image annotation functionality as well as many other items on the wish list.
    • Has some back-end code for annotating with respect to arbitrary conceptual model, including entities, relations, sub-entities (I think).
    • Has some front-end code for specifying the arbitrary conceptual model target for annotation.
    • Currently tied up in a large Java code base.
    • Potential to be integrated with conceptual modeling code.
  3. NLP's CCASH
    • Developed here.
    • Web interface.
    • Will require adding image annotation functionality as well as many other items on the wish list.
    • Greater potential of being integrated with active learning and other ML/NLP code.
  4. University of Arizona, Hong Cui's tool
  5. TrueViz
  6. Don Curtis' (Ancestry developer's) annotation tool.
    • Annotates given name, surname, dates and places within the major event categories, ties these to names, and family relationships between two names.
    • Does not currently annotate zones of any kind, or any information outside the genealogy domain.
    • He would be willing to extend it a little, but not to include all our wish list.
    • Written in C#.
    • They might allow us to extend it ourselves.
  7. FootNote's crowd-sourcing annotation tool.
    • Can annotate Person, Place, Date, Text, and can probably allow for corrected transcriptions.
    • Probably only annotates these four entity types and nothing else, and there's probably we probably would not be allowed to extend it.
  8. Ancestry's production annotation tool.
    • Probably not available.
  9. LDS Church's “Internet Indexing” crowd-sourcing tool.

<br />

Discussion: