Back to Noisy OCR Group
<br />
We need to find, adapt or create an annotation tool to help us make a gold standard for document analysis and/or information extraction based on scanned document images.
<br />
__TOC__
<br />
A list of features to keep in mind as we consider which annotation tool to choose.
Feature | Votes |
---|---|
Transcribe what we annotate, to compute WER of OCR, to train OCR error corrector, and to evaluate OCR error correction. | Thomas, Bill |
Annotate coarse grained zones (e.g. page column sized units). Needed to evaluate basic document image analysis projects. | Josh H. |
Annotate fine grained zones (e.g. rows of a table, sentences). Needed to evaluate table analysis or sentence splitting as an important intermediate step in extracting information. | Thomas, Aaron |
Annotate entities. | Thomas, Joshua L. |
Annotate relationships, meaning named, ordered tuples of entities. | Thomas, |
Open-ended labels (e.g. not hard-coded with genealogy-specific entity and relationship labels like “Person”, “Place”, “Date”, etc.). | Thomas, |
Annotate sub-entities, meaning contiguous tokens of which entities will be composed. Sub-entities are valuable to target in information extraction for the down-stream processes of information retrieval (searching for a name without conflating surname with given name) and record linkage (semantic-level interpretation, including record linkage, is helped by guessing that “A. Hitler” and “Hitler, Adolf” probably have the same given name.) Based on the Andrew Carlson's coupled bootstrapped learning which showed that learning both relations and entities simultaneously using predicate-argument-type constraints to limit semantic drift (a constraint between entities and relations), Thomas proposes that an analogous constrain be used between entities and sub-entities. | Thomas, |
Selection of smaller pieces of text to annotate larger pieces. E.g. we can annotate entities by selecting tokens, as opposed to drawing rectangles. Selection can be done by dragging a mouse over contiguous tokens or by clicking on each token. Having helped create a fast annotation tool at Ancestry, Thomas recommends this approach. | Thomas, Dr. Ringger |
Use OCR output as the basis of painlessly selecting tokens along with their bounding box coordinates and OCR transcriptions. | Thomas, Dr. Ringger |
Save bounding boxes in output XML to reference during evaluation. Bounding boxes would make our annotations OCR-engine-agnostic. Using IDs will tie us to using just one OCR engine for all hand annotations and extractor annotations for a given gold standard reference. We may not need this feature if we are sure we will only ever use one OCR engine per document, e.g. if we settle on using Abbyy FineReader now and forever. | Thomas, Dr. Embley |
Save token IDs in output XML to reference during evaluation. Annotated units larger than a token could have IDs based on their constituent tokens, e.g. an enumeration or a range of constituent token IDs. | Thomas |
<br />
Options ordered from best to worst. More input needed.
<br />
Discussion: