Differences

This shows you the differences between two versions of the page.

Link to this comparison view

nlp-private:image-annotation-tools [2015/04/23 19:21] (current)
ryancha created
Line 1: Line 1:
 +Back to [[Noisy OCR Group]]
 +
 +<br />
 +
 +We need to find, adapt or create an annotation tool to help us make a gold standard for document analysis and/or information extraction based on scanned document images.
 +
 +<br />
 +
 +__TOC__
 +
 +<br />
 +
 +== Wish List ==
 +
 +A list of features to keep in mind as we consider which annotation tool to choose.
 +
 +{| class="​wikitable"​ border="​1"​
 +|-
 +! Feature
 +! Votes
 +|-
 +| '''​Transcribe'''​ what we annotate, to compute WER of OCR, to train OCR error corrector, and to evaluate OCR error correction.
 +| Thomas, Bill
 +|-
 +| Annotate '''​coarse grained zones'''​ (e.g. page column sized units). ​ Needed to evaluate basic document image analysis projects.
 +| Josh H.
 +|-
 +| Annotate '''​fine grained zones'''​ (e.g. rows of a table, sentences). ​ Needed to evaluate table analysis or sentence splitting as an important intermediate step in extracting information.
 +| Thomas, Aaron
 +|-
 +| Annotate '''​entities'''​.
 +| Thomas, Joshua L.
 +|-
 +| Annotate '''​relationships''',​ meaning named, ordered tuples of entities.
 +| Thomas, ​
 +|-
 +| '''​Open-ended labels'''​ (e.g. not hard-coded with genealogy-specific entity and relationship labels like "​Person",​ "​Place",​ "​Date",​ etc.).
 +| Thomas, ​
 +|-
 +| Annotate '''​sub-entities''',​ meaning contiguous tokens of which entities will be composed. ​ Sub-entities are valuable to target in information extraction for the down-stream processes of information retrieval (searching for a name without conflating surname with given name) and record linkage (semantic-level interpretation,​ including record linkage, is helped by guessing that "A. Hitler"​ and "​Hitler,​ Adolf" probably have the same given name.) ​ Based on the Andrew Carlson'​s coupled bootstrapped learning which showed that learning both relations and entities simultaneously using predicate-argument-type constraints to limit semantic drift (a constraint between entities and relations), Thomas proposes that an analogous constrain be used between entities and sub-entities.
 +| Thomas, ​
 +|-
 +| '''​Selection of smaller pieces of text'''​ to annotate larger pieces. ​ E.g. we can annotate entities by selecting tokens, as opposed to drawing rectangles. ​ Selection can be done by dragging a mouse over contiguous tokens or by clicking on each token. ​ Having helped create a fast annotation tool at Ancestry, Thomas recommends this approach.
 +| Thomas, Dr. Ringger
 +|-
 +| '''​Use OCR output'''​ as the basis of painlessly selecting tokens along with their bounding box coordinates and OCR transcriptions.
 +| Thomas, Dr. Ringger
 +|-
 +| '''​Save bounding boxes'''​ in output XML to reference during evaluation. ​ Bounding boxes would make our annotations OCR-engine-agnostic. ​ Using IDs will tie us to using just one OCR engine for all hand annotations and extractor annotations for a given gold standard reference. ​ We may not need this feature if we are sure we will only ever use one OCR engine per document, e.g. if we settle on using Abbyy FineReader now and forever.
 +| Thomas, Dr. Embley
 +|-
 +| '''​Save token IDs'''​ in output XML to reference during evaluation. ​ Annotated units larger than a token could have IDs based on their constituent tokens, e.g. an enumeration or a range of constituent token IDs.
 +| Thomas
 +|}
 +
 +<br />
 +
 +== Options ==
 +
 +Options ordered from best to worst. ​ More input needed.
 +
 +# Write something simple from scratch
 +#* No baggage to worry about.
 +#* No external people or resources to coordinate with.
 +# DEG's FOCIH (probably in conjunction with the image display code Aaron has already started working on):
 +#* Developed here.
 +#* Will require adding image annotation functionality as well as many other items on the wish list.
 +#* Has some back-end code for annotating with respect to arbitrary conceptual model, including entities, relations, sub-entities (I think).
 +#* Has some front-end code for specifying the arbitrary conceptual model target for annotation.
 +#* Currently tied up in a large Java code base.
 +#* Potential to be integrated with conceptual modeling code.
 +# NLP's CCASH
 +#* Developed here.
 +#* Web interface.
 +#* Will require adding image annotation functionality as well as many other items on the wish list.
 +#* Greater potential of being integrated with active learning and other ML/NLP code.
 +# University of Arizona, Hong Cui's tool
 +#* Just starting this big project including developing an annotation tool for an IE project very similar to ours, but in the bioinformatics domain.
 +#* Overview for bioinformatics annotation and unsupervised information exraction project: [http://​sites.google.com/​site/​biosemanticsproject/​]
 +#* Annotation disucussion [http://​sites.google.com/​site/​biosemanticsproject/​character-annotation-discussions]. ​ Note that "​character"​ means "​physical attribute"​ of some organism in this domain.
 +# TrueViz
 +#* Download: [http://​www.kanungo.com/​software/​software.html#​trueviz]
 +#* Paper: [http://​www.google.com/​url?​sa=t&​source=web&​ct=res&​cd=1&​ved=0CAsQFjAA&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.32.7739%26rep%3Drep1%26type%3Dpdf&​ei=KXOVS56JCoPetgPOka3hBg&​usg=AFQjCNHGFIskGU9qzWBFQyxNfyQM0S9e9Q&​sig2=YdUEhqEfvuVEHRks_k_iAg]
 +#* Paper: [http://​www.google.com/​url?​sa=t&​source=web&​ct=res&​cd=1&​ved=0CAsQFjAA&​url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.32.7739%26rep%3Drep1%26type%3Dpdf&​ei=KXOVS56JCoPetgPOka3hBg&​usg=AFQjCNHGFIskGU9qzWBFQyxNfyQM0S9e9Q&​sig2=YdUEhqEfvuVEHRks_k_iAg]
 +#* By Kanungo. ​ Open source Java with a history of being extended.
 +#* Already functioning for annotating characters, tokens, lines, coarse zones for document analysis applications.
 +#* Annotations are rectangles or arbitrary polygons for "​entities"​ and ordered line segments for logical relations among these "​entities"​.
 +#* Would need to be extended to handle our domain-specific entities and relations.
 +#* Probably needs to be extended to take OCR transcriptions as input as the basis for labeling entities with respect to existing tokens.
 +#* Annotation tree view can get complex.
 +# Don Curtis'​ (Ancestry developer'​s) annotation tool.
 +#* Annotates given name, surname, dates and places within the major event categories, ties these to names, and family relationships between two names.
 +#* Does not currently annotate zones of any kind, or any information outside the genealogy domain.
 +#* He would be willing to extend it a little, but not to include all our wish list.
 +#* Written in C#.
 +#* They might allow us to extend it ourselves.
 +# FootNote'​s crowd-sourcing annotation tool.
 +#* http://​www.footnote.com/​tour/​
 +#* Can annotate Person, Place, Date, Text, and can probably allow for corrected transcriptions.
 +#* Probably only annotates these four entity types and nothing else, and there'​s probably we probably would not be allowed to extend it.
 +# Ancestry'​s production annotation tool.
 +#* Probably not available.
 +# LDS Church'​s "​Internet Indexing"​ crowd-sourcing tool.
 +#* Probably not available.
 +#* [http://​indexing.familysearch.org/​newuser/​nuhome.jsf] [http://​www.familysearch.org/​eng/​indexing/​frameset_indexing.asp]
 +
 +<br />
 +
 +Discussion:
 +
 +* Who should annotate?
 +** Suggestion: ​ developers and non-developers annotate training set together to train the non-developers and to compute inter-annotator agreement. ​ Non-developers label two tests sets so developers don't cheat.
 +* How many annotators for the same pages?
 +** Suggestion: ​ Two with a third for tie-breaking.
 +* Should we and can we hire a non-researcher to annotate the two test sets?
 +** We should ask Ancestry if they would put one or more people on this task.  There'​s a good chance.
 +* The precise role for test sets: training, dev, blind
 +** Training set is for looking at and tuning parameters and hyper-parameters.
 +** Dev test set is for evaluating multiple times.
 +** Blind test set is reserved for the public competition (ICDAR-2011).
 +* The amount of experimental data we need to process to be convincing
 +** http://​www.surveysystem.com/​sscalc.htm
 +** That and another method I can show in a spread sheet suggest we need a sample size on the order of thousands of instances to get p = 0.05 around 1% accuracy differences (depending on how close to 50% the accuracies are).
 +* Annotation only for current projects v. annotation for future projects
 +* A look at the annotation needs for each individual project
 +** Thomas'​ project involves about four levels of nested IE/​segmentation annotation, specialized for several semi-tabular record types in the new Ancestry data, as well as table detection and analysis annotation throughout the corpus.  ​
 +** If NLP or AML wants to do unstructured IE, they might also be interested in the table detection labeling as well as paragraph and sentence boundary labeling.  ​
 +** Josh Hansen might be interested in page structure annotation.
 +** Dan Walker might be interested in page category annotation.
 +** Check out the wiki page for more detail.
 +* Annotation tools
 +** DEG lab plan to produce a fairly general-purpose image-based annotation tool.
  
nlp-private/image-annotation-tools.txt ยท Last modified: 2015/04/23 19:21 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0