nlp-private:document-image-data-sets [CS Wiki]

Lee Jensen's Request

Primary goal:

Gold standard text representation of an image.
Gold standard entity representation of text.

Possible secondary goals, as long as they don’t overpower the project:

Analysis if labeling methods.
Inclusion in TC-11.
Competitions.
Page classification.
Table analysis research.
Formatting analysis research.

The entities are the ones I have been after all along.

Person Record
Name (given and surnames)
Gender
Events (date (mm/dd/yyyy and place (city,county,state,country))
- Birth
- Death
- Marriage
- Other
Familial relationship to other person record

As for quantity I am flexible on this, but I was hoping for twenty or so artifacts (pages) from each title. Of course newspapers are much richer than cruise books and so we may need to be flexible on this. Perhaps only take portions of a newspaper image or something of the like.

I am planning on a few thousand dollars for labeling. Possibilities for labelers include MTurk, our customer support department, or perhaps part time contractors.

Dan's Wish List

Must have:

Transcription
Topic labels of articles
Ocr output
Zoning (logical structure of articles)
Newspapers only
Tens of thousands of documents (articles)

Nice to have:

Word bounding boxes
Multiple OCR engine output for each article

Bill's Wish List

Must have:

Transcriptions
OCR output of multiple engines
Zoning (physical structure)
Narrative text

Nice to have:

Word bounding boxes

Thomas' Wish List

Must have:

Semi-structured data, record and field boundaries
Word bounding boxes
Line boundaries

Very nice to have:

Transcriptions

Dr. Embley's Wish List

Must have:

Facts/predicates annotated (entities, relations among entities)
Location of facts (bounding boxes)

Nice to have:

Transcriptions

Benjamin Lambert's Wish List (from CMU)

Must have

transcriptions
Zoning (logical structure of newspaper articles)
word bounding boxes

Nice to have:

Categories of documents, pages, etc.
line bounding boxes
paragraphs and other structure

Other Wish Lists

The list below is old and small. It is replaced by http://www.allourideas.org/ancestry_corpus. What we need in a data set.

Feature	Votes
Contains semi-structured data.	Thomas
A book with various (alternating/mixed) semi-structured formats/patterns.	Dr. Embley, Thomas, Aaron
Contains OCR output (unless we get an OCR engine).	All.
OCR contains punctuations, correct token bounding boxes, text line groupings of tokens and good zoning of coarse-grained page structures.	Thomas, Dr. Embley, Aaron.
Contains hand-annotations (unless we get or make an annotation tool).	All
Hand annotations include detailed multilevel-structured information for multiple classes, e.g. sub-entities, entities, relations.	Thomas

Wish List Given to Ancestry

New data, e.g. twice as much as what we got last year (or more).
It can include the same titles as before (from the data we got last year) if the OCR is all the same format. We liked the sample we got last year in terms of diversity of document style.
The equivalent of at least three books for each document format/style. E.g. if we get city directory books that have one table of contents and an index per book, we'd need at least three such books so we can have an example of each part of the book in each of our data sub-sets: training, dev-test and blind-test sets. The more the better – like 6 books per style would be even better.
If a certain type of document is pretty big and uniform, like a set of census forms that were never bound into book-style volumes, then it's okay to give us one big title if it contains a lot of individual records/pages. We can manually split that kind of a title into different date ranges, for example, to produce our training/dev-test/blind-test split.
A variety of document styles. There are three main kinds of document formats: documents with narrative sentences and paragraphs (e.g. newspaper), documents containing mostly tables (distinct columns and rows of data, with or without column headers), and semi-structured text (e.g. city directories and event-lists like marriages or births that are written in a uniform style like “John Jones m. Sarah Wight, Jun 9., by Pastor Richard Smith”). Within these categories are potentially many sub-categories (e.g. city directories and local histories with lots of lists of names are probably both considered semi-structured). We'd like a variety even within each category, as long as there are at least three titles for each sub-category. And it's okay if a certain sub-category seems to contain some pages that are free text and some that are semi-structured, for example. Mixtures of categories found within the same document are fine (actually beneficial).
An emphasis on tabular and semi-structured data, i.e. data-rich documents, especially those containing family relationship and life-event facts about many people in a consistent or semi-consistent format. But if we have a large sample of each type, we don't necessarily need to have more tabular data than unstructured data.
Contain TIFF images corresponding to the OCR output for each page. (JPEG 2K format is harder to deal with.)
Verbose XML format containing accurate bounding box coordinates for characters or words, word-groupings of characters (e.g. spaces preserved between words), text-line groupings of words, character confidence scores, correct word order, punctuation preserved. We would also assume that zoning (column analysis) was done by the OCR engine. Having zone information in the output is not as important as text line groupings (assuming those text lines don't span multiple columns).

Options

Here are data sets we have access to. “Diverse” means there are multiple formats of text within the same corpus (e.g. more than one document type). “Mixed” means heterogeneous on the page-level, i.e. there can be multiple formats of text structure within any given page. “Unstructured” means free-text, natural language sentences.

Ancestry Data.
- Includes images
- Includes OCR in Abbyy R.S. XML format (lots of information).
- No annotations (we will make).
- Contains diverse and mixed unstructured, semi-structured and structured text in family history domain.
HBL Library
- Might need to scan and OCR ourselves.
- Interesting variety of books and other documents.
LDS Church Data
- Some interesting church-related data (patriarchal blessings, etc.)
Botanical Specimen Labels
- Might includes images
- Might includes OCR with nothing but text.
- Might include some annotations, not sure if multilevel-structured.
- Contains diverse but not mixed semi-structured text in botany domain.
- Might require significant knowledge of botany to label this data.
CiteSeer
- Not sure what this includes.
Bibliography Data
- No images
- Includes OCR with nothing but text, already extracted from any other kinds of text on the original page. Very simple data.
- Includes flat, in-line annotations of segmented bibliographic entries (author segment, title segment, etc.), not multilevel-structured.
- Contains non-diverse and non-mixed semi-structured text in botany domain.
POBCRIS Data
- Might includes images later.
- Includes OCR in ARS XML format (lots of information).
- Some flat, in-line annotations of places and person names.
- Contains non-diverse and non-mixed unstructured text of parliamentary proceedings (UK legal domain).
Project Gutenberg
- Constant stream of images, OCR and hand-transcriptions.
U of Arizona Data
- Might include a lot of stuff in biology domain. Not sure yet.
NIST Data
- Haven't looked at it yet.

Ancestry Corpus Plan

Five-phase plan for annotating the Ancestry corpus. Round parentheses and voting-derived scores were preserved from AllOurIdeas.org. Square parentheses are further explanations added after the voting to explain how these items fit into this plan.

The time line is rough and fuzzy. There may be extra pages of one task done in the phase of another task, e.g. more transcription may take place during the later phases. We could swap the due dates for phases 2 and 3 if anyone really wanted us to.

By the way, the number of pages to be annotated or transcribed has not been determined here. Ancestry may be interested in very few pages, just a sample from each document.

Ancestry Corpus Wiki

https://facwiki.cs.byu.edu/Ancestrycorp/index.php/Main_Page

Summary Annotation Plan

Phase 1: Transcription (before Jan. 1):
- Gather and restructure what Ancestry gives us including images and OCR output
- Transcribe / correct OCR, including annotation time cost
- Align to OCR output, preserving bounding boxes, line segmentation, etc.
- Compute WER
Phase 2: E/R (before Mar. 1):
- Marking semi-structured lists and their component records, including annotation time cost
- Labeling entities and relation in unstructured and semi-structured text, including annotation time cost
Phase 3: Logical Structure:
- Segment logical page structure, mostly newspaper articles
- Label articles with their topics
Phase 4:
- Other logical structure, typo correction, database, record-linkage and co-reference stuff we may not get to. There were a few high-ranked items here that just don't fit in well with the essentials.
Phase 5:
- Leave the rest undone and be thankful we didn't have to do it all

Phase 1: Done by January 1, 2010, with Ancestry's full support

Basic Document Metadata:
- contains dates for when each document was written 75
- Contains other similar document-level mata-data, including place, document style/type.
- contains larger document categories, e.g. “newspaper”, “local history”, “yearbook”, etc. 61
- contains a large variety of document types 52
- contains unstructured and narrative text blocks [not necessarily marked as such] 57
- contains tabular and semi-structured text blocks (not necessarily segmented) [not necessarily marked as such] 45
Document Physical Structure:
- contains OCR output for each page with just the OCR text [not just OCR, actually] 56
- contains full page transcriptions 67
- provides an accurate plain-text version of the document. [This is a transcription and will include the text as it appears in the original document (no type-correction, which might come later)] 67
- contains gold transcribed text with word or character bounding boxes 57
- promotes OCR research [by which I mean, automatic transcription research which requires image and gold transcription] 50
- contains OCR tokens identifiable by image coordinates instead of (OCR-engine specific) token IDs [this is the plan, I guess there's little harm in also having token IDs, although they are implicit in the full list of tokens] 47
- contains OCR output for each page including character and/or word bounding boxes 44
- promotes OCR error correction research [by this I mean it has gold transcriptions aligned with OCR output and images] 43
- contains OCR output for each page including font size and style information, e.g. “bold”, “italics”, etc. [I think we get this for free from Ancestry] 41
- contains OCR output with the engine’s word error rates for each page [this should be easy to do once we have OCR and manual transcription] 35
Extra Stuff
- contains annotator cost, e.g. the time it took an annotator to produce any of the manual tasks like transcribing or labeling some unit… [this should be easy to do with our own annotation tool] 33

Phase 2: Done by March 1, 2011, with Ancestry's full support

Document Logical Structure:
- Select semi-structured lists will be identified and segmented into records. [Thomas' requirement]
Entities and Relations:
- promotes genealogy search/IR research [that's the whole point, for Ancestry] 38
- contains gold annotated entities, e.g. persons, places, dates, etc., including label and correct transcription [includes Lee's required entities and Thomas' fully segmented list records] 78
- contains gold annotation of highly structured information (e.g. nested entities and relations) [includes Lee's required entities and Thomas' fully segmented list records] 71
- contains fully labeled/segmented (semi-)structured lists, e.g. city directory entries with everything labeled [includes Lee's required entities and Thomas' fully segmented list records] 59
- contains transcriptions of labeled entities [taken from previous phase when manual transcription was done] 67
- promotes extraction of structured data (e.g. nested entities and relations) 67
- promotes named entity recognition research 63
- contains gold annotation of family relationships and life events 52
- promotes unsupervised machine learning approaches [by which I mean plenty of unlabeled data] 52
- promotes semi-supervised machine learning approaches [seems the same as for unsupervised approaches with supervised evaluation metrics] 47
- contains some labeled and a lot of unlabeled training data 50
Extra Stuff
- contains annotator cost, e.g. the time it took an annotator to produce any of the manual tasks like transcribing or labeling some unit… [this should be easy to do with our own annotation tool] 33

Phase 3: Done by April 1, 2011, if ever, hoping for Ancestry's support in this

Document Logical Structure:
- promotes document or page categorization/clustering [in this phase, just newspaper articles as documents] 52
- contains labels of text blocks indicating whether it is an unstructured (narrative) block of text vs. a structured (tabular, list) blo… 69
- promotes automatic analysis of page layout (e.g. identifying paragraphs, page headers, chapter and article segmentation, text wrapping… [in this phase, just newspaper articles and maybe chapters] 88
- contains information about how text blocks are connected within the flow of a newspaper article or family history narrative 80
- contains topic/category labels for individual newspaper articles or book sections, e.g. “classifieds” “sports”, “advertisements”, etc. 55
- promotes unsupervised machine learning approaches [by which I mean plenty of unlabeled data] 52
- promotes semi-supervised machine learning approaches [seems the same as for unsupervised approaches with supervised evaluation metrics] 47
- contains some labeled and a lot of unlabeled training data 50
Extra Stuff
- is included in a substantial public collection of corpora, such as the AIPR TC-11 Reading Systems data sets 50

Phase 4: Done by June 1, 2011, if ever, hoping for Ancestry's support in this

Document Physical Structure:
- Contains typo-corrections, i.e. contains transcriptions that distinguish between OCR mistakes and typos found in the original printed … 67
Document Logical Structure:
- contains page categories, e.g. “title page”, “advertisement”, “index page” 43
- promotes automatic analysis of page layout (e.g. identifying paragraphs, page headers, chapter and article segmentation, text wrapping… [some of this will be done in a previous phase] 88
Entities and Relations:
- promotes record linkage research 43
- contains annotation of entities tying them to a database of specific persons and places, dates are normalized, etc. 76
- yields a searchable repository of facts [if Ancestry really wants to pay for this… Actually, they will probably be building this on their own after we're done. I'm not sure if they will somehow make this available to the public along with this corpus.] 36
Extra Stuff:
- includes an annotation tool that you could later use to add more information to the corpus that's relevant to you 67

Phase 5: Will never be done :-)

contains gold marked-up tables and lists, e.g. rows and columns marked, hierarchical lists such as in indexes also marked, etc. 58
promotes structured, e.g. table and list, analysis [list stuff does come in other phases, but not tables in general] 58
contains simple linguistic annotations, e.g. part of speech tags and sentence boundaries 50
promotes NLP-based approaches, e.g. to NER or document layout analysis [to fuzzy to be a useful question] 50
contains sentence boundaries 47
contains complex linguistic annotations, e.g. parse trees of sentences 44
contains multiple OCR engine’s outputs at various levels of WER for each page [sorry, Bill. I don't think we can do this.] 36
contains more languages than English 33
contains image quality score/description for each page [related to WER, but maybe not close enough to worry about] 29
is accompanied by an executable to evaluate your output with respect to this gold reference dataset [almost the same as turning this into a competition] 22
is turned into a public competition at a good conference (e.g. ICDAR) 19
promotes image pre-processing parameter tuning research, e.g. binarization, de-warping, de-skewing, de-speckling, etc. [apparently not the kind of research we're interested in doing] 20

nlp-private/document-image-data-sets.txt · Last modified: 2015/04/23 13:21 by ryancha

Back to top

Table of Contents

Lee Jensen's Request

Dan's Wish List

Bill's Wish List

Thomas' Wish List

Dr. Embley's Wish List

Benjamin Lambert's Wish List (from CMU)

Other Wish Lists

Wish List Given to Ancestry

Options

Ancestry Corpus Plan

Ancestry Corpus Wiki

Summary Annotation Plan

Phase 1: Done by January 1, 2010, with Ancestry's full support

Phase 2: Done by March 1, 2011, with Ancestry's full support

Phase 3: Done by April 1, 2011, if ever, hoping for Ancestry's support in this

Phase 4: Done by June 1, 2011, if ever, hoping for Ancestry's support in this

Phase 5: Will never be done :-)