Back to Noisy OCR Group
<br />
__TOC__
<br />
Lee Jensen's Request
Primary goal:
Possible secondary goals, as long as they don’t overpower the project:
The entities are the ones I have been after all along.
Person Record
Name (given and surnames)
Gender
Events (date (mm/dd/yyyy and place (city,county,state,country))
Birth
Death
Marriage
Other
Familial relationship to other person record
As for quantity I am flexible on this, but I was hoping for twenty or so artifacts (pages) from each title. Of course newspapers are much richer than cruise books and so we may need to be flexible on this. Perhaps only take portions of a newspaper image or something of the like.
I am planning on a few thousand dollars for labeling. Possibilities for labelers include MTurk, our customer support department, or perhaps part time contractors.
<br />
Dan's Wish List
Must have:
Nice to have:
<br />
Bill's Wish List
Must have:
Nice to have:
<br />
Thomas' Wish List
Must have:
Very nice to have:
<br />
Dr. Embley's Wish List
Must have:
Nice to have:
<br />
Benjamin Lambert's Wish List (from CMU)
Must have
Nice to have:
Categories of documents, pages, etc.
line bounding boxes
paragraphs and other structure
<br />
Other Wish Lists
The list below is old and small. It is replaced by http://www.allourideas.org/ancestry_corpus.
What we need in a data set.
Feature |
Votes |
Contains semi-structured data. |
Thomas |
A book with various (alternating/mixed) semi-structured formats/patterns. |
Dr. Embley, Thomas, Aaron |
Contains OCR output (unless we get an OCR engine). |
All. |
OCR contains punctuations, correct token bounding boxes, text line groupings of tokens and good zoning of coarse-grained page structures. |
Thomas, Dr. Embley, Aaron. |
Contains hand-annotations (unless we get or make an annotation tool). |
All |
Hand annotations include detailed multilevel-structured information for multiple classes, e.g. sub-entities, entities, relations. |
Thomas |
<br />
Wish List Given to Ancestry
New data, e.g. twice as much as what we got last year (or more).
It can include the same titles as before (from the data we got last year) if the OCR is all the same format. We liked the sample we got last year in terms of diversity of document style.
The equivalent of at least three books for each document format/style. E.g. if we get city directory books that have one table of contents and an index per book, we'd need at least three such books so we can have an example of each part of the book in each of our data sub-sets: training, dev-test and blind-test sets. The more the better – like 6 books per style would be even better.
If a certain type of document is pretty big and uniform, like a set of census forms that were never bound into book-style volumes, then it's okay to give us one big title if it contains a lot of individual records/pages. We can manually split that kind of a title into different date ranges, for example, to produce our training/dev-test/blind-test split.
A variety of document styles. There are three main kinds of document formats: documents with narrative sentences and paragraphs (e.g. newspaper), documents containing mostly tables (distinct columns and rows of data, with or without column headers), and semi-structured text (e.g. city directories and event-lists like marriages or births that are written in a uniform style like “John Jones m. Sarah Wight, Jun 9., by Pastor Richard Smith”). Within these categories are potentially many sub-categories (e.g. city directories and local histories with lots of lists of names are probably both considered semi-structured). We'd like a variety even within each category, as long as there are at least three titles for each sub-category. And it's okay if a certain sub-category seems to contain some pages that are free text and some that are semi-structured, for example. Mixtures of categories found within the same document are fine (actually beneficial).
An emphasis on tabular and semi-structured data, i.e. data-rich documents, especially those containing family relationship and life-event facts about many people in a consistent or semi-consistent format. But if we have a large sample of each type, we don't necessarily need to have more tabular data than unstructured data.
Contain TIFF images corresponding to the OCR output for each page. (JPEG 2K format is harder to deal with.)
Verbose XML format containing accurate bounding box coordinates for characters or words, word-groupings of characters (e.g. spaces preserved between words), text-line groupings of words, character confidence scores, correct word order, punctuation preserved. We would also assume that zoning (column analysis) was done by the OCR engine. Having zone information in the output is not as important as text line groupings (assuming those text lines don't span multiple columns).
<br />
Options
Here are data sets we have access to. “Diverse” means there are multiple formats of text within the same corpus (e.g. more than one document type). “Mixed” means heterogeneous on the page-level, i.e. there can be multiple formats of text structure within any given page. “Unstructured” means free-text, natural language sentences.
<br />
Ancestry Corpus Plan
Five-phase plan for annotating the Ancestry corpus. Round parentheses and voting-derived scores were preserved from AllOurIdeas.org. Square parentheses are further explanations added after the voting to explain how these items fit into this plan.
The time line is rough and fuzzy. There may be extra pages of one task done in the phase of another task, e.g. more transcription may take place during the later phases. We could swap the due dates for phases 2 and 3 if anyone really wanted us to.
By the way, the number of pages to be annotated or transcribed has not been determined here. Ancestry may be interested in very few pages, just a sample from each document.
<br />
Ancestry Corpus Wiki
Summary Annotation Plan
Phase 1: Transcription (before Jan. 1):
Gather and restructure what Ancestry gives us including images and OCR output
Transcribe / correct OCR, including annotation time cost
Align to OCR output, preserving bounding boxes, line segmentation, etc.
Compute WER
Phase 2: E/R (before Mar. 1):
Marking semi-structured lists and their component records, including annotation time cost
Labeling entities and relation in unstructured and semi-structured text, including annotation time cost
Phase 3: Logical Structure:
Phase 4:
Other logical structure, typo correction, database, record-linkage and co-reference stuff we may not get to. There were a few high-ranked items here that just don't fit in well with the essentials.
Phase 5:
<br />
Phase 1: Done by January 1, 2010, with Ancestry's full support
Phase 2: Done by March 1, 2011, with Ancestry's full support
Phase 3: Done by April 1, 2011, if ever, hoping for Ancestry's support in this
Phase 4: Done by June 1, 2011, if ever, hoping for Ancestry's support in this
Phase 5: Will never be done :-)
contains gold marked-up tables and lists, e.g. rows and columns marked, hierarchical lists such as in indexes also marked, etc. 58
promotes structured, e.g. table and list, analysis [list stuff does come in other phases, but not tables in general] 58
contains simple linguistic annotations, e.g. part of speech tags and sentence boundaries 50
promotes NLP-based approaches, e.g. to NER or document layout analysis [to fuzzy to be a useful question] 50
contains sentence boundaries 47
contains complex linguistic annotations, e.g. parse trees of sentences 44
contains multiple OCR engine’s outputs at various levels of WER for each page [sorry, Bill. I don't think we can do this.] 36
contains more languages than English 33
contains image quality score/description for each page [related to WER, but maybe not close enough to worry about] 29
is accompanied by an executable to evaluate your output with respect to this gold reference dataset [almost the same as turning this into a competition] 22
is turned into a public competition at a good conference (e.g. ICDAR) 19
promotes image pre-processing parameter tuning research, e.g. binarization, de-warping, de-skewing, de-speckling, etc. [apparently not the kind of research we're interested in doing] 20
<br />
Back to top