nlp-private:ancestry-dot-com [CS Wiki]

__TOC__

Data Locations

Primary Location on dithers:

dithers: ssh username@dithers.cs.byu.edu

cd /var/www/deg/Data/Ancestry

Also viewable from the Web:

www.deg.byu.edu/Data/Ancestry

ancestry deg&nlp

Backup Location:

nlp.cs.byu.edu/data/ancestry

Document Categories, OCR Engines and Sizes

From Shawn Reid of Ancestry.com:

There were 3 OCR systems employed to create the data in the samples that we sent you.

PrimeOCR. This is a voting system that utilizes 6 OCR engines and provides back the “best” results of all engines. This is a commercial system that we used for serveral years.
Abby. Over the past 3 years, we've used an OCR engine provided by Kofax in our Dexter system. Under the covers, this is a version of Abbyy FineReader.
Unknown. Our newspapers were OCR'd by our newspaper partner. I am unaware of what OCR engine they used to do this.

We will specify which pages belong to which sets in our corpus definition file (XML). Ask Thomas for the current version.

City Directories (OCR Engine 2)
- Birmingham, AL; 1888 (875 pages)
- Portland, OR; 1878-1881 (680 pages)
Church Year Books (OCR Engine 1)
- Hartford First Church of Christ, 1904 (137 pages)
- New York Church Year Book, 1859 (119 pages)
Family History Books (OCR Engine 1)
- Blake Family; England; (8 pages)
- Libby family; USA; 1602-1881 (648 pages)
Local History Books (OCR Engine 2)
- Fairfield, CT (780 pages)
- Inverness, Nova Scotia (356 pages)
US Navy Cruise Books (7262 pages) (OCR Engine 2)
Newspapers (OCR Engine 3)
- Montclair Tribune; 1967-1968 (10,039 pages)
- The Story City Herald; 1955 (24 pages)

Labeling Guidelines

It's best to look at one of the existing hand label files and read these guidelines at least once. Refer back to them when you have questions. Try to follow the xml formatting conventions that Thomas has started, e.g. a blank line between FullName tags and between other large sections of the XML document, etc.

If you see an inconsistency in an existing file and these guidelines, ask Thomas which one should be changed.

General Concerns

The hand label file should be named as follows:

HandLabel_BookName-FileNum.xml

I'd like to standardize the book names as written in the data file names, hand label file names, predicted label file names, etc., so they do not contain underscores or hyphens, so the above naming convention is easy to read.

Needed Information

A list of nested label tags. Right now FullName is the common parent label, with GivenName, Surname, NamePrefix and NameSuffix are its possible constituents. All but FullName represent the label given to an individual token and therefore should have a tokenId attribute that correspond to the token ID found in the XML data file being labeled. This will change in the future to allow for a sequence of more than one token to represent one term, e.g. there would be three token tags (each with an ID) in the name “O'Henry”, and these would be grouped by a Surname tag.

Optional Information

It might be convenient in the future to have the IncludedLabels section completed for each page, so when we add more entity types to a file, we know which ones have been done and which ones have not. See example HandLabel files for how this section looks.

If you are looking for a particular label and it does not appear on the page, then mark it as complete.

It might also be nice to have the source information, but the primary place to get source information for a hand-label file is in the master corpus definition file.

The labeler's name might be useful information in the future for two reasons: in case someone looking at the file has questions about why or how the original labeler did the labeling, and in case we compute IAA (Inter-Annotator Agreement) later.

Tricky Labeling Decisions

Garbled OCR

A garbled token will not be considered for inclusion in a hand label file if more than half of the true characters are garbled. To quantify this further, mentally construct an edit score for a token. The following edits count as one each: an extra character was appended or inserted into the token, a character was replaced by another character within the token, a space was inserted into the token, a character was lost. If one character were replaced by more than one character, then each new character will count as one edit. Count up the total number of such edited characters. If this value is greater than half of the characters in the original token (before OCR), then discard that token: do not include it as part of the hand labeling.

The only exception to this that I can think of that makes sense is when one initial letter is replaced by another letter, e.g. “J. E. McCann” was rendered as “J. K. McCann” by the OCR engine. In that case, I think it's okay to label the “K” as a given name.

Corrections

Add a correction='CorrectToken' attribute to each token that is garbled. If there are fewer tokens in the corrected annotation than in the original, then put the corrected token in the attribute of the first token tag. If there are more tokens in the correction than in the original OCR, then put the first corrected token in the attribute of the single original token tag, and create empty token tags following it in which to place the additional corrected token attributes. If the token has not been included in the hand label file because it was greater than 50% corrupted, then you obviously cannot correct it by adding a correction attribute to it.

Nested Names

If a person's name is part of a company name, place name, wedding name (e.g. “the Olsen-Johnson wedding”), house name (e.g. “the Mrs. O. M. Olson home”), or any other entity type (concept) name, do not label it as a person.

Factored Names

If there is a surname factored out of multiple given names, then label them all as complete names, e.g. distribute the surname to be a part of every given name, keeping the given names and surname in the same relative linear order that they appear in the data file.

This also relates to natural language cases such as “Mr. and Mrs. Herschel Williams”, which should be labeled as two names: “Mr Herschel Williams” and “Mrs Herschel Williams”.

Space and Punctuation inside Name

We do not include space and punctuation that appear within a name, even the periods after initials and abbreviated titles, and even when the punctuation represents a real letter (but was garbled into a punctuation character by OCR error).

Name Titles

Only label the titles listed in the “Primary social titles in English” at the bottom of this wikipedia page and military titles like Captain.

http://en.wikipedia.org/wiki/Esquire

Example Hand Label Page

<HandLabels>
	<Source>
		<Genre>Local Histories</Genre>
		<Title>Fairfield Family</Title>
		<Title>History and Genealogy of the Families of Old Fairfield</Title>
		<Page>00767</Page>
		<FileNumber>00771</FileNumber>
		<ImageFile>FamFairfieldI-003692.767.tif</ImageFile>
		<Labeler>Thomas L. Packer</Labeler>
	</Source>

	<!-- Indicates which labels are given and implies that any other labels have not been added, such as places and dates
	and relationships. -->
	<IncludedLabels>
		<Label complete='True'>FullName</Label>
		<Label complete='True'>GivenName</Label>
		<Label complete='True'>Surname</Label>
		<Label complete='True'>NameSuffix</Label>
	</IncludedLabels>

	<!-- When hand labels are marked as not complete, that just means we cannot compute recall based on this page, but
	we can still use these hand labels to compute precision because we assume that any labels that are given are accurate.
	Tokens are listed in the order they appear in the text.  Corrections are given for OCR errors. -->
	<Labels>
		<FullName>
			<GivenName tokenId='5'>Phebe</GivenName>
		</FullName>
		
		<FullName>
			<GivenName tokenId='18'>Philippa</GivenName>
		</FullName>

		...

		<FullName>
			<GivenName tokenId='749'>Enoch</GivenName>
		</FullName>
	</Labels>
</HandLabels>

Plan

Pick two file formats to work with (MUC and MALLET?). I think Robbie is suggesting we use a MUC-3 format for format 1 and MALLET format for format 2. Format 1 will be used as our structured document format, containing all the document information we have, including header, coordinates, lines and words, etc. Format 2 will be used as our input to machine learning algorithms, e.g. it will be a sequence of instances/datums with accompanying features and labels.
Format 1 should contain at least the following:
1. Document Type / Category
2. Document Name
3. Page Sequential ID (position of page within document/book)
4. Datum Sequential ID (position of datum/token within page, which might be implicit)
5. Datum Coordinates (geometric location within page)
Translate all OCR pages into format 1.
Optionally re-order the tokens so they are in actual document order (correcting ordering errors from the OCR pipeline). This could be done at the same time the OCR pages are translated into format 1.
Don't examine the blind test pages except to label them.
Do all training, development and tuning of system on dev pages.
Label every third page of each book completely starting with page three.
Don't label new pages once we've accumulated 100+ entities and their relations. (We may increase this number later.) Stop on a page boundary: don't label partial pages.
Label everything we might extract, including some of the following (bold is minimum):
1. Name Prefix (Mr., Mrs., Dr., …)
2. Given Name (first names, middle names, first and middle initials)
3. Surname (last name, maiden name, married name)
4. Name Suffix (Jr., Sr., III, …)
5. Full Name (containing any of the above name pieces)
6. City
7. County
8. State
9. Country/Nation
10. Place (containing any of the above place pieces)
11. Day of the Month
12. Month
13. Year
14. Date (containing any of the above date pieces)
15. Life Event (including label distinctions for Birth, Marriage, Residence/Visit, Death, containing one or more of the above entities, e.g. Full Name, Place, Date)
16. Family Relationship (including label distinctions for Parent-Child, Grandparent-Grandchild, Sibling and Spouse, containing two Full Names)

Questions

Does the NLP code already read MUC format? [Put your answer right after the question.]

Can we put entity labels in the MUC format? How about keeping features in the MUC format, like word coordinates?

Is HTML a good format 1? Should we use that and the DEG extraction format (still in development) for format 1?

Do we need to do hand-labeling in MALLET format?

What code should we use in the NLP code base for doing feature engineering?

Task Assignments

If you waiting on someone else to do a previous task before you can do your task, bug them about it. All coding should ideally be accompanied by the writing of unit tests. Does the NLP lab do this already?

Assignment Names (Predicted Date – Actual Date)	Tasks Name	Task Description
Thomas (Jun 22 – Jun 22)	Explain NLP Code	Figure out (with Robbie's help) what code is already available in the NLP code base for use in this project. E.g., is there code to read format 1 (assuming MUC format)? Is there code for reading format 2 (i.e. lists of instances in MALLET format)? What feature extractor/selector code is there for us to use? Does all of this code work together (e.g. can we read in format 1 into classes that existing feature extractors are able to extract from)? Write a description in the “Resources” section of this wiki.
Thomas (Jun 23 – )	Pick format 1	Find a format that would be good for format 1, (Probably HTML to be used by DEG code). In the “resources” section of this wiki, provide a description or explanation.
Thomas (Jun 24 – )	Pick format 2	Find a standard format that would be good for format 2, (Probably CoNLL format because I think this is the only format that the NLP code base currently read. Probably not MUC-3). In the “resources” section of this wiki, provide a link to a description of that format and/or code samples for reading and/or writing that format.
Thomas (Jun 22 – Jun 22)	Pick format 3	Find a standard format that would be good for format 3, (MALLET). In the “resources” section of this wiki, provide a link to a description of that format and/or code samples for reading and/or writing that format.
Aaron (Jun 23 – )	Format 0 to Format 1	Write, run and debug code to translate Ancestry .dat format into format 1. We'll start with just plain text, as it appears in the Ancestry .dat format. Second, we could try going back and re-ordering tokens. Then, we may go back and mark line boundaries and other stuff.
	OCR Clean-up	Write, run and debug code to rotate word coordinates to be correctly orthogonal, parse columns and sentences and infer correct order of tokens from coordinates. (Could be done within the Format 0 to Format 1.) Make new files available on dithers.
	Format 1 to Format 2	Write, run and debug code to translate format 1 into format 2. We could instead write code to translate directly from format 0 to format 2 if that's easier. Make new files available on dithers.
	Format 2 to Format 3	Write, run and debug code to translate format 1 into format 2. Make new files available on dithers.
	Manual Labeling	Manually provide labels for down-stream automatic training and testing processes within format 1 files. There will be significant repetitive work in this, so come up with good techniques of minimizing the work somehow. It might make sense for one of us to make a GUI to speed up this process. Labels should include segmentation and label information for all entities and relations that we care to work with, as well as the actual word in case the OCR has made mistakes. This last bit of information will be useful in evaluating how well our systems do on the garbled words, as opposed to the ungarbled words. Make new files available on dithers.
	Dictionaries	Create or find dictionaries to be used in feature extraction and analogous DEG processes. E.g. we'd probably like a list of given names, surnames, place names, etc.
	Feature Extractors	Write, run and debug code to extract features from format 1, probably run within Format 1 to Format 2.
	Feature Selectors	Write, run and debug code to select features after feature extraction, probably run within Format 1 to Format 2.
	Experiments	Design and/or assign yourself here and execute one or more experiments. NLP lab experiments will likely involve feature extraction, feature selection, machine learning algorithm and parameter decisions and data selection. Run the appropriate code on the appropriate data with the appropriate parameters. Record results of the experiments along with a description of the input on the wiki here.

Resources

Mallet sequence file format: http://mallet.cs.umass.edu/sequences.php

MUC Stuff: http://www.cs.technion.ac.il/~gabr/resources/data/ne_datasets.html

MUC 7: http://www.itl.nist.gov/iad/894.02/related_projects/muc/proceedings/muc_7_proceedings/overview.html

CoNLL 2002 NER files: http://www.cnts.ua.ac.be/conll2002/ner/

CoNLL 2003 NER files: http://www.cnts.ua.ac.be/conll2003/ner/ http://nlp.cs.byu.edu/data/conll2003ner/ner/

Wikipedia NER: http://en.wikipedia.org/wiki/Named_entity_recognition

NLP Code Base

Looks like this code base can already read from the CoNLL format and that's it. We'll start

Papers

Named Entity Recognition for Digitised Historical Texts - Grover, Givon, Tobin, Ball. LREC 2008. http://www.lrec-conf.org/proceedings/lrec2008/pdf/342_paper.pdf

nlp-private/ancestry-dot-com.txt · Last modified: 2015/04/23 13:22 by ryancha

Back to top

Table of Contents