This is for the quick addition to the Ancestry-data paper based on NAACL/HLT submission feedback, to resubmit to ASIST.

<br />

__TOC__

<br />

Labeling Guidelines

It's best to look at one of the existing hand label files and read these guidelines at least once. Refer back to them when you have questions. Try to follow the xml formatting conventions that Thomas has started, e.g. using blank lines to improve readability, etc.

If you see an inconsistency in an existing file and these guidelines, ask Thomas which one should be changed.

<br />

General Concerns

The resulting hand label file should be named as follows:

HandLabels-TranscriptPageLevel_BookName-FileNum.xml

The file from which this will be derived is called:

Transcription_BookName-FileNum.xml

I'd like to standardize the book names as written in the data file names, hand label file names, predicted label file names, etc., so they do not contain underscores or hyphens, so the above naming convention is easy to read.

<br />

Needed Information

A list of FullName tags containing just the text of the full name, no nested elements/constituents/sub-entities like GivenName, Surname, NamePrefix and NameSuffix. Also, no need to include token IDs or bounding box coordinates. These are not given in the transcript files which will be used to manually extract full names from.

No header information needs to be given in this version of the hand label XML file. All such files are assumed to be complete for full names once they are shared among the group.

Optional information includes the source information given in the example below, most importantly the original OCR output data file or the transcript file.

<br />

Tricky Labeling Decisions

<br />

Nested Names

If a person's name is part of a company name, place name, wedding name (e.g. “the Olsen-Johnson wedding”), house name (e.g. “the Mrs. O. M. Olson home”), or any other entity type (concept) name, do not label it as a person.

<br />

Factored Names

If there is a surname factored out of multiple given names, then label them all as complete names, e.g. distribute the surname to be a part of every given name, keeping the given names and surname in the same relative linear order that they appear in the data file.

This also relates to natural language cases such as “Mr. and Mrs. Herschel Williams”, which should be labeled as two names: “Mr Herschel Williams” and “Mrs Herschel Williams”.

<br />

Space and Punctuation inside Name

We do not include space and punctuation that appear within a name, even the periods after initials and abbreviated titles, and even when the punctuation represents a real letter (but was garbled into a punctuation character by OCR error). But hand annotations can include punctuation because this can be removed automatically by the label-file reading code.

<br />

Name Titles

Only label the titles listed in the “Primary social titles in English” at the bottom of this wikipedia page and military titles like Captain.

http://en.wikipedia.org/wiki/Esquire

<br />

Example Hand Label Page

<HandLabels pageLevelEval='True'>
	<Source>
		<Title>Fairfield Family</Title>
		<Title>History and Genealogy of the Families of Old Fairfield</Title>
		<Page>00767</Page>
		<FileNumber>00771</FileNumber>
		<ImageFile>FamFairfieldI-003692.767.tif</ImageFile>
		<TranscriptFile>Transcription_FamFairfieldI-003692.767.tif</TranscriptFile>
		<OcrDataFile>OcrData_FamFairfieldI-003692.767.tif</OcrDataFile>
		<OcrDataFile>FamFairfieldI-003692.767.tif</OcrDataFile>
		<Labeler>Thomas L. Packer</Labeler>
	</Source>

	<Labels>
		<FullName>Phebe</FullName>
		
		<FullName>Philippa Clement</FullName>

		<FullName>Mr Mark Clement Jr</FullName>

		...

		<FullName>Enoch Jasperson</FullName>
	</Labels>
</HandLabels>

<br />

nlp-private/page-level-image-based-labeling-guidelines.txt · Last modified: 2015/04/23 19:23 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0