nlp-private:jlutes

We're Nearly Done Now
NER with Noisy OCR

My favorite month is Halloween month. My favorite day is Halloween day.

We're Nearly Done Now

Bibliography?

NER with Noisy OCR

Feature ideas
- Word order robustness.
  - Effect of higher order models on f-score (1,2,3)
  - “Setify” the previous and next word features so that order matters less on, for example, a five-gram window.
- Aaron's Regexes
  - “PersonName-optTitle-first-optInitial-last”
    - \b({Title}\s+){0,1}({First})\s+([A-Z]\s+){0,1}{Last}\b
  - “PersonName-title-last”
    - \b{Title}\s+{Last}\b
  - “PersonName-title-capitalized”
    - \b{Title}( [A-Z][A-Za-z]*){1,3}\b
  - “PersonName-last-first_initial12”
    - \b({Last})(\s+{Title})?(\s+({First}|[A-Z])){1,2}\b
  - “PersonName-initials-last”
    - \b([A-Z]\s+){1,2}{Last}\b

Export probabilities for the labels given to words with the labels.

Harold B. Lee Library NER corpora.
StatNLP models to try.
- <s>Train with just PER tags (with dictionary (tolerance 0)).</s> Dev, full name: f-measure 29.85%, precision 37.23%, recall 24.91%.
- Train with just PER tags (with dictionary (tolerance 2)).
- Mallet (comparable to our MEMM).
- <s>Generic MEMM (without dictionary).</s> Dev, full name: f-measure 30.11%, precision 29.96%, recall 30.25%.
  - Features:
    - Prefix/Suffix (size 1 to 10).
    - Current word starts uppercase.
    - Current word is all uppercase.
    - Current word starts uppercase and is not at the beginning of a sentence.
    - Next word starts uppercase.
    - Previous word starts uppercase.
    - Current word contains a number.
    - Current word contains a hyphen.
- <s>MEMM with dictionary (clean training).</s> Dev, full name: f-measure 29.16%, precision 28.25%, recall 30.13%.
  - Dictionaries:
    - middle_initials.txt
    - us_census/us.census.gov.surnames.1990
    - us_census/us.census.gov.surnames.2000
    - us_census/us.census.gov.female.first.1990
    - us_census/us.census.gov.male.first.1990
    - thomas/Name_GivenName_F_Gutenberg.xml.txt
    - thomas/Name_GivenName_M_Gutenberg.xml.txt
    - thomas/Name_NamePrefix.xml.txt
    - thomas/Name_NameSuffix.xml.txt
    - thomas/Name_Surname_MessageBoards.xml.txt
- <s>MEMM with fuzzy dictionary (tolerance 3) (clean training).</s> Dev, full name: f-measure 27.20%, precision 22.54%, recall 34.28%.
- LingPipe
- OpenNLP
OCR alignment model.
- Noisify the original data bajillions of times for the training data?
- Apply to the fuzzy dictionary edit distance function.
  - Make sure that the approximate match code is okay.
Run DEG on Ancestry.com training set
- train our model on that data only
- train on that data as well as CoNLL etc.
<s>Provide Bill Lund with fuzzy dictionary</s>

Things being done now

Improve the argument in the paper

Results from our codebase on a vanilla MEMM.
Results from Mallet on a CRF.
Clean up code for to commit into the repository.

Make our model better than DEG's

Implement features from the papers below.
Change to the BILOU encoding of the data.

Things to be doing soonish

Import Aaron's regexes and templates as features.
Try out Thomas' list pruning on name dictionaries.
More labeled data through the HBLL.

598R NER reading list

L. Ratinov and D. Roth, “Design Challenges and Misconceptions in Named Entity Recognition”

L. Ratinov and D. Roth, “Design Challenges and Misconceptions in Named Entity Recognition,” in Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), 2009, 147–155.

Notes.
- Perhaps “one sense per discourse” could be applied to NER with appropriate stop words.
- Phrasal forms like “both X and Y” imply that X and Y should be the same type.
- They found that BILOU representation significantly outperforms BIO.
- An NER system should be robust across multiple domains (examples given: historical texts, news articles, patent applications and web pages). Yes.
- CoNLL03 dataset (Reuters 1996 news feeds). MUC7 dataset (North American News Text Corpora). Webpages (personal, academic and computer-science conference pages).
  - Not really much variety in the test data.
- Baseline features.
  - Previous two tags.
  - Current word.
  - Word type (is-capitalized, all-capitalized, all-digits, alphanumeric, &tc.).
  - Current word prefixes and suffixes.
  - Tokens in the window <math>c = (x_{i-2}, x_{i-1}, x_{i}, x_{i+1}, x_{i+2})</math>.
  - Capitalization pattern in the window c.
  - Conjunction of c and the previous tag.
- Additional NER system features.
  - POS tags.
  - Shallow parsing information.
  - Gazeteers.
- Numbers are replaced with a special token so that they can be abstracted. 1997 becomes #### and 801-867-5309 becomes ###-###-####. I like this one a lot.
- Key design decisions in an NER system.
  - How to represent text chunks in NER system.
  - What inference algorithm to use.
  - How to model non-local dependencies.
  - How to use external knowledge resources in NER.
- They postulate that identical tokens should have identical label assignments. One sense per discourse, more or less.
- They ignore document boundaries with the idea that they'll be able to use similarities between the documents even across the boundary as they will deal with similar things. This is valid, I think, especially with what we're doing with each document being a single page. We should find some way of letting them bleed over into each other.
- Context aggregation features. I don't think I understand this point [Chieu and Ng, 2003].
- Extended prediction history. Previous predictions for the same token factor into this prediction. How to keep it from assigning ALL of the probability? Maybe just make it another feature?
- They used word clusters from unlabeled text to try and group similar words. I like this idea. They reference [Brown et al., 1992] and [Liang, 2005] and use the Brown algorithm to group related words and then abstract them to concepts and parts of speech (computer made labeled, of course).
- They say “… injection of gazetteer matches as features in machine-learning based approaches is critical for good performance.”
  - They also note that they've developed non-exact string matching but don't go into details in this paper due to space limitations. This might be interesting.
- One of their knowledge sources is Wikipedia articles. I don't think it would be great for making gazetteers for historical documents though.

H. L Chieu and H. T Ng, “Named entity recognition with a maximum entropy approach.”

They used a tagging method with BCLUO prefixes, which I think is just the same as BILOU.
Two systems are compared, one which makes use of a dictionary and another which does not. It looks like the dictionary gives approximately 1.5% improvement in F-measure.
Lists are compiled from the training data.
- Frequent Word List. Words occurring in more than five documents.
- Useful Unigrams. Top twenty for each class ranked by a correlation metric.
- Useful Bigrams. Bigrams that precede a class.
- Useful Word Suffixes. They use three letter suffixes. We have features that list out up to 10 letter suffixes and prefixes.
- Useful Name Class Suffixes. Unigrams that follow a class.
- Function Words. Lower case words that appear within a name class (of, van, etc.).
- Stop Words. They don't have this list, but it seems like it may be useful.
As a preprocessing step, text is “zoned” into headlines, bylines, date lines and story. This wouldn't be so useful in historical text, but it seems that “typing” the text into coarse classes like running text and tabled text may be useful.
Their features.
- First word, case, and zone.
- Case and zone of preceding and succeeding words
- Case sequence (prev and succ both uppercase).
- Token information (all-digits, contains-dollar-sign, &tc.).
- …

Further

P. F Brown et al., “Class-based n-gram models of natural language,” Computational linguistics 18, no. 4 (1992): 467–479.
P. Liang, “Semi-supervised learning for natural language” (Citeseer, 2005).

Things done

Ancestry presentation.
- Dictionary (indicate fuzzy matching when appropriate).

Table of Contents