Notes.
Perhaps “one sense per discourse” could be applied to NER with appropriate stop words.
Phrasal forms like “both X and Y” imply that X and Y should be the same type.
They found that BILOU representation significantly outperforms BIO.
An NER system should be robust across multiple domains (examples given: historical texts, news articles, patent applications and web pages). Yes.
CoNLL03 dataset (Reuters 1996 news feeds). MUC7 dataset (North American News Text Corpora). Webpages (personal, academic and computer-science conference pages).
Baseline features.
Previous two tags.
Current word.
Word type (is-capitalized, all-capitalized, all-digits, alphanumeric, &tc.).
Current word prefixes and suffixes.
Tokens in the window <math>c = (x_{i-2}, x_{i-1}, x_{i}, x_{i+1}, x_{i+2})</math>.
Capitalization pattern in the window c.
Conjunction of c and the previous tag.
Additional NER system features.
Numbers are replaced with a special token so that they can be abstracted. 1997 becomes #### and 801-867-5309 becomes ###-###-####. I like this one a lot.
Key design decisions in an NER system.
How to represent text chunks in NER system.
What inference algorithm to use.
How to model non-local dependencies.
How to use external knowledge resources in NER.
They postulate that identical tokens should have identical label assignments. One sense per discourse, more or less.
They ignore document boundaries with the idea that they'll be able to use similarities between the documents even across the boundary as they will deal with similar things. This is valid, I think, especially with what we're doing with each document being a single page. We should find some way of letting them bleed over into each other.
Context aggregation features. I don't think I understand this point [Chieu and Ng, 2003].
Extended prediction history. Previous predictions for the same token factor into this prediction. How to keep it from assigning ALL of the probability? Maybe just make it another feature?
They used word clusters from unlabeled text to try and group similar words. I like this idea. They reference [Brown et al., 1992] and [Liang, 2005] and use the Brown algorithm to group related words and then abstract them to concepts and parts of speech (computer made labeled, of course).
They say “… injection of gazetteer matches as features in machine-learning based approaches is critical for good performance.”
One of their knowledge sources is Wikipedia articles. I don't think it would be great for making gazetteers for historical documents though.