This shows you the differences between two versions of the page.

Link to this comparison view

nlp-private:jlutes [2015/04/23 19:44] (current)
ryancha created
Line 1: Line 1:
 +My favorite month is Halloween month. ​ My favorite day is Halloween day.
 +==We'​re Nearly Done Now==
 +==NER with Noisy OCR==
 +*Feature ideas
 +**Word order robustness.
 +***Effect of higher order models on f-score (1,2,3)
 +***"​Setify"​ the previous and next word features so that order matters less on, for example, a five-gram window.
 +**Aaron'​s Regexes
 +**#​*\b{Title}( [A-Z][A-Za-z]*){1,​3}\b
 +*Export probabilities for the labels given to words with the labels.
 +*Harold B. Lee Library NER corpora.
 +*StatNLP models to try.
 +**<​s>​Train with just PER tags (with dictionary (tolerance 0)).</​s>​ Dev, full name: f-measure 29.85%, precision 37.23%, recall 24.91%.
 +**Train with just PER tags (with dictionary (tolerance 2)).
 +**Mallet (comparable to our MEMM).
 +**<​s>​Generic MEMM (without dictionary).</​s>​ Dev, full name: f-measure 30.11%, precision 29.96%, recall 30.25%.
 +***#​Prefix/​Suffix (size 1 to 10).
 +***#Current word starts uppercase.
 +***#Current word is all uppercase.
 +***#Current word starts uppercase and is not at the beginning of a sentence.
 +***#Next word starts uppercase.
 +***#​Previous word starts uppercase.
 +***#Current word contains a number.
 +***#Current word contains a hyphen.
 +**<​s>​MEMM with dictionary (clean training).</​s>​ Dev, full name: f-measure 29.16%, precision 28.25%, recall 30.13%.
 +**<​s>​MEMM with fuzzy dictionary (tolerance 3) (clean training).</​s>​ Dev, full name: f-measure 27.20%, precision 22.54%, recall 34.28%.
 +*OCR alignment model.
 +**Noisify the original data bajillions of times for the training data?
 +**Apply to the fuzzy dictionary edit distance function.
 +***Make sure that the approximate match code is okay.
 +*Run DEG on Ancestry.com training set
 +** train our model on that data only
 +** train on that data as well as CoNLL etc.
 +*<​s>​Provide Bill Lund with fuzzy dictionary</​s>​
 +===Things being done now===
 +''​Improve the argument in the paper''​
 +# Results from our codebase on a vanilla MEMM.
 +# Results from Mallet on a CRF.
 +# Clean up code for to commit into the repository.
 +''​Make our model better than DEG'​s''​
 +# Implement features from the papers below.
 +# Change to the BILOU encoding of the data.
 +===Things to be doing soonish===
 +# Import Aaron'​s regexes and templates as features.
 +# Try out Thomas'​ list pruning on name dictionaries.
 +# More labeled data through the HBLL.
 +===598R NER reading list===
 +====L. Ratinov and D. Roth, “Design Challenges and Misconceptions in Named Entity Recognition”====
 +L. Ratinov and D. Roth, “[http://​l2r.cs.uiuc.edu/​~danr/​Papers/​RatinovRo09.pdf Design Challenges and Misconceptions in Named Entity Recognition],​” in Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009),​ 2009, 147–155.
 +**Perhaps "one sense per discourse"​ could be applied to NER with appropriate stop words.
 +**Phrasal forms like "both X and Y" imply that X and Y should be the same type.
 +**They found that BILOU representation ''​significantly''​ outperforms BIO.
 +**An NER system should be robust across multiple domains (examples given: historical texts, news articles, patent applications and web pages). ​ Yes.
 +**CoNLL03 dataset (Reuters 1996 news feeds). ​ MUC7 dataset (North American News Text Corpora). ​ Webpages (personal, academic and computer-science conference pages).
 +***Not really much variety in the test data.
 +**Baseline features.
 +**#Previous two tags.
 +**#Current word.
 +**#Word type (is-capitalized,​ all-capitalized,​ all-digits, alphanumeric,​ &tc.).
 +**#Current word prefixes and suffixes.
 +**#Tokens in the window <​math>​c = (x_{i-2}, x_{i-1}, x_{i}, x_{i+1}, x_{i+2})</​math>​.
 +**#​Capitalization pattern in the window c.
 +**#​Conjunction of c and the previous tag.
 +**Additional NER system features.
 +***POS tags.
 +***Shallow parsing information.
 +**Numbers are replaced with a special token so that they can be abstracted. ​ 1997 becomes #### and 801-867-5309 becomes ###​-###​-####​. ​ I like this one a lot.
 +**Key design decisions in an NER system.
 +**#How to represent text chunks in NER system.
 +**#What inference algorithm to use.
 +**#How to model non-local dependencies.
 +**#How to use external knowledge resources in NER.
 +**They postulate that identical tokens should have identical label assignments. ​ One sense per discourse, more or less.
 +**They ignore document boundaries with the idea that they'​ll be able to use similarities between the documents even across the boundary as they will deal with similar things. ​ This is valid, I think, especially with what we're doing with each document being a single page.  We should find some way of letting them bleed over into each other.
 +**Context aggregation features. ​ I don't think I understand this point [Chieu and Ng, 2003].
 +**Extended prediction history. ​ Previous predictions for the same token factor into this prediction. ​ How to keep it from assigning ALL of the probability? ​ Maybe just make it another feature?
 +**They used word clusters from unlabeled text to try and group similar words. ​ I like this idea.  They reference [Brown et al., 1992] and [Liang, 2005] and use the Brown algorithm to group related words and then abstract them to concepts and parts of speech (computer made labeled, of course).
 +**They say "... injection of gazetteer matches as features in machine-learning based approaches is critical for good performance."​
 +***They also note that they'​ve developed non-exact string matching but don't go into details in this paper due to space limitations. ​ This might be interesting.
 +**One of their knowledge sources is Wikipedia articles. ​ I don't think it would be great for making gazetteers for historical documents though.
 +====H. L Chieu and H. T Ng, “Named entity recognition with a maximum entropy approach.”====
 +H. L Chieu and H. T Ng, “Named entity recognition with a maximum entropy approach.”
 +*They used a tagging method with BCLUO prefixes, which I think is just the same as BILOU.
 +*Two systems are compared, one which makes use of a dictionary and another which does not.  It looks like the dictionary gives approximately 1.5% improvement in F-measure.
 +*Lists are compiled from the training data.
 +**Frequent Word List.  Words occurring in more than five documents.
 +**Useful Unigrams. ​ Top twenty for each class ranked by a correlation metric.
 +**Useful Bigrams. ​ Bigrams that precede a class.
 +**Useful Word Suffixes. ​ They use three letter suffixes. ​ We have features that list out up to 10 letter suffixes and prefixes.
 +**Useful Name Class Suffixes. ​ Unigrams that follow a class.
 +**Function Words. ​ Lower case words that appear within a name class (of, van, etc.).
 +**''​Stop Words''​. ​ They don't have this list, but it seems like it may be useful.
 +*As a preprocessing step, text is "​zoned"​ into headlines, bylines, date lines and story. ​ This wouldn'​t be so useful in historical text, but it seems that "​typing"​ the text into coarse classes like running text and tabled text may be useful.
 +*Their features.
 +**First word, case, and zone.
 +**Case and zone of preceding and succeeding words
 +**Case sequence (prev and succ both uppercase).
 +**Token information (all-digits,​ contains-dollar-sign,​ &tc.).
 +** ...
 +#P. F Brown et al., “Class-based n-gram models of natural language,​” Computational linguistics 18, no. 4 (1992): 467–479.
 +#P. Liang, “Semi-supervised learning for natural language” (Citeseer, 2005).
 +===Things done===
 +*Ancestry presentation.
 +**Dictionary (indicate fuzzy matching when appropriate).
nlp-private/jlutes.txt · Last modified: 2015/04/23 19:44 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0