Table of Contents

Tasks Preparing for NAACL

Next

  1. Talk to Dr. Lonsdale and Dr. Davies in the Linguistic department about finding a corpus of mid-20th century news to build a language model.
  2. New lower bound based on adding spelling
  3. Think about whether it is better to only allow tokens that have some indication of being right. That would eliminate a lot of tokens from the committed list and possibly lower the error rate, but is that useful?

Prioritized

  1. Add another data set. (Deseret News, Daily Enquirer, fax data set?)
  2. Add another OCR engine. (IrisReader, Adobe, OCRopus current version?)
  3. Add hypotheses based on an edit distance from all of the aligned tokens rather than a “spell checker” method that only considers the alternatives to a single token.
  4. Construct the multiple token list by looking at tokens that appear in more than one OCR engine rather than just more than once across all of the corpus. This way avoid the problem that the OCR engine is making the same mistake repeatedly so we include the token in the list.
  5. Run a complete set of numbers on the multiple token list, but change the priority of it in the levels of evidence, include voting, separate punctuation.
  6. Separate recognition of punctuation from words. Sclite merges them into a single recognized token.

Unprioritized

Done

Process

At each point show the results.

  1. Aligned with Oracle results using sclite to examine all alternatives and determine whether any of them are correct. This is the old baseline and exists.
  2. Using a dictionary and gazetteer, commit to one token sequence. This is the JCDL 2009 result and exists.
  3. Collect evidence from the OCR of recurring tokens.
    • Collect recurring tokens as they occur multiple times in the same file. The commit step considered this evidence as equal in weight to the dictionary and gazetteer, and provided marginal improvement across the entire dev set and for some individual files significantly worse results (200% worse).
    • A better method would be to take the recurring token file and make it less significant than dictionary/gazetteer evidence. Only refer to it when the dictionary/gazetteer look-up failed. This may permit us to use all of recurring tokens. (One of the problems was that as we improved the recall of true tokens from the recurring set, our precision dropped. When this had the same weight as the dictionary/gazetteer, our overall error rate was affected. If we only use the recurring token list when the dictionary/gazetteer fails we aren't hurting the dictionary only results and can only improve it.)
    • Another possible way to improve the recurring token list precision is to require the recurring tokens to appear in more than one OCR engine. This way we help avoid the problem of the OCR engines consistently making the same mistake. At least then more than one OCR engine would need to make the same mistake.
  4. Add hypotheses by taking each sausage, checking tokens for existence in the dictionary/gazetteer. If not found, use the spell checker on the single word to suggest an alternative. For sausages this creates a new alternative within the sausage. A viewer needs to accommodate this!
    • If a token is not found in the dictionary/gazetteer explore whether by dividing the token with a space will result in two tokens that are found.
    • Explore whether tokens divided by a single dash (do the OCR engines use different dashes?) have both halves in the dictionary/gazetteer.
    • Explore whether tokens ending in a dash, whether when merged with the next token they are found in the dictionary/gazetteer.
    • It seems that we are reacting to specific types of OCR errors. Is this a problem?
  5. Within a sausage explore across aligned alternatives using a multi-input spell checker.
  6. The commit process needs to explore each sausage for “fitness” using levels of evidence of the tokens within the sausage, e.g. found in the 1) dictionary/gazetteer, 2) some type of splitting or merging results in a found token, 3) single spell check alternative, 4) multiple spell check alternative, 5) found in recurring token list, 6) “looks like an English word.”

Tasks Preparing for JCDL

Tasks that need to happen

Expand Datasets

DocumentLattice

*toSclite

Tasks for Chris Rotz

Tasks for Johnny Williamson

Research in Scholarly Publications