Progress Tasks

Tasks that will move the research forward. (Not prioritized!)

  • Use new document collections in the current system
  • Use more OCR engines in the current system
  • OCR confusion matrix to adjust the costs on mismatches. The hope is that there will be fewer paths through the network which may allow us to do more complex documents, and explore the network more quickly.
  • Adding hypotheses based on spell checking.
    • Based on one-dimensional spell checking. Only consider a single word
    • Based on multi-dimensional spell checking. Look at all of the aligned non-English words to determine which word(s) are “closest” to all of the OCR provided tokens.
  • Two pass process: when two or more OCR engines recognize the same word, yet the word is not in the dictionary, add the word to the dictionary. On the second pass, the word will be accepted in more than one document.
  • Use a language model to select between multiple accepted words. Requires augmenting the lattice as described above.
    • Need a mid-20th century news corpus for training.
  • Sclite Viewer: take an Sclite file and view the contents in a way that shows each “sausage”. – Assigned to Chris Rotz
  • Aligned Backpointer Viewer: take the aligned backpointer output of DocumentLattice and view the contents in a way that shows the optimal alignment, the “sausages” and a count of the optimal paths for each sausage.

Clean-up Tasks

  • Go back and update the command-line interface for all runnable classes to use Apache Commons CLI
  • Set up a database to contain all of the results

Come-Up-To-Speed Tasks


  • Read: Lund, Ringger (2009) Improving optical character recognition through efficient multiple system alignment. JCDL 2009. <br/>Get a copy of this from Bill.
  • Read: T. Ikeda & T. Imai (1994) Fast A* algorithms for multiple sequence alignment. Proceedings of Genome Informatics Workshop 1994. Yokohama, Japan.<br/>Try to find this yourself first, but if not possible, see Bill.
  • Read: S. Schroedl (2005) An improved search algorithm for optimal multiple-sequence alignment. Journal of artificial intelligence research. 23 (January/June 2005): 587-623<br/>Try to find this yourself first, but if not possible, see Bill.
  • J. Ajot, J.Fiscus, N. Radde, and C. Laprun. Asclite – Multi-dimensional alignment program<br/>Try to find this yourself first, but if not possible, see Bill.


  • Get an accounts on, from Ryan Amy in Library Information Systems.
  • Get an account on the super computer.
  • Get an account on the NLP lab private wiki. See Bill.
  • Get an account on the NLP Subversion server, Entropy. See Bill.
  • Make sure that the computer you are using has SSH, SFTP, SCP, Eclipse
  • Begin familiarizing yourself with Linux commands, Bash shell.
  • Get your PC up and running. Let Bill know what you need.
  • Set up Zotero to track reading.
  • Set up Subversion in Eclipse on DS1, DS2, and your PC


Bringing up the System

  • Duplicate the results previously run on a subset of the Eisenhower Communique documents
    • Check out the current code from Entropy
    • Recompile on DS2 and Marylou5
    • Complete runs on both DS2 and Marylou5

Tasks for [[Cr24|Chris Rotz]]

  • Come up to speed
  • Run Eisenhower Communiques through Adobe OCR

Tasks for Johnny Williamson

  • Come up to speed


nlp-private/library-ocr-tasks.txt · Last modified: 2015/04/23 13:37 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0