nlp-private:library-ocr-tasks [CS Wiki]

Progress Tasks

Tasks that will move the research forward. (Not prioritized!)

Use new document collections in the current system
- Daily Enquirer (19th century)
- Deseret News (19th century)
- Fax communications (21st century)
Use more OCR engines in the current system
- ReadIRIS
- Adobe Acrobat OCR – Assigned to Chris Rotz
- Prime OCR
OCR confusion matrix to adjust the costs on mismatches. The hope is that there will be fewer paths through the network which may allow us to do more complex documents, and explore the network more quickly.
Adding hypotheses based on spell checking.
- Based on one-dimensional spell checking. Only consider a single word
- Based on multi-dimensional spell checking. Look at all of the aligned non-English words to determine which word(s) are “closest” to all of the OCR provided tokens.
Two pass process: when two or more OCR engines recognize the same word, yet the word is not in the dictionary, add the word to the dictionary. On the second pass, the word will be accepted in more than one document.
Use a language model to select between multiple accepted words. Requires augmenting the lattice as described above.
- Need a mid-20th century news corpus for training.
Sclite Viewer: take an Sclite file and view the contents in a way that shows each “sausage”. – Assigned to Chris Rotz
Aligned Backpointer Viewer: take the aligned backpointer output of DocumentLattice and view the contents in a way that shows the optimal alignment, the “sausages” and a count of the optimal paths for each sausage.

Go back and update the command-line interface for all runnable classes to use Apache Commons CLI
Set up a database to contain all of the results

Read: Lund, Ringger (2009) Improving optical character recognition through efficient multiple system alignment. JCDL 2009. <br/>Get a copy of this from Bill.
Read: T. Ikeda & T. Imai (1994) Fast A* algorithms for multiple sequence alignment. Proceedings of Genome Informatics Workshop 1994. Yokohama, Japan.<br/>Try to find this yourself first, but if not possible, see Bill.
Read: S. Schroedl (2005) An improved search algorithm for optimal multiple-sequence alignment. Journal of artificial intelligence research. 23 (January/June 2005): 587-623<br/>Try to find this yourself first, but if not possible, see Bill.
J. Ajot, J.Fiscus, N. Radde, and C. Laprun. Asclite – Multi-dimensional alignment program<br/>Try to find this yourself first, but if not possible, see Bill.

Get an accounts on DS1.lib.byu.edu, DS2.lib.byu.edu from Ryan Amy in Library Information Systems.
Get an account on the super computer.
Get an account on the NLP lab private wiki. See Bill.
Get an account on the NLP Subversion server, Entropy. See Bill.
Make sure that the computer you are using has SSH, SFTP, SCP, Eclipse
Begin familiarizing yourself with Linux commands, Bash shell.
Get your PC up and running. Let Bill know what you need.
Set up Zotero to track reading.
Set up Subversion in Eclipse on DS1, DS2, and your PC

Duplicate the results previously run on a subset of the Eisenhower Communique documents
- Check out the current code from Entropy
- Recompile on DS2 and Marylou5
- Complete runs on both DS2 and Marylou5

nlp-private/library-ocr-tasks.txt · Last modified: 2015/04/23 13:37 by ryancha

Back to top