Progress Tasks
Tasks that will move the research forward. (Not prioritized!)
Use new document collections in the current system
Use more OCR engines in the current system
OCR confusion matrix to adjust the costs on mismatches. The hope is that there will be fewer paths through the network which may allow us to do more complex documents, and explore the network more quickly.
Adding hypotheses based on spell checking.
Based on one-dimensional spell checking. Only consider a single word
Based on multi-dimensional spell checking. Look at all of the aligned non-English words to determine which word(s) are “closest” to all of the OCR provided tokens.
Two pass process: when two or more OCR engines recognize the same word, yet the word is not in the dictionary, add the word to the dictionary. On the second pass, the word will be accepted in more than one document.
Use a language model to select between multiple accepted words. Requires augmenting the lattice as described above.
Sclite Viewer: take an Sclite file and view the contents in a way that shows each “sausage”. – Assigned to
Chris Rotz
Aligned Backpointer Viewer: take the aligned backpointer output of DocumentLattice and view the contents in a way that shows the optimal alignment, the “sausages” and a count of the optimal paths for each sausage.
Clean-up Tasks
Come-Up-To-Speed Tasks
Background
Read: Lund, Ringger (2009) Improving optical character recognition through efficient multiple system alignment. JCDL 2009. <br/>Get a copy of this from Bill.
Read: T. Ikeda & T. Imai (1994) Fast A* algorithms for multiple sequence alignment. Proceedings of Genome Informatics Workshop 1994. Yokohama, Japan.<br/>Try to find this yourself first, but if not possible, see Bill.
Read: S. Schroedl (2005) An improved search algorithm for optimal multiple-sequence alignment. Journal of artificial intelligence research. 23 (January/June 2005): 587-623<br/>Try to find this yourself first, but if not possible, see Bill.
J. Ajot, J.Fiscus, N. Radde, and C. Laprun. Asclite – Multi-dimensional alignment program<br/>Try to find this yourself first, but if not possible, see Bill.
Configuration
Get an accounts on DS1.lib.byu.edu, DS2.lib.byu.edu from Ryan Amy in Library Information Systems.
-
Get an account on the NLP lab private wiki. See Bill.
Get an account on the NLP Subversion server, Entropy. See Bill.
Make sure that the computer you are using has SSH, SFTP, SCP, Eclipse
Begin familiarizing yourself with Linux commands, Bash shell.
Get your PC up and running. Let Bill know what you need.
Set up Zotero to track reading.
Set up Subversion in Eclipse on DS1, DS2, and your PC
Learning
Linux utilities: SSH, SCP, SFTP
Eclipse
Subversion
Mediawiki editing and authoring
PBS on Marylou5
-
-
-
Bringing up the System
Tasks for [[Cr24|Chris Rotz]]
Tasks for Johnny Williamson
Back to top