Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
nlp:historical-document-recognition [2015/04/16 20:49]
ryancha
nlp:historical-document-recognition [2015/05/21 22:40] (current)
plf1
Line 1: Line 1:
 +~~NOTOC~~
 Recovering high quality digital text from modern machine-printed document images using Optical Character Recognition (OCR) is nearly a solved problem. ​ However, recovering high quality digital text from historical document images is significantly more challenging. ​ Our document recognition project focuses on the latter problem. ​ The work encompasses efforts to combine multiple OCR hypotheses using multi-sequence alignment methods and machine learning to select the best hybrid transcription. ​ Such hypotheses can come from multiple OCR engines or from a single OCR engine on different inputs. ​ The potential for transcription improvement is substantial. Recovering high quality digital text from modern machine-printed document images using Optical Character Recognition (OCR) is nearly a solved problem. ​ However, recovering high quality digital text from historical document images is significantly more challenging. ​ Our document recognition project focuses on the latter problem. ​ The work encompasses efforts to combine multiple OCR hypotheses using multi-sequence alignment methods and machine learning to select the best hybrid transcription. ​ Such hypotheses can come from multiple OCR engines or from a single OCR engine on different inputs. ​ The potential for transcription improvement is substantial.
  
Line 5: Line 6:
 == Publications == == Publications ==
  
-<​!-- ​ 
-At some point someone may want to use a table instead of the artificial spacing using breaks. Using tables would ensure that there is no overlap between images and text of the next publications ​ 
---> 
  
 +^ [http://​www.researchgate.net/​publication/​260084914_How_Well_Does_Multiple_OCR_Error_Correction_Generalize| How Well Does Multiple OCR Error Correction Generalize?​] ​  ​^^ ​
 +| [[media:​nlp:​140px-binarization-generalization.png]] | William B. Lund, Eric K. Ringger, Daniel D. Walker ​  ​| ​
 +| :::                             | '''​In Proceedings of the 20th Document Recognition and Retrieval (DRR 2014)''' ​                                       | 
 +| :::                             | As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model. ​ | 
  
-[http://​www.researchgate.net/​publication/​260084914_How_Well_Does_Multiple_OCR_Error_Correction_Generalize| How Well Does Multiple OCR Error Correction Generalize?​] 
-[[media:​nlp:​140px-binarization-generalization.png|left|140px|border]] 
-* William B. Lund, Eric K. Ringger, Daniel D. Walker 
-* '''​In Proceedings of the 20th Document Recognition and Retrieval (DRR 2014)'''​ 
-* As the digitization of historical documents, such as newspapers, becomes more common, the need of the archive patron for accurate digital text from those documents increases. Building on our earlier work, the contributions of this paper are: 1. in demonstrating the applicability of novel methods for correcting optical character recognition (OCR) on disparate data sets, including a new synthetic training set, 2. enhancing the correction algorithm with novel features, and 3. assessing the data requirements of the correction learning method. First, we correct errors using conditional random fields (CRF) trained on synthetic training data sets in order to demonstrate the applicability of the methodology to unrelated test sets. Second, we show the strength of lexical features from the training sets on two unrelated test sets, yielding a relative reduction in word error rate on the test sets of 6.52%. New features capture the recurrence of hypothesis tokens and yield an additional relative reduction in WER of 2.30%. Further, we show that only 2.0% of the full training corpus of over 500,000 feature cases is needed to achieve correction results comparable to those using the entire training corpus, effectively reducing both the complexity of the training process and the learned correction model. 
  
-[http://​dl.acm.org/​citation.cfm?​id=2501126| Why Multiple Document Image Binarizations Improve OCR]  
-[[media:​nlp:​140px-whybinarization.png|left|140px|border]] 
-* William B. Lund, Douglas J. Kennard, Eric K. Ringger 
-* '''​2nd International Workshop on Historical Document Imaging and Processing 2013 (HIP 2013)'''​ 
-* Our previous work has shown that the error correction of optical character recognition (OCR) on degraded historical machine-printed documents is improved with the use of multiple information sources and multiple OCR hypotheses including from multiple document image binarizations. The contributions of this paper are in demonstrating how diversity among multiple binarizations makes those improvements to OCR accuracy possible. We demonstrate the degree and breadth to which the information required for correction is distributed across multiple binarizations of a given document image. Our analysis reveals that the sources of these corrections are not limited to any single binarization and that the full range of binarizations holds information needed to achieve the best result as measured by the word error rate (WER) of the final OCR decision. ​ Even binarizations with high WERs contribute to improving the final OCR. For the corpus used in this research, fully 2.68% of all tokens are corrected using hypotheses not found in the OCR of the binarized image with the lowest WER. Further, we show that the higher the WER of the OCR overall, the more the corrections are distributed among all binarizations of the document image. 
  
-[http://proceedings.spiedigitallibrary.org/proceeding.aspx?articleid=1568648Combining ​Multiple ​Thresholding Binarization Values to Improve OCR Output]  +[http://dl.acm.org/citation.cfm?id=2501126Why Multiple ​Document Image Binarizations ​Improve OCR]   ^^  
-[[media:​nlp:​140px-multiple-thresholding-ocr.png|left|140px|border]] +[[media:​nlp:​140px-whybinarization.png]] ​| William B. Lund, Douglas J. KennardEric K. Ringger ​  |  
-* Bill Lund; Doug KennardEric Ringger +| :::                             ​| ​'''​2nd International Workshop on Historical Document Imaging and Processing ​2013 (HIP 2013)''' ​                                       ​| ​ 
-'''​DRR 2013'''​ +| :::                             | Our previous work has shown that the error correction of optical character recognition (OCR) on degraded historical machine-printed documents ​is improved with the use of multiple ​information ​sources and multiple OCR hypotheses including ​from multiple ​document image binarizations. The contributions ​of this paper are in demonstrating how diversity among multiple binarizations makes those improvements ​to OCR accuracy possibleWe demonstrate the degree and breadth to which the information required for correction ​is distributed across multiple binarizations ​of a given document image. Our analysis reveals that the sources of these corrections are not limited to any single binarization and that the full range of binarizations holds information needed ​to achieve the best result as measured ​by the word error rate (WERof the final OCR decision Even binarizations with high WERs contribute to improving the final OCR. For the corpus used in this researchfully 2.68of all tokens are corrected using hypotheses not found in the OCR of the binarized image with the lowest WER. Further, we show that the higher the WER of the OCR overall, the more the corrections are distributed among all binarizations of the document image ​| ​
-* On noisy, historical document images a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization ​is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates ​information from multiple ​global threshold ​binarizations of the same image to improve text outputUsing a new corpus of 19th century newspaper grayscale images for which the text transcription ​is known, we observe WERs of 13.8% and higher using current binarization techniques and state-of-the-art OCR engine. Our novel approach combines ​the OCR outputs from multiple thresholded images by aligning ​the text output. From the word lattice we commit ​to one hypothesis ​by applying ​the methods of Lund et al. (2011achieving 8.41% WERa 39.1reduction ​in error rate relative to the performance ​of the original ​OCR engine on this data set.+
  
  
-[http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1568659| Evaluating Supervised Topic Models in the Presence of OCR Errors] ​ 
-[[media:​nlp:​120px-supervised-noisy-tm.png|left|120px|border]] 
-* Daniel Walker; Eric Ringger; and Kevin Seppi 
-* '''​The Conference on Document Recognition and Retrieval XX (DRR 2013)'''​ 
-* Received best student paper award 
-* Topic discovery using unsupervised topic models degrades as error rates increase in OCR transcriptions of historical document images. ​ Despite the availability of meta-data, analyses by supervised topic models, such as Supervised LDA and Topics over Non-Parametric Time, exhibit similar degradation. 
  
 +^ [http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1568648| Combining Multiple Thresholding Binarization Values to Improve OCR Output] ​  ​^^ ​
 +| [[media:​nlp:​140px-multiple-thresholding-ocr.png]] | Bill Lund; Doug Kennard; Eric Ringger ​  ​| ​
 +| :::                             | '''​DRR 2013''' ​                                       | 
 +| :::                             | On noisy, historical document images a high optical character recognition (OCR) word error rate (WER) can render the OCR text unusable. Since image binarization is often the method used to identify foreground pixels, a significant body of research has sought to improve image-wide binarization directly. Instead of relying on any one imperfect binarization technique, our method incorporates information from multiple global threshold binarizations of the same image to improve text output. Using a new corpus of 19th century newspaper grayscale images for which the text transcription is known, we observe WERs of 13.8% and higher using current binarization techniques and a state-of-the-art OCR engine. Our novel approach combines the OCR outputs from multiple thresholded images by aligning the text output. From the word lattice we commit to one hypothesis by applying the methods of Lund et al. (2011) achieving 8.41% WER, a 39.1% reduction in error rate relative to the performance of the original OCR engine on this data set.  | 
  
-[http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1284063| A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods] ​ 
-[[media:​nlp:​120px-synthetic-ocr.png|left|120px|border]] 
-* Dan Walker; Bill Lund; Eric Ringger 
-* '''​DRR 2012'''​ 
-* We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset. 
  
-[http://​www.icdar2011.org/​fileup/​PDF/​4520a764.pdf| Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines] 
-[[media:​nlp:​140px-progressive-alignment.png|left|140px|border]] 
-* Bill Lund; Dan Walker; Eric Ringger 
-* '''​ICDAR 2011'''​ 
-* This paper presents a novel method for improving optical character recognition (OCR). The method employs the progressive alignment of hypotheses from multiple OCR engines followed by final hypothesis selection using maximum entropy classification methods. The maximum entropy models are trained on a synthetic calibration data set. Although progressive alignment is not guaranteed to be optimal, the results are nonetheless strong. Our method shows a 24.6% relative improvement over the word error rate (WER) of the best performing of the five OCR engines employed in this work. Relative to the average WER of all five OCR engines, our method yields a 69.1% relative reduction in the error rate. Furthermore,​ 52.2% of the documents achieve a new low WER. 
  
  
-[http://www.icdar2011.org/fileup/​PDF/​4520a658.pdfError Correction with In-Domain Training Across Multiple ​OCR System Outputs]  +[http://proceedings.spiedigitallibrary.org/proceeding.aspx?​articleid=1568659Evaluating Supervised Topic Models in the Presence of OCR Errors  ^^  
-[[media:​nlp:​120px-ocr-error-correction.png|left|120px|border]] +[[media:​nlp:​120px-supervised-noisy-tm.png]] ​| Daniel Walker; Eric Ringger; and Kevin Seppi   ​| ​ 
-* Bill Lund; Eric Ringger +| :::                             ​| ​'''​The Conference on Document Recognition and Retrieval XX (DRR 2013)''' ​                                       ​| ​ 
-'''​ICDAR 2011'''​ +| :::                             | '''​Received best student ​paper award''' ​                                       |  
-* This paper demonstrates the degree to which the word error rate (WER) can be reduced ​using a decision list on a combination ​of textual features across ​the aligned output ​of multiple OCR engines where in-domain training ​data is available. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement ​over the best single OCR engine.+| :::                             | Topic discovery ​using unsupervised topic models degrades as error rates increase in OCR transcriptions ​of historical document images. ​ Despite ​the availability ​of meta-data, analyses by supervised topic models, such as Supervised LDA and Topics ​over Non-Parametric Time, exhibit similar degradation ​| ​
  
  
  
-[http://​www.researchgate.net/​publication/​220774817_Extracting_person_names_from_diverse_and_noisy_OCR_text/​file/​79e415051d9d572e4e.pdf| Extracting Person Names from Diverse and Noisy OCR Text] 
-[[media:​nlp:​140px-extracting-names.png|left|140px|border]] 
-* Thomas Packer; Joshua Lutes; Aaron Stewart; David Embley; Eric Ringger; Kevin Seppi; Lee Jensen 
-* '''​CIKM 2010 Workshop on the Analysis of Noisy Documents (AND 2010)'''​ 
-* We apply four extraction algorithms to various types of noisy OCR data found “in the wild” and focus on full name extraction. We evaluate the extraction quality with respect to hand-labeled test data and improve upon the extraction performance of the individual systems by means of ensemble extraction. 
  
 +^ [http://​proceedings.spiedigitallibrary.org/​proceeding.aspx?​articleid=1284063| A Synthetic Document Image Dataset for Developing and Evaluating Historical Document Processing Methods] ​  ​^^ ​
 +| [[media:​nlp:​120px-synthetic-ocr.png]] | Dan Walker; Bill Lund; Eric Ringger ​  ​| ​
 +| :::                             | '''​DRR 2012''' ​                                       | 
 +| :::                             | We introduce synthetic document image datasets of varying levels of noise that have been created from standard (English) text corpora using an existing document degradation model applied in a novel way. Included in the datasets is the OCR output from real OCR engines including the commercial ABBYY FineReader and the open-source Tesseract engines. These synthetic datasets are designed to exhibit some of the characteristics of an example real-world document image dataset. ​ | 
  
-[http://​nlp.cs.byu.edu/​~dan/​papers/​emnlp_2010.pdf| Evaluating Models of Latent Document Semantics in the Presence of OCR Errors] 
-[[media:​nlp:​120px-noisyocr-lds.png|left|120px|border]] 
-* Dan Walker; Bill Lund; Eric Ringger 
-* '''​EMNLP 2010'''​ 
-* We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.  ​ 
  
  
-[http://​dl.acm.org/​citation.cfm?​id=1555437| Improving optical character recognition through efficient multiple system alignment] + 
-[[media:​nlp:​140px-mult-alignment.png|left|140px|border]] +^ [http://​www.icdar2011.org/​fileup/​PDF/​4520a764.pdf| Progressive Alignment and Discriminative Error Correction for Multiple OCR Engines] ​  ^^  
-Bill Lund; Eric Ringger +| [[media:​nlp:​140px-progressive-alignment.png]] | Bill Lund; Dan Walker; Eric Ringger ​  |  
-'''​JCDL 2009'''​ +| :::                             | '''​ICDAR 2011''' ​                                       |  
-Awarded Best Student Paper of the conference +| :::                             | This paper presents a novel method for improving optical character recognition (OCR). The method employs the progressive alignment of hypotheses from multiple OCR engines followed by final hypothesis selection using maximum entropy classification methods. The maximum entropy models are trained on a synthetic calibration data set. Although progressive alignment is not guaranteed to be optimal, the results are nonetheless strong. Our method shows a 24.6% relative improvement over the word error rate (WER) of the best performing of the five OCR engines employed in this work. Relative to the average WER of all five OCR engines, our method yields a 69.1% relative reduction in the error rate. Furthermore,​ 52.2% of the documents achieve a new low WER.  |  
-By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. Results from a collection of poor quality mid-twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents.+ 
 + 
 + 
 + 
 +^ [http://​www.icdar2011.org/​fileup/​PDF/​4520a658.pdf| Error Correction with In-Domain Training Across Multiple OCR System Outputs] ​  ^^  
 +| [[media:​nlp:​120px-ocr-error-correction.png]] | Bill Lund; Eric Ringger ​  |  
 +| :::                             | '''​ICDAR 2011''' ​                                       |  
 +| :::                             | This paper demonstrates the degree to which the word error rate (WER) can be reduced using a decision list on a combination of textual features across the aligned output of multiple OCR engines where in-domain training data is available. Our correction method leads to a 52.2% relative decrease in the mean WER and a 19.5% relative improvement over the best single OCR engine. ​ |  
 + 
 + 
 + 
 + 
 +^ [http://​www.researchgate.net/​publication/​220774817_Extracting_person_names_from_diverse_and_noisy_OCR_text/​file/​79e415051d9d572e4e.pdf| Extracting Person Names from Diverse and Noisy OCR Text]   ^^  
 +| [[media:​nlp:​140px-extracting-names.png]] | Thomas Packer; Joshua Lutes; Aaron Stewart; David Embley; Eric Ringger; Kevin Seppi; Lee Jensen ​  |  
 +| :::                             | '''​CIKM 2010 Workshop on the Analysis of Noisy Documents (AND 2010)''' ​                                       |  
 +| :::                             | We apply four extraction algorithms to various types of noisy OCR data found “in the wild” and focus on full name extraction. We evaluate the extraction quality with respect to hand-labeled test data and improve upon the extraction performance of the individual systems by means of ensemble extraction. ​ |  
 + 
 + 
 + 
 + 
 +^ [http://​nlp.cs.byu.edu/​~dan/​papers/​emnlp_2010.pdf| Evaluating Models of Latent Document Semantics in the Presence of OCR Errors] ​  ^^  
 +| [[media:​nlp:​120px-noisyocr-lds.png]] | Dan Walker; Bill Lund; Eric Ringger ​  |  
 +| :::                             | '''​EMNLP 2010''' ​                                       |  
 +| :::                             | We show the effects both with document-level topic analysis (document clustering) and with word-level topic analysis (LDA) on both synthetic and real-world OCR data. As expected, experimental results show that performance declines as word error rates increase. Common techniques for alleviating these problems, such as filtering low-frequency words, are successful in enhancing model quality, but exhibit failure trends similar to models trained on unprocessed OCR output in the case of LDA.  |  
 + 
 + 
 + 
 + 
 +[http://​dl.acm.org/​citation.cfm?​id=1555437| Improving optical character recognition through efficient multiple system alignment] ​  ^^  
 +[[media:​nlp:​140px-mult-alignment.png]] ​Bill Lund; Eric Ringger ​  |  
 +| :::                             ​| ​'''​JCDL 2009''' ​                                       ​| ​ 
 +| :::                             | ''' ​Awarded Best Student Paper of the conference ​''' ​                                       |  
 +| :::                             ​| ​By aligning the output of multiple OCR engines and taking advantage of the differences between them, the error rate based on the aligned lattice of recognized words is significantly lower than the individual OCR word error rates. Results from a collection of poor quality mid-twentieth century typewritten documents demonstrate an average reduction of 55.0% in the error rate of the lattice of alternatives and a realized word error rate (WER) reduction of 35.8% in a dictionary-based selection process. As an important precursor, an innovative admissible heuristic for the A* algorithm is developed, which results in a significant reduction in state space exploration to identify all optimal alignments of the OCR text output, a necessary step toward the construction of the word hypothesis lattice. On average 0.0079% of the state space is explored to identify all optimal alignments of the documents. ​ |  
  
  
nlp/historical-document-recognition.txt · Last modified: 2015/05/21 22:40 by plf1
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0