Differences

This shows you the differences between two versions of the page.

Link to this comparison view

nlp-private:ocr-engines [2015/04/23 19:21] (current)
ryancha created
Line 1: Line 1:
 +Back to [[Noisy OCR Group]]
 +
 +<br />
 +
 +__TOC__
 +
 +<br />
 +
 +== Wish List ==
 +
 +A list of features to keep in mind as we consider which OCR engine to choose.
 +
 +{| class="​wikitable"​ border="​1"​
 +|-
 +! Feature
 +! Votes
 +|-
 +| Can be run in batch mode.
 +| All
 +|-
 +| Provides token bounding boxes or character bounding boxes plus token delimiters
 +| Thomas, Aaron, Dr. Embley
 +|-
 +| Provides text line groupings of tokens.
 +| Thomas, Aaron, Dr. Embley
 +|-
 +| Does well on degraded, historical text.
 +| Thomas, Aaron, Dr. Embley
 +|-
 +| Cheep or free
 +| Professors
 +|-
 +| Open source
 +| Josh H.
 +|-
 +| Good zoning
 +| all
 +|-
 +| Good WER
 +| all
 +|-
 +| Provides word or character confidence levels.
 +| Thomas?
 +|-
 +| Provides word hypothesis latices.
 +| Thomas?
 +|-
 +| Multiple OCR engines.
 +| Bill
 +|}
 +
 +<br />
 +
 +== Options ==
 +
 +Options ordered from best to worst. ​ More input needed. ​ Need to also look through the list in Figure 6 of this paper containing OCR options: [http://​www.springerlink.com/​content/​mdxy1gtq3ba8d91t/​fulltext.pdf]
 +
 +# Novadys [http://​novadys.net/​]
 +#* Italian company with a free OCR service (one file at a time).
 +#* Seems pretty accurate.
 +#* Produces text and PDF output, among other formats.
 +#* Does not accept J2K format input.
 +# Adobe acrobat Pro reader OCR
 +#* We have access to it.
 +#* BYU library has it.
 +#* That plus PDFBox gives us word bounding boxes and ...
 +# Omnipage
 +#* On library machines.
 +# DocMorph [http://​docmorph.nlm.nih.gov/​docmorph/​docmorph.htm]
 +#* From the US National Library of Medicine
 +#* Free
 +#* Seem to produce good quality OCR output on good images, pretty bad on other images.
 +#* Output is plain text.  No apparent way to get token coordinates.
 +# OcrTerminal.com [http://​www.ocrterminal.com/​features/​pricing.cgi]
 +#* This is a OCR web service that is "​powered by ABBYY" and prices range from 4 to 9 cents per page.  ​
 +#* They are open to discussing academic prices.
 +#* They provide an API and also provide an XML output with bounding boxes, text line groupings of tokens, etc.
 +# In-house PDFBox-based bounding box extraction code plus Abbyy retail product
 +#* Internal code project used to extract bounding boxes and OCR text from PDF files produced by Abbyy retail product.
 +#* Which is the cheapest edition of Abbyy FineReader that produces such PDF files?
 +#* Need to make sure we have access to zoning information in the PDF, especially lines. ​ Aaron doesn'​t think it has zones, but does think it has lines.
 +#* No word/​character confidence levels.
 +# Ocropus
 +#* Open source
 +#* "​Free"​
 +#* Will likely require creating a large set of training data to get decent character recognition accuracies.
 +#* Can be made to output bounding boxes.
 +#* Outputs intermediate image segmentations and word hypothesis latices.  ​
 +# Abbyy Recognition Server
 +#* $1650+
 +#* Outputs bounding boxes and lots of other useful info.
 +#* High quality character recognition and zoning, etc.
 +#* Would the three labs and the library consider going in together on purchasing this?
 +# Other OCR services for hire
 +#* Most seem expensive and/or do not provide bounding boxes.
 +# Amazon Mechanical Turk
 +#* Transcription?​
 +#* Bounding boxes?
 +# Abbyy retail product
 +#* $600
 +#* May or may not output bounding boxes and very little else.
 +# Cuneiform
 +# Bill’s voting engine
 +# Ancestry'​s Scanning Service
 +#* Not sure if they offer a relatively cheap OCR service that might be cheaper than buying ARS.
 +#* After asking Shawn Reid we know that Ancestry has discontinued their digitization service.
 +
 +<br />
 +
 +== PDF Text and Layout Extractors ==
 +
 +# Multivalent Extract Text 
 +#* Has a stand-alone tool to extract text along with bounding boxes.  ​
 +#* Extract Text Documentation Page: [http://​multivalent.sourceforge.net/​Tools/​doc/​ExtractText.html]
 +#* Package website: [http://​multivalent.sourceforge.net/​]
 +#* Downloads: [http://​multivalent.sourceforge.net/​download.html]
 +# PDFBox
 +#* Library, not a stand-alone tool?
 +# Apache Tika
 +#* Not sure if this can extract bounding boxes.
 +#* [http://​tika.apache.org/​0.7/​gettingstarted.html]
 +# PDFTOHTML
 +#* [http://​pdftohtml.sourceforge.net/​]
 +
 +<br />
  
nlp-private/ocr-engines.txt · Last modified: 2015/04/23 19:21 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0