Back to Noisy OCR Group
<br />
__TOC__
<br />
Wish List
A list of features to keep in mind as we consider which OCR engine to choose.
Feature |
Votes |
Can be run in batch mode. |
All |
Provides token bounding boxes or character bounding boxes plus token delimiters |
Thomas, Aaron, Dr. Embley |
Provides text line groupings of tokens. |
Thomas, Aaron, Dr. Embley |
Does well on degraded, historical text. |
Thomas, Aaron, Dr. Embley |
Cheep or free |
Professors |
Open source |
Josh H. |
Good zoning |
all |
Good WER |
all |
Provides word or character confidence levels. |
Thomas? |
Provides word hypothesis latices. |
Thomas? |
Multiple OCR engines. |
Bill |
<br />
Options
Options ordered from best to worst. More input needed. Need to also look through the list in Figure 6 of this paper containing OCR options: ://www.springerlink.com/content/mdxy1gtq3ba8d91t/fulltext.pdf
-
Italian company with a free OCR service (one file at a time).
Seems pretty accurate.
Produces text and PDF output, among other formats.
Does not accept J2K format input.
Adobe acrobat Pro reader OCR
Omnipage
-
From the US National Library of Medicine
Free
Seem to produce good quality OCR output on good images, pretty bad on other images.
Output is plain text. No apparent way to get token coordinates.
-
In-house PDFBox-based bounding box extraction code plus Abbyy retail product
Internal code project used to extract bounding boxes and OCR text from PDF files produced by Abbyy retail product.
Which is the cheapest edition of Abbyy FineReader that produces such PDF files?
Need to make sure we have access to zoning information in the PDF, especially lines. Aaron doesn't think it has zones, but does think it has lines.
No word/character confidence levels.
Ocropus
Open source
“Free”
Will likely require creating a large set of training data to get decent character recognition accuracies.
Can be made to output bounding boxes.
Outputs intermediate image segmentations and word hypothesis latices.
Abbyy Recognition Server
$1650+
Outputs bounding boxes and lots of other useful info.
High quality character recognition and zoning, etc.
Would the three labs and the library consider going in together on purchasing this?
Other OCR services for hire
Amazon Mechanical Turk
Transcription?
Bounding boxes?
Abbyy retail product
Cuneiform
Bill’s voting engine
Ancestry's Scanning Service
<br />
PDF Text and Layout Extractors
Multivalent Extract Text
PDFBox
Apache Tika
PDFTOHTML
<br />
Back to top