nlp-private:ocr-engines [CS Wiki]

Wish List

A list of features to keep in mind as we consider which OCR engine to choose.

Feature	Votes
Can be run in batch mode.	All
Provides token bounding boxes or character bounding boxes plus token delimiters	Thomas, Aaron, Dr. Embley
Provides text line groupings of tokens.	Thomas, Aaron, Dr. Embley
Does well on degraded, historical text.	Thomas, Aaron, Dr. Embley
Cheep or free	Professors
Open source	Josh H.
Good zoning	all
Good WER	all
Provides word or character confidence levels.	Thomas?
Provides word hypothesis latices.	Thomas?
Multiple OCR engines.	Bill

<br />

Options

Options ordered from best to worst. More input needed. Need to also look through the list in Figure 6 of this paper containing OCR options: ://www.springerlink.com/content/mdxy1gtq3ba8d91t/fulltext.pdf

Novadys ://novadys.net/
- Italian company with a free OCR service (one file at a time).
- Seems pretty accurate.
- Produces text and PDF output, among other formats.
- Does not accept J2K format input.
Adobe acrobat Pro reader OCR
- We have access to it.
- BYU library has it.
- That plus PDFBox gives us word bounding boxes and …
Omnipage
- On library machines.
DocMorph ://docmorph.nlm.nih.gov/docmorph/docmorph.htm
- From the US National Library of Medicine
- Free
- Seem to produce good quality OCR output on good images, pretty bad on other images.
- Output is plain text. No apparent way to get token coordinates.
OcrTerminal.com ://www.ocrterminal.com/features/pricing.cgi
- This is a OCR web service that is “powered by ABBYY” and prices range from 4 to 9 cents per page.
- They are open to discussing academic prices.
- They provide an API and also provide an XML output with bounding boxes, text line groupings of tokens, etc.
In-house PDFBox-based bounding box extraction code plus Abbyy retail product
- Internal code project used to extract bounding boxes and OCR text from PDF files produced by Abbyy retail product.
- Which is the cheapest edition of Abbyy FineReader that produces such PDF files?
- Need to make sure we have access to zoning information in the PDF, especially lines. Aaron doesn't think it has zones, but does think it has lines.
- No word/character confidence levels.
Ocropus
- Open source
- “Free”
- Will likely require creating a large set of training data to get decent character recognition accuracies.
- Can be made to output bounding boxes.
- Outputs intermediate image segmentations and word hypothesis latices.
Abbyy Recognition Server
- $1650+
- Outputs bounding boxes and lots of other useful info.
- High quality character recognition and zoning, etc.
- Would the three labs and the library consider going in together on purchasing this?
Other OCR services for hire
- Most seem expensive and/or do not provide bounding boxes.
Amazon Mechanical Turk
- Transcription?
- Bounding boxes?
Abbyy retail product
- $600
- May or may not output bounding boxes and very little else.
Cuneiform
Bill’s voting engine
Ancestry's Scanning Service
- Not sure if they offer a relatively cheap OCR service that might be cheaper than buying ARS.
- After asking Shawn Reid we know that Ancestry has discontinued their digitization service.

<br />

PDF Text and Layout Extractors

Multivalent Extract Text
- Has a stand-alone tool to extract text along with bounding boxes.
- Extract Text Documentation Page: ://multivalent.sourceforge.net/Tools/doc/ExtractText.html
- Package website: ://multivalent.sourceforge.net/
- Downloads: ://multivalent.sourceforge.net/download.html
PDFBox
- Library, not a stand-alone tool?
Apache Tika
- Not sure if this can extract bounding boxes.
- ://tika.apache.org/0.7/gettingstarted.html
PDFTOHTML
- ://pdftohtml.sourceforge.net/

<br />

nlp-private/ocr-engines.txt · Last modified: 2015/04/23 13:21 by ryancha

Back to top

Table of Contents

Wish List

Options

PDF Text and Layout Extractors