Back to Noisy OCR Group

<br />

__TOC__

<br />

Wish List

A list of features to keep in mind as we consider which OCR engine to choose.

Feature Votes
Can be run in batch mode. All
Provides token bounding boxes or character bounding boxes plus token delimiters Thomas, Aaron, Dr. Embley
Provides text line groupings of tokens. Thomas, Aaron, Dr. Embley
Does well on degraded, historical text. Thomas, Aaron, Dr. Embley
Cheep or free Professors
Open source Josh H.
Good zoning all
Good WER all
Provides word or character confidence levels. Thomas?
Provides word hypothesis latices. Thomas?
Multiple OCR engines. Bill

<br />

Options

Options ordered from best to worst. More input needed. Need to also look through the list in Figure 6 of this paper containing OCR options: ://www.springerlink.com/content/mdxy1gtq3ba8d91t/fulltext.pdf

    • Italian company with a free OCR service (one file at a time).
    • Seems pretty accurate.
    • Produces text and PDF output, among other formats.
    • Does not accept J2K format input.
  1. Adobe acrobat Pro reader OCR
    • We have access to it.
    • BYU library has it.
    • That plus PDFBox gives us word bounding boxes and …
  2. Omnipage
    • On library machines.
    • From the US National Library of Medicine
    • Free
    • Seem to produce good quality OCR output on good images, pretty bad on other images.
    • Output is plain text. No apparent way to get token coordinates.
    • This is a OCR web service that is “powered by ABBYY” and prices range from 4 to 9 cents per page.
    • They are open to discussing academic prices.
    • They provide an API and also provide an XML output with bounding boxes, text line groupings of tokens, etc.
  3. In-house PDFBox-based bounding box extraction code plus Abbyy retail product
    • Internal code project used to extract bounding boxes and OCR text from PDF files produced by Abbyy retail product.
    • Which is the cheapest edition of Abbyy FineReader that produces such PDF files?
    • Need to make sure we have access to zoning information in the PDF, especially lines. Aaron doesn't think it has zones, but does think it has lines.
    • No word/character confidence levels.
  4. Ocropus
    • Open source
    • “Free”
    • Will likely require creating a large set of training data to get decent character recognition accuracies.
    • Can be made to output bounding boxes.
    • Outputs intermediate image segmentations and word hypothesis latices.
  5. Abbyy Recognition Server
    • $1650+
    • Outputs bounding boxes and lots of other useful info.
    • High quality character recognition and zoning, etc.
    • Would the three labs and the library consider going in together on purchasing this?
  6. Other OCR services for hire
    • Most seem expensive and/or do not provide bounding boxes.
  7. Amazon Mechanical Turk
    • Transcription?
    • Bounding boxes?
  8. Abbyy retail product
    • $600
    • May or may not output bounding boxes and very little else.
  9. Cuneiform
  10. Bill’s voting engine
  11. Ancestry's Scanning Service
    • Not sure if they offer a relatively cheap OCR service that might be cheaper than buying ARS.
    • After asking Shawn Reid we know that Ancestry has discontinued their digitization service.

<br />

PDF Text and Layout Extractors

  1. Multivalent Extract Text
  2. PDFBox
    • Library, not a stand-alone tool?
  3. Apache Tika
  4. PDFTOHTML

<br />

nlp-private/ocr-engines.txt · Last modified: 2015/04/23 13:21 by ryancha
Back to top
CC Attribution-Share Alike 4.0 International
chimeric.de = chi`s home Valid CSS Driven by DokuWiki do yourself a favour and use a real browser - get firefox!! Recent changes RSS feed Valid XHTML 1.0