 Back to [[Noisy OCR Group]]
+ + __TOC__ + +
+ + == Wish List == + + A list of features to keep in mind as we consider which OCR engine to choose. + + {| class="​wikitable"​ border="​1"​ + |- + ! Feature + ! Votes + |- + | Can be run in batch mode. + | All + |- + | Provides token bounding boxes or character bounding boxes plus token delimiters + | Thomas, Aaron, Dr. Embley + |- + | Provides text line groupings of tokens. + | Thomas, Aaron, Dr. Embley + |- + | Does well on degraded, historical text. + | Thomas, Aaron, Dr. Embley + |- + | Cheep or free + | Professors + |- + | Open source + | Josh H. + |- + | Good zoning + | all + |- + | Good WER + | all + |- + | Provides word or character confidence levels. + | Thomas? + |- + | Provides word hypothesis latices. + | Thomas? + |- + | Multiple OCR engines. + | Bill + |} + +
+ + == Options == + + Options ordered from best to worst. ​ More input needed. ​ Need to also look through the list in Figure 6 of this paper containing OCR options: [http://​www.springerlink.com/​content/​mdxy1gtq3ba8d91t/​fulltext.pdf] + + # Novadys [http://​novadys.net/​] + #* Italian company with a free OCR service (one file at a time). + #* Seems pretty accurate. + #* Produces text and PDF output, among other formats. + #* Does not accept J2K format input. + # Adobe acrobat Pro reader OCR + #* We have access to it. + #* BYU library has it. + #* That plus PDFBox gives us word bounding boxes and ... + # Omnipage + #* On library machines. + # DocMorph [http://​docmorph.nlm.nih.gov/​docmorph/​docmorph.htm] + #* From the US National Library of Medicine + #* Free + #* Seem to produce good quality OCR output on good images, pretty bad on other images. + #* Output is plain text.  No apparent way to get token coordinates. + # OcrTerminal.com [http://​www.ocrterminal.com/​features/​pricing.cgi] + #* This is a OCR web service that is "​powered by ABBYY" and prices range from 4 to 9 cents per page.  ​ + #* They are open to discussing academic prices. + #* They provide an API and also provide an XML output with bounding boxes, text line groupings of tokens, etc. + # In-house PDFBox-based bounding box extraction code plus Abbyy retail product + #* Internal code project used to extract bounding boxes and OCR text from PDF files produced by Abbyy retail product. + #* Which is the cheapest edition of Abbyy FineReader that produces such PDF files? + #* Need to make sure we have access to zoning information in the PDF, especially lines. ​ Aaron doesn'​t think it has zones, but does think it has lines. + #* No word/​character confidence levels. + # Ocropus + #* Open source + #* "​Free"​ + #* Will likely require creating a large set of training data to get decent character recognition accuracies. + #* Can be made to output bounding boxes. + #* Outputs intermediate image segmentations and word hypothesis latices.  ​ + # Abbyy Recognition Server + #* $1650+ + #* Outputs bounding boxes and lots of other useful info. + #* High quality character recognition and zoning, etc. + #* Would the three labs and the library consider going in together on purchasing this? + # Other OCR services for hire + #* Most seem expensive and/or do not provide bounding boxes. + # Amazon Mechanical Turk + #* Transcription?​ + #* Bounding boxes? + # Abbyy retail product + #*$600 + #* May or may not output bounding boxes and very little else. + # Cuneiform + # Bill’s voting engine + # Ancestry'​s Scanning Service + #* Not sure if they offer a relatively cheap OCR service that might be cheaper than buying ARS. + #* After asking Shawn Reid we know that Ancestry has discontinued their digitization service. + +
+ + == PDF Text and Layout Extractors == + + # Multivalent Extract Text + #* Has a stand-alone tool to extract text along with bounding boxes.  ​ + #* Extract Text Documentation Page: [http://​multivalent.sourceforge.net/​Tools/​doc/​ExtractText.html] + #* Package website: [http://​multivalent.sourceforge.net/​] + #* Downloads: [http://​multivalent.sourceforge.net/​download.html] + # PDFBox + #* Library, not a stand-alone tool? + # Apache Tika + #* Not sure if this can extract bounding boxes. + #* [http://​tika.apache.org/​0.7/​gettingstarted.html] + # PDFTOHTML + #* [http://​pdftohtml.sourceforge.net/​] + +