OCR: Difference between revisions
m (→hOCR) |
m (→Software) |
||
Line 86: | Line 86: | ||
--> | --> | ||
===Software=== | ===Software=== | ||
{{stub}} | |||
Line 99: | Line 98: | ||
: https://opensource.google.com/projects/tesseract | : https://opensource.google.com/projects/tesseract | ||
: https://en.wikipedia.org/wiki/Tesseract_(software) | : https://en.wikipedia.org/wiki/Tesseract_(software) | ||
CuneiForm | |||
: https://en.wikipedia.org/wiki/CuneiForm_(software) | |||
keras-ocr | keras-ocr | ||
Line 107: | Line 109: | ||
Line 155: | Line 147: | ||
IBM datacap[https://www.ibm.com/products/data-capture-and-imaging], | IBM datacap[https://www.ibm.com/products/data-capture-and-imaging], | ||
Abbyy, | Abbyy, | ||
====Convenience tools / wrappers==== | |||
[https://learn.microsoft.com/en-us/windows/powertoys/text-extractor Powertoys's Text Extractor] | |||
: from screen capture. More of a convenience tool | |||
: for text that comes from fonts this can work quite well, and fairly quickly, even in photographic context, though degrades quickly on more creative text | |||
[https://github.com/zendalona/lios Lios] | |||
--> | --> | ||
===Output formats=== | ===Output formats=== | ||
Revision as of 19:20, 15 July 2023
OCR as a task
Software
OCRopus
- document OCR (used in Google Books, Internet Archive)
- multifont, multilanguage
- https://en.wikipedia.org/wiki/OCRopus
Tesseract
- document OCR
- https://opensource.google.com/projects/tesseract
- https://en.wikipedia.org/wiki/Tesseract_(software)
CuneiForm
keras-ocr
EasyOCR
ABBYY (FineReader)
Google Docs OCR
Rossum
- paid, online-only?
Amazon Rekognition
- more for scene text?(verify)
- paid, online-only
Amazon Textract
- more for documents?(verify)
- paid, online-only
Transym
- more for documents?(verify)
- paid, online-only
Apache Tika
- geared at content analysis and indexing (also metadata/document structure parser)
- uses tesseract for OCR
- https://tika.apache.org/
Integrated features / online APIs (i.e. not easy to automate)
Acrobat, Google Keep, Google Drive ('open with' converts), OneNote, IBM datacap[1],
Abbyy,
Convenience tools / wrappers
- from screen capture. More of a convenience tool
- for text that comes from fonts this can work quite well, and fairly quickly, even in photographic context, though degrades quickly on more creative text
-->
Output formats
hOCR
A (HTML-based) format to store detected words/fragments of text's position, and optionally detected style, layout, and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.
https://en.wikipedia.org/wiki/HOCR
https://pypi.org/project/hocr-spec/
ALTO
https://en.wikipedia.org/wiki/ALTO_(XML)
PAGE XML
https://en.wikipedia.org/wiki/PAGE_(XML)
abbyyXML
https://support.abbyy.com/hc/en-us/articles/360017336699-ABBYY-FineReader-Engine-XML-Export