OCR: Difference between revisions
Jump to navigation
Jump to search
m (→Software) |
m (→Software) |
||
Line 112: | Line 112: | ||
ABBYY (FineReader) | ABBYY (FineReader) | ||
: paid | |||
: https://pdf.abbyy.com/ | |||
Google Docs OCR | Google Docs OCR | ||
: online-only | |||
Rossum | Rossum | ||
: paid, online-only? | : paid, online-only? | ||
: https://rossum.ai/lp/ocr-software/ | |||
Amazon Rekognition | Amazon Rekognition | ||
Line 125: | Line 129: | ||
: more for documents?{{verify}} | : more for documents?{{verify}} | ||
: paid, online-only | : paid, online-only | ||
Transym | Transym | ||
: more for documents?{{verify}} | : more for documents?{{verify}} | ||
: paid, online-only | : paid, online-only | ||
: https://transym.com/ | |||
Line 157: | Line 157: | ||
====Document managers==== | |||
Apache Tika | |||
: geared at content analysis and indexing (also metadata/document structure parser) | |||
: uses tesseract for OCR | |||
: https://tika.apache.org/ | |||
Aleph | |||
: https://docs.aleph.occrp.org/ | |||
Revision as of 19:24, 15 July 2023
✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
OCR as a task
Software
✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
OCRopus
- document OCR (used in Google Books, Internet Archive)
- multifont, multilanguage
- https://en.wikipedia.org/wiki/OCRopus
Tesseract
- document OCR
- https://opensource.google.com/projects/tesseract
- https://en.wikipedia.org/wiki/Tesseract_(software)
CuneiForm
keras-ocr
EasyOCR
ABBYY (FineReader)
Google Docs OCR
- online-only
Rossum
- paid, online-only?
- https://rossum.ai/lp/ocr-software/
Amazon Rekognition
- more for scene text?(verify)
- paid, online-only
Amazon Textract
- more for documents?(verify)
- paid, online-only
Transym
- more for documents?(verify)
- paid, online-only
- https://transym.com/
Integrated features / online APIs (i.e. not easy to automate)
- Acrobat,
- Google Keep,
- Google Drive ('open with' converts),
- OneNote,
- IBM datacap[1],
Convenience tools / wrappers
- from screen capture. More of a convenience tool
- for text that comes from fonts this can work quite well, and fairly quickly, even in photographic context, though degrades quickly on more creative text
Document managers
Apache Tika
- geared at content analysis and indexing (also metadata/document structure parser)
- uses tesseract for OCR
- https://tika.apache.org/
Aleph
-->
Output formats
hOCR
A (HTML-based) format to store detected words/fragments of text's position, and optionally detected style, layout, and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.
https://en.wikipedia.org/wiki/HOCR
https://pypi.org/project/hocr-spec/
ALTO
https://en.wikipedia.org/wiki/ALTO_(XML)
PAGE XML
https://en.wikipedia.org/wiki/PAGE_(XML)
abbyyXML
https://support.abbyy.com/hc/en-us/articles/360017336699-ABBYY-FineReader-Engine-XML-Export