OCR: Difference between revisions
Jump to navigation
Jump to search
m (→Output formats) |
|||
(6 intermediate revisions by the same user not shown) | |||
Line 86: | Line 86: | ||
--> | --> | ||
===Software=== | ===Software=== | ||
{{stub}} | |||
Line 99: | Line 98: | ||
: https://opensource.google.com/projects/tesseract | : https://opensource.google.com/projects/tesseract | ||
: https://en.wikipedia.org/wiki/Tesseract_(software) | : https://en.wikipedia.org/wiki/Tesseract_(software) | ||
CuneiForm | |||
: https://en.wikipedia.org/wiki/CuneiForm_(software) | |||
keras-ocr | keras-ocr | ||
: | : https://keras-ocr.readthedocs.io/en/latest/ | ||
EasyOCR | EasyOCR | ||
: | : https://github.com/JaidedAI/EasyOCR | ||
ABBYY (FineReader) | ABBYY (FineReader) | ||
: paid | |||
: https://pdf.abbyy.com/ | |||
Google Docs OCR | Google Docs OCR | ||
: online-only | |||
Rossum | Rossum | ||
: paid, online-only? | : paid, online-only? | ||
: https://rossum.ai/lp/ocr-software/ | |||
Amazon Rekognition | Amazon Rekognition | ||
Line 133: | Line 129: | ||
: more for documents?{{verify}} | : more for documents?{{verify}} | ||
: paid, online-only | : paid, online-only | ||
Transym | Transym | ||
: more for documents?{{verify}} | : more for documents?{{verify}} | ||
: paid, online-only | : paid, online-only | ||
: https://transym.com/ | |||
Integrated features / online APIs (i.e. not easy to automate) | |||
: Acrobat, | |||
: Google Keep, | |||
: Google Drive ('open with' converts), | |||
: OneNote, | |||
: IBM datacap[https://www.ibm.com/products/data-capture-and-imaging], | |||
====Convenience tools / wrappers==== | |||
[https://learn.microsoft.com/en-us/windows/powertoys/text-extractor Powertoys's Text Extractor] | |||
: from screen capture. More of a convenience tool | |||
: for text that comes from fonts this can work quite well, and fairly quickly, even in photographic context, though degrades quickly on more creative text | |||
[https://github.com/zendalona/lios Lios] | |||
====Document managers with OCR==== | |||
<!-- | |||
Apache Tika | Apache Tika | ||
: geared at content analysis and indexing (also metadata/document structure parser) | : geared at content analysis and indexing (also metadata/document structure parser) | ||
Line 145: | Line 166: | ||
: https://tika.apache.org/ | : https://tika.apache.org/ | ||
Aleph | |||
: https://docs.aleph.occrp.org/ | |||
--> | |||
===Output formats=== | ===Output formats=== | ||
====hOCR==== | ====hOCR==== | ||
A (HTML-based) format to store detected words/fragments of text's position, | A (HTML-based) format to store detected words/fragments of text's position, | ||
Line 168: | Line 181: | ||
https://en.wikipedia.org/wiki/HOCR | https://en.wikipedia.org/wiki/HOCR | ||
- | |||
https://pypi.org/project/hocr-spec/ | |||
====ALTO==== | ====ALTO==== |
Latest revision as of 19:25, 15 July 2023
✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
OCR as a task
Software
✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
OCRopus
- document OCR (used in Google Books, Internet Archive)
- multifont, multilanguage
- https://en.wikipedia.org/wiki/OCRopus
Tesseract
- document OCR
- https://opensource.google.com/projects/tesseract
- https://en.wikipedia.org/wiki/Tesseract_(software)
CuneiForm
keras-ocr
EasyOCR
ABBYY (FineReader)
Google Docs OCR
- online-only
Rossum
- paid, online-only?
- https://rossum.ai/lp/ocr-software/
Amazon Rekognition
- more for scene text?(verify)
- paid, online-only
Amazon Textract
- more for documents?(verify)
- paid, online-only
Transym
- more for documents?(verify)
- paid, online-only
- https://transym.com/
Integrated features / online APIs (i.e. not easy to automate)
- Acrobat,
- Google Keep,
- Google Drive ('open with' converts),
- OneNote,
- IBM datacap[1],
Convenience tools / wrappers
- from screen capture. More of a convenience tool
- for text that comes from fonts this can work quite well, and fairly quickly, even in photographic context, though degrades quickly on more creative text
Document managers with OCR
Output formats
hOCR
A (HTML-based) format to store detected words/fragments of text's position, and optionally detected style, layout, and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.
https://en.wikipedia.org/wiki/HOCR
https://pypi.org/project/hocr-spec/
ALTO
https://en.wikipedia.org/wiki/ALTO_(XML)
PAGE XML
https://en.wikipedia.org/wiki/PAGE_(XML)
abbyyXML
https://support.abbyy.com/hc/en-us/articles/360017336699-ABBYY-FineReader-Engine-XML-Export