OCR: Difference between revisions
Jump to navigation
Jump to search
m (→Output formats) |
m (→hOCR) |
||
Line 161: | Line 161: | ||
====hOCR==== | ====hOCR==== | ||
A (HTML-based) format to store detected words/fragments of text's position, | A (HTML-based) format to store detected words/fragments of text's position, | ||
Line 168: | Line 167: | ||
https://en.wikipedia.org/wiki/HOCR | https://en.wikipedia.org/wiki/HOCR | ||
- | |||
https://pypi.org/project/hocr-spec/ | |||
====ALTO==== | ====ALTO==== |
Revision as of 19:17, 15 July 2023
✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
OCR as a task
Software
Output formats
hOCR
A (HTML-based) format to store detected words/fragments of text's position, and optionally detected style, layout, and other information using Extensible Markup Language (XML) in the form of Hypertext Markup Language (HTML) or XHTML.
https://en.wikipedia.org/wiki/HOCR
https://pypi.org/project/hocr-spec/
ALTO
https://en.wikipedia.org/wiki/ALTO_(XML)
PAGE XML
https://en.wikipedia.org/wiki/PAGE_(XML)
abbyyXML
https://support.abbyy.com/hc/en-us/articles/360017336699-ABBYY-FineReader-Engine-XML-Export