PDF notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
Line 272: Line 272:




===Text in PDFs===
==Text in PDFs==
{{stub}}
{{stub}}
<!--
<!--
Line 278: Line 278:
We may think of PDFs as text documents, but they are not.
We may think of PDFs as text documents, but they are not.


They are basically a print format. A visual one.  
They are basically a print format.  
A visual one.  




Line 284: Line 285:




 
When it comes to text, it is text much like how [https://en.wikipedia.org/wiki/Movable_type#Typesetting typesetting a tray of prepared movable type] is text: recognizably that thing, but with low-level details that may be potentially awkward to deal with, depending on your expectations.
When it comes to text, it is text like how a tray of prepared movable type is text:
recognizably that thing, but potentially awkward to deal with, depending on your expectations.




Text components may be not be structurally very related.
Text components may be not be structurally very related.
They may be a paragraph, or a single character.
They may be a paragraph, or a single character.
Or maybe a sentence at a time, completely independent from the next line that ''visually''
Or maybe a sentence at a time, or rather the rest of that line,
it does a hyphenation onto.
completely independent from the next line that it ''visually'' may e.g. do a line-breaking [[typographic hyphenation]] onto.
 




Their position in the stream may not be the reading order.
If the text comes from something that thinks about documents - LaTeX, a word processor, or such - then the order that these fragments come in are often in a natural reading order.
It regularly is
Often.


If it comes from OCR or creative composition of parts, it may not.




Also, PDF (without further qualification) is only required to ''draw'' text correctly.


Also, objects that show text may be encoded in a way that only the font drawing understands.
Objects that show text may be encoded in a way that only the font drawing understands.
This is why copy-pasting text from PDFs sometimes is completely garbled.
This is why copy-pasting text from PDFs sometimes is completely garbled.
This is due to the way it uses fonts (apparently roughly: glyph indices into the font are not mapped to unicode{{verify}}, and/or the font isn't there to fix that after the fact?{{verify}})




Text objects are sometimes added onto images by what amounts to OCR,  
Even if it's correctly mapped to the font, a lot of PDFs use OCR, which is why there are a lot of more minor mistakes.
which is why copy-pasting text from PDFs is sometimes clearly but only slightly wrong.






Yes, there are variants where parts of that are addressed (e.g. tagged PDFs,  [[#PDF/A|PDF/A]], PDF/UA), but generally have no control over the PDFs you get.
Yes, there are variants where parts of that are addressed (e.g. tagged PDFs,  [[#PDF/A|PDF/A]], PDF/UA), but generally have no control over the PDFs you get.





Revision as of 13:12, 2 August 2023

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

PDF is a file format to store published versions of documents.

It is mot often seen for manuals, articles (alongside PostScript), and such.


From a functional point of view, the PDF language combines

  • layout and graphics in much in the same way as PostScript does (which is why the two are fairly directly and accurately convertible), though PDF is not a full language as PS is.
  • Font embedding in documents
  • Other file embedding

Graphical advantages over Postscript include transparency, embedded raster images

But like Postscript, it is a rendering format more than it is a document format.



Versions

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The internal format versions often went with major Acrobat releases.

  • PDF 1.0 with Acrobat 1.0 (1993)


  • PDF 1.1 with Acrobat 2.0 (1996)
    • Introduced links, some color features, passwords


  • PDF 1.2 with Acrobat 3.0 (1996)
    • Introduced Unicode, some interactivity, media, some more color and image features


  • PDF 1.3 with Acrobat 4.0 (2000)
    • Introduced digital signatures, some color features, JavaScript actions


  • PDF 1.4 with Acrobat 5.0 (2001)
    • introduced transparency, JBIG2 images, OCR text layers


  • PDF 1.5 with Acrobat 6.0 (2003)
    • some more stream features, JPEG2000 raster


  • PDF 1.6 with Acrobat 7.0 (2004)
    • Inroduced option of OpenType fonts
    • XML forms
    • AES encryption
    • Embedded multimedia


  • PDF 1.7 with Acrobat 8.0 (2006)


Since here, PDF specs were handed over to ISO. This means Adobe no longer releases PDF specs, but publishes extensions that initially only their products support, but which it tries to get into the ISO standard.


  • PDF 1.7 plus Adobe Extension Level 3 with Acrobat 9.0 (2008)


  • PDF 2.0 (2017), ISO 32000-2


Most viewers support all 1.4 features, and not necessarily all features since that time. (verify)

From a quick and likely-skewed sampling of PDF files, it seems that typical versions lag a decade behind what's available - which is probably great for compatibility. And is partly intentional, as saving PDFs for backwards compatibility often went a few versions/years back.


PDF/A

For long-term preservation of documents it is a good idea to restrict a PDF to a subset of features to ensure that

it can be fully rendered anywhere, e.g.
requiring font information be embedded rather than linked
require a lack of encryption
disallow audio, video
it does not use any more recent features that are potentially exploitable
disallow scripting, executable launching


PDF/A-1 (2005) is a specification based on PDF (1.4) of such a set of restrictions.

It refers to ISO 19005-1:2005, "Document Management - Electronic document file format for long term preservation - Part 1: Use of PDF 1.4 (PDF/A-1)"
PDF/A-1a and PDF/A-1b refer to a specific levels of compliance.


PDF/A-2 (2011) considers features from 1.5, 1.6 and 1.7 and decided to include JPEG 2000, OpenType fonts, transparency effects and layers, digital signatures


PDF/A-3 (2012) allows embedding of arbitrary files


PDF/A-4 is expected in 2019, and based on PDF 2.0



See also:

PDF Viewers

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • Adobe Reader
  • Foxit Reader
  • Sumatra PDF
  • PDF-XChange Viewer
  • Preview (OSX)
  • Skim (OSX)
  • BePDF (BeOS)
  • Yap
  • Vindaloo


Editing, annotating, tools

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • flpsed (annotates)
  • Pdfvue


ps2pdf12, ps2pdf13, and ps2pdf14 actually call ps2pdfwr with a specific version argument, and ps2pdfwr mostly just calls gs (GhostScript)


Mostly page-based operations:

  • pdftk
    • merge, split, rotate, watermark, metadata, attachment, encrypt, decrypt, attempt to uncorrupt
  • pdfjam has a few convenience utilities (mostly for printing) such as pdfnup (many-on-a-page), pdfjoin (combines separate documents/pages) and pdf90 (rotate 90 degrees)
  • pdfshuffler
    • (split, rearrange, merge, rotate, crop)
  • PDFGarden (OSX)
    • split, combine, arrange, and such


GIMP, Inkscape and such can also open PDF pages, but they tend not to be very convenient.

Converters, Printing to PDF

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • many online converters


Print to PDF:



Note that most office suites can save to PDF.


Libraries

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Generating from code, analysing from code

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: There are a bunch of converters from HTML, which may be a convenient intermediate for simple documents.


  • pdflib
    • free + paid versions
    • C, has bindings for C++, Java, Perl, PHP, Python, Ruby, Tcl


  • binding to Cairo


Python:

  • reportlab
    • pdfgen (low-level)
    • Platypus (higher-level)
  • pisa (conversion from HTML, based on reportlab)




Things that output to PDF that are somewhat more specific-purpose (but sometimes quite controllable):


Technical notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Sizes

MediaBox - size of the media

CropBox - box that is expected to be shown or printed

usually a li

BleedBox -

TrimBox

ArtBox -


Broadly, this is about professional printing, and cutting. Very roughly:

Mediabox should be the size of the media we're printing on,
Cropbox, BleedBox and TrimBox are about the area you end up
it seems that TrimBox is most used when printing many copies / pages onto a single sheet (see terms like press sheet)
...and it may just be the same as CropBox


Text in PDFs

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

See also