PDF notes

From Helpful
(Redirected from PDF)
Jump to navigation Jump to search

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

PDF can be seen as a domain-specific language, that is used primarily to express text and graphics positioning.

It is most often seen for manuals, academic articles (there historically alongside PostScript), final read-only versions of documents, and such.


The way it expresses graphics is similar to the way PostScript works, which is why the two are fairly directly and accurately convertible, and note that PS still lives on in the context of printers.

PDFs and text

PDF, more technically

Versions

  • PDF 1.0 with Acrobat 1.0 (1993)
  • PDF 1.1 with Acrobat 2.0 (1996)
    • Introduced links, some color features, passwords
  • PDF 1.2 with Acrobat 3.0 (1996)
    • Introduced Unicode, some interactivity, media, some more color and image features
  • PDF 1.3 with Acrobat 4.0 (2000)
    • Introduced digital signatures, some color features, JavaScript actions
  • PDF 1.4 with Acrobat 5.0 (2001)
    • introduced transparency, JBIG2 images
  • PDF 1.5 with Acrobat 6.0 (2003)
    • some more stream features, JPEG2000 raster
  • PDF 1.6 with Acrobat 7.0 (2004)
    • Introduced option of OpenType fonts
    • XML forms
    • AES encryption
    • Embedded multimedia
  • PDF 1.7 with Acrobat 8.0 (2006)


  • PDF 1.7 plus Adobe Extension Level 3 with Acrobat 9.0 (2008)
ISO 32000-1:2008(verify)
  • PDF 2.0 (2017), ISO 32000-2
ISO 32000-1:2020(verify)


Notes:

  • Most viewers support most or all 1.4 features(verify).
  • The internal format versions used to change along with major Acrobat releases.
This stopped when PDF was no longer managed/owned by Adobe - ISO has been there alongside for a while, and seems to have been handed the standard around PDF 1.7. Adobe still publishes extensions, that initially only their products support, but which it tries to get into the ISO standard.
  • If you want it to display on as many places as possible, you could limit yourself to 1.4, 1.5
but you can usually figure out or guess, per feature, how widely supported later features are
As of this writing, the bulk of PDFs you'll find are probably 1.5, 1.4, and various earlier. You'll see handfuls of uses up to 1.7, while 2.0 is currently rare.
  • Documents can in theory use features newer than the version they report
...and get away with it because most PDF reading just implements features they care most about
...even though doing this is out of spec


Types/variants

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


PDF variants can be seen as "uses existing features to conform to a specific set of requirements".

This is often to make it a more useful, reliable format within a specificic field.


Retaining and finding information

PDF/A (archiving)

PDF/A is a set of requirements useful for long-term preservation of documents.

It aims to e.g. restrict features so that we can ensure that

it can be rendered fully anywhere, e.g.
requiring font information be embedded rather than linked
require a lack of encryption
disallow audio, video
it avoids specifics that might risk it not rendering
there is no guarantee it renders identically, but avoiding external dependencies does go a long way
it does not use any features that are potential security features
disallow scripting, executable launching


Defined by ISO 19005, "Document management — Electronic document file format for long-term preservation". You can consider this as four distinct parts:

  • PDF/A-1 (ISO 19005-1:2005), "Part 1: Use of PDF 1.4 (PDF/A-1)" ISO 19005-1
based on PDF (1.4)
  • PDF/A-2 (ISO 19005-2:2011), "Part 2: Use of ISO 32000-1 (PDF/A-2)" ISO 19005-2
based on PDF 1.7 (seems to exist in part to consider features from 1.5, 1.6 and 1.7(verify))
decided to include JPEG 2000, OpenType fonts, transparency effects and layers, digital signatures
  • PDF/A-3 (ISO 19005-3:2012), "Part 3: Use of ISO 32000-1 with support for embedded files (PDF/A-3)" ISO 19005-3
based on PDF 1.7
allows embedding of arbitrary files (was meant for machine parseable data?(verify))
is otherwise mostly the same as PDF/A-2 (verify)
  • PDF/A-4 (ISO 19005-4:2020), "Use of ISO 32000-2 (PDF/A-4)" ISO 19005-4
based on PDF 2.0
basically went back on allowing embedding arbitrary files, apparently because it allowed/caused more problems than it solved? (verify)
except for specific accessibility features?(verify)



Added letters are specific levels of compliance, e.g. PDF/A-1a and PDF/A-1b refer to a specific levels of compliance.

Different parts have their own letters, but are partially shared.

  • A-1 has a and b
  • A-2 has b, u, and a
  • A-3 has b, u, and a
  • A-4 has f and e

where:

  • b (“Basic”)
visual appearance is same when viewing or printing
  • a (“Accessible”)
meets b
guarantee that all characters are mapped to Unicode
has document structure information (means that logical structure like reading order)
  • u (“Unicode”), introduced in -2
meets b
guarantee that all characters are mapped to Unicode (which means these characters are searchable, not just viewable)
  • u (“Unicode”), introduced in -2


In theory, A-4 documents

always conform to what b used to guarantee(verify),
also conform to u (if a little differently)(verify),
accessibility was separated; a document can separately conform to /A-4 and to /UA (verify)

So they only leave some new conformance details:

  • ffile attachments (restricted)
  • eengineering/technical documents


See also:

PDF/UA (universal accessibility)

Defined by: ISO 14289



See also:

"Searchable PDF"

There does not seem to be a standard called 'searchable PDF', it seems to just describe whether there is embedded/OCRed text available at all, which is itself not really a standard either


For more specific guarantees, look to PDF/A and PDF/UA

The closest thing that is a standard and implies a PDF is searchable is probably PDF/A.

...not that you can easily test that everything that looks like text is searchable.

graphic design

PDF/X

Defined by: ISO 15930

Meant for graphics exchange - primarily for printing .


https://en.wikipedia.org/wiki/PDF/X


PDF/VT

Defined by: ISO 16612-2


https://en.wikipedia.org/wiki/PDF/VT

PDF/E (engineering, construction, manufacturing)

Defined by: ISO 24517-1

https://en.wikipedia.org/wiki/PDF/E


PDF-Healthcare

https://www.adobe.com/uk/acrobat/resources/document-files/pdf-types/pdf-healthcare.html

PAdES - signed PDFs

While the idea of electronic signatures were introduced earlier (in PDF 1.3), the standard we call "PDF Advanced Electronic Signatures" (PAdES) refers to a more recent, more secure, and more featureful refinement.


This is not defined directly by the PDF standard. It seems to have started by a EU need for these, expressed in a directive, and led to ETSI EN 319 142 (verify) (which comes in two parts).


https://en.wikipedia.org/wiki/PAdES

https://blog.pdf-tools.com/2018/11/pades-pdf-advanced-electronic-signature.html

https://www.adobe.com/uk/acrobat/resources/document-files/pdf-types/pades.html



"Hybrid PDF"

On file structure

Objects

Appending, and why xref is central

Progressive loading, and linearization

PDF forensics

PDF Metadata

Changes

Hiding data?

On fonts and characters

Embedded and base fonts

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Embedding fonts, characters and font mappings, CID

PDF tools

Editing, annotating, tools

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research).


  • flpsed (annotates)
  • Pdfvue


ps2pdf12, ps2pdf13, and ps2pdf14 actually call ps2pdfwr with a specific version argument, and ps2pdfwr mostly just calls gs (GhostScript)


Mostly page-based operations:

  • pdftk
    • merge, split, rotate, watermark, metadata, attachment, encrypt, decrypt, attempt to uncorrupt
  • pdfjam has a few convenience utilities (mostly for printing) such as pdfnup (many-on-a-page), pdfjoin (combines separate documents/pages) and pdf90 (rotate 90 degrees)
  • pdfshuffler
    • (split, rearrange, merge, rotate, crop)
  • pdfcrop
  • PDFGarden (OSX)
    • split, combine, arrange, and such



GIMP, Inkscape and such can also open PDF pages, but they tend not to be very convenient.

Converters, Printing to PDF

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
  • many online converters


Print to PDF:



Note that most office suites can save to PDF.


pdfmark

Libraries

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


conversions to DjVu, epub, mobi, XPS, Postscript, and more
including rasterization
python wrapper around MuPDF library
older versions of PyMuPDF had their Python import name as fitz. Newer versions use pymupdf, but for now fitz is still there for backwards compatibility


eases various alterations (less about generation, won't rasterize)
leverages QPDF and others


rendering mostly?
python wrapper around PDFium


read, render, modify PDF


mostly generating, not altering or rendering?

Generating from code, analysing from code

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: There are a bunch of converters from HTML, which may be a convenient intermediate for simple documents.


  • pdflib
    • free + paid versions
    • C, has bindings for C++, Java, Perl, PHP, Python, Ruby, Tcl


  • binding to Cairo


Python:

  • reportlab
    • pdfgen (low-level)
    • Platypus (higher-level)
  • pisa (conversion from HTML, based on reportlab)




Things that output to PDF that are somewhat more specific-purpose (but sometimes quite controllable):


Technical notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

PDF / page sizes

Note that most of these differences do not matter to the average person printing a page, 'it may get resized a little' is a great tradeoff for 'it just prints without difficult questions'


They matter for things like prepress, professional printing, and the cutting involved in it, where concepts like bleed are about printing a little beyond, catching the minor inconsistencies that may happen when trimming pages.

Mediabox should be the size of the media we're printing on, Cropbox, BleedBox and TrimBox are about the area you end up with


MediaBox (required)

should contain (be larger or the same as) all of CropBox, TrimBox, ArtBox and BleedBox
in everyday use, probably the media size
in prepress, this may be larger than the media size, for reasons relating to bleed/trimming
...but software may also add printer marks beyond this (verify)


ArtBox (optional, mostly relevant to prepress)

seemed to be a weird exception, pointing at where the fancy stuff is
does not get a systematic use in prepress? (or anywhere?) yet has some specific-case uses
e.g. 'this is how much to stay away from the edges if you put it in a lightbox'
may default to cropbox

BleedBox (optional, mostly relevant to prepress)

in prepress: region that should be clipped to in production, often a few millimeters larger than TrimBox.
but prepress tends to let you define the actual bleed, which then ignores BleedBox(verify)
otherwise probably the same as CropBox(verify)

TrimBox (optional, mostly relevant to prepress)

in prepress: the intended dimensions of the product, so this is seems more important that CropBox (and BleedBox)(verify)
perhaps(verify) the most important box (and most ued) when printing many copies / pages onto a single sheet (see terms like press sheet)
otherwise: same as cropbox
and yes, trim and crop being different things is confusing


CropBox (optional)

the default for TrimBox, ArtBox and BleedBox if those are not specified in the PDF(verify)
box that is expected to be shown or printed
if a PDF contains a CropBox definition, even screen viewers will probably use this, not mediabox.
in prepress: not really used


PDF extraction notes

Text in PDFs

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

On OCR in PDF

Text from PDFs

As just noted, PDF is not a document format.

It does not have concepts like columns, paragraphs, lines -- or necessarily even words.


PDF/A (for archiving) and PDF/UA (Universal Accessibility) does have moderate structure information, but that describes a minority of PDF files out there.

And there are plenty of little issues left over after that.


Editing or annotating PDF

See also