PDF notes
PDF is a file format to store published versions of documents.
It is most often seen for manuals, academic articles (historically alongside PostScript), final versions of documents, and such.
From a functional point of view, the PDF language combines
- layout and graphics in much in the same way as PostScript does (which is why the two are fairly directly and accurately convertible), though PDF is not a full language as PS is.
- Font embedding in documents
- Other file embedding
Graphical advantages over Postscript include transparency, embedded raster images
But like Postscript, it is a rendering format more than it is a document format.
Structure from PDFs
As just noted, PDF is not a document format.
It does not have concepts like columns, lines, or even words.
PDF/A (for archiving) and PDF/UA (Universal Accessibility)
does have moderate structure information,
but that describes a minority of PDF files out there.
And there are plenty of little issues left over after that.
Versions and variants
The internal format versions often went with major Acrobat releases.
- PDF 1.0 with Acrobat 1.0 (1993)
- PDF 1.1 with Acrobat 2.0 (1996)
- Introduced links, some color features, passwords
- PDF 1.2 with Acrobat 3.0 (1996)
- Introduced Unicode, some interactivity, media, some more color and image features
- PDF 1.3 with Acrobat 4.0 (2000)
- Introduced digital signatures, some color features, JavaScript actions
- PDF 1.4 with Acrobat 5.0 (2001)
- introduced transparency, JBIG2 images, OCR text layers
- PDF 1.5 with Acrobat 6.0 (2003)
- some more stream features, JPEG2000 raster
- PDF 1.6 with Acrobat 7.0 (2004)
- Inroduced option of OpenType fonts
- XML forms
- AES encryption
- Embedded multimedia
- PDF 1.7 with Acrobat 8.0 (2006)
Since here, PDF specs were handed over to ISO. This means Adobe no longer releases PDF specs, but publishes extensions that initially only their products support, but which it tries to get into the ISO standard.
- PDF 1.7 plus Adobe Extension Level 3 with Acrobat 9.0 (2008)
- PDF 2.0 (2017), ISO 32000-2
Most viewers support all 1.4 features, and not necessarily all features since that time. (verify)
From a quick and likely-skewed sampling of PDF files, it seems that typical versions lag a decade behind what's available - which is probably great for compatibility. And is partly intentional, as saving PDFs for backwards compatibility often went a few versions/years back.
Most of the PDFs I have
PDF/A (archiving)
For long-term preservation of documents it is a good idea to restrict a PDF to a subset of features to ensure that
- can be rendered fully (and essentially identially) anywhere, e.g.
- requiring font information be embedded rather than linked
- require a lack of encryption
- disallow audio, video
- does not use any features that are potential security features
- disallow scripting, executable launching
PDF/A-1 (2005) is a specification based on PDF (1.4) of such a set of restrictions.
- It refers to ISO 19005-1:2005, "Document Management - Electronic document file format for long term preservation - Part 1: Use of PDF 1.4 (PDF/A-1)"
PDF/A-2 (2011) considers features from 1.5, 1.6 and 1.7 and decided to include JPEG 2000, OpenType fonts, transparency effects and layers, digital signatures
PDF/A-3 (2012) allows embedding of arbitrary files, and is otherwise the same as PDF/A-2
PDF/A-4 (2020) is based on PDF 2.0
b, u, and a are specific levels of compliance
- PDF/A-1a and PDF/A-1b refer to a specific levels of compliance.
- PDF/A-2 and -3 have b, u, and a
where:
- b (“Basic”)
- visual appearance is same when viewing or printing
- u (“Unicode”)
- meets b
- guarantee that all characters are mapped to Unicode (which means they are searchable, not just viewable)
- a (“Accessible”)
- meets b
- guarantee that all characters are mapped to Unicode
- has document structure information (means that logical structure like reading order)
See also:
PDF/UA (universal accessibility)
https://en.wikipedia.org/wiki/PDF/UA
PDF/X
https://en.wikipedia.org/wiki/PDF/X
PDF/VT
https://en.wikipedia.org/wiki/PDF/VT
Editing, annotating, tools
⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research). |
- Adobe Acrobat Pro (paid)
- PDFEdit (free)
- Foxit Editor
- flpsed (annotates)
- Pdfvue
ps2pdf12, ps2pdf13, and ps2pdf14 actually call ps2pdfwr with a specific version argument, and ps2pdfwr mostly just calls gs (GhostScript)
Mostly page-based operations:
- pdftk
- merge, split, rotate, watermark, metadata, attachment, encrypt, decrypt, attempt to uncorrupt
- pdfjam has a few convenience utilities (mostly for printing) such as pdfnup (many-on-a-page), pdfjoin (combines separate documents/pages) and pdf90 (rotate 90 degrees)
- pdfshuffler
- (split, rearrange, merge, rotate, crop)
- PDFGarden (OSX)
- split, combine, arrange, and such
GIMP, Inkscape and such can also open PDF pages, but they tend not to be very convenient.
Converters, Printing to PDF
- many online converters
Print to PDF:
Note that most office suites can save to PDF.
pdfmark
Libraries
- Ghostsript (interprets PS and PDF)
- MuPDF (C)
- conversions to DjVu, epub, mobi, XPS, Postscript, and more
- including rasterization
- PyMuPDF (python)
- python wrapper around MuPDF library
- older versions of PyMuPDF had their Python import name as fitz. Newer versions use pymupdf, but for now fitz is still there for backwards compatibility
- pikepdf (python)
- eases various alterations (less about generation, won't rasterize)
- leverages QPDF and others
- PDFium (C, used in Chrome?)
- rendering mostly?
- pypdfium2 (python)
- python wrapper around PDFium
- poppler (C, and python wrapper)
- read, render, modify PDF
- reportlab (python)
- mostly generating, not altering or rendering?
Generating from code, analysing from code
Note: There are a bunch of converters from HTML, which may be a convenient intermediate for simple documents.
- pdflib
- free + paid versions
- C, has bindings for C++, Java, Perl, PHP, Python, Ruby, Tcl
- binding to Cairo
Python:
- reportlab
- pdfgen (low-level)
- Platypus (higher-level)
- pisa (conversion from HTML, based on reportlab)
- rst2pdf (reStructuredText to PDF)
- pdfminer (text analyser)
Things that output to PDF that are somewhat more specific-purpose (but sometimes quite controllable):
- Various reporting systems
- Jasper ()
- BIRT (Java, free, part of eclipse(verify))
- G2 Report Engine (Java, free)
- Pentaho
Technical notes
PDF / page sizes
Note that most of these differences do not matter to the average person printing a page.
They matter for things like prepress - professional printing, and the cutting involved in it,
where concepts like bleed are about printing a little beyond,
catching the minor inconsistencies that may happen when trimming pages.
Mediabox should be the size of the media we're printing on, Cropbox, BleedBox and TrimBox are about the area you end up with
MediaBox (required)
- should contain (be larger or the same as) all of CropBox, TrimBox, ArtBox and BleedBox
- in everyday use, probably the media size
- in prepress, this may be larger than the media size, for reasons relating to bleed/trimming
- ...but software may also add printer marks beyond this (verify)
CropBox (optional)
- the default for TrimBox, ArtBox and BleedBox if those are not specified in the PDF(verify)
- box that is expected to be shown or printed
- if a PDF contains a CropBox definition, even screen viewers will probably use this, not mediabox.
- in prepress: not really used
BleedBox (optional)
- in prepress: region that should be clipped to in production, often a few millimeters larger than TrimBox.
- but prepress tends to let you define the actual bleed, which then ignores BleedBox(verify)
- otherwise probably the same as CropBox(verify)
TrimBox (optional)
- in prepress: the intended dimensions of the product, so this is seems more important that CropBox (and BleedBox)(verify)
- perhaps(verify) the most important box (and most ued) when printing many copies / pages onto a single sheet (see terms like press sheet)
- otherwise: same as cropbox
- and yes, trim and crop being different things is confusing
ArtBox (optional)
- seemed to be a weird exception, pointing at where the fancy stuff is
- does not get a systematic use in prepress? (or anywhere?) yet has some specific-case uses
- e.g. 'this is how much to stay away from the edges if you put it in a lightbox'
- may default to cropbox