PDF notes: Difference between revisions

Revision as of 18:12, 22 September 2023

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

PDF is a file format to store published versions of documents.

It is mot often seen for manuals, articles (alongside PostScript), and such.

From a functional point of view, the PDF language combines

layout and graphics in much in the same way as PostScript does (which is why the two are fairly directly and accurately convertible), though PDF is not a full language as PS is.
Font embedding in documents
Other file embedding

Graphical advantages over Postscript include transparency, embedded raster images

But like Postscript, it is a rendering format more than it is a document format.

Structure from PDFs

Versions

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The internal format versions often went with major Acrobat releases.

PDF 1.0 with Acrobat 1.0 (1993)

PDF 1.1 with Acrobat 2.0 (1996)
- Introduced links, some color features, passwords

PDF 1.2 with Acrobat 3.0 (1996)
- Introduced Unicode, some interactivity, media, some more color and image features

PDF 1.3 with Acrobat 4.0 (2000)
- Introduced digital signatures, some color features, JavaScript actions

PDF 1.4 with Acrobat 5.0 (2001)
- introduced transparency, JBIG2 images, OCR text layers

PDF 1.5 with Acrobat 6.0 (2003)
- some more stream features, JPEG2000 raster

PDF 1.6 with Acrobat 7.0 (2004)
- Inroduced option of OpenType fonts
- XML forms
- AES encryption
- Embedded multimedia

PDF 1.7 with Acrobat 8.0 (2006)

Since here, PDF specs were handed over to ISO. This means Adobe no longer releases PDF specs, but publishes extensions that initially only their products support, but which it tries to get into the ISO standard.

PDF 1.7 plus Adobe Extension Level 3 with Acrobat 9.0 (2008)

PDF 2.0 (2017), ISO 32000-2

Most viewers support all 1.4 features, and not necessarily all features since that time. (verify)

From a quick and likely-skewed sampling of PDF files, it seems that typical versions lag a decade behind what's available - which is probably great for compatibility. And is partly intentional, as saving PDFs for backwards compatibility often went a few versions/years back.

PDF/A

For long-term preservation of documents it is a good idea to restrict a PDF to a subset of features to ensure that

it can be fully rendered anywhere, e.g.

requiring font information be embedded rather than linked

require a lack of encryption

disallow audio, video

it does not use any more recent features that are potentially exploitable

disallow scripting, executable launching

PDF/A-1 (2005) is a specification based on PDF (1.4) of such a set of restrictions.

It refers to ISO 19005-1:2005, "Document Management - Electronic document file format for long term preservation - Part 1: Use of PDF 1.4 (PDF/A-1)"

PDF/A-1a and PDF/A-1b refer to a specific levels of compliance.

PDF/A-2 (2011) considers features from 1.5, 1.6 and 1.7 and decided to include JPEG 2000, OpenType fonts, transparency effects and layers, digital signatures

PDF/A-3 (2012) allows embedding of arbitrary files

PDF/A-4 is expected in 2019, and based on PDF 2.0

PDF Viewers

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Adobe Reader
Foxit Reader
Sumatra PDF
PDF-XChange Viewer

Preview (OSX)
Skim (OSX)

gv
Xpdf ()
ePDFView
Poppler
Okular (KDE4)
KPDF (KDE)
Evince (GNOME)

MuPDF

BePDF (BeOS)

Yap
Vindaloo

Editing, annotating, tools

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Adobe Acrobat Pro (paid)
PDFEdit (free)
Foxit Editor

flpsed (annotates)

Pdfvue

ps2pdf12, ps2pdf13, and ps2pdf14 actually call ps2pdfwr with a specific version argument, and ps2pdfwr mostly just calls gs (GhostScript)

Mostly page-based operations:

pdftk
- merge, split, rotate, watermark, metadata, attachment, encrypt, decrypt, attempt to uncorrupt

pdfjam has a few convenience utilities (mostly for printing) such as pdfnup (many-on-a-page), pdfjoin (combines separate documents/pages) and pdf90 (rotate 90 degrees)

pdfshuffler
- (split, rearrange, merge, rotate, crop)

PDFGarden (OSX)
- split, combine, arrange, and such

GIMP, Inkscape and such can also open PDF pages, but they tend not to be very convenient.

Converters, Printing to PDF

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

many online converters

http://www.dopdf.com/

http://convert.neevia.com/

Print to PDF:

Note that most office suites can save to PDF.

Libraries

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

MuPDF
Ghostsript (interprets PS and PDF)

Generating from code, analysing from code

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Note: There are a bunch of converters from HTML, which may be a convenient intermediate for simple documents.

pdflib
- free + paid versions
- C, has bindings for C++, Java, Perl, PHP, Python, Ruby, Tcl

binding to Cairo

Python:

reportlab
- pdfgen (low-level)
- Platypus (higher-level)

pycairo

pyPdf

pisa (conversion from HTML, based on reportlab)

rst2pdf (reStructuredText to PDF)

pdfminer (text analyser)

Things that output to PDF that are somewhat more specific-purpose (but sometimes quite controllable):

Various reporting systems
- Jasper ()
- BIRT (Java, free, part of eclipse(verify))
- G2 Report Engine (Java, free)
- Pentaho

Technical notes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Sizes

MediaBox - size of the media

CropBox - box that is expected to be shown or printed

usually a li

BleedBox -

TrimBox

ArtBox -

Broadly, this is about professional printing, and cutting. Very roughly:

Mediabox should be the size of the media we're printing on,

Cropbox, BleedBox and TrimBox are about the area you end up

it seems that TrimBox is most used when printing many copies / pages onto a single sheet (see terms like press sheet)

...and it may just be the same as CropBox

Text in PDFs

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

@@ Line 4: / Line 4: @@
 It is mot often seen for manuals, articles (alongside [[PostScript]]), and such.
-Like PS, A PDF should render the same everywhere.
@@ Line 20: / Line 17: @@
+==Structure from PDFs==
+<!--
+As just noted, PDF is not a document format.
+It does not have concepts like columns, lines, or even words.
+PDF/A (for archiving) and PDF/UA (Universal Accessibility) have structure information
+''does'' have structure information, but that describes a minority of PDF files out there.
+And there are plenty of little issues left over after that.
+Things are represented in a visually oriented way, there to be rendered.
+If you want to extract formatted text, it amounts to finding all the objects on a page, and piecing it together yourself.
+PDF extraction tools do just that, but despite having just one job, tend to not do everything well.
+It's largely a question of focus - PDF can be used for such a diverse set of things,
+these tools focus on documents, and then specific kinds of them.
+Getting it right will always depend a little on context,
+and there are many refinements that will work ''great'' on a specific document set,
+but are not useful in general.
+The only reason that certain products (acrobat, word, varied specific others) do it well is
+probably purely because more time was spent on them.
+Easier things include
+* separate header and footer from body easily,  looking at consistency in where text sits (often a rectangle)
+* paragraphs, because they are often separated by whitespace
+Harder are things like
+* footnotes, because they may butt up into that main body area
+* tables, because they are very diverse
+To evaluate:
+PDFExtract [https://github.com/CrossRef/pdfextract]
+: seems somewhat
+PDFBox [https://pdfbox.apache.org/]
+PyMuPDF [https://pymupdf.readthedocs.io/en/latest/]
+: more mechanical but gives more control
+Poppler
+: backs pdftotext?
+iText PDF
+See also:
+* "Towards High-Quality Text Stream Extraction from PDF" [https://aclanthology.org/W12-3211.pdf]
+-->
@@ Line 121: / Line 180: @@
 See also:
 * http://en.wikipedia.org/wiki/PDF/A
-===PDF/UA===
-<!--
-PDF/UA, for Universal Accessibility, is a PDF (1.7) that adheres to a number of extra
-This includes
-* visual guidelines, largely [https://en.wikipedia.org/wiki/Web_Content_Accessibility_Guidelines WCAG]
-* ability for assistive technology to work decently
-:: including text that is present in a sensible reading order
-* adhere to Tagged PDF , but adds a variety of qualitative requirements, especially regarding semantic correctness of the tags employed
-* een document moet 'tags' bevatten
-* alle afbeeldingen moeten voorzien zijn van een alternatieve tekst
-* alle lettertypen moeten in het document ingebed zijn; tabellen moeten een duidelijke kop (rij of kolom) hebben en eenduidig te interpreteren zijn; inhoud mag niet alleen door kleur, vorm of contrast worden gerepresenteerd.
-* meaningful graphics must include text descriptions
-https://en.wikipedia.org/wiki/PDF/UA
 ==PDF Viewers==

PDF notes: Difference between revisions

Revision as of 18:12, 22 September 2023

Contents

Structure from PDFs

Versions

PDF/A

PDF Viewers

Editing, annotating, tools

Converters, Printing to PDF

Libraries

Generating from code, analysing from code

Technical notes

Text in PDFs

See also

Navigation menu