General documentation / cheat sheets for various languages and services

PDF format file structure

The primary free version of the PDF specification: PDF32000_2008.pdf

The GnuPDF Wiki seems to be another good resource, too. (E.g., CCITT_Fax_Filter.)

A summary of the basic stuff

PDF includes eight basic types of objects: Boolean values, Integer and Real numbers, Strings, Names, Arrays, Dictionaries, Streams, and the null object.

Boolean literals appear as true and false.

Number literals are the usual. No scientific notation, though.

Strings literals are denoted by ( and ) delimiters, or as hexadecimal data with < and > delimiters. You can escape literal parentheses with the \ character, which you can literally express as \\. Lines in a string literal ending with \ should ignore the subsequent line break.

var hexadecimal_string = '7e19ea68d47cd58418bb9001776e808b';
var bytes = hexadecimal_string.match(/\w\w/g).map(function(point) { return parseInt(point, 16); });
var string = new Buffer(bytes).toString('utf8');

Delimiters:

glyph dec hex oct name
( 40 28 50 LEFT PARENTHESIS
) 41 29 51 RIGHT PARENTHESIS
< 60 3C 60 LESS-THAN SIGN
> 62 3E 62 GREATER-THAN SIGN
[ 91 5B 133 LEFT SQUARE BRACKET
] 93 5D 135 RIGHT SQUARE BRACKET
{ 123 7B 173 LEFT CURLY BRACKET
} 125 7D 175 RIGHT CURLY BRACKET
/ 47 2F 57 SOLIDUS
% 37 25 45 PERCENT SIGN

“Regular characters” are defined as the range [!-~], excluding the above delimiters (PDF32000_2008.pdf:17).

Thus we have !-', *-., 0-;, =, ?-Z, \, ^-z, |, and ~, which, in a regex character class, looks like: [!-'*-.0-;=?-Z\\^-z|~].

A Name is delimited by the / character on the left, and any whitespace on the right. The Name itself not contain the initial slash. If you want funny characters in your Name, like a space or parentheses, you have to escape them with the # character. Express the # character as #23.

Array objects consist of other types (including arrays), delimited by [ and ] characters. The array’s distinct children are potentially separated by whitespace, but not necessarily. Arrays can be empty.

Dictionary objects consist of a series of key-value pairs (where the keys are all Names) delimited by << and >> strings. Type and Subtype names are metadata conventions used to describe the dictionary, but not required. The pairs may be separated by newlines, or maybe each item will simply be separated by a space.

Stream objects are delimited by stream and endstream keywords. Stream objects must be immediately preceded by a Dictionary object with the following key-value pairs:

The Null object is a singleton, literally expressed as null, which has the type of “Null”.

Indirect objects (Sec 7.3.10) provide a way of referring to an object. E.g.,

1 0 obj
  (Christopher Brown)
endobj

Puts my name into object number “1” and sets the “generation number” to 0. I can refer to this later as:

1 0 R

Object numbers need not be given in sequence, or even consume all positive integers; the only requirement is that they be positive.

References to never-defined indirect objects are not errors; instead, they are references to the Null object.

Headers

The “header line” refers to the very first line of a PDF, e.g., %PDF-1.4.

If a PDF contains binary data, the second line of the PDF must be a comment with at least four binary characters (characters with codes > 128).

Cross-Reference Table

A cross-reference section gives byte offsets of where, in the PDF, to find specific indirect objects.

xref
100 3
0000000016 00000 n
0000000911 00000 n
0000001006 00000 n

You can also have “free objects” in the cross-reference section, where the lines end with f instead of n, and the 10-digit number is the “object number” (?) of the next free object.

The byte offset is from the beginning of the PDF file, so you can call

dd bs=1 skip=1006 count=9 <my.pdf 2>&-

And it will return

103 0 obj

Which is the first part of the declaration of indirect object 103.

Trailers

“Conforming readers should read a PDF file from its end.”

When reading from the end, you’ll first encounter a magic line %%EOF, which will be immediately preceded (the lines coming above it in the file) by a trailer and two lines, like so:

trailer << ... >>
startxref
173
%%EOF

Table 15 describes the trailer dictionary in full, but in brief:

Text

Chapter 5, PDFReference.pdf page 311, has the information on Text content in PDFs.

BT                   % begin text element
  /F13 12 Tf         % Use font F13 (Helvetica) at size 12
  288 720 Td         % The origin is the lower-left, so this puts the cursor at
                     % 288/72 = 4 inches from the left, and 720/72 = 10 inches from the bottom
  (ABC) Tj           % draw the string "ABC"
ET                   % end text element

F13 means Helvetica because, in one of the document meta dictionaries, we saw something like:

/Resources
<<
  /Font <<
    /F13 <<
      /Type /Font
      /Subtype /Type1
      /BaseFont /Helvetica
    >>
  >>
>>

There are nine text state parameters (the “text state” does not include the current text transformation matrices):

The commands below manipulate the text state as well as the text rendering matrices, and draw text.

Drawing

References

Drawing operator categories (PDF32000_2008.pdf:8.2)

General graphics state: w, J, j, M, d, ri, i, gs
Special graphics state: q, Q, cm
Path construction: m, l, c, v, y, h, re
Path painting: S, s, f, F, f*, B, B*, b, b*, n
Clipping paths: W, W*
Text objects: BT, ET
Text state: Tc, Tw, Tz, TL, Tf, Tr, Ts
Text positioning: Td, TD, Tm, T*
Text showing: Tj, TJ, ', "
Type 3 fonts: d0, d1
Color: CS, cs, SC, SCN, sc, scn, G, g, RG, rg, K, k
Shading patterns: sh
Inline images: BI, ID, EI
XObjects: Do
Marked content: MP, DP, BMC, BDC, EMC
Compatibility: BX, EX

The “Text state”, “Text positioning”, and “Text showing” operators only apply between BT and ET markers.

Font handling

Font objects are represented by a PDF object dictionary with the TypeFont”.

The 14 “standard” fonts:

  1. Times-Roman
  2. Helvetica
  3. Courier
  4. Symbol
  5. Times-Bold
  6. Helvetica-Bold
  7. Courier-Bold
  8. ZapfDingbats
  9. Times-Italic
  10. Helvetica-Oblique
  11. Courier-Oblique
  12. Times-BoldItalic
  13. Helvetica-BoldOblique
  14. Courier-BoldOblique

Font Subsets. When the PDF only uses a subset of a font, it prefixes its name with six capital letters. These don’t matter, but they should be unique for each distinct subset. (PDF32000_2008.pdf:9.6.4)

See also “Extraction of Text Content” (PDF32000_2008.pdf:9.10)

FontDescriptor

TODO

Encoding

TODO

CMap

A concise introduction to the CMap format, as far as we’re concerned with it, is in “ToUnicode CMaps” (PDF32000_2008.pdf:9.10.3).

A CMap is declared as a stream in a Font dictionary’s ToUnicode field. The contents of the stream look like this:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe−Identity−UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 beginbfrange
<0000> <005E>
<005F> <0061>
endbfrange
1 beginbfchar
<3A51> <D840DC3E>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

The important bits are the beginbfrangeendbfrange and beginbfcharendbfchar segments. These define the mapping from character codes to UTF-16BE-encoded Unicode character sequences.

You might also need to incorporate the stream dictionary’s UseCMap value, but in my experience, it’s not often used.