General documentation / cheat sheets for various languages and services

PDF format file structure

The primary free version of the PDF specification: PDF32000_2008.pdf

The GnuPDF Wiki seems to be another good resource, too. (E.g., CCITT_Fax_Filter.)

A summary of the basic stuff

PDF includes eight basic types of objects: Boolean values, Integer and Real numbers, Strings, Names, Arrays, Dictionaries, Streams, and the null object.

Boolean literals appear as true and false.

Number literals are the usual. No scientific notation, though.

Strings literals are denoted by ( and ) delimiters, or as hexadecimal data with < and > delimiters. You can escape literal parentheses with the \ character, which you can literally express as \\. Lines in a string literal ending with \ should ignore the subsequent line break.

var hexadecimal_string = '7e19ea68d47cd58418bb9001776e808b';
var bytes = hexadecimal_string.match(/\w\w/g).map(function(point) { return parseInt(point, 16); });
var string = new Buffer(bytes).toString('utf8');

Delimiters:

glyph	dec	hex	oct	name
(	40	28	50	LEFT PARENTHESIS
)	41	29	51	RIGHT PARENTHESIS
<	60	3C	60	LESS-THAN SIGN
>	62	3E	62	GREATER-THAN SIGN
[	91	5B	133	LEFT SQUARE BRACKET
]	93	5D	135	RIGHT SQUARE BRACKET
{	123	7B	173	LEFT CURLY BRACKET
}	125	7D	175	RIGHT CURLY BRACKET
/	47	2F	57	SOLIDUS
%	37	25	45	PERCENT SIGN

“Regular characters” are defined as the range [!-~], excluding the above delimiters (PDF32000_2008.pdf:17).

Thus we have !-', *-., 0-;, =, ?-Z, \, ^-z, |, and ~, which, in a regex character class, looks like: [!-'*-.0-;=?-Z\\^-z|~].

A Name is delimited by the / character on the left, and any whitespace on the right. The Name itself not contain the initial slash. If you want funny characters in your Name, like a space or parentheses, you have to escape them with the # character. Express the # character as #23.

Array objects consist of other types (including arrays), delimited by [ and ] characters. The array’s distinct children are potentially separated by whitespace, but not necessarily. Arrays can be empty.

Dictionary objects consist of a series of key-value pairs (where the keys are all Names) delimited by << and >> strings. Type and Subtype names are metadata conventions used to describe the dictionary, but not required. The pairs may be separated by newlines, or maybe each item will simply be separated by a space.

Stream objects are delimited by stream and endstream keywords. Stream objects must be immediately preceded by a Dictionary object with the following key-value pairs:

/Length 5190 (required) measures the length of the stream inside the stream and endstream delimiters, not including a newline after the stream keyword.
/Filter /FlateDecode (optional) can also be an array of names, which are applied in order
/DecodeParms << /K -1 /Columns 2550 >> (optional) goes with the Filter; should specify parameters for each Filter in order
/DL 19283 (optional) the length of the completely decoded, decompressed stream, perhaps an approximation
/F <file specification> (optional) for reading external files?
/FFilter ... (optional) same as /Filter, but for /F
/FDecodeParms (optional) same as /DecodeParms, but for /F

The Null object is a singleton, literally expressed as null, which has the type of “Null”.

Indirect objects (Sec 7.3.10) provide a way of referring to an object. E.g.,

1 0 obj
  (Christopher Brown)
endobj

Puts my name into object number “1” and sets the “generation number” to 0. I can refer to this later as:

1 0 R

Object numbers need not be given in sequence, or even consume all positive integers; the only requirement is that they be positive.

References to never-defined indirect objects are not errors; instead, they are references to the Null object.

Headers

The “header line” refers to the very first line of a PDF, e.g., %PDF-1.4.

If a PDF contains binary data, the second line of the PDF must be a comment with at least four binary characters (characters with codes > 128).

Cross-Reference Table

A cross-reference section gives byte offsets of where, in the PDF, to find specific indirect objects.

xref
100 3
0000000016 00000 n
0000000911 00000 n
0000001006 00000 n

100 is the index of the first of 3 consecutively numbered objects (100, 101, 102)
0000000016 is a 10-digit number denoting the byte offset of object 100
00000 is the 5-digit generation number of that object
n means this is an in-use entry (object)
0000000911 is a 10-digit number denoting the byte offset of object 101
and so on.

You can also have “free objects” in the cross-reference section, where the lines end with f instead of n, and the 10-digit number is the “object number” (?) of the next free object.

The byte offset is from the beginning of the PDF file, so you can call

dd bs=1 skip=1006 count=9 <my.pdf 2>&-

And it will return

103 0 obj

Which is the first part of the declaration of indirect object 103.

Trailers

“Conforming readers should read a PDF file from its end.”

When reading from the end, you’ll first encounter a magic line %%EOF, which will be immediately preceded (the lines coming above it in the file) by a trailer and two lines, like so:

trailer << ... >>
startxref
173
%%EOF

173 is the byte offset of the “last” xref section. If there’s only one xref section, it’ll immediately precede the trailer. If there are multiple, it’ll probably be near the top of the document.

Table 15 describes the trailer dictionary in full, but in brief:

/Size: The total number of entries throughout all the cross-reference tables “this value shall be 1 greater than the highest object number defined in the file.”
/Prev: The byte offset in the decoded stream from the beginning of the file to the beginning of the previous cross-reference section.
/Root: Reference to a dictionary object describing the PDF document – not every trailer will have this.
/Info: Reference to a dictionary object with metadata about this document, like Author, Title, etc.

Text

Chapter 5, PDFReference.pdf page 311, has the information on Text content in PDFs.

BT                   % begin text element
  /F13 12 Tf         % Use font F13 (Helvetica) at size 12
  288 720 Td         % The origin is the lower-left, so this puts the cursor at
                     % 288/72 = 4 inches from the left, and 720/72 = 10 inches from the bottom
  (ABC) Tj           % draw the string "ABC"
ET                   % end text element

F13 means Helvetica because, in one of the document meta dictionaries, we saw something like:

/Resources
<<
  /Font <<
    /F13 <<
      /Type /Font
      /Subtype /Type1
      /BaseFont /Helvetica
    >>
  >>
>>

There are nine text state parameters (the “text state” does not include the current text transformation matrices):

Tc Character spacing
Tw Word spacing
Th Horizontal scaling
Tl Leading
Tf Text font
Tfs Text font size
Tmode Text rendering mode
Trise Text rise
Tk Text knockout

The commands below manipulate the text state as well as the text rendering matrices, and draw text.

fontName fontSize Tf set the text’s font and size
charSpace Tc: set the text’s character spacing (charSpace is expressed in unscaled text space units). Default is 0. E.g.:
- 0 Tc -> My Password
- 0.25 Tc -> M y P a s s w o r d
wordSpace Tw: set the text’s word spacing (expressed in unscaled text space units). Default is 0. E.g.:
- 0 Tw -> My Password
- 2.5 Tw -> My Password Tw only applies to character 32, “SP”.
x y Td: adjust the text’s current position. Specifically, “Move to the start of the next line, offset from the start of the current line by (x, y). x and y are expressed in unscaled text space units. “When executed for the first time after BT, it establishes the text position in the current user coordinate system.” Presumably, after that first time, it’s a relative adjustment.
x y TD: Like x y Td, but sets the leading parameter to -y. In ACL pdfs, there are only TDs, no Tds. Equivalent to −y TL x y Td. (I’m not exactly clear on difference from Td; see TABLE 5.5 in PDFReference.pdf page 330)
mode Tr: set the text rendering mode;
- mode = 0: fill (the default)
- mode = 1: stroke (outline)
- mode = 2: fill then stroke
- mode = 3: none (invisible)
- mode = 4: fill text and add to clipping path
- mode = 5: stroke text and add to clipping path
- mode = 6: fill, stroke, and add to clipping path
- mode = 7: only add to clipping path
scale Tz: set the horizontal scale; scale is a percentage. Default is 100.
leading TL: set the text leading (the vertical distance between the baselines of adjacent lines of text), in unscaled text space units. Default is 0.
rise Ts: set the text rise (move the baseline up or down from its default location), in unscaled text space units. Default is 0. E.g.,
- (This text is ) Tj 5 Ts (superscripted) Tj
- (This text is ) Tj –5 Ts (subscripted) Tj
T*: Move to the start of the next line. Same as 0 Tl Td where Tl is whatever the current TL value parameter is.
a b c d e f Tm: Replace the text matrix to
```
a b 0
c d 0
e f 1
```
(some string) Tj: paints the corresponding glyphs in the graphics state.
(string) ': Move to the next line and paint the given string. Equivalent to T* (string) Tj
wordSpace charSpace (string) ": Equivalent to wordSpace Tw charSpace Tc (string) '
array TJ: Show one or more text strings. “Each element of array can be a string or a number. If the element is a string, this operator shows the string. If it is a number, the operator adjusts the text position by that amount.”

Drawing

lineWidth w: Set the current line width. The thinnest line that can be rendered is 0 w.
red green blue RG: Set stroke color. red, green, and blue should range between 0.0 and 1.0.
red green blue rg: Same as RG, but for non-stroking operations.
gray G: Set stroke coloring mode to grayscale; gray should range between 0.0 and 1.0.
gray g: Same as G, but for non-stroking operations.
x y m: move the cursor to (x, y). I think this is a relative adjustment from the previous drawing operation, but overrides the previous m operation if nothing was drawn.
x y l: draw a straight line from the cursor to (x, y) (relative, I think), and set the cursor to that point.
h: close current path.
x y width height re: draw rectangle with lower-left at (x, y). Presumably leaves the cursor in the same place.
c, v, and y are Bezier curve operations. See PDFReference.pdf page 183.
S: stroke current path
s: close and stroke current path. Equivalent to h S.
f: fill current path
B: fill and stroke current path
W: “Modify the current clipping path by intersecting it with the current path, using the nonzero winding number rule to determine which regions lie inside the clipping path.”

References

Drawing operator categories (PDF32000_2008.pdf:8.2)

General graphics state: w, J, j, M, d, ri, i, gs
Special graphics state: q, Q, cm
Path construction: m, l, c, v, y, h, re
Path painting: S, s, f, F, f*, B, B*, b, b*, n
Clipping paths: W, W*
Text objects: BT, ET
Text state: Tc, Tw, Tz, TL, Tf, Tr, Ts
Text positioning: Td, TD, Tm, T*
Text showing: Tj, TJ, ', "
Type 3 fonts: d0, d1
Color: CS, cs, SC, SCN, sc, scn, G, g, RG, rg, K, k
Shading patterns: sh
Inline images: BI, ID, EI
XObjects: Do
Marked content: MP, DP, BMC, BDC, EMC
Compatibility: BX, EX

The “Text state”, “Text positioning”, and “Text showing” operators only apply between BT and ET markers.

Font handling

Font objects are represented by a PDF object dictionary with the Type “Font”.

Type: string: “Font”
SubType: string: possible SubTypes:
- Type1: Most PDF fonts are Type 1 (PDF32000_2008.pdf:9.6.1).
- MMType1: “Multiple Master” fonts are derivation of Type 1 fonts.
- TrueType: (PDF32000_2008.pdf:9.6.3)
- Type3: Type 3 fonts (PDF32000_2008.pdf:9.6.5)
- Type0: Type 0 fonts (PDF32000_2008.pdf:9.7.6)
BaseFont: string: The name of the font; e.g., “ESAJJT+NimbusRomNo9L-Regu”
Widths?: number[]: An array of numbers designating the widths of each unique glyph. Each width = Widths[index] in this array corresponds to the character code with the value FirstChar + index. Widths.length = LastChar + 1 - FirstChar. If the reader encounters a character code not represented by an entry in Widths, it should instead use FontDescriptor.MissingWidth.
FirstChar?: number: See Widths, above. FirstChar = LastChar + 1 - Widths.length
LastChar?: number: See Widths, above. LastChar = Widths.length + FirstChar - 1
FontDescriptor?: : Required unless this font is one of the 14 “standard” fonts. See FontDescriptor subsection below.
Encoding?: string | Encoding: As a string, can be one of MacRomanEncoding, MacExpertEncoding, or WinAnsiEncoding.
ToUnicode?: stream: a “CMap” that describes how to convert this stream into Unicode.

The 14 “standard” fonts:

Times-Roman
Helvetica
Courier
Symbol
Times-Bold
Helvetica-Bold
Courier-Bold
ZapfDingbats
Times-Italic
Helvetica-Oblique
Courier-Oblique
Times-BoldItalic
Helvetica-BoldOblique
Courier-BoldOblique

Font Subsets. When the PDF only uses a subset of a font, it prefixes its name with six capital letters. These don’t matter, but they should be unique for each distinct subset. (PDF32000_2008.pdf:9.6.4)

See also “Extraction of Text Content” (PDF32000_2008.pdf:9.10)

`FontDescriptor`

TODO

`Encoding`

TODO

`CMap`

A concise introduction to the CMap format, as far as we’re concerned with it, is in “ToUnicode CMaps” (PDF32000_2008.pdf:9.10.3).

A CMap is declared as a stream in a Font dictionary’s ToUnicode field. The contents of the stream look like this:

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe−Identity−UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 beginbfrange
<0000> <005E>
<005F> <0061>
endbfrange
1 beginbfchar
<3A51> <D840DC3E>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end

The important bits are the beginbfrange … endbfrange and beginbfchar … endbfchar segments. These define the mapping from character codes to UTF-16BE-encoded Unicode character sequences.

You might also need to incorporate the stream dictionary’s UseCMap value, but in my experience, it’s not often used.