The primary free version of the PDF specification: PDF32000_2008.pdf
The GnuPDF Wiki seems to be another good resource, too. (E.g., CCITT_Fax_Filter.)
PDF includes eight basic types of objects: Boolean values, Integer and Real numbers, Strings, Names, Arrays, Dictionaries, Streams, and the null object.
Boolean literals appear as true
and false
.
Number literals are the usual. No scientific notation, though.
Strings literals are denoted by (
and )
delimiters, or as hexadecimal data with <
and >
delimiters. You can escape literal parentheses with the \
character, which you can literally express as \\
.
Lines in a string literal ending with \
should ignore the subsequent line break.
var hexadecimal_string = '7e19ea68d47cd58418bb9001776e808b';
var bytes = hexadecimal_string.match(/\w\w/g).map(function(point) { return parseInt(point, 16); });
var string = new Buffer(bytes).toString('utf8');
Delimiters:
glyph | dec | hex | oct | name |
---|---|---|---|---|
( | 40 | 28 | 50 | LEFT PARENTHESIS |
) | 41 | 29 | 51 | RIGHT PARENTHESIS |
< | 60 | 3C | 60 | LESS-THAN SIGN |
> | 62 | 3E | 62 | GREATER-THAN SIGN |
[ | 91 | 5B | 133 | LEFT SQUARE BRACKET |
] | 93 | 5D | 135 | RIGHT SQUARE BRACKET |
{ | 123 | 7B | 173 | LEFT CURLY BRACKET |
} | 125 | 7D | 175 | RIGHT CURLY BRACKET |
/ | 47 | 2F | 57 | SOLIDUS |
% | 37 | 25 | 45 | PERCENT SIGN |
“Regular characters” are defined as the range [!-~]
, excluding the above delimiters (PDF32000_2008.pdf:17).
Thus we have !-'
, *-.
, 0-;
, =
, ?-Z
, \
, ^-z
, |
, and ~
, which, in a regex character class, looks like: [!-'*-.0-;=?-Z\\^-z|~]
.
A Name is delimited by the /
character on the left, and any whitespace on the right. The Name itself not contain the initial slash. If you want funny characters in your Name, like a space or parentheses, you have to escape them with the #
character. Express the #
character as #23
.
Array objects consist of other types (including arrays), delimited by [
and ]
characters. The array’s distinct children are potentially separated by whitespace, but not necessarily. Arrays can be empty.
Dictionary objects consist of a series of key-value pairs (where the keys are all Names) delimited by <<
and >>
strings. Type and Subtype names are metadata conventions used to describe the dictionary, but not required. The pairs may be separated by newlines, or maybe each item will simply be separated by a space.
Stream objects are delimited by stream
and endstream
keywords. Stream objects must be immediately preceded by a Dictionary object with the following key-value pairs:
/Length 5190
(required) measures the length of the stream inside the stream
and endstream
delimiters, not including a newline after the stream
keyword./Filter /FlateDecode
(optional) can also be an array of names, which are applied in order/DecodeParms << /K -1 /Columns 2550 >>
(optional) goes with the Filter; should specify parameters for each Filter in order/DL 19283
(optional) the length of the completely decoded, decompressed stream, perhaps an approximation/F <file specification>
(optional) for reading external files?/FFilter ...
(optional) same as /Filter
, but for /F
/FDecodeParms
(optional) same as /DecodeParms
, but for /F
The Null object is a singleton, literally expressed as null
, which has the type of “Null”.
Indirect objects (Sec 7.3.10) provide a way of referring to an object. E.g.,
1 0 obj
(Christopher Brown)
endobj
Puts my name into object number “1” and sets the “generation number” to 0. I can refer to this later as:
1 0 R
Object numbers need not be given in sequence, or even consume all positive integers; the only requirement is that they be positive.
References to never-defined indirect objects are not errors; instead, they are references to the Null object.
The “header line” refers to the very first line of a PDF, e.g., %PDF-1.4
.
If a PDF contains binary data, the second line of the PDF must be a comment with at least four binary characters (characters with codes > 128).
A cross-reference section gives byte offsets of where, in the PDF, to find specific indirect objects.
xref
100 3
0000000016 00000 n
0000000911 00000 n
0000001006 00000 n
100
is the index of the first of 3
consecutively numbered objects (100, 101, 102)0000000016
is a 10-digit number denoting the byte offset of object 10000000
is the 5-digit generation number of that objectn
means this is an in-use entry (object)0000000911
is a 10-digit number denoting the byte offset of object 101You can also have “free objects” in the cross-reference section, where the lines end with f
instead of n
, and the 10-digit number is the “object number” (?) of the next free object.
The byte offset is from the beginning of the PDF file, so you can call
dd bs=1 skip=1006 count=9 <my.pdf 2>&-
And it will return
103 0 obj
Which is the first part of the declaration of indirect object 103.
“Conforming readers should read a PDF file from its end.”
When reading from the end, you’ll first encounter a magic line %%EOF
, which will be immediately preceded (the lines coming above it in the file) by a trailer and two lines, like so:
trailer << ... >>
startxref
173
%%EOF
173
is the byte offset of the “last” xref section. If there’s only one xref section, it’ll immediately precede the trailer. If there are multiple, it’ll probably be near the top of the document.Table 15 describes the trailer dictionary in full, but in brief:
/Size
: The total number of entries throughout all the cross-reference tables “this value shall be 1 greater than the highest object number defined in the file.”/Prev
: The byte offset in the decoded stream from the beginning of the file to the beginning of the previous cross-reference section./Root
: Reference to a dictionary object describing the PDF document – not every trailer will have this./Info
: Reference to a dictionary object with metadata about this document, like Author, Title, etc.Chapter 5, PDFReference.pdf page 311, has the information on Text content in PDFs.
BT % begin text element
/F13 12 Tf % Use font F13 (Helvetica) at size 12
288 720 Td % The origin is the lower-left, so this puts the cursor at
% 288/72 = 4 inches from the left, and 720/72 = 10 inches from the bottom
(ABC) Tj % draw the string "ABC"
ET % end text element
F13 means Helvetica because, in one of the document meta dictionaries, we saw something like:
/Resources
<<
/Font <<
/F13 <<
/Type /Font
/Subtype /Type1
/BaseFont /Helvetica
>>
>>
>>
There are nine text state parameters (the “text state” does not include the current text transformation matrices):
Tc
Character spacingTw
Word spacingTh
Horizontal scalingTl
LeadingTf
Text fontTfs
Text font sizeTmode
Text rendering modeTrise
Text riseTk
Text knockoutThe commands below manipulate the text state as well as the text rendering matrices, and draw text.
fontName fontSize Tf
set the text’s font and sizecharSpace Tc
: set the text’s character spacing (charSpace
is expressed in unscaled text space units). Default is 0. E.g.:
0 Tc
-> My Password
0.25 Tc
-> M y P a s s w o r d
wordSpace Tw
: set the text’s word spacing (expressed in unscaled text space units). Default is 0. E.g.:
0 Tw
-> My Password
2.5 Tw
-> My Password
Tw
only applies to character 32, “SP”.x y Td
: adjust the text’s current position. Specifically, “Move to the start of the next line, offset from the start of the current line by (x
, y
). x
and y
are expressed in unscaled text space units.
“When executed for the first time after BT, it establishes the text position in the current user coordinate system.”
Presumably, after that first time, it’s a relative adjustment.x y TD
: Like x y Td
, but sets the leading parameter to -y
. In ACL pdfs, there are only TD
s, no Td
s. Equivalent to −y TL x y Td
. (I’m not exactly clear on difference from Td
; see TABLE 5.5 in PDFReference.pdf page 330)mode Tr
: set the text rendering mode;
0
: fill (the default)1
: stroke (outline)2
: fill then stroke3
: none (invisible)4
: fill text and add to clipping path5
: stroke text and add to clipping path6
: fill, stroke, and add to clipping path7
: only add to clipping pathscale Tz
: set the horizontal scale; scale
is a percentage. Default is 100.leading TL
: set the text leading (the vertical distance between the baselines of adjacent lines of text), in unscaled text space units. Default is 0.rise Ts
: set the text rise (move the baseline up or down from its default location), in unscaled text space units. Default is 0. E.g.,
(This text is ) Tj 5 Ts (superscripted) Tj
(This text is ) Tj –5 Ts (subscripted) Tj
T*
: Move to the start of the next line. Same as 0 Tl Td
where Tl
is whatever the current TL
value parameter is.a b c d e f Tm
: Replace the text matrix to
a b 0
c d 0
e f 1
(some string) Tj
: paints the corresponding glyphs in the graphics state.(string) '
: Move to the next line and paint the given string. Equivalent to T* (string) Tj
wordSpace charSpace (string) "
: Equivalent to wordSpace Tw charSpace Tc (string) '
array TJ
: Show one or more text strings. “Each element of array
can be a string or a number. If the element is a string, this operator shows the string. If it is a number, the operator adjusts the text position by that amount.”lineWidth w
: Set the current line width. The thinnest line that can be rendered is 0 w
.red green blue RG
: Set stroke color. red
, green
, and blue
should range between 0.0 and 1.0.red green blue rg
: Same as RG
, but for non-stroking operations.gray G
: Set stroke coloring mode to grayscale; gray should range between 0.0 and 1.0.gray g
: Same as G
, but for non-stroking operations.x y m
: move the cursor to (x
, y
). I think this is a relative adjustment from the previous drawing operation, but overrides the previous m
operation if nothing was drawn.x y l
: draw a straight line from the cursor to (x, y
) (relative, I think), and set the cursor to that point.h
: close current path.x y width height re
: draw rectangle with lower-left at (x
, y
). Presumably leaves the cursor in the same place.c
, v
, and y
are Bezier curve operations. See PDFReference.pdf page 183.S
: stroke current paths
: close and stroke current path. Equivalent to h S
.f
: fill current pathB
: fill and stroke current pathW
: “Modify the current clipping path by intersecting it with the current path, using the nonzero winding number rule to determine which regions lie inside the clipping path.”Drawing operator categories (PDF32000_2008.pdf:8.2)
General graphics state: w, J, j, M, d, ri, i, gs
Special graphics state: q, Q, cm
Path construction: m, l, c, v, y, h, re
Path painting: S, s, f, F, f*, B, B*, b, b*, n
Clipping paths: W, W*
Text objects: BT, ET
Text state: Tc, Tw, Tz, TL, Tf, Tr, Ts
Text positioning: Td, TD, Tm, T*
Text showing: Tj, TJ, ', "
Type 3 fonts: d0, d1
Color: CS, cs, SC, SCN, sc, scn, G, g, RG, rg, K, k
Shading patterns: sh
Inline images: BI, ID, EI
XObjects: Do
Marked content: MP, DP, BMC, BDC, EMC
Compatibility: BX, EX
The “Text state”, “Text positioning”, and “Text showing” operators only apply between BT and ET markers.
Font objects are represented by a PDF object dictionary with the Type
“Font
”.
Type: string
: “Font”SubType: string
: possible SubTypes:
Type1
: Most PDF fonts are Type 1 (PDF32000_2008.pdf:9.6.1).MMType1
: “Multiple Master” fonts are derivation of Type 1 fonts.TrueType
: (PDF32000_2008.pdf:9.6.3)Type3
: Type 3 fonts (PDF32000_2008.pdf:9.6.5)Type0
: Type 0 fonts (PDF32000_2008.pdf:9.7.6)BaseFont: string
: The name of the font; e.g., “ESAJJT+NimbusRomNo9L-Regu”Widths?: number[]
: An array of numbers designating the widths of each unique glyph. Each width = Widths[index]
in this array corresponds to the character code with the value FirstChar + index
. Widths.length = LastChar + 1 - FirstChar
. If the reader encounters a character code not represented by an entry in Widths
, it should instead use FontDescriptor.MissingWidth
.FirstChar?: number
: See Widths
, above. FirstChar = LastChar + 1 - Widths.length
LastChar?: number
: See Widths
, above. LastChar = Widths.length + FirstChar - 1
FontDescriptor?:
: Required unless this font is one of the 14 “standard” fonts. See FontDescriptor
subsection below.Encoding?: string | Encoding
: As a string, can be one of MacRomanEncoding
, MacExpertEncoding
, or WinAnsiEncoding
.ToUnicode?: stream
: a “CMap” that describes how to convert this stream into Unicode.The 14 “standard” fonts:
Font Subsets. When the PDF only uses a subset of a font, it prefixes its name with six capital letters. These don’t matter, but they should be unique for each distinct subset. (PDF32000_2008.pdf:9.6.4)
See also “Extraction of Text Content” (PDF32000_2008.pdf:9.10)
FontDescriptor
TODO
Encoding
TODO
CMap
A concise introduction to the CMap format, as far as we’re concerned with it, is in “ToUnicode CMaps” (PDF32000_2008.pdf:9.10.3).
A CMap is declared as a stream in a Font dictionary’s ToUnicode
field. The contents of the stream look like this:
/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo
<< /Registry (Adobe)
/Ordering (UCS)
/Supplement 0
>> def
/CMapName /Adobe−Identity−UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
2 beginbfrange
<0000> <005E>
<005F> <0061>
endbfrange
1 beginbfchar
<3A51> <D840DC3E>
endbfchar
endcmap
CMapName currentdict /CMap defineresource pop
end
end
The important bits are the beginbfrange
… endbfrange
and beginbfchar
… endbfchar
segments.
These define the mapping from character codes to UTF-16BE-encoded Unicode character sequences.
You might also need to incorporate the stream dictionary’s UseCMap
value, but in my experience, it’s not often used.