General documentation / cheat sheets for various languages and services

Unicode

The Unicode specification is comprehensive, but needlessly opaque to newcomers. The tables which describe the columns of the Unicode data have columns which in turn have a key further up in the document (instead of simple column headers).

The current (as of 2015-04-09) Unicode version is 7.0.0, which was released on 2014-10-08. Unicode 8.0.0 has been drafted, but has not been published (the planned release date is 2015-06), and the ReadMe warns: “Do NOT attempt to make use of this data in any Unicode-based product.”

The Unicode 7.0.0 package is available at http://www.unicode.org/Public/7.0.0/

Inside 7.0.0, there is a ReadMe.txt and three directories:

  1. charts/ - contains a very large (97 MB; 2,186 pages) PDF, CodeCharts.pdf, which provides canonical graphical representations for all characters.
  2. ucd/ - The character data for all Unicode characters. More on this below.
  3. ucdxml/ - The character data, converted to XML, and zipped. I guess this is for people who prefer to parse XML over very simple ASCII text files (who are you people?!?). Furthermore, from the ucdxml.readme.txt (which is not XML, so I’m not sure how the target users are going to be able to read it): “While every effort has been made to ensure consistency of the XML representation with the UCD files, there may be some errors; the UCD files are authoritative.” So, yeah, no thanks, I’ll stick to the text versions.

ucd/ - Unicode Character Data

Documentation can be found at Unicode Standard Annex #44.

Most of the information below comes from that page.

Table 3. Code Point Label Tags

Tag General_Category Note
reserved Cn Noncharacter_Code_Point=F
noncharacter Cn Noncharacter_Code_Point=T
control C
private-use C
surrogate C

Table 4. Default Values for Properties

Property Name Default Value(s)
Age unassigned
Bidi_Class L, AL, R, BN, ET
Block No_Block
Canonical_Combining_Class Not_Reordered (= 0)
Decomposition_Type None
East_Asian_Width Neutral (= N), Wide (= W)
General_Category Cn
Line_Break Unknown (= XX), ID, PR
Numeric_Type None
Numeric_Value NaN
Script Unknown (= Zzzz)

Table 8. Property Type Key

Property Type Symbol Examples
Catalog C Age, Block
Enumeration E Joining_Type, Line_Break
Binary B Uppercase, White_Space
String S Uppercase_Mapping, Case_Folding
Numeric N Numeric_Value
Miscellaneous M Name, Jamo_Short_Name

Table 10. Extracted Properties

File Status Property Extracted from
DerivedBidiClass.txt N Bidi_Class* UnicodeData.txt, field 4
DerivedBinaryProperties.txt N Bidi_Mirrored UnicodeData.txt, field 9
DerivedCombiningClass.txt N Canonical_Combining_Class UnicodeData.txt, field 3
DerivedDecompositionType.txt N/I Decomposition_Type the <tag> in UnicodeData.txt, field 5
DerivedEastAsianWidth.txt I East_Asian_Width* EastAsianWidth.txt, field 1
DerivedGeneralCategory.txt N General_Category UnicodeData.txt, field 2
DerivedJoiningGroup.txt N Joining_Group ArabicShaping.txt, field 3
DerivedJoiningType.txt N Joining_Type* ArabicShaping.txt, field 2
DerivedLineBreak.txt N Line_Break* LineBreak.txt, field 1
DerivedNumericType.txt N Numeric_Type UnicodeData.txt, fields 6 through 8
DerivedNumericValues.txt N Numeric_Value UnicodeData.txt, field 8

Table 12. General_Category Values

Abbr Long Description
Lu Uppercase_Letter an uppercase letter
Ll Lowercase_Letter a lowercase letter
Lt Titlecase_Letter a digraphic character, with first part uppercase
LC Cased_Letter Lu / Ll / Lt
Lm Modifier_Letter a modifier letter
Lo Other_Letter other letters, including syllables and ideographs
L Letter Lu / Ll / Lt / Lm / Lo
Mn Nonspacing_Mark a nonspacing combining mark (zero advance width)
Mc Spacing_Mark a spacing combining mark (positive advance width)
Me Enclosing_Mark an enclosing combining mark
M Mark Mn / Mc / Me
Nd Decimal_Number a decimal digit
Nl Letter_Number a letterlike numeric character
No Other_Number a numeric character of other type
N Number Nd / Nl / No
Pc Connector_Punctuation a connecting punctuation mark, like a tie
Pd Dash_Punctuation a dash or hyphen punctuation mark
Ps Open_Punctuation an opening punctuation mark (of a pair)
Pe Close_Punctuation a closing punctuation mark (of a pair)
Pi Initial_Punctuation an initial quotation mark
Pf Final_Punctuation a final quotation mark
Po Other_Punctuation a punctuation mark of other type
P Punctuation Pc / Pd / Ps / Pe / Pi / Pf / Po
Sm Math_Symbol a symbol of mathematical use
Sc Currency_Symbol a currency sign
Sk Modifier_Symbol a non-letterlike modifier symbol
So Other_Symbol a symbol of other type
S Symbol Sm / Sc / Sk / So
Zs Space_Separator a space character (of various non-zero widths)
Zl Line_Separator U+2028 LINE SEPARATOR only
Zp Paragraph_Separator U+2029 PARAGRAPH SEPARATOR only
Z Separator Zs / Zl / Zp
Cc Control a C0 or C1 control code
Cf Format a format control character
Cs Surrogate a surrogate code point
Co Private_Use a private-use character
Cn Unassigned a reserved unassigned code point or a noncharacter
C Other Cc / Cf / Cs / Co / Cn

Table 9. UnicodeData.txt

# Field Type Status
1 Name M N
2 General_Category E N
3 Canonical_Combining_Class N N
4 Bidi_Class E N
5 Decomposition_Type,Decomposition_Mapping E, S N
6 Numeric_Value_1 E, N N
7 Numeric_Value_2 E, N N
8 Numeric_Value_3 E, N N
9 Bidi_Mirrored B N
10 Unicode_1_Name M I
11 ISO_Comment M I
12 Simple_Uppercase_Mapping S N
13 Simple_Lowercase_Mapping S N
14 Simple_Titlecase_Mapping S N

The Status column indicates whether the file (and its content) is considered Normative, Informative, or Provisional.

Table 14. Compatibility Formatting Tags

Tag Description
<font> Font variant (for example, a blackletter form)
<noBreak> No-break version of a space or hyphen
<initial> Initial presentation form (Arabic)
<medial> Medial presentation form (Arabic)
<final> Final presentation form (Arabic)
<isolated> Isolated presentation form (Arabic)
<circle> Encircled form
<super> Superscript form
<sub> Subscript form
<vertical> Vertical layout presentation form
<wide> Wide (or zenkaku) compatibility character
<narrow> Narrow (or hankaku) compatibility character
<small> Small variant form (CNS compatibility)
<square> CJK squared font variant
<fraction> Vulgar fraction form
<compat> Otherwise unspecified compatibility character

Table 15. Canonical_Combining_Class Values

Value Long Description
0 Not_Reordered Spacing and enclosing marks; also many vowel and consonant signs, even if nonspacing
1 Overlay Marks which overlay a base letter or symbol
7 Nukta Diacritic nukta marks in Brahmi-derived scripts
8 Kana_Voicing Hiragana/Katakana voicing marks
9 Virama Viramas
10 Ccc10 Start of fixed position classes
199 End of fixed position classes
200 Attached_Below_Left Marks attached at the bottom left
202 Attached_Below Marks attached directly below
204 Marks attached at the bottom right
208 Marks attached to the left
210 Marks attached to the right
212 Marks attached at the top left
214 Attached_Above Marks attached directly above
216 Attached_Above_Right Marks attached at the top right
218 Below_Left Distinct marks at the bottom left
220 Below Distinct marks directly below
222 Below_Right Distinct marks at the bottom right
224 Left Distinct marks to the left
226 Right Distinct marks to the right
228 Above_Left Distinct marks at the top left
230 Above Distinct marks directly above
232 Above_Right Distinct marks at the top right
233 Double_Below Distinct marks subtending two bases
234 Double_Above Distinct marks extending above two bases
240 Iota_Subscript Greek iota subscript only