General documentation / cheat sheets for various languages and services

Unicode

The Unicode specification is comprehensive, but needlessly opaque to newcomers. The tables which describe the columns of the Unicode data have columns which in turn have a key further up in the document (instead of simple column headers).

The current (as of 2015-04-09) Unicode version is 7.0.0, which was released on 2014-10-08. Unicode 8.0.0 has been drafted, but has not been published (the planned release date is 2015-06), and the ReadMe warns: “Do NOT attempt to make use of this data in any Unicode-based product.”

The Unicode 7.0.0 package is available at https://www.unicode.org/Public/7.0.0/

Inside 7.0.0, there is a ReadMe.txt and three directories:

charts/ - contains a very large (97 MB; 2,186 pages) PDF, CodeCharts.pdf, which provides canonical graphical representations for all characters.
ucd/ - The character data for all Unicode characters. More on this below.
ucdxml/ - The character data, converted to XML, and zipped. I guess this is for people who prefer to parse XML over very simple ASCII text files (who are you people?!?). Furthermore, from the ucdxml.readme.txt (which is not XML, so I’m not sure how the target users are going to be able to read it): “While every effort has been made to ensure consistency of the XML representation with the UCD files, there may be some errors; the UCD files are authoritative.” So, yeah, no thanks, I’ll stick to the text versions.

`ucd/` - Unicode Character Data

Documentation can be found at Unicode Standard Annex #44.

Most of the information below comes from that page.

Table 3. Code Point Label Tags

Tag	General_Category	Note
reserved	`Cn`	`Noncharacter_Code_Point=F`
noncharacter	`Cn`	`Noncharacter_Code_Point=T`
control	`C`
private-use	`C`
surrogate	`C`

Table 4. Default Values for Properties

Property Name	Default Value(s)
`Age`	unassigned
`Bidi_Class`	`L`, `AL`, `R`, `BN`, `ET`
`Block`	`No_Block`
`Canonical_Combining_Class`	`Not_Reordered` (= `0`)
`Decomposition_Type`	`None`
`East_Asian_Width`	`Neutral` (= `N`), `Wide` (= `W`)
`General_Category`	`Cn`
`Line_Break`	`Unknown` (= `XX`), `ID`, `PR`
`Numeric_Type`	`None`
`Numeric_Value`	`NaN`
`Script`	`Unknown` (= `Zzzz`)

Table 8. Property Type Key

Property Type	Symbol	Examples
`Catalog`	`C`	`Age`, `Block`
`Enumeration`	`E`	`Joining_Type`, `Line_Break`
`Binary`	`B`	`Uppercase`, `White_Space`
`String`	`S`	`Uppercase_Mapping`, `Case_Folding`
`Numeric`	`N`	`Numeric_Value`
`Miscellaneous`	`M`	`Name`, `Jamo_Short_Name`

Table 10. Extracted Properties

File	Status	Property	Extracted from
`DerivedBidiClass.txt`	`N`	`Bidi_Class`*	`UnicodeData.txt`, field 4
`DerivedBinaryProperties.txt`	`N`	`Bidi_Mirrored`	`UnicodeData.txt`, field 9
`DerivedCombiningClass.txt`	`N`	`Canonical_Combining_Class`	`UnicodeData.txt`, field 3
`DerivedDecompositionType.txt`	`N`/`I`	`Decomposition_Type`	the `<tag>` in `UnicodeData.txt`, field 5
`DerivedEastAsianWidth.txt`	`I`	`East_Asian_Width`*	`EastAsianWidth.txt`, field 1
`DerivedGeneralCategory.txt`	`N`	`General_Category`	`UnicodeData.txt`, field 2
`DerivedJoiningGroup.txt`	`N`	`Joining_Group`	`ArabicShaping.txt`, field 3
`DerivedJoiningType.txt`	`N`	`Joining_Type`*	`ArabicShaping.txt`, field 2
`DerivedLineBreak.txt`	`N`	`Line_Break`*	`LineBreak.txt`, field 1
`DerivedNumericType.txt`	`N`	`Numeric_Type`	`UnicodeData.txt`, fields 6 through 8
`DerivedNumericValues.txt`	`N`	`Numeric_Value`	`UnicodeData.txt`, field 8

Table 12. `General_Category` Values

Abbr	Long	Description
`Lu`	`Uppercase_Letter`	an uppercase letter
`Ll`	`Lowercase_Letter`	a lowercase letter
`Lt`	`Titlecase_Letter`	a digraphic character, with first part uppercase
`LC`	`Cased_Letter`	`Lu` / `Ll` / `Lt`
`Lm`	`Modifier_Letter`	a modifier letter
`Lo`	`Other_Letter`	other letters, including syllables and ideographs
`L`	`Letter`	`Lu` / `Ll` / `Lt` / `Lm` / `Lo`
`Mn`	`Nonspacing_Mark`	a nonspacing combining mark (zero advance width)
`Mc`	`Spacing_Mark`	a spacing combining mark (positive advance width)
`Me`	`Enclosing_Mark`	an enclosing combining mark
`M`	`Mark`	`Mn` / `Mc` / `Me`
`Nd`	`Decimal_Number`	a decimal digit
`Nl`	`Letter_Number`	a letterlike numeric character
`No`	`Other_Number`	a numeric character of other type
`N`	`Number`	`Nd` / `Nl` / `No`
`Pc`	`Connector_Punctuation`	a connecting punctuation mark, like a tie
`Pd`	`Dash_Punctuation`	a dash or hyphen punctuation mark
`Ps`	`Open_Punctuation`	an opening punctuation mark (of a pair)
`Pe`	`Close_Punctuation`	a closing punctuation mark (of a pair)
`Pi`	`Initial_Punctuation`	an initial quotation mark
`Pf`	`Final_Punctuation`	a final quotation mark
`Po`	`Other_Punctuation`	a punctuation mark of other type
`P`	`Punctuation`	`Pc` / `Pd` / `Ps` / `Pe` / `Pi` / `Pf` / `Po`
`Sm`	`Math_Symbol`	a symbol of mathematical use
`Sc`	`Currency_Symbol`	a currency sign
`Sk`	`Modifier_Symbol`	a non-letterlike modifier symbol
`So`	`Other_Symbol`	a symbol of other type
`S`	`Symbol`	`Sm` / `Sc` / `Sk` / `So`
`Zs`	`Space_Separator`	a space character (of various non-zero widths)
`Zl`	`Line_Separator`	U+2028 LINE SEPARATOR only
`Zp`	`Paragraph_Separator`	U+2029 PARAGRAPH SEPARATOR only
`Z`	`Separator`	`Zs` / `Zl` / `Zp`
`Cc`	`Control`	a C0 or C1 control code
`Cf`	`Format`	a format control character
`Cs`	`Surrogate`	a surrogate code point
`Co`	`Private_Use`	a private-use character
`Cn`	`Unassigned`	a reserved unassigned code point or a noncharacter
`C`	`Other`	`Cc` / `Cf` / `Cs` / `Co` / `Cn`

Table 9. `UnicodeData.txt`

#	Field	Type	Status
1	`Name`	`M`	`N`
2	`General_Category`	`E`	`N`
3	`Canonical_Combining_Class`	`N`	`N`
4	`Bidi_Class`	`E`	`N`
5	`Decomposition_Type,Decomposition_Mapping`	`E`, `S`	`N`
6	`Numeric_Value_1`	`E`, `N`	`N`
7	`Numeric_Value_2`	`E`, `N`	`N`
8	`Numeric_Value_3`	`E`, `N`	`N`
9	`Bidi_Mirrored`	`B`	`N`
10	`Unicode_1_Name`	`M`	`I`
11	`ISO_Comment`	`M`	`I`
12	`Simple_Uppercase_Mapping`	`S`	`N`
13	`Simple_Lowercase_Mapping`	`S`	`N`
14	`Simple_Titlecase_Mapping`	`S`	`N`

The Status column indicates whether the file (and its content) is considered Normative, Informative, or Provisional.

Table 14. Compatibility Formatting Tags

Tag	Description
`<font>`	Font variant (for example, a blackletter form)
`<noBreak>`	No-break version of a space or hyphen
`<initial>`	Initial presentation form (Arabic)
`<medial>`	Medial presentation form (Arabic)
`<final>`	Final presentation form (Arabic)
`<isolated>`	Isolated presentation form (Arabic)
`<circle>`	Encircled form
`<super>`	Superscript form
`<sub>`	Subscript form
`<vertical>`	Vertical layout presentation form
`<wide>`	Wide (or zenkaku) compatibility character
`<narrow>`	Narrow (or hankaku) compatibility character
`<small>`	Small variant form (CNS compatibility)
`<square>`	CJK squared font variant
`<fraction>`	Vulgar fraction form
`<compat>`	Otherwise unspecified compatibility character

Table 15. Canonical_Combining_Class Values

Value	Long	Description
0	`Not_Reordered`	Spacing and enclosing marks; also many vowel and consonant signs, even if nonspacing
1	`Overlay`	Marks which overlay a base letter or symbol
7	`Nukta`	Diacritic nukta marks in Brahmi-derived scripts
8	`Kana_Voicing`	Hiragana/Katakana voicing marks
9	`Virama`	Viramas
10	`Ccc10`	Start of fixed position classes
…	…
199		End of fixed position classes
200	`Attached_Below_Left`	Marks attached at the bottom left
202	`Attached_Below`	Marks attached directly below
204		Marks attached at the bottom right
208		Marks attached to the left
210		Marks attached to the right
212		Marks attached at the top left
214	`Attached_Above`	Marks attached directly above
216	`Attached_Above_Right`	Marks attached at the top right
218	`Below_Left`	Distinct marks at the bottom left
220	`Below`	Distinct marks directly below
222	`Below_Right`	Distinct marks at the bottom right
224	`Left`	Distinct marks to the left
226	`Right`	Distinct marks to the right
228	`Above_Left`	Distinct marks at the top left
230	`Above`	Distinct marks directly above
232	`Above_Right`	Distinct marks at the top right
233	`Double_Below`	Distinct marks subtending two bases
234	`Double_Above`	Distinct marks extending above two bases
240	`Iota_Subscript`	Greek iota subscript only