The Unicode specification is comprehensive, but needlessly opaque to newcomers. The tables which describe the columns of the Unicode data have columns which in turn have a key further up in the document (instead of simple column headers).
The current (as of 2015-04-09) Unicode version is 7.0.0, which was released on 2014-10-08. Unicode 8.0.0 has been drafted, but has not been published (the planned release date is 2015-06), and the ReadMe warns: “Do NOT attempt to make use of this data in any Unicode-based product.”
The Unicode 7.0.0 package is available at https://www.unicode.org/Public/7.0.0/
Inside 7.0.0, there is a ReadMe.txt
and three directories:
charts/
- contains a very large (97 MB; 2,186 pages) PDF, CodeCharts.pdf
, which provides canonical graphical representations for all characters.ucd/
- The character data for all Unicode characters. More on this below.ucdxml/
- The character data, converted to XML, and zipped. I guess this is for people who prefer to parse XML over very simple ASCII text files (who are you people?!?). Furthermore, from the ucdxml.readme.txt
(which is not XML, so I’m not sure how the target users are going to be able to read it): “While every effort has been made to ensure consistency of the XML representation with the UCD files, there may be some errors; the UCD files are authoritative.” So, yeah, no thanks, I’ll stick to the text versions.ucd/
- Unicode Character DataDocumentation can be found at Unicode Standard Annex #44.
Most of the information below comes from that page.
Tag | General_Category | Note |
---|---|---|
reserved | Cn |
Noncharacter_Code_Point=F |
noncharacter | Cn |
Noncharacter_Code_Point=T |
control | C |
|
private-use | C |
|
surrogate | C |
Property Name | Default Value(s) |
---|---|
Age |
unassigned |
Bidi_Class |
L , AL , R , BN , ET |
Block |
No_Block |
Canonical_Combining_Class |
Not_Reordered (= 0 ) |
Decomposition_Type |
None |
East_Asian_Width |
Neutral (= N ), Wide (= W ) |
General_Category |
Cn |
Line_Break |
Unknown (= XX ), ID , PR |
Numeric_Type |
None |
Numeric_Value |
NaN |
Script |
Unknown (= Zzzz ) |
Property Type | Symbol | Examples |
---|---|---|
Catalog |
C |
Age , Block |
Enumeration |
E |
Joining_Type , Line_Break |
Binary |
B |
Uppercase , White_Space |
String |
S |
Uppercase_Mapping , Case_Folding |
Numeric |
N |
Numeric_Value |
Miscellaneous |
M |
Name , Jamo_Short_Name |
File | Status | Property | Extracted from |
---|---|---|---|
DerivedBidiClass.txt |
N |
Bidi_Class * |
UnicodeData.txt , field 4 |
DerivedBinaryProperties.txt |
N |
Bidi_Mirrored |
UnicodeData.txt , field 9 |
DerivedCombiningClass.txt |
N |
Canonical_Combining_Class |
UnicodeData.txt , field 3 |
DerivedDecompositionType.txt |
N /I |
Decomposition_Type |
the <tag> in UnicodeData.txt , field 5 |
DerivedEastAsianWidth.txt |
I |
East_Asian_Width * |
EastAsianWidth.txt , field 1 |
DerivedGeneralCategory.txt |
N |
General_Category |
UnicodeData.txt , field 2 |
DerivedJoiningGroup.txt |
N |
Joining_Group |
ArabicShaping.txt , field 3 |
DerivedJoiningType.txt |
N |
Joining_Type * |
ArabicShaping.txt , field 2 |
DerivedLineBreak.txt |
N |
Line_Break * |
LineBreak.txt , field 1 |
DerivedNumericType.txt |
N |
Numeric_Type |
UnicodeData.txt , fields 6 through 8 |
DerivedNumericValues.txt |
N |
Numeric_Value |
UnicodeData.txt , field 8 |
General_Category
ValuesAbbr | Long | Description |
---|---|---|
Lu |
Uppercase_Letter |
an uppercase letter |
Ll |
Lowercase_Letter |
a lowercase letter |
Lt |
Titlecase_Letter |
a digraphic character, with first part uppercase |
LC |
Cased_Letter |
Lu / Ll / Lt |
Lm |
Modifier_Letter |
a modifier letter |
Lo |
Other_Letter |
other letters, including syllables and ideographs |
L |
Letter |
Lu / Ll / Lt / Lm / Lo |
Mn |
Nonspacing_Mark |
a nonspacing combining mark (zero advance width) |
Mc |
Spacing_Mark |
a spacing combining mark (positive advance width) |
Me |
Enclosing_Mark |
an enclosing combining mark |
M |
Mark |
Mn / Mc / Me |
Nd |
Decimal_Number |
a decimal digit |
Nl |
Letter_Number |
a letterlike numeric character |
No |
Other_Number |
a numeric character of other type |
N |
Number |
Nd / Nl / No |
Pc |
Connector_Punctuation |
a connecting punctuation mark, like a tie |
Pd |
Dash_Punctuation |
a dash or hyphen punctuation mark |
Ps |
Open_Punctuation |
an opening punctuation mark (of a pair) |
Pe |
Close_Punctuation |
a closing punctuation mark (of a pair) |
Pi |
Initial_Punctuation |
an initial quotation mark |
Pf |
Final_Punctuation |
a final quotation mark |
Po |
Other_Punctuation |
a punctuation mark of other type |
P |
Punctuation |
Pc / Pd / Ps / Pe / Pi / Pf / Po |
Sm |
Math_Symbol |
a symbol of mathematical use |
Sc |
Currency_Symbol |
a currency sign |
Sk |
Modifier_Symbol |
a non-letterlike modifier symbol |
So |
Other_Symbol |
a symbol of other type |
S |
Symbol |
Sm / Sc / Sk / So |
Zs |
Space_Separator |
a space character (of various non-zero widths) |
Zl |
Line_Separator |
U+2028 LINE SEPARATOR only |
Zp |
Paragraph_Separator |
U+2029 PARAGRAPH SEPARATOR only |
Z |
Separator |
Zs / Zl / Zp |
Cc |
Control |
a C0 or C1 control code |
Cf |
Format |
a format control character |
Cs |
Surrogate |
a surrogate code point |
Co |
Private_Use |
a private-use character |
Cn |
Unassigned |
a reserved unassigned code point or a noncharacter |
C |
Other |
Cc / Cf / Cs / Co / Cn |
UnicodeData.txt
# | Field | Type | Status |
---|---|---|---|
1 | Name |
M |
N |
2 | General_Category |
E |
N |
3 | Canonical_Combining_Class |
N |
N |
4 | Bidi_Class |
E |
N |
5 | Decomposition_Type,Decomposition_Mapping |
E , S |
N |
6 | Numeric_Value_1 |
E , N |
N |
7 | Numeric_Value_2 |
E , N |
N |
8 | Numeric_Value_3 |
E , N |
N |
9 | Bidi_Mirrored |
B |
N |
10 | Unicode_1_Name |
M |
I |
11 | ISO_Comment |
M |
I |
12 | Simple_Uppercase_Mapping |
S |
N |
13 | Simple_Lowercase_Mapping |
S |
N |
14 | Simple_Titlecase_Mapping |
S |
N |
The Status column indicates whether the file (and its content) is considered Normative, Informative, or Provisional.
Tag | Description |
---|---|
<font> |
Font variant (for example, a blackletter form) |
<noBreak> |
No-break version of a space or hyphen |
<initial> |
Initial presentation form (Arabic) |
<medial> |
Medial presentation form (Arabic) |
<final> |
Final presentation form (Arabic) |
<isolated> |
Isolated presentation form (Arabic) |
<circle> |
Encircled form |
<super> |
Superscript form |
<sub> |
Subscript form |
<vertical> |
Vertical layout presentation form |
<wide> |
Wide (or zenkaku) compatibility character |
<narrow> |
Narrow (or hankaku) compatibility character |
<small> |
Small variant form (CNS compatibility) |
<square> |
CJK squared font variant |
<fraction> |
Vulgar fraction form |
<compat> |
Otherwise unspecified compatibility character |
Value | Long | Description |
---|---|---|
0 | Not_Reordered |
Spacing and enclosing marks; also many vowel and consonant signs, even if nonspacing |
1 | Overlay |
Marks which overlay a base letter or symbol |
7 | Nukta |
Diacritic nukta marks in Brahmi-derived scripts |
8 | Kana_Voicing |
Hiragana/Katakana voicing marks |
9 | Virama |
Viramas |
10 | Ccc10 |
Start of fixed position classes |
… | … | |
199 | End of fixed position classes | |
200 | Attached_Below_Left |
Marks attached at the bottom left |
202 | Attached_Below |
Marks attached directly below |
204 | Marks attached at the bottom right | |
208 | Marks attached to the left | |
210 | Marks attached to the right | |
212 | Marks attached at the top left | |
214 | Attached_Above |
Marks attached directly above |
216 | Attached_Above_Right |
Marks attached at the top right |
218 | Below_Left |
Distinct marks at the bottom left |
220 | Below |
Distinct marks directly below |
222 | Below_Right |
Distinct marks at the bottom right |
224 | Left |
Distinct marks to the left |
226 | Right |
Distinct marks to the right |
228 | Above_Left |
Distinct marks at the top left |
230 | Above |
Distinct marks directly above |
232 | Above_Right |
Distinct marks at the top right |
233 | Double_Below |
Distinct marks subtending two bases |
234 | Double_Above |
Distinct marks extending above two bases |
240 | Iota_Subscript |
Greek iota subscript only |