"Fossies" - the Fresh Open Source Software Archive

Member "harfbuzz-2.6.4/docs/html/unicode-character-categories.html" (29 Oct 2019, 4206 Bytes) of package /linux/misc/harfbuzz-2.6.4.tar.xz:


Caution: In this restricted "Fossies" environment the current HTML page may not be correctly presentated and may have some non-functional links. You can here alternatively try to browse the pure source code or just view or download the uninterpreted raw source code. If the rendering is insufficient you may try to find and view the page on the harfbuzz-2.6.4.tar.xz project site itself.

Unicode character categories

Shaping models are typically specified with respect to how scripts are defined in the Unicode standard.

Every codepoint in the Unicode Character Database (UCD) is assigned a Unicode General Category (UGC), which provides the most fundamental information about the codepoint: whether the codepoint represents a Letter, a Mark, a Number, Punctuation, a Symbol, a Separator, or something else (Other).

These UGC properties are "Major" categories. Each codepoint is further assigned to a "minor" category within its Major category, such as "Letter, uppercase" (Lu) or "Letter, modifier" (Lm).

Shaping models are concerned primarily with Letter and Mark codepoints. The minor categories of Mark codepoints are particularly important for shaping. Marks can be nonspacing (Mn), spacing combining (Mc), or enclosing (Me).

In addition to the UGC property, codepoints in the Indic and Southeast Asian scripts are also assigned Unicode Indic Syllabic Category (UISC) and Unicode Indic Positional Category (UIPC) properties that provide more detailed information needed for shaping.

The UISC property sub-categorizes Letters and Marks according to common script-shaping behaviors. For example, UISC distinguishes between consonant letters, vowel letters, and vowel marks. The UIPC property sub-categorizes Mark codepoints by the relative visual position that they occupy (above, below, right, left, or in multiple positions).

Some complex scripts require that the text run be split into syllables. What constitutes a valid syllable in these scripts is specified in regular expressions, formed from the Letter and Mark codepoints, that take the UISC and UIPC properties into account.