Using Unicode: Difference between revisions

Latest revision as of 10:29, 13 January 2024

Unicode 15.1 was released September 2023. Unicode 14.0 was released September 14, 2021. There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.

Good references include:

The utf8(7) and unicode(7) pages from the Linux man pages
RFC 3629, "UTF-8, a transformation format of ISO 10646"
UTF-8 FAQ for Linux
The Linux Unicode HOWTO
Multilingual text on Linux
The Unicode Book, version 5.0 (the web-only 5.1.0 standard supersedes this) is one of the most beautifully-constructed books I've ever seen, and well worth the price.

Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1–3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an ASCII-compatible Unicode encoding, usable in any context designed for ASCII but insensitive to characters' meanings.

Interesting Unicode

Isomorphisms of the English alphabet

- Parenthesized minuscules (U249C+): ⒜⒝⒞⒟⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵
- Circled majuscules (U24B6+): ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
- Circled minuscules (U24D0+): ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
- Superscript minuscules (missing q): ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ
- Bold (U1D400+): 𝐀𝐁𝐂𝐃𝐄𝐅𝐆𝐇𝐈𝐉𝐊𝐋𝐌𝐍𝐎𝐏𝐐𝐑𝐒𝐓𝐔𝐕𝐖𝐗𝐘𝐙
- Italic (U1D434+): 𝐴𝐵𝐶𝐷𝐸𝐹𝐺𝐻𝐼𝐽𝐾𝐿𝑀𝑁𝑂𝑃𝑄𝑅𝑆𝑇𝑈𝑉𝑊𝑋𝑌𝑍
- Bold italic (U1D468+): 𝑨𝑩𝑪𝑫𝑬𝑭𝑮𝑯𝑰𝑱𝑲𝑳𝑴𝑵𝑶𝑷𝑸𝑹𝑺𝑻𝑼𝑽𝑾𝑿𝒀𝒁
- Mathematical sans-serif (U1D5A0+): 𝖠𝖡𝖢𝖣𝖤𝖥𝖦𝖧𝖨𝖩𝖪𝖫𝖬𝖭𝖮𝖯𝖰𝖱𝖲𝖳𝖴𝖵𝖶𝖷𝖸𝖹
- Mathematical sans-serif bold (U1D5D4+): 𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭
- Mathematical sans-serif italic (U1D608+): 𝘈𝘉𝘊𝘋𝘌𝘍𝘎𝘏𝘐𝘑𝘒𝘓𝘔𝘕𝘖𝘗𝘘𝘙𝘚𝘛𝘜𝘝𝘞𝘟𝘠𝘡
- Mathematical sans-serif italic bold (U1D63C+): 𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕
- Mathematical script: 𝒜ℬ𝒞𝒟ℰℱ𝒢ℋℐ𝒥𝒦ℒℳ𝒩𝒪𝒫𝒬ℛ𝒮𝒯𝒰𝒱𝒲𝒳𝒴𝒵
- Mathematical script bold: 𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣𝓤𝓥𝓦𝓧𝓨𝓩
- Fraktur: 𝔄𝔅ℭ𝔇𝔈𝔉𝔊ℌℑ𝔍𝔎𝔏𝔐𝔑𝔒𝔓𝔔ℜ𝔖𝔗𝔘𝔙𝔚𝔛𝔜ℨ
- Fraktur bold (U1D56C+): 𝕬𝕭𝕮𝕯𝕰𝕱𝕲𝕳𝕴𝕵𝕶𝕷𝕸𝕹𝕺𝕻𝕼𝕽𝕾𝕿𝖀𝖁𝖂𝖃𝖄𝖅
- Monospace (U1D670+): 𝙰𝙱𝙲𝙳𝙴𝙵𝙶𝙷𝙸𝙹𝙺𝙻𝙼𝙽𝙾𝙿𝚀𝚁𝚂𝚃𝚄𝚅𝚆𝚇𝚈𝚉
- Doublestruck: 𝔸𝔹ℂ𝔻𝔼𝔽𝔾ℍ𝕀𝕁𝕂𝕃𝕄ℕ𝕆ℙℚℝ𝕊𝕋𝕌𝕍𝕎𝕏𝕐ℤ

FIXME do minuscules

Isomorphisms of the Greek alphabet

FIXME

Isomorphisms of the Arabic digits

Bold (U1D7CE+): 𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗
Doublestruck (U1D7D8+): 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡
Sans-serif (U1D7E2+): 𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫
Sans-serif bold (U1D7EC+): 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵
Monospace (U1D7F6+): 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿
Seven-segment (U1FBF0+): 🯰🯱🯲🯳🯴🯵🯶🯷🯸🯹

UTF-8

The One True Encoding, almost always. See RFC 3629 and Annex D of ISO/IEC 10646.

UTF-8 encoding yields up to four bytes per encoded codepoint. Valid ASCII (all characters less than 0x80) are directly encoded using a single byte. This four byte maximum arises from RFC 3629 §3, which defines UTF-8 on codepoints through only 0x10FFFF (suitable for handling the 17 defined Planes as of Unicode 14); if the 10646 maximum of U+7FFFFFFF is considered, UTF-8 would encode up to six bytes.

The 2048 codepoints U+D800 through U+DFFF cannot be encoded in UTF-8; they are metapoints intended for use with UTF-16.

Along with the octets F5--FF, C0 and C1 never appear in valid UTF-8. ASCII characters never show up as parts of other, multibyte characters.

Octets of the form 10xxxxxx are continuation bytes, and can only be found after a valid initial byte.

libc

Ensure the proper locales are present, and being regenerated on package updates. locale -a will list all available locales. You want en_US.UTF-8 or the appropriate regional equivalent (use C.UTF-8 for an agnostic UTF-8 encoding): locale -a | grep UTF-8$ should generate output. On Debian, run dpkg-reconfigure locales to select generated locales and rebuild the locale database (it uses libc's localedef).
Ensure that you're exposing a UTF-8-enabled locale to setlocale(3) and friends: LANG=en_US.UTF-8 should be exported in your environment (the various LC_* variables can override LANG for certain subsets of context, while LC_ALL overrides other LC_* values). On Debian, configure /etc/default/locale via dpkg-reconfigure locales (which subsequently drives update-locale from the same package). This file is sourced by pam configs and /etc/init.d files.

filesystems

ext3 and friends use octets for filenames; it is up to applications to interpret them. For VFAT, ISO9660 and some others:

Ensure the UTF-8 module is being built (CONFIG_NLS_UTF8)
Ensure the "Default NLS Language" is "utf8" in the kernel config (CONFIG_NLS_DEFAULT)
- nls=utf8 as an option to mount will work on a per-filesystem basis

Console

Check whether UTF-8 mode is being used in the terminal driver via vt-is-utf8 from console-tools.
Set it with unicode_start, also from console-tools.

Application details

vim

In Insert mode, Ctrl-K can be used to enter characters by digraph (see loaded digraphs with :dig). Classes of digraphs share a common suffix character:
- - Greek: * (thus Ctrl-K, a* generates α, Ctrl-K, m* generates μ, etc)
  - Grave accent: ! ( a! -> à, A! -> À )
  - Acute/sharp accent: ' ( a' -> á, A' -> Á )
In Insert mode, Ctrl-V starts a reference input sequence. Use the Unicode decimal codepoint. Examples:
- Ctrl-V, 227 generates ã
- Ctrl-V, 167 generates §

@@ Line 1: / Line 1: @@
-Unicode 14.0 is scheduled for release September 2021. There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.
+Unicode 15.1 was released September 2023.
+Unicode 14.0 was released September 14, 2021.
+There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.
 Good references include:
@@ Line 22: / Line 24: @@
 ** Mathematical sans-serif bold (U1D5D4+): 𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭
 ** Mathematical sans-serif italic (U1D608+): 𝘈𝘉𝘊𝘋𝘌𝘍𝘎𝘏𝘐𝘑𝘒𝘓𝘔𝘕𝘖𝘗𝘘𝘙𝘚𝘛𝘜𝘝𝘞𝘟𝘠𝘡
-** Mathematical sans-serif intalic bold (U1D63C+): 𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕
+** Mathematical sans-serif italic bold (U1D63C+): 𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕
 ** Mathematical script: 𝒜ℬ𝒞𝒟ℰℱ𝒢ℋℐ𝒥𝒦ℒℳ𝒩𝒪𝒫𝒬ℛ𝒮𝒯𝒰𝒱𝒲𝒳𝒴𝒵
 ** Mathematical script bold: 𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣𝓤𝓥𝓦𝓧𝓨𝓩
@@ Line 39: / Line 41: @@
 * Monospace (U1D7F6+): 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿
 * Seven-segment (U1FBF0+): 🯰🯱🯲🯳🯴🯵🯶🯷🯸🯹
+==UTF-8==
+The One True Encoding, almost always. See [https://www.ietf.org/rfc/rfc3629.txt RFC 3629] and Annex D of ISO/IEC 10646.
+UTF-8 encoding yields up to four bytes per encoded codepoint. Valid ASCII (all characters less than 0x80) are directly encoded using a single byte. This four byte maximum arises from RFC 3629 §3, which defines UTF-8 on codepoints through only 0x10FFFF (suitable for handling the 17 defined Planes as of Unicode 14); if the 10646 maximum of U+7FFFFFFF is considered, UTF-8 would encode up to six bytes.
+The 2048 codepoints U+D800 through U+DFFF cannot be encoded in UTF-8; they are metapoints intended for use with UTF-16.
+Along with the octets F5--FF, C0 and C1 never appear in valid UTF-8. ASCII characters never show up as parts of other, multibyte characters.
+Octets of the form 10xxxxxx are continuation bytes, and can only be found after a valid initial byte.
 == [[libc]] ==

Using Unicode: Difference between revisions

Dank (talk | contribs)

Latest revision as of 10:29, 13 January 2024

Contents

Interesting Unicode

Isomorphisms of the English alphabet

Isomorphisms of the Greek alphabet

Isomorphisms of the Arabic digits

UTF-8

libc

filesystems

Console

Application details

vim

X

See Also

navigation menu

Using Unicode: Difference between revisions

Dank (talk | contribs)

Latest revision as of 10:29, 13 January 2024

Interesting Unicode

Isomorphisms of the English alphabet

Isomorphisms of the Greek alphabet

Isomorphisms of the Arabic digits

UTF-8

libc

filesystems

Console

Application details

vim

X

See Also

navigation menu

Search