Using Unicode

From dankwiki
Revision as of 07:10, 23 September 2021 by Dank (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

Unicode 14.0 was released September 14, 2021. There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.

Good references include:

Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1–3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an ASCII-compatible Unicode encoding, usable in any context designed for ASCII but insensitive to characters' meanings.

Interesting Unicode

Isomorphisms of the English alphabet

    • Parenthesized minuscules (U249C+): ⒜⒝⒞⒟⒠⒡⒢⒣⒤⒥⒦⒧⒨⒩⒪⒫⒬⒭⒮⒯⒰⒱⒲⒳⒴⒵
    • Circled majuscules (U24B6+): ⒶⒷⒸⒹⒺⒻⒼⒽⒾⒿⓀⓁⓂⓃⓄⓅⓆⓇⓈⓉⓊⓋⓌⓍⓎⓏ
    • Circled minuscules (U24D0+): ⓐⓑⓒⓓⓔⓕⓖⓗⓘⓙⓚⓛⓜⓝⓞⓟⓠⓡⓢⓣⓤⓥⓦⓧⓨⓩ
    • Superscript minuscules (missing q): ᵃᵇᶜᵈᵉᶠᵍʰⁱʲᵏˡᵐⁿᵒᵖʳˢᵗᵘᵛʷˣʸᶻ
    • Mathematical sans-serif (U1D5A0+): 𝖠𝖡𝖢𝖣𝖤𝖥𝖦𝖧𝖨𝖩𝖪𝖫𝖬𝖭𝖮𝖯𝖰𝖱𝖲𝖳𝖴𝖵𝖶𝖷𝖸𝖹
    • Mathematical sans-serif bold (U1D5D4+): 𝗔𝗕𝗖𝗗𝗘𝗙𝗚𝗛𝗜𝗝𝗞𝗟𝗠𝗡𝗢𝗣𝗤𝗥𝗦𝗧𝗨𝗩𝗪𝗫𝗬𝗭
    • Mathematical sans-serif italic (U1D608+): 𝘈𝘉𝘊𝘋𝘌𝘍𝘎𝘏𝘐𝘑𝘒𝘓𝘔𝘕𝘖𝘗𝘘𝘙𝘚𝘛𝘜𝘝𝘞𝘟𝘠𝘡
    • Mathematical sans-serif italic bold (U1D63C+): 𝘼𝘽𝘾𝘿𝙀𝙁𝙂𝙃𝙄𝙅𝙆𝙇𝙈𝙉𝙊𝙋𝙌𝙍𝙎𝙏𝙐𝙑𝙒𝙓𝙔𝙕
    • Mathematical script bold: 𝓐𝓑𝓒𝓓𝓔𝓕𝓖𝓗𝓘𝓙𝓚𝓛𝓜𝓝𝓞𝓟𝓠𝓡𝓢𝓣𝓤𝓥𝓦𝓧𝓨𝓩

FIXME do minuscules

Isomorphisms of the Greek alphabet


Isomorphisms of the Arabic digits

  • Bold (U1D7CE+): 𝟎𝟏𝟐𝟑𝟒𝟓𝟔𝟕𝟖𝟗
  • Doublestruck (U1D7D8+): 𝟘𝟙𝟚𝟛𝟜𝟝𝟞𝟟𝟠𝟡
  • Sans-serif (U1D7E2+): 𝟢𝟣𝟤𝟥𝟦𝟧𝟨𝟩𝟪𝟫
  • Sans-serif bold (U1D7EC+): 𝟬𝟭𝟮𝟯𝟰𝟱𝟲𝟳𝟴𝟵
  • Monospace (U1D7F6+): 𝟶𝟷𝟸𝟹𝟺𝟻𝟼𝟽𝟾𝟿
  • Seven-segment (U1FBF0+): 🯰🯱🯲🯳🯴🯵🯶🯷🯸🯹


The One True Encoding, almost always. See RFC 3629 and Annex D of ISO/IEC 10646.

UTF-8 encoding yields up to four bytes per encoded codepoint. Valid ASCII (all characters less than 0x80) are directly encoded using a single byte. This four byte maximum arises from RFC 3629 §3, which defines UTF-8 on codepoints through only 0x10FFFF (suitable for handling the 17 defined Planes as of Unicode 14); if the 10646 maximum of U+7FFFFFFF is considered, UTF-8 would encode up to six bytes.

The 2048 codepoints U+D800 through U+DFFF cannot be encoded in UTF-8; they are metapoints intended for use with UTF-16.

Along with the octets F5--FF, C0 and C1 never appear in valid UTF-8. ASCII characters never show up as parts of other, multibyte characters.

Octets of the form 10xxxxxx are continuation bytes, and can only be found after a valid initial byte.


  • Ensure the proper locales are present, and being regenerated on package updates. locale -a will list all available locales. You want en_US.UTF-8 or the appropriate regional equivalent (use C.UTF-8 for an agnostic UTF-8 encoding): locale -a | grep UTF-8$ should generate output. On Debian, run dpkg-reconfigure locales to select generated locales and rebuild the locale database (it uses libc's localedef).
  • Ensure that you're exposing a UTF-8-enabled locale to setlocale(3) and friends: LANG=en_US.UTF-8 should be exported in your environment (the various LC_* variables can override LANG for certain subsets of context, while LC_ALL overrides other LC_* values). On Debian, configure /etc/default/locale via dpkg-reconfigure locales (which subsequently drives update-locale from the same package). This file is sourced by pam configs and /etc/init.d files.


ext3 and friends use octets for filenames; it is up to applications to interpret them. For VFAT, ISO9660 and some others:

  • Ensure the UTF-8 module is being built (CONFIG_NLS_UTF8)
  • Ensure the "Default NLS Language" is "utf8" in the kernel config (CONFIG_NLS_DEFAULT)
    • nls=utf8 as an option to mount will work on a per-filesystem basis


  • Check whether UTF-8 mode is being used in the terminal driver via vt-is-utf8 from console-tools.
  • Set it with unicode_start, also from console-tools.

Application details


  • In Insert mode, Ctrl-K can be used to enter characters by digraph (see loaded digraphs with :dig). Classes of digraphs share a common suffix character:
      • Greek: * (thus Ctrl-K, a* generates α, Ctrl-K, m* generates μ, etc)
      • Grave accent: ! ( a! -> à, A! -> À )
      • Acute/sharp accent: ' ( a' -> á, A' -> Á )
  • In Insert mode, Ctrl-V starts a reference input sequence. Use the Unicode decimal codepoint. Examples:
    • Ctrl-V, 227 generates ã
    • Ctrl-V, 167 generates §


See Also