Check out my first novel, midnight's simulacra!

Using Unicode: Difference between revisions

From dankwiki
Jump to navigation Jump to search
No edit summary
 
(2 intermediate revisions by the same user not shown)
Line 1: Line 1:
Unicode 14.0 is scheduled for release September 2021. There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.
Unicode 15.1 was released September 2023.
Unicode 14.0 was released September 14, 2021.
There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.


Good references include:
Good references include:
Line 22: Line 24:
** Mathematical sans-serif bold (U1D5D4+): ๐—”๐—•๐—–๐——๐—˜๐—™๐—š๐—›๐—œ๐—๐—ž๐—Ÿ๐— ๐—ก๐—ข๐—ฃ๐—ค๐—ฅ๐—ฆ๐—ง๐—จ๐—ฉ๐—ช๐—ซ๐—ฌ๐—ญ
** Mathematical sans-serif bold (U1D5D4+): ๐—”๐—•๐—–๐——๐—˜๐—™๐—š๐—›๐—œ๐—๐—ž๐—Ÿ๐— ๐—ก๐—ข๐—ฃ๐—ค๐—ฅ๐—ฆ๐—ง๐—จ๐—ฉ๐—ช๐—ซ๐—ฌ๐—ญ
** Mathematical sans-serif italic (U1D608+): ๐˜ˆ๐˜‰๐˜Š๐˜‹๐˜Œ๐˜๐˜Ž๐˜๐˜๐˜‘๐˜’๐˜“๐˜”๐˜•๐˜–๐˜—๐˜˜๐˜™๐˜š๐˜›๐˜œ๐˜๐˜ž๐˜Ÿ๐˜ ๐˜ก
** Mathematical sans-serif italic (U1D608+): ๐˜ˆ๐˜‰๐˜Š๐˜‹๐˜Œ๐˜๐˜Ž๐˜๐˜๐˜‘๐˜’๐˜“๐˜”๐˜•๐˜–๐˜—๐˜˜๐˜™๐˜š๐˜›๐˜œ๐˜๐˜ž๐˜Ÿ๐˜ ๐˜ก
** Mathematical sans-serif intalic bold (U1D63C+): ๐˜ผ๐˜ฝ๐˜พ๐˜ฟ๐™€๐™๐™‚๐™ƒ๐™„๐™…๐™†๐™‡๐™ˆ๐™‰๐™Š๐™‹๐™Œ๐™๐™Ž๐™๐™๐™‘๐™’๐™“๐™”๐™•
** Mathematical sans-serif italic bold (U1D63C+): ๐˜ผ๐˜ฝ๐˜พ๐˜ฟ๐™€๐™๐™‚๐™ƒ๐™„๐™…๐™†๐™‡๐™ˆ๐™‰๐™Š๐™‹๐™Œ๐™๐™Ž๐™๐™๐™‘๐™’๐™“๐™”๐™•
** Mathematical script: ๐’œโ„ฌ๐’ž๐’Ÿโ„ฐโ„ฑ๐’ขโ„‹โ„๐’ฅ๐’ฆโ„’โ„ณ๐’ฉ๐’ช๐’ซ๐’ฌโ„›๐’ฎ๐’ฏ๐’ฐ๐’ฑ๐’ฒ๐’ณ๐’ด๐’ต
** Mathematical script: ๐’œโ„ฌ๐’ž๐’Ÿโ„ฐโ„ฑ๐’ขโ„‹โ„๐’ฅ๐’ฆโ„’โ„ณ๐’ฉ๐’ช๐’ซ๐’ฌโ„›๐’ฎ๐’ฏ๐’ฐ๐’ฑ๐’ฒ๐’ณ๐’ด๐’ต
** Mathematical script bold: ๐“๐“‘๐“’๐““๐“”๐“•๐“–๐“—๐“˜๐“™๐“š๐“›๐“œ๐“๐“ž๐“Ÿ๐“ ๐“ก๐“ข๐“ฃ๐“ค๐“ฅ๐“ฆ๐“ง๐“จ๐“ฉ
** Mathematical script bold: ๐“๐“‘๐“’๐““๐“”๐“•๐“–๐“—๐“˜๐“™๐“š๐“›๐“œ๐“๐“ž๐“Ÿ๐“ ๐“ก๐“ข๐“ฃ๐“ค๐“ฅ๐“ฆ๐“ง๐“จ๐“ฉ
Line 39: Line 41:
* Monospace (U1D7F6+): ๐Ÿถ๐Ÿท๐Ÿธ๐Ÿน๐Ÿบ๐Ÿป๐Ÿผ๐Ÿฝ๐Ÿพ๐Ÿฟ
* Monospace (U1D7F6+): ๐Ÿถ๐Ÿท๐Ÿธ๐Ÿน๐Ÿบ๐Ÿป๐Ÿผ๐Ÿฝ๐Ÿพ๐Ÿฟ
* Seven-segment (U1FBF0+): ๐Ÿฏฐ๐Ÿฏฑ๐Ÿฏฒ๐Ÿฏณ๐Ÿฏด๐Ÿฏต๐Ÿฏถ๐Ÿฏท๐Ÿฏธ๐Ÿฏน
* Seven-segment (U1FBF0+): ๐Ÿฏฐ๐Ÿฏฑ๐Ÿฏฒ๐Ÿฏณ๐Ÿฏด๐Ÿฏต๐Ÿฏถ๐Ÿฏท๐Ÿฏธ๐Ÿฏน
==UTF-8==
The One True Encoding, almost always. See [https://www.ietf.org/rfc/rfc3629.txt RFC 3629] and Annex D of ISO/IEC 10646.
UTF-8 encoding yields up to four bytes per encoded codepoint. Valid ASCII (all characters less than 0x80) are directly encoded using a single byte. This four byte maximum arises from RFC 3629 ยง3, which defines UTF-8 on codepoints through only 0x10FFFF (suitable for handling the 17 defined Planes as of Unicode 14); if the 10646 maximum of U+7FFFFFFF is considered, UTF-8 would encode up to six bytes.
The 2048 codepoints U+D800 through U+DFFF cannot be encoded in UTF-8; they are metapoints intended for use with UTF-16.
Along with the octets F5--FF, C0 and C1 never appear in valid UTF-8. ASCII characters never show up as parts of other, multibyte characters.
Octets of the form 10xxxxxx are continuation bytes, and can only be found after a valid initial byte.


== [[libc]] ==
== [[libc]] ==

Latest revision as of 10:29, 13 January 2024

Unicode 15.1 was released September 2023. Unicode 14.0 was released September 14, 2021. There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.

Good references include:

Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1โ€“3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an ASCII-compatible Unicode encoding, usable in any context designed for ASCII but insensitive to characters' meanings.

Interesting Unicode

Isomorphisms of the English alphabet

    • Parenthesized minuscules (U249C+): โ’œโ’โ’žโ’Ÿโ’ โ’กโ’ขโ’ฃโ’คโ’ฅโ’ฆโ’งโ’จโ’ฉโ’ชโ’ซโ’ฌโ’ญโ’ฎโ’ฏโ’ฐโ’ฑโ’ฒโ’ณโ’ดโ’ต
    • Circled majuscules (U24B6+): โ’ถโ’ทโ’ธโ’นโ’บโ’ปโ’ผโ’ฝโ’พโ’ฟโ“€โ“โ“‚โ“ƒโ“„โ“…โ“†โ“‡โ“ˆโ“‰โ“Šโ“‹โ“Œโ“โ“Žโ“
    • Circled minuscules (U24D0+): โ“โ“‘โ“’โ““โ“”โ“•โ“–โ“—โ“˜โ“™โ“šโ“›โ“œโ“โ“žโ“Ÿโ“ โ“กโ“ขโ“ฃโ“คโ“ฅโ“ฆโ“งโ“จโ“ฉ
    • Superscript minuscules (missing q): แตƒแต‡แถœแตˆแต‰แถ แตสฐโฑสฒแตหกแตโฟแต’แต–สณหขแต—แต˜แต›สทหฃสธแถป
    • Bold (U1D400+): ๐€๐๐‚๐ƒ๐„๐…๐†๐‡๐ˆ๐‰๐Š๐‹๐Œ๐๐Ž๐๐๐‘๐’๐“๐”๐•๐–๐—๐˜๐™
    • Italic (U1D434+): ๐ด๐ต๐ถ๐ท๐ธ๐น๐บ๐ป๐ผ๐ฝ๐พ๐ฟ๐‘€๐‘๐‘‚๐‘ƒ๐‘„๐‘…๐‘†๐‘‡๐‘ˆ๐‘‰๐‘Š๐‘‹๐‘Œ๐‘
    • Bold italic (U1D468+): ๐‘จ๐‘ฉ๐‘ช๐‘ซ๐‘ฌ๐‘ญ๐‘ฎ๐‘ฏ๐‘ฐ๐‘ฑ๐‘ฒ๐‘ณ๐‘ด๐‘ต๐‘ถ๐‘ท๐‘ธ๐‘น๐‘บ๐‘ป๐‘ผ๐‘ฝ๐‘พ๐‘ฟ๐’€๐’
    • Mathematical sans-serif (U1D5A0+): ๐– ๐–ก๐–ข๐–ฃ๐–ค๐–ฅ๐–ฆ๐–ง๐–จ๐–ฉ๐–ช๐–ซ๐–ฌ๐–ญ๐–ฎ๐–ฏ๐–ฐ๐–ฑ๐–ฒ๐–ณ๐–ด๐–ต๐–ถ๐–ท๐–ธ๐–น
    • Mathematical sans-serif bold (U1D5D4+): ๐—”๐—•๐—–๐——๐—˜๐—™๐—š๐—›๐—œ๐—๐—ž๐—Ÿ๐— ๐—ก๐—ข๐—ฃ๐—ค๐—ฅ๐—ฆ๐—ง๐—จ๐—ฉ๐—ช๐—ซ๐—ฌ๐—ญ
    • Mathematical sans-serif italic (U1D608+): ๐˜ˆ๐˜‰๐˜Š๐˜‹๐˜Œ๐˜๐˜Ž๐˜๐˜๐˜‘๐˜’๐˜“๐˜”๐˜•๐˜–๐˜—๐˜˜๐˜™๐˜š๐˜›๐˜œ๐˜๐˜ž๐˜Ÿ๐˜ ๐˜ก
    • Mathematical sans-serif italic bold (U1D63C+): ๐˜ผ๐˜ฝ๐˜พ๐˜ฟ๐™€๐™๐™‚๐™ƒ๐™„๐™…๐™†๐™‡๐™ˆ๐™‰๐™Š๐™‹๐™Œ๐™๐™Ž๐™๐™๐™‘๐™’๐™“๐™”๐™•
    • Mathematical script: ๐’œโ„ฌ๐’ž๐’Ÿโ„ฐโ„ฑ๐’ขโ„‹โ„๐’ฅ๐’ฆโ„’โ„ณ๐’ฉ๐’ช๐’ซ๐’ฌโ„›๐’ฎ๐’ฏ๐’ฐ๐’ฑ๐’ฒ๐’ณ๐’ด๐’ต
    • Mathematical script bold: ๐“๐“‘๐“’๐““๐“”๐“•๐“–๐“—๐“˜๐“™๐“š๐“›๐“œ๐“๐“ž๐“Ÿ๐“ ๐“ก๐“ข๐“ฃ๐“ค๐“ฅ๐“ฆ๐“ง๐“จ๐“ฉ
    • Fraktur: ๐”„๐”…โ„ญ๐”‡๐”ˆ๐”‰๐”Šโ„Œโ„‘๐”๐”Ž๐”๐”๐”‘๐”’๐”“๐””โ„œ๐”–๐”—๐”˜๐”™๐”š๐”›๐”œโ„จ
    • Fraktur bold (U1D56C+): ๐•ฌ๐•ญ๐•ฎ๐•ฏ๐•ฐ๐•ฑ๐•ฒ๐•ณ๐•ด๐•ต๐•ถ๐•ท๐•ธ๐•น๐•บ๐•ป๐•ผ๐•ฝ๐•พ๐•ฟ๐–€๐–๐–‚๐–ƒ๐–„๐–…
    • Monospace (U1D670+): ๐™ฐ๐™ฑ๐™ฒ๐™ณ๐™ด๐™ต๐™ถ๐™ท๐™ธ๐™น๐™บ๐™ป๐™ผ๐™ฝ๐™พ๐™ฟ๐š€๐š๐š‚๐šƒ๐š„๐š…๐š†๐š‡๐šˆ๐š‰
    • Doublestruck: ๐”ธ๐”นโ„‚๐”ป๐”ผ๐”ฝ๐”พโ„๐•€๐•๐•‚๐•ƒ๐•„โ„•๐•†โ„™โ„šโ„๐•Š๐•‹๐•Œ๐•๐•Ž๐•๐•โ„ค

FIXME do minuscules

Isomorphisms of the Greek alphabet

FIXME

Isomorphisms of the Arabic digits

  • Bold (U1D7CE+): ๐ŸŽ๐Ÿ๐Ÿ๐Ÿ‘๐Ÿ’๐Ÿ“๐Ÿ”๐Ÿ•๐Ÿ–๐Ÿ—
  • Doublestruck (U1D7D8+): ๐Ÿ˜๐Ÿ™๐Ÿš๐Ÿ›๐Ÿœ๐Ÿ๐Ÿž๐ŸŸ๐Ÿ ๐Ÿก
  • Sans-serif (U1D7E2+): ๐Ÿข๐Ÿฃ๐Ÿค๐Ÿฅ๐Ÿฆ๐Ÿง๐Ÿจ๐Ÿฉ๐Ÿช๐Ÿซ
  • Sans-serif bold (U1D7EC+): ๐Ÿฌ๐Ÿญ๐Ÿฎ๐Ÿฏ๐Ÿฐ๐Ÿฑ๐Ÿฒ๐Ÿณ๐Ÿด๐Ÿต
  • Monospace (U1D7F6+): ๐Ÿถ๐Ÿท๐Ÿธ๐Ÿน๐Ÿบ๐Ÿป๐Ÿผ๐Ÿฝ๐Ÿพ๐Ÿฟ
  • Seven-segment (U1FBF0+): ๐Ÿฏฐ๐Ÿฏฑ๐Ÿฏฒ๐Ÿฏณ๐Ÿฏด๐Ÿฏต๐Ÿฏถ๐Ÿฏท๐Ÿฏธ๐Ÿฏน

UTF-8

The One True Encoding, almost always. See RFC 3629 and Annex D of ISO/IEC 10646.

UTF-8 encoding yields up to four bytes per encoded codepoint. Valid ASCII (all characters less than 0x80) are directly encoded using a single byte. This four byte maximum arises from RFC 3629 ยง3, which defines UTF-8 on codepoints through only 0x10FFFF (suitable for handling the 17 defined Planes as of Unicode 14); if the 10646 maximum of U+7FFFFFFF is considered, UTF-8 would encode up to six bytes.

The 2048 codepoints U+D800 through U+DFFF cannot be encoded in UTF-8; they are metapoints intended for use with UTF-16.

Along with the octets F5--FF, C0 and C1 never appear in valid UTF-8. ASCII characters never show up as parts of other, multibyte characters.

Octets of the form 10xxxxxx are continuation bytes, and can only be found after a valid initial byte.

libc

  • Ensure the proper locales are present, and being regenerated on package updates. locale -a will list all available locales. You want en_US.UTF-8 or the appropriate regional equivalent (use C.UTF-8 for an agnostic UTF-8 encoding): locale -a | grep UTF-8$ should generate output. On Debian, run dpkg-reconfigure locales to select generated locales and rebuild the locale database (it uses libc's localedef).
  • Ensure that you're exposing a UTF-8-enabled locale to setlocale(3) and friends: LANG=en_US.UTF-8 should be exported in your environment (the various LC_* variables can override LANG for certain subsets of context, while LC_ALL overrides other LC_* values). On Debian, configure /etc/default/locale via dpkg-reconfigure locales (which subsequently drives update-locale from the same package). This file is sourced by pam configs and /etc/init.d files.

filesystems

ext3 and friends use octets for filenames; it is up to applications to interpret them. For VFAT, ISO9660 and some others:

  • Ensure the UTF-8 module is being built (CONFIG_NLS_UTF8)
  • Ensure the "Default NLS Language" is "utf8" in the kernel config (CONFIG_NLS_DEFAULT)
    • nls=utf8 as an option to mount will work on a per-filesystem basis

Console

  • Check whether UTF-8 mode is being used in the terminal driver via vt-is-utf8 from console-tools.
  • Set it with unicode_start, also from console-tools.

Application details

vim

  • In Insert mode, Ctrl-K can be used to enter characters by digraph (see loaded digraphs with :dig). Classes of digraphs share a common suffix character:
      • Greek: * (thus Ctrl-K, a* generates ฮฑ, Ctrl-K, m* generates ฮผ, etc)
      • Grave accent: ! ( a! -> ร , A! -> ร€ )
      • Acute/sharp accent: ' ( a' -> รก, A' -> ร )
  • In Insert mode, Ctrl-V starts a reference input sequence. Use the Unicode decimal codepoint. Examples:
    • Ctrl-V, 227 generates รฃ
    • Ctrl-V, 167 generates ยง

X

See Also