Using Unicode

From dankwiki
Jump to: navigation, search

Good references include:

Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1–3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an ASCII-compatible Unicode encoding, usable in any context designed for ASCII but insensitive to characters' meanings.

libc

  • Ensure the proper locales are present, and being regenerated on package updates. locale -a will list all available locales. You want en_US.utf8 or its regional equivalent: locale -a | grep utf8$ should generate output. On Debian, run dpkg-reconfigure locales to select generated locales and rebuild the locale database (it uses libc's localedef).
  • Ensure that you're exposing a UTF-8-enabled locale to setlocale(3) and friends: LANG=en_US.UTF-8 should be exported in your environment (the various LC_* variables can override LANG for certain subsets of context, while LC_ALL overrides other LC_* values). On Debian, configure /etc/default/locale via dpkg-reconfigure locales (which subsequently drives update-locale from the same package). This file is sourced by pam configs and /etc/init.d files.

filesystems

ext3 and friends use octets for filenames; it is up to applications to interpret them. For VFAT, ISO9660 and some others:

  • Ensure the UTF-8 module is being built (CONFIG_NLS_UTF8)
  • Ensure the "Default NLS Language" is "utf8" in the kernel config (CONFIG_NLS_DEFAULT)
    • nls=utf8 as an option to mount will work on a per-filesystem basis

Console

  • Check whether UTF-8 mode is being used in the terminal driver via vt-is-utf8 from console-tools.
  • Set it with unicode_start, also from console-tools.

Application details

vim

  • In Insert mode, Ctrl-K can be used to enter characters by digraph (see loaded digraphs with :dig). Classes of digraphs share a common suffix character:
      • Greek: * (thus Ctrl-K, a* generates α, Ctrl-K, m* generates μ, etc)
      • Grave accent: ! ( a! -> à, A! -> À )
      • Acute/sharp accent: ' ( a' -> á, A' -> Á )
  • In Insert mode, Ctrl-V starts a reference input sequence. Use the Unicode decimal codepoint. Examples:
    • Ctrl-V, 227 generates ã
    • Ctrl-V, 167 generates §

X

See Also

  • The UTF-8 demo/test file of Markus Kuhn