Good references include:
- The utf8(7) and unicode(7) pages from the Linux man pages
- RFC 3829, "UTF-8, a transformation format of ISO 10646"
- UTF-8 FAQ for Linux
- The Linux Unicode HOWTO
- Multilingual text on Linux
- The Unicode Book, version 5.0 (the web-only 5.1.0 standard supersedes this) is one of the most beautifully-constructed books I've ever seen, and well worth the price.
Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1–3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an ASCII-compatible Unicode encoding, usable in any context designed for ASCII but insensitive to characters' meanings.
- Ensure the proper locales are present, and being regenerated on package updates. locale -a will list all available locales. You want en_US.utf8 or its regional equivalent: locale -a | grep utf8$ should generate output. On Debian, run dpkg-reconfigure locales to select generated locales and rebuild the locale database (it uses libc's localedef).
- Ensure that you're exposing a UTF-8-enabled locale to setlocale(3) and friends: LANG=en_US.UTF-8 should be exported in your environment (the various LC_* variables can override LANG for certain subsets of context, while LC_ALL overrides other LC_* values). On Debian, configure /etc/default/locale via dpkg-reconfigure locales (which subsequently drives update-locale from the same package). This file is sourced by pam configs and /etc/init.d files.
ext3 and friends use octets for filenames; it is up to applications to interpret them. For VFAT, ISO9660 and some others:
- Ensure the UTF-8 module is being built (CONFIG_NLS_UTF8)
- Ensure the "Default NLS Language" is "utf8" in the kernel config (CONFIG_NLS_DEFAULT)
- nls=utf8 as an option to mount will work on a per-filesystem basis
- In Insert mode, Ctrl-K can be used to enter characters by digraph (see loaded digraphs with :dig). Classes of digraphs share a common suffix character:
- In Insert mode, Ctrl-V starts a reference input sequence. Use the Unicode decimal codepoint. Examples:
- Ctrl-V, 227 generates ã
- Ctrl-V, 167 generates §