Check out my first novel, midnight's simulacra!
Using Unicode: Difference between revisions
No edit summary |
No edit summary |
||
Line 1: | Line 1: | ||
Unicode 14.0 is scheduled for release September 2021. There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020. | |||
Good references include: | Good references include: | ||
* The utf8(7) and unicode(7) pages from the Linux man pages | * The utf8(7) and unicode(7) pages from the Linux man pages | ||
* [http://www.ietf.org/rfc/rfc3629.txt RFC | * [http://www.ietf.org/rfc/rfc3629.txt RFC 3629], "UTF-8, a transformation format of ISO 10646" | ||
* [http://www.cl.cam.ac.uk/~mgk25/unicode.html UTF-8 FAQ for Linux] | * [http://www.cl.cam.ac.uk/~mgk25/unicode.html UTF-8 FAQ for Linux] | ||
* [http://www.linux.org/docs/ldp/howto/Unicode-HOWTO.html The Linux Unicode HOWTO] | * [http://www.linux.org/docs/ldp/howto/Unicode-HOWTO.html The Linux Unicode HOWTO] | ||
Line 7: | Line 9: | ||
* The Unicode [http://www.unicode.org/standard/standard.html Book], version 5.0 (the web-only 5.1.0 standard supersedes this) is one of the most beautifully-constructed books I've ever seen, and well worth the price. | * The Unicode [http://www.unicode.org/standard/standard.html Book], version 5.0 (the web-only 5.1.0 standard supersedes this) is one of the most beautifully-constructed books I've ever seen, and well worth the price. | ||
Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1–3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an [[ASCII]]-compatible Unicode encoding, usable in any context designed for [[ASCII]] but insensitive to characters' meanings. | Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1–3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an [[ASCII]]-compatible Unicode encoding, usable in any context designed for [[ASCII]] but insensitive to characters' meanings. | ||
==Interesting Unicode== | |||
Revision as of 07:26, 13 January 2021
Unicode 14.0 is scheduled for release September 2021. There is no "Unicode 13.1", but Emoji 13.1 was released in September 2020 under the auspices of the Unicode Consortium. Unicode 13.0 was released in March 2020.
Good references include:
- The utf8(7) and unicode(7) pages from the Linux man pages
- RFC 3629, "UTF-8, a transformation format of ISO 10646"
- UTF-8 FAQ for Linux
- The Linux Unicode HOWTO
- Multilingual text on Linux
- The Unicode Book, version 5.0 (the web-only 5.1.0 standard supersedes this) is one of the most beautifully-constructed books I've ever seen, and well worth the price.
Unicode 5.0 corresponds to ISO 10646:2003, including amendments 1–3. Unicodes since 2.0 are backwards-compatible -- no characters are removed or replaced in new versions, only added. ISO 14651 defines string sorting order. RFC 3629 defines UTF-8, an ASCII-compatible Unicode encoding, usable in any context designed for ASCII but insensitive to characters' meanings.
Interesting Unicode
libc
- Ensure the proper locales are present, and being regenerated on package updates. locale -a will list all available locales. You want en_US.utf8 or its regional equivalent: locale -a | grep utf8$ should generate output. On Debian, run dpkg-reconfigure locales to select generated locales and rebuild the locale database (it uses libc's localedef).
- Ensure that you're exposing a UTF-8-enabled locale to setlocale(3) and friends: LANG=en_US.UTF-8 should be exported in your environment (the various LC_* variables can override LANG for certain subsets of context, while LC_ALL overrides other LC_* values). On Debian, configure /etc/default/locale via dpkg-reconfigure locales (which subsequently drives update-locale from the same package). This file is sourced by pam configs and /etc/init.d files.
filesystems
ext3 and friends use octets for filenames; it is up to applications to interpret them. For VFAT, ISO9660 and some others:
- Ensure the UTF-8 module is being built (CONFIG_NLS_UTF8)
- Ensure the "Default NLS Language" is "utf8" in the kernel config (CONFIG_NLS_DEFAULT)
- nls=utf8 as an option to mount will work on a per-filesystem basis
Console
- Check whether UTF-8 mode is being used in the terminal driver via vt-is-utf8 from console-tools.
- Set it with unicode_start, also from console-tools.
Application details
vim
- In Insert mode, Ctrl-K can be used to enter characters by digraph (see loaded digraphs with :dig). Classes of digraphs share a common suffix character:
- Greek: * (thus Ctrl-K, a* generates α, Ctrl-K, m* generates μ, etc)
- Grave accent: ! ( a! -> à, A! -> À )
- Acute/sharp accent: ' ( a' -> á, A' -> Á )
- In Insert mode, Ctrl-V starts a reference input sequence. Use the Unicode decimal codepoint. Examples:
- Ctrl-V, 227 generates ã
- Ctrl-V, 167 generates §
X
See Also
- The UTF-8 demo/test file of Markus Kuhn
- Unicode control characters by Aivosto Oy