Theory and Practice of Sprixels

dankblog! 2021-03-29, 0524 EDT at the danktower

Our ordinary language has no means for describing a particular shade of color.—Ludwig Wittgenstein

I've spent significant time over the past few weeks adding "sprixel" support to Notcurses, culminating (thus far) in yesterday's 2.2.4 release. What's a sprixel? Numerous terminal emulators support one or another "graphic protocol" by which bitmaps of arbitrary size can be written to the glyph-based viewing area. Showing that dynamic control of the font is not even approximately equivalent in power to such graphic protocols is left as an exercise for the reader (or you can scroll to the end).

Sprixel flavors

There are at least four implemented protocols:

the venerable DEC Sixel protocol, introduced on the VT125, and also present on at least the VT240 and VT330.
- Implemented by at least XTerm, Mlterm, wezterm, and foot
- with patches out for at least VTE, alacritty, and Windows Terminal.
- also supported by iTerm2, despite having its own protocol
Kitty's protocol, advanced by Kovid Goyal, and recently (Q3 2021) supported by wezterm.
The "1337" iTerm2 protocol, also implemented by wezterm.
The Jexer Image Sequence from Autumn Lamonte's Jexer

Additionally, you can shit con brio all over the Linux framebuffer directly when in a framebuffer console, though I wouldn't call this a protocol. Christian Parpart (associated with Contour) is working on his "Terminal Good Image Protocol", but it has not yet AFAIK seen implementation.

Having waded in the velvet seas of terminal graphics protocols for a minute now, I was putting some of my own thoughts down as STEGAP. It ended up looking sufficiently much like Kitty's protocol that I just advocate terminal authors use that one, though.

Sixel

Sixel is both the oldest and the most widely supported, though much of this support has come in the past year. Unfortunately, while passable for controlling early-Eighties nine-pin color dot matrix printers like the DEC LA50, it's pretty undesirable from a console bitmap standpoint. Nonetheless, combined with "sprite", it has loaned its name to "sprixels".

Sixel outputs consist of three parts: a header, a palette, and the body. The header (most of it optional) indicates image geometry and how to interpret transparent pixels. The palette (also known as color registers) is a set of small integers mapped to HSL or RGB values on the range [0..100] (i.e. there is less precision available than our native 24bpp RGB). The body is then made up of a series of indices into the palette, each followed by the pixels of that color over a six-row subset of the bitmap. Each byte encodes one full six-pixel column. The top pixel is the least significant bit, and the bottom pixel is the most significant bit of a 6-bit value (hence the name). 63 is added to this value to get the encoded byte. This scheme ensures that all values are both (a) printable and (b) seven-bit-clean. Background/transparent bits are simply unspecified. Each index+body pair is followed by either a '$' or a '-'. The former returns to the left for further rendering; the latter moves down and to the left.

In addition, there are several global Sixel parameters which can be retrieved and, in some cases, set: the number of color registers (maximum palette size), the maximum bitmap geometry, and the scrolling/positioning mode.

Kitty

Kitty output consists of two parts: a header, and the chunked body. The header indicates image geometry, image index and position index, z-order, image format, and other attributes (we always use 32-bit RGBA). Chunks of up to 4KiB follow, containing raw decoded image data. It is possible to feed PNG format to Kitty directly, and also possible to compress the chunks with zlib's deflate algorithm.

Transparency and z-ordering

These protocols differ from each other in fundamental, important ways, and forming an abstraction from them isn't trivial. Integrating them wholly into the z-ordered semantics of Notcurses was still more difficult. Notcurses works via piles (rendering contexts) of planes (drawing surfaces), with each pile's planes totally ordered on a z axis. Higher planes blend with or obscure lower ones. When Notcurses renders a pile, it projects these planes down onto a single plane, solving a single surface; it rasterizes this virtual surface to the physical viewing area by encoding it as a set of terminal escape codes and UTF-8 data. Only cells which have changed need to be rasterized, and eliding unchanged cells is a critical optimization, usually cutting the output size down by 90% or more. Every output byte is processed several times before becoming visible (we must copy it to a kernel buffer, the terminal must read it from a kernel buffer, and the terminal must display the resulting cell), so trading some computation for fewer output bytes is well worth it.

The elegant model of planes of fixed-width, independent cells solving for a single cell matrix already breaks down in the presence of Unicode wide glyphs, perhaps foremost among them U+FDFD ARABIC LIGATURE BISMILLAH AR-RAHMAN AR-RAHEEM ﷽, a single glyph occupying anywhere from one to who-knows (I've seen 9) cells, depending on font, font engine, and terminal emulator. Wide characters cannot be printed in part, and printing another character anywhere atop a wide character obliterates all of the old character. Even restricting ourselves to characters two cells wide, a single character can annihilate four columns containing two characters:

column 0: left side of character A
column 1: right side of character A
column 2: left side of character B
column 3: right side of character B

Print a wide character C at column 1, and we end up with

column 0: space
column 1: left side of character C
column 2: right side of character C
column 3: space

This isn't a problem with a single virtual plane, but imagine we have two planes, P0 and P1, with P0 above P1. They are of equal size, and P0 is transparent by default. P0 has a wide character at column 1. P1 has a wide character at column 0. What do we render? It's impossible to render the first half of P1, despite that being the logical render. Allah, the All-Powerful, has fucked us again!

This difficulty is ratcheted up significantly in the case of sprixels. In the most complicated case, a single sprixel might need to be both "under" (obscured by) and "over" (partially obscuring) text, and the set of cells obscuring and being obscured can change every frame. Imagine, for instance, that we have the following friendly octopus (512x357px, and note that it has a transparent channel):

Let's assume a cell geometry of 11x20, which just happens to be my current cell geometry. 512 pixels then occupy a little over 46 columns (512 == 46 * 11 + 6), while 357 pixels occupy just about 18 rows (357 == 17 * 20 + 17). We want to be able to draw this happy fellow atop a background of text:

Note that the text "underneath" the octopus can change while the octopus is present, and this change must be reflected if the text is visible. We also want to draw on top of the octopus, and this text too can change while the octopus is displayed, or even go away, in which case the octopus must be restored.

We might want to write atop the sprixel, but with a transparent background. Unfortunately, this is only really possible with Kitty (I'm excluding ideas like sampling the glyph and interpolating an RGB background for the text cell), where we can place an image at "z=-1" in the internal kitty image Z-axis to place it underneath text. Here I've done so, placing the sprixel underneath the layer of 'a's, but still annihilating the sprixel with some other text.

Sixel doesn't really provide this capability; writing text over a sixel graphic will annihilate affected cells in their entirety. Here we come to the most meaningful difference between Sixel and Kitty/iTerm: the former is temporal, the latter positional.

Temporal vs Positional

Sixel writes the provided glyph as a unit, starting at the current cursor position (assuming we've entered DECSDM aka "Sixel Display Mode", which we always do—otherwise all sixels are emitted at the upper left corner of the display). Bits not specified in the sixel are transparent (assuming the Device Control String has used the value 1 for parameter P2), and transparent pixels leave undisturbed whatever was already present at the time of emission. Writing an entirely transparent region is a no-op. Writing a new character (or new Sixel pixels) anywhere in the graphic's region will wipe out the entirety of that cell. The Sixel data thus annihilated are not recoverable. This means that updating the text underneath a Sixel will blow the obscuring graphics away, requiring that the affected Sixel pixels be rerendered. In the event of stacked sprixels, we would potentially need to rerender the entire stack. In a naive implementation, changing one text cell "underneath" the octopus could result in painting 781 cells (the text cell actually being changed, and the 780 cells of the octopus). This is plainly undesirable.

With Kitty, a graphic is positional. All text is assumed to be at z=0. Graphics with non-negative coordinates obscure text. Graphics with negative coordinates sit underneath text. Note that this means we must choose whether a Kitty sprixel is going to have text above it, or under it, but no arbitrary single sprixel can do both.

Cell wiping

Thankfully, unlike Sixel, we have the ability to delete graphics in the Kitty protocol (Sixel graphics can only be cleared along with a larger screen region, or entirely obscured by other output), so we can "cut" chunks out of our bitmap, delete the visible sprixel, and load the new one; doing so effectively allows us to make text underneath a "z=-1" sprixel dynamically available. We cut the chunk out by setting the α (alpha) parameter to 0 across the breadth of the affected RGBA region. This function, sprite_kitty_cell_wipe(), can be found in src/lib/kitty.c. Easy peasy. We must keep the original α values somewhere, in case the damage need be recovered (we needn't duplicate the RGB values; we leave those as they are). Since we normalize the α values on input to either 0 or 255, we only need to keep one bit of state for each pixel, and map back that bit times 0xff. For a 1024x1024 pixel image (4MiB), this would be be a maximum 1Mb bit vector, representing 3.125% overhead.

Color quantization

The Kitty protocol accepts arbitrary RGBA 32-bit words in the sRGB color space; our natural internal format is 24bpp RGB plus two bits of alpha, so that's great—our colorspaces match. Sixel is a palette-indexed format. Common sizes are 16-, 256-, and 1024-entry palettes (in XTerm, this is controlled with the numColorRegisters X resource). Palettes are independent unless the privateColorRegisters resource is explicitly set false). A color quantization step is thus necessary (note that this these quantizations can be performed in perfect parallelism when split across multiple frames/images, but not necessarily within a single frame). There are five essential color quantization algorithms:

These images throw libsixel for a loop; `img2sixel` chugs along, and then generates no output.

Uniform chunking aka popularity algorithms / Voronoi clustering. Any child can come up with this scheme. Autumn Lamonte's Jexer uses a clever variant on this. Very fast, but the least accurate on difficult images.
Median cut, with or without a histogram. Heckbert in SIGGRAPH 1982. Used by libsixel by way of netpbm.
octrees, introduced by Gervautz and Purgathofer in 1988.
Kohonen neural nets as introduced by Dekker in NeuQuant. Used by the color_quant crate.
- You'll also see Hopfield nets
K-means clustering, perhaps best known from exoquant.

Notcurses uses my own take on Voronoi, a scheme I call "relaxation", as seen in the refine_colors() function in src/lib/sixel.c. I'll likely be changing this up, though. In the meantime, it's very robust, working properly on images that trip up libsixel. For heavy-duty testing, the fascinating images at AllRGB have been invaluable.

The T-A Matrix

Associated with a sprixel (technically with the plane on which the sixel is blitted, so it can be reused across frames) is a "transparency-annihilation matrix" having dimensions equivalent to the sprixel's area in cells. These matrix entries are loaded when the sprixel is first encoded, as one of:

0 (a wholly opaque cell)
1 (a cell region with both transparent and opaque pixels)
2 (a wholly transparent cell)

Each has different properties. The wholly opaque cell obstructs text underneath it entirely, so we can elide updates to such text. We can update the text underneath a wholly transparent cell without an expensive repaint of the bitmap. A mixed cell's most interesting property is that it has no interesting properties.

When we cut out a cell (to print text above the bitmap), the T-A matrix takes the value 3 (annihilation). Only with Kitty do we actually cut the cell out of the encoded glyph (which we then delete and replay), but we always want to track this, because we often reuse the T-A matrix across frames. When we encode an image with a preseeded T-A matrix in hand, we implicitly drop the annihilated pixels. Using this method, we never obscure text atop a series of frames, eliminating flicker we would otherwise suffer.

As you might have already guessed, the cut vector mentioned above is hung off the T-A matrix (though it must be refreshed on each frame change). When an annihilated cell is uncovered, it must be reconstructed; the T-A matrix value is changed to 4 (anastasis). At render time, the resurrected cell can either be rebuilt using the cut vector directly into the image (Kitty), or the Sixel-based sprixel can be reconstructed in toto. It might be desirable to revive into cell-sized sprixels, but see below. Either way, following reconstruction, the TAM entry returns to 0, 1, or 2.

As noted, the T-A matrix would seem to be a property of a sprixel, but is really a property of the sprixel-plane (it must sometimes have lifetime beyond the sprixel, but we must check its suitability for further sprixels). The T-A matrix is absolutely key to getting decent flicker-free performance while supporting arbitrary damage to sprixels. You're encouraged to take a look at the source.

Dead ends

Several seemingly-promising ideas end up nonworkable.

A False Path: Mosaics

All of this complexity arises from two things:

Sprixels are large, potentially covering many cells in a row, and even a rectilinear area across multiple rows. The mere presence of atomic wide characters is enough to disturb the model of perfect, independent cells. Sprixels destroy the illusion entirely.
Painting sprixels tends to be slow, and is prone to flicker. We very much want to avoid sprixel repaints.

Give this problem to a ten year old, and they'll come back eventually, pointing out that things would be greatly simplified by drawing sprixels as "swarms" of cell-sized sprixels. At this point, we can just reuse our proven Z-ordered TUI painting algorithm, with the additional rule that a solved sprixel cell containing transparency should print the glyph, and then (in sixel) print the sprixel cell. in kitty, we'd need only reprint the glyph (generalizing a stack of sprixel cells is left as an exercise for the reader)! Printing atop a sprixel would mean simply deleting the sprixel cell (in kitty), and printing the glyph. No computationally intensive "cuts", nothing difficult at all, really.

As a bonus, your color quantization pretty much goes away, since you're now doing a set a palette per cell. For a 10x20px cell, a 256-entry palette can decay to no-lookahead RGB. At the 1024-entry paint, only pathological cell sizes require quantization.

One downside would be that with Sixel, you're dealing with 1x6 column sets of pixels (hence the name). Assuming a 20px high cell, this requires 4 sixels per column, with the bottom 4 bits transparent (active mask of 0x03). This would mean 20% overhead (generally, for a cell Hpx high, overhead would be (6 - H % 6) / H for H ≆ 0 mod 6). This would be unacceptable, but thanks to Sixel's RLE functionality, the transparent excess could be encoded efficiently, in five bytes. Five bytes per cell, however, still represents significant overhead, but only on the order of kilobytes for a 1024x1024px sprixel, comfortably less than 1% overhead.

Sounds great, right? Alas, with a Flaming-Sword-of-the-Most-High-like strike, Allah dashes your plans. Emitting more than a small number (fewer than ten) will bring at least XTerm to an absolutely unusable crawl. So it goes.

My memory, sir, is like a garbage disposal.—Jorge Luis Borges

A False Path: Font reprogramming

"But dank," you ask, "what if we were to seize the means of glyph production, aka the font?" Not only is that a non-portable unholy fucking mess, you still only get two colors in a stream-written glyph. Colors matter as much if not more than resolution. The NES and the SNES had basically the same resolution; the main reason why Super Mario World looked so much better than Super Mario Bros. was because it had a much wider palette (and more memory).

Implementation details

The most complex expression of Sixel requires four phases: clear the old bitmaps (possibly overlapping with new bitmaps), draw any updated text that's visible through transparent areas of the bitmaps, draw the bitmaps, and draw any text (updated or not) that's visible above. If cuts are employed, this can be reduced to three phases, since the text above the bitmap can be printed below the bitmap. Kitty never requires more than two phases (update text, update bitmaps), since ordering isn't relevant in the Kitty protocol.

It's generally necessary to immediately follow a Kitty deletion command with the redraw, if it's being replaced. The animation extension coming in 0.20.0 ought improve significantly on this situation (note: it did). Clearing a Sixel requires an overwrite with text or non-transparent sixels, or a screen clear/rectangular delete. In the case where a Sixel to be destroyed and a Sixel to be painted overlap, clearing the first sixel might require damaging cells underneath the second, hence the strong dependencies between the first three phases of Sixel rendering.

Some terminals can draw Sixels more quickly when the P2 parameter is 0 than when it is 1 (foot explicitly makes this claim). We need 1 if we have transparent pixels, but we can use 0 otherwise. This can be derived directly from the TAM--if and only if all TAM cells are 0, we can set P2=0.

Other people's bugs

XTerm through 367 doesn't update the pixel counts returned in the TIOCGWINSZ ioctl(2) when the font changes. I submitted a bug report for this, and it was fixed in Patch #368.
XTerm through at least 366 progressively slows down as Sixels are written in alternate mode, even if they are cleared. I submitted a bug report for this.
XTerm through 369 doesn't clear sixels with DECERA, DECFRA, etc. I submitted a patch for this, which was merged for XTerm 370.
Mlterm through at least 3.9.0 makes the cursor visible after emitting a Sixel, even if it was hidden before. I submitted a bug for this.
MLterm through 3.9.1 inverts the broadly-interpreted sense of DECSDM. I submitted a PR for this, which was merged.
Kitty through 0.19.3 scrolls the screen when an emitted image touches the lowest line. I submitted a bug for this; the C=1 modifier, available in 0.20.0, inhibits this scrolling.

External Links

The VT330/VT340 Programmer Reference Manual Volume 2: Graphics Programming from vt100.net
The Kitty graphics protocol
The iTerm2 graphics protocol
Thomas Dickey's XTerm Control Sequences document
Jexer, Autumn Lamonte's Java library inspired by Borland's "Turbo" line of products, is the only thing that comes close AFAIK
Issue #12 of the terminal-ng working group
hackerb9's vt340test repository

previously: "spooky tmux at a distance" 2021-02-20

Theory and Practice of Sprixels

Contents