Check out my first novel, midnight's simulacra!

Fast UNIX Servers

From dankwiki
Revision as of 20:49, 22 May 2023 by Dank (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

"I love the smell of 10GbE in the morning. Smells like...victory." - W. Richard Stevens, "Secret Teachings of the UNIX Environment"

Dan Kegel's classic site "The C10K Problem" (still updated from time to time) put a Promethean order to the arcana of years, with Jeff Darcy's "High-Performance Server Architecture" adding to our understanding. I'm collecting here some followup material to these excellent works (and of course the books of W. Richard Stevens, whose torch we merely carry). Some of these techniques have found, or will find, their way into libtorque, my multithreaded event unification library (and master's thesis).

FIXME as of 2019, I need update this with information on DPDK, io_uring, and eBPF/XDP at a minimum.

Central Design Principles

Varghese's Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices is in a league of its own in this regard.

  • Principle 1: Exploit all cycles/bandwidth. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate (this applies to page caches just as much as processor caches or any other layer of the memory hierarchy). Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
  • Principle 2: Don't duplicate work. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or ringbuffers, and calls like Linux's splice(2).
  • Principle 3: Measure, measure, and measure again, preferably automatically. Hardware, software and networks will all surprise you. Become friends with your hardware's performance counters and tools like eBPF, dtrace, ktrace, etc. Build explicit support for performance analysis into the application, especially domain-specific statistics.

"I thought of another moral, more down to earth and concrete, and I believe that every militant chemist can confirm it: that one must distrust the almost-the-same (sodium is almost the same as potassium, but with sodium nothing would have happened), the practically identical, the approximate, all surrogates, and all patchwork. The differences can be small, but they can lead to radically different consequences, like a railroad's switch points: the chemist's trade consists in good part of being aware of these differences, knowing them close up and foreseeing their effects. And not only the chemist's trade." - Primo Levi, The Periodic Table

Queueing Theory

Event Cores

Edge and Level Triggering

  • Historic interfaces like POSIX.1g/POSIX.1-2001's select(2) and POSIX.1-2001's poll(2) were level-triggered
  • Asynchronous I/O is pretty much by definition edge-triggered.
  • epoll (via EPOLLET) and kqueue (via EV_CLEAR) provide edge-triggered semantics
  • see the mteventqueue document from libtorque's documentation:

<include src=" " />

A Garden of Interfaces

We all know doddering old read(2) and write(2) (which can't, by the way, be portably used with shared memory). But what about...

  • readv(2), writev(2) (FreeBSD's sendfile(2) has a struct iov handily attached, perfect for eg the Chunked transfer-encoding)
  • splice(2), vmsplice(2) and tee(2) on Linux since version 2.6.17
    • (When the first page of results for your interface centers largely on exploits, might it be time to reconsider your design assumptions?)
  • sendfile(2) (with charmingly different interfaces on FreeBSD and Linux)
    • On Linux since 2.6.2x (FIXME get a link), sendfile(2) is implemented in terms of splice(2)
  • aio_ and friends for aysnchronous i/o
  • mmap(2) and an entire associated bag of tricks (FIXME detail)
    • most uses of mincore(2) and madvise(2) are questionable at best and useless at likely. FIXME defend
    • broad use of mlock(2) as a performance hack is not even really questionable, just a bad idea FIXME defend
    • use of large pages is highly recommended for any large, non-sparse maps FIXME explain
    • mremap(2) and remap_file_pages(2) on Linux can be used effectively at times
    • There's nothing wrong with MAP_FIXED so long as you've already allocated the region before (see caveats...)

"The linux mremap() is an idiotic system call. Just unmap the file and re-mmap it. There are a thousand ways to do it, which is why linux's mremap() syscall is stupid." - Matthew Dillon

"I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home." - Linus Torvalds

  • User-space networking stacks: The Return of Mach!
    • Linux has long had zero-copy PF_PACKET RX; get ready for zero-copy TX (using the same PACKET_MMAP interface)
    • I'm actively using PACKET_TX_RING in Omphalos as of June 2011; it works magnificently.
  • "Zero-copy" gets banded about a good bit; be sure you're aware of hardware limitations (see FreeBSD's zero_copy(9), for instance)
  • Linux 5 introduced io_uring and really locked down and expanded eBPF

The Full Monty: A Theory of UNIX Servers

We must mix and match:

  • Many event sources, of multiple types and possibly various triggering mechanisms (edge- vs level-triggered):
    • Socket descriptors, pipes
    • File descriptors referring to actual files (these usually have different blocking semantics)
    • Signals, perhaps being used for asynchronous I/O with descriptors (signalfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_SIGNAL events)
    • Timers (timerfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_TIMER events)
    • Condition variables and/or mutexes becoming available
    • Filesystem events (inotify(7) on Linux, EVFILT_VNODE with kqueue)
    • Networking events (netlink(7) (PF_NETLINK) sockets on Linux, EVFILT_NETDEV with kqueue)
  • One or more event notifiers (epoll or kqueue fd)
  • One or more event vectors, into which notifiers dump events
    • kqueue supports vectorized registration of event changes, generalizing the issue
  • Threads -- one event notifier per? one shared event notifier with one event vector per? one shared event notifier feeding one shared event vector? work-stealing/handoff?
    • It is doubtful (but not, AFAIK, proven impossible) that one scheduling/sharing solution is optimal for all workloads
    • The Flash web server dynamically spawns and culls helper threads for high-latency I/O operations
    • The contest is between the costs of demultiplexing asynchronous event notifications vs managing threads
      • My opinion: if fast async notifications can be distributed across threads, one thread per processing element always suffices
      • Other opinions exist, centered around communication bottlenecks

"Thread scheduling provides a facility for juggling between clients without further programming; if it is too expensive, the application may benefit from doing the juggling itself. Effectively, the application must implement its own internal scheduler that juggles the state of each client." - George Varghese, Network Algorithmics

DoS Prevention or, Maximizing Useful Service

  • TCP SYN -- to Syncookie or nay? The "half-open session" isn't nearly as meaningful or important a concept on modern networking stacks as it was in 2000.
    • Long-fat-pipe options, fewer MSS values, etc...but recent work (in Linux, at least) has improved them (my gut feeling: nay)
  • Various attacks like slowloris, TCPPersist as written up in Phrack 0x0d-0x42-0x09, Outpost24 etc...
  • What are the winning feedbacks? fractals and queueing theory, oh my! fixme detail

The Little Things

Hardware Esoterica

  • Direct Cache Access must be supported by NICs, northbridge chipset, OS and microarchitecture
  • Checksum offloading / TSO / LRO / Frame descriptors
    • Use ethtool on Linux to configure NICs (try ethtool -g, -k and -c)
  • PCI shared bus/bus-mastering, PCIe slots/lanes (channel grouping), PCI-X, MSI

"Many times, but not every time, a network frontend processor is likely to be an overly complex solution to the wrong part of the problem. It is possibly an expedient short-term measure (and there's certainly a place in the world for those), but as a long-term architectural approach, the commoditization of processor cores makes specialized hardware very difficult to justify." - Mike O'Dell, "Network Front-end Processors, Yet Again"

  • What about robustness in the face of hardware failure? Actual hardware interfaces (MCE, IPMI, CPU and memory blacklisting) ought mainly be the domain of the operating system, but the effects will be felt by the application. If a processing unit is removed, are compute-bound threads pruned?

Operating System Esoterica

  • The Linux networking stack is a boss hawg and a half. Check out the Linux Advanced Routing and Traffic Control (LARTC) HOWTO for details ad nauseam
  • See my TCP page -- auto-tuning is pretty much to be assumed (and best not subverted) in recent Linux/FreeBSD
  • When extending MAP_NOSYNC maps on FreeBSD, be sure to write(2) in zeros, rather than merely ftruncating (see the man page's warning)
  • Linux enforces a systemwide limit on LVMAs (maps): /proc/sys/vm/max_map_count

Tuning for the Network

  • All hosts ought employ the RFC 1323 options (see Syncookies regarding contraindications there)
  • Avoid fragmentation: datagram services (UDP, DCCP) ought ensure they're not exceeding PMTUs
  • LAN services ought consider jumbo frames.
  • There is little point in setting IPv4 TOS bits (RFC 791, RFC 1349); they've been superseded as DiffServ/ECN (RFC 3168)
    • IPv6 does away with this entire concept, using flow labels and letting the router decide

Power Consumption

Less power consumed means reduced operating cost and less waste heat, prolonging component life.

  • Using on-demand CPU throttling (ACPI P-states, voltage reduction) is a no-brainer, but requires dynamic control to be effective.
    • Be sure it's enabled in your OS and your BIOS; more info here
  • Sleep states (architectural changes) are useful outside environments pairing low-latency requirements with sporadic traffic
  • Don't wake up disks when it's not necessary; try using tmpfs or async for transient logging, and don't log MARK entries
    • If your app doesn't use disk directly, consider PXE booting and network-based logging
  • Avoid periodic taskmastering and timers where available, using event-driven notification (more effective anyway!)
  • Use as few processing elements as completely as possible, so that CPUs and caches can be powered down
    • This also applies, of course, to machines in a cluster

See Also