Fast UNIX Servers

Dan Kegel's classic site "The C10K Problem" (still updated from time to time) put a Promethean order to the arcana of years, with Jeff Darcy's "High-Performance Server Architecture" adding to our understanding. I'm collecting here some followup material to these excellent works (and of course the books of W. Richard Stevens, whose torch we merely carry).

"I love the smell of 10GbE in the morning. Smells like...victory." - W. Richard Stevens, "Secret Teachings of the UNIX Environment"

Central Design Principles

Varghese's Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices is in a league of its own in this regard.

Principle 1: Exploit all cycles/bandwidth. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate (this applies to page caches just as much as processor caches or any other layer of the memory hierarchy). Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
Principle 2: Don't duplicate work. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like Linux's splice(2).
Principle 3: Measure, measure, and measure again, preferably automatically. Hardware, software and networks will all surprise you. Become friends with your hardware's performance counters and tools like Oprofile, dtrace, ktrace, etc.
- "I thought of another moral, more down to earth and concrete, and I believe that every militant chemist can confirm it: that one must distrust the almost-the-same (sodium is almost the same as potassium, but with sodium nothing would have happened), the practically identical, the approximate, all surrogates, and all patchwork. The differences can be small, but they can lead to radically different consequences, like a railroad's switch points: the chemist's trade consists in good part of being aware of these differences, knowing them close up and foreseeing their effects. And not only the chemist's trade." - Primo Levi, The Periodic Table

Queueing Theory

"Introduction to Queueing"
Little's Law
Leonard Kleinrock's peerless Queueing Systems (Volume 1: Theory, Volume 2: Computer Applications)

Event Cores

epoll on Linux, /dev/poll on Solaris, kqueue on FreeBSD
liboop, libev and libevent
Ulrich Drepper's "The Need for Aynchronous, ZeroCopy Network I/O"
- If nothing else, Drepper's plans tend to become sudden and crushing realities in the glibc world

Edge and Level Triggering

Historic interfaces like POSIX.1g/POSIX.1-2001's select(2) and POSIX.1-2001's poll(2) were level-triggered
Asynchronous I/O is pretty much by definition edge-triggered.
epoll (via EPOLLET) and kqueue (via EV_CLEAR) provide edge-triggered semantics
fixme a thorough comparison of these is sorely needed -- in short, edge-triggered saves syscalls and makes concurrency easier

A Garden of Interfaces

We all know doddering old read(2) and write(2) (which can't, by the way, be portably used with shared memory). But what about...

readv(2), writev(2) (FreeBSD's sendfile(2) has a struct iov handily attached, perfect for eg the Chunked transfer-encoding)
splice(2), vmsplice(2) and tee(2) on Linux since version 2.6.17
- (When the first page of results for your interface centers largely on exploits, might it be time to reconsider your design assumptions?)
sendfile(2) (with charmingly different interfaces on FreeBSD and Linux)
- On Linux since 2.6.2x (FIXME get a link), sendfile(2) is implemented in terms of splice(2)
aio_ and friends for aysnchronous i/o
mmap(2) and an entire associated bag of tricks (FIXME detail)
- most uses of mincore(2) and madvise(2) are questionable at best and useless at likely. FIXME defend
- broad use of mlock(2) as a performance hack is not even really questionable, just a bad idea FIXME defend
- use of large pages is highly recommended for any large, non-sparse maps FIXME explain
- mremap(2) and remap_file_pages(2) on Linux can be used effectively at times
- There's nothing wrong with MAP_FIXED so long as you've already allocated the region before (see caveats...)
- "The linux mremap() is an idiotic system call. Just unmap the file and re-mmap it. There are a thousand ways to do it, which is why linux's mremap() syscall is stupid." - Matthew Dillon
User-space networking stacks: The Return of Mach!
- Linux has long had zero-copy PF_PACKET RX; get ready for zero-copy TX (using the same PACKET_MMAP interface)
"Zero-copy" gets banded about a good bit; be sure you're aware of hardware limitations (see FreeBSD's zero_copy(9), for instance)

The Full Monty: A Theory of UNIX Servers

We must mix and match:

Many event sources, of multiple types and possibly various triggering mechanisms (edge- vs level-triggered):
- Socket descriptors, pipes
- File descriptors referring to actual files (these usually have different blocking semantics)
- Signals, perhaps being used for asynchronous I/O with descriptors (signalfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_SIGNAL events)
- Timers (timerfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_TIMER events)
- Condition variables and/or mutexes becoming available
- Filesystem events (inotify(7) on Linux, EVFILT_VNODE with kqueue)
- Networking events (netlink(7) (PF_NETLINK) sockets on Linux, EVFILT_NETDEV with kqueue)
One or more event notifiers (epoll or kqueue fd)
One or more event vectors, into which notifiers dump events
- kqueue supports vectorized registration of event changes, extending the issue
Threads -- one event notifier per? one shared event notifier with one event vector per? one shared event notifier feeding one shared event vector? work-stealing/handoff?
- It is doubtful (but not, AFAIK, proven impossible) that one scheduling/sharing solution is optimal for all workloads
- The Flash web server dynamically spawns and culls helper threads for high-latency I/O operations
- The contest is between the costs of demultiplexing asynchronous event notifications vs managing threads
  - My opinion: if fast async notifications can be distributed across threads, one thread per processing element always suffices
- "Thread scheduling provides a facility for juggling between clients without further programming; if it is too expensive, the application may benefit from doing the juggling itself. Effectively, the application must implement its own internal scheduler that juggles the state of each client." - George Varghese, Network Algorithmics

DoS Prevention or, Maximizing Useful Service

TCP SYN -- to Syncookie or nay? The "half-open session" isn't nearly as meaningful or important a concept on modern networking stacks as it was in 2000.
- Long-fat-pipe options, fewer MSS values, etc...but recent work (in Linux, at least) has improved them (my gut feeling: nay)
Various attacks like slowloris, TCPPersist as written up in Phrack 0x0d-0x42-0x09, Outpost24 etc...
What are the winning feedbacks? fractals and queueing theory, oh my! fixme detail

The Little Things

Hardware Esoterica

Direct Cache Access must be supported by NICs, northbridge chipset, OS and microarchitecture
IOMMU / I/OAT
Checksum offloading / TSO / LRO / Frame descriptors
- Use ethtool on Linux to configure NICs (try ethtool -g, -k and -c)
PCI shared bus/bus-mastering, PCIe slots/lanes (channel grouping), PCI-X, MSI

Operating System Esoterica

The Linux networking stack is a boss hawg and a half. Check out the Linux Advanced Routing and Traffic Control (LARTC) HOWTO for details ad nauseam
See my TCP page -- auto-tuning is pretty much to be assumed (and best not subverted) in recent Linux/FreeBSD
When extending MAP_NOSYNC maps on FreeBSD, be sure to write(2) in zeros, rather than merely ftruncating (see the man page's warning)

Tuning for the Network

All hosts ought employ the RFC1323 options (see Syncookies regarding contraindications there)
Avoid fragmentation: datagram services (UDP, DCCP) ought ensure they're not exceeding PMTUs
LAN services ought consider jumbo frames.
There is little point in setting IPv4 TOS bits (RFC 791, RFC 1349); they've been superseded as DiffServ/ECN (RFC 3168)

Power Consumption

Less power consumed means reduced operating cost and less waste heat, prolonging component life.

Using on-demand CPU throttling (ACPI P-states, voltage reduction) is a no-brainer, but requires dynamic control to be effective.
- Be sure it's enabled in your OS and your BIOS; more info here
Sleep states (architectural changes) are useful outside environments pairing low-latency requirements with sporadic traffic
- Even aggressive power-saving ACPI C-states wake up in usec
Don't wake up disks when it's not necessary; try using tmpfs or async for transient logging, and don't log MARK entries
- If your app doesn't use disk directly, consider PXE booting and network-based logging
Avoid periodic taskmastering and timers where available, using event-driven notification (more effective anyway!)
Use as few processing elements as completely as possible, so that CPUs and caches can be powered down
- This also applies, of course, to machines in a cluster

Fast UNIX Servers

Contents

Central Design Principles

Queueing Theory

Event Cores

Edge and Level Triggering

A Garden of Interfaces

The Full Monty: A Theory of UNIX Servers

DoS Prevention or, Maximizing Useful Service

The Little Things

Hardware Esoterica

Operating System Esoterica

Tuning for the Network

Power Consumption

See Also

navigation menu

Fast UNIX Servers

Central Design Principles

Queueing Theory

Event Cores

Edge and Level Triggering

A Garden of Interfaces

The Full Monty: A Theory of UNIX Servers

DoS Prevention or, Maximizing Useful Service

The Little Things

Hardware Esoterica

Operating System Esoterica

Tuning for the Network

Power Consumption

See Also

navigation menu

Search