Check out my first novel, midnight's simulacra!

Fast UNIX Servers: Difference between revisions

From dankwiki
Line 4: Line 4:
==Central Design Principles==
==Central Design Principles==
Varghese's "[http://www.amazon.com/Network-Algorithmics-Interdisciplinary-Designing-Networking/dp/0120884771 Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices]" is in a league of its own in this regard.
Varghese's "[http://www.amazon.com/Network-Algorithmics-Interdisciplinary-Designing-Networking/dp/0120884771 Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices]" is in a league of its own in this regard.
* Principle 1: '''Exploit all cycles/bandwidth.''' Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate (this applies to page caches just as much as processor caches or any other layer of the [[Architecture|memory hierarchy]]). Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
* Principle 1: '''Exploit all cycles/bandwidth.''' Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate (this applies to page caches just as much as processor caches or any other layer of the [[Architecture#Memory_hierarchy|memory hierarchy]]). Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
* Principle 2: '''Don't duplicate work.''' Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like [[Linux APIs|Linux's]] <tt>splice(2)</tt>.
* Principle 2: '''Don't duplicate work.''' Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like [[Linux APIs|Linux's]] <tt>splice(2)</tt>.
* Principle 3: '''Measure, measure, and measure again, preferably automatically.''' Hardware, software and networks will all surprise you. Become friends with your hardware's [[Performance Counters|performance counters]] and tools like [[Oprofile]], dtrace, ktrace, etc.
* Principle 3: '''Measure, measure, and measure again, preferably automatically.''' Hardware, software and networks will all surprise you. Become friends with your hardware's [[Performance Counters|performance counters]] and tools like [[Oprofile]], dtrace, ktrace, etc.

Revision as of 01:37, 26 June 2009

Everyone ought start with Dan Kegel's classic site, "The C10K Problem" (still updated from time to time). Jeff Darcy's "High-Performance Server Architecture" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens.

"I love the smell of 10GbE in the morning. Smells like...victory." - W. Richard Stevens, "Secret Teachings of the UNIX Environment"

Central Design Principles

Varghese's "Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices" is in a league of its own in this regard.

  • Principle 1: Exploit all cycles/bandwidth. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate (this applies to page caches just as much as processor caches or any other layer of the memory hierarchy). Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
  • Principle 2: Don't duplicate work. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like Linux's splice(2).
  • Principle 3: Measure, measure, and measure again, preferably automatically. Hardware, software and networks will all surprise you. Become friends with your hardware's performance counters and tools like Oprofile, dtrace, ktrace, etc.

"I thought of another moral, more down to earth and concrete, and I believe that every militant chemist can confirm it: that one must distrust the almost-the-same (sodium is almost the same as potassium, but with sodium nothing would have happened), the practically identical, the approximate, all surrogates, and all patchwork. The differences can be small, but they can lead to radically different consequences, like a railroad's switch points: the chemist's trade consists in good part of being aware of these differences, knowing them close up and foreseeing their effects. And not only the chemist's trade." - Primo Levi, The Periodic Table

Queueing Theory

Event Cores

Edge and Level Triggering

  • Historic interfaces like POSIX.1g/POSIX.1-2001's select(2) and POSIX.1-2001's poll(2) were level-triggered
  • Asynchronous I/O is pretty much by definition edge-triggered.
  • epoll (via EPOLLET) and kqueue (via EV_CLEAR) provide edge-triggered semantics
  • fixme a thorough comparison of these is sorely needed -- in short, edge-triggered saves syscalls and makes concurrency easier

A Garden of Interfaces

We all know doddering old read(2) and write(2) (which can't, by the way, be portably used with shared memory). But what about...

  • readv(2), writev(2) (FreeBSD's sendfile(2) has a struct iov handily attached, perfect for eg the Chunked transfer-encoding)
  • splice(2), vmsplice(2) and tee(2) on Linux since version 2.6.17
    • (When the first page of results for your interface centers largely on exploits, might it be time to reconsider your design assumptions?)
  • sendfile(2) (with charmingly different interfaces on FreeBSD and Linux)
    • On Linux since 2.6.2x (FIXME get a link), sendfile(2) is implemented in terms of splice(2)
  • aio_ and friends for aysnchronous i/o
  • mmap(2) and an entire associated bag of tricks (FIXME detail)
    • most uses of mincore(2) and madvise(2) are questionable at best and useless at likely. FIXME defend
    • broad use of mlock(2) as a performance hack is not even really questionable, just a bad idea FIXME defend
    • use of large pages is highly recommended for any large, non-sparse maps FIXME explain
    • mremap(2) and remap_file_pages(2) on Linux can be used effectively at times
    • There's nothing wrong with MAP_FIXED so long as you've already allocated the region before (see caveats...)
  • User-space networking stacks: The Return of Mach!
    • Linux has long had zero-copy PF_PACKET RX; get ready for zero-copy TX (using the same PACKET_MMAP interface)
  • "Zero-copy" gets banded about a good bit; be sure you're aware of hardware limitations (see FreeBSD's zero_copy(9), for instance)

The Full Monty: A Theory of UNIX Servers

We must mix and match:

  • Many event sources, of multiple types and possibly various triggering mechanisms (edge- vs level-triggered):
    • Socket descriptors, pipes
    • File descriptors referring to actual files (these usually have different blocking semantics)
    • Signals, perhaps being used for asynchronous I/O with descriptors (signalfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_SIGNAL events)
    • Timers (timerfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_TIMER events)
    • Condition variables and/or mutexes becoming available
    • Filesystem events (inotify(7) on Linux, EVFILT_VNODE with kqueue)
    • Networking events (netlink(7) (PF_NETLINK) sockets on Linux, EVFILT_NETDEV with kqueue)
  • One or more event notifiers (epoll or kqueue fd)
  • One or more event vectors, into which notifiers dump events
    • kqueue supports vectorized registration of event changes, extending the issue
  • Threads -- one event notifier per? one shared event notifier with one event vector per? one shared event notifier feeding one shared event vector? work-stealing/handoff?

"Thread scheduling provides a facility for juggling between clients without further programming; if it is too expensive, the application may benefit from doing the juggling itself. Effectively, the application must implement its own internal scheduler that juggles the state of each client." - George Varghese, Network Algorithmics

    • It is doubtful (but not, AFAIK, proven impossible) that one scheduling/sharing solution is optimal for all workloads
    • The Flash web server dynamically spawns and culls helper threads for high-latency I/O operations
    • The contest is between the costs of demultiplexing asynchronous event notifications vs managing threads
      • My opinion: if fast async notifications can be distributed across threads, one thread per processing element always suffices

DoS Prevention or, Maximizing Useful Service

  • TCP SYN -- to Syncookie or nay? The "half-open session" isn't nearly as meaningful or important a concept on modern networking stacks as it was in 2000.
    • Long-fat-pipe options, fewer MSS values, etc...but recent work (in Linux, at least) has improved them (my gut feeling: nay)
  • Various attacks like slowloris, TCPPersist as written up in Phrack 0x0d-0x42-0x09, Outpost24 etc...
  • What are the winning feedbacks? fractals and queueing theory, oh my! fixme detail

Hardware Esoterica

  • Direct Cache Access must be supported by NICs, northbridge chipset, OS and microarchitecture
  • IOMMU / I/OAT
  • Checksum offloading / TSO
  • PCI shared bus/bus-mastering, PCIe slots/lanes (channel grouping), PCI-X, MSI

Operating System Esoterica

  • The Linux networking stack is a boss hawg and a half. Check out the Linux Advanced Routing and Traffic Control (LARTC) HOWTO for details ad nauseam
  • See my TCP page -- auto-tuning is pretty much to be assumed (and best not subverted) in recent Linux/FreeBSD

See Also