Check out my first novel, midnight's simulacra!

Fast UNIX Servers: Difference between revisions

From dankwiki
No edit summary
Line 3: Line 3:
==Central Design Principles==
==Central Design Principles==
* Varghese's "[http://www.amazon.com/Network-Algorithmics-Interdisciplinary-Designing-Networking/dp/0120884771 Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices]" is in a league of its own in this regard.
* Varghese's "[http://www.amazon.com/Network-Algorithmics-Interdisciplinary-Designing-Networking/dp/0120884771 Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices]" is in a league of its own in this regard.
* Principle 1: '''Exploit all cycles/bandwidth'''. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate. Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
* Principle 1: '''Exploit all cycles/bandwidth.''' Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate. Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
* Principle 2: '''Don't duplicate work'''. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like [[Linux APIs|Linux's]] <tt>splice(2)</tt>.
* Principle 2: '''Don't duplicate work.''' Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like [[Linux APIs|Linux's]] <tt>splice(2)</tt>.
* Principle 3: ...?
* Principle 3: '''Measure, measure, and measure again, preferably automatically.''' Hardware, software and networks will all surprise you.
<pre>"I thought of another moral, more down to earth and concrete, and I believe that every militant chemist can confirm it: that one must distrust the almost-the-same (sodium is almost the same as potassium, but with sodium nothing would have happened), the practically identical, the approximate, all surrogates, and all patchwork. The differences can be small, but they can lead to radically different consequences, like a railroad's switch points: the chemist's trade consists in good part of being aware of these differences, knowing them close up and foreseeing their effects. And not only the chemist's trade." - Primo Levi, "The Periodic Table"</pre>


==Queueing Theory==
==Queueing Theory==

Revision as of 22:47, 25 June 2009

Everyone ought start with Dan Kegel's classic site, "The C10K Problem" (still updated from time to time). Jeff Darcy's "High-Performance Server Architecture" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens.

Central Design Principles

  • Varghese's "Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices" is in a league of its own in this regard.
  • Principle 1: Exploit all cycles/bandwidth. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate. Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
  • Principle 2: Don't duplicate work. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like Linux's splice(2).
  • Principle 3: Measure, measure, and measure again, preferably automatically. Hardware, software and networks will all surprise you.
"I thought of another moral, more down to earth and concrete, and I believe that every militant chemist can confirm it: that one must distrust the almost-the-same (sodium is almost the same as potassium, but with sodium nothing would have happened), the practically identical, the approximate, all surrogates, and all patchwork. The differences can be small, but they can lead to radically different consequences, like a railroad's switch points: the chemist's trade consists in good part of being aware of these differences, knowing them close up and foreseeing their effects. And not only the chemist's trade." - Primo Levi, "The Periodic Table"

Queueing Theory

Event Cores

Edge and Level Triggering

  • Historic interfaces like POSIX.1g/POSIX.1-2001's select(2) and POSIX.1-2001's poll(2) were level-triggered
  • epoll (via EPOLLET) and kqueue (via EV_CLEAR) provide edge-triggered semantics
  • fixme a thorough comparison of these is sorely needed

A Garden of Interfaces

We all know doddering old read(2) and write(2) (which can't, by the way, be portably used with shared memory). But what about...

  • readv(2), writev(2) (FreeBSD's sendfile(2) has a struct iov handily attached)
  • splice(2), vmsplice(2) and tee(2) on Linux since version 2.6.17
    • (When the first page of results for your interface centers largely on exploits, might it be time to reconsider your design assumptions?)
  • sendfile(2) (with charmingly different interfaces on FreeBSD and Linux)
    • On Linux since 2.6.2x (FIXME get a link), sendfile(2) is implemented in terms of splice(2)
  • aio_ and friends for aysnchronous i/o
  • mmap(2) and an entire associated bag of tricks (FIXME detail)
    • most uses of mincore(2) and madvise(2) are questionable at best and useless at likely. FIXME defend
    • broad use of mlock(2) as a performance hack is not even really questionable FIXME defend
    • use of large pages is highly recommended for any large, non-sparse maps FIXME explain
    • mremap(2) and remap_file_pages(2) on Linux can be used effectively at times
    • There's nothing wrong with MAP_FIXED so long as you've already allocated the region before (see caveats...)

The Full Monty: A Theory of UNIX Servers

We must mix and match:

  • Many event sources, of multiple types and possibly various triggering mechanisms (edge- vs level-triggered):
    • Socket descriptors, pipes
    • File descriptors referring to actual files (these usually have different blocking semantics)
    • Signals, perhaps being used for asynchronous I/O with descriptors (signalfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_SIGNAL events)
    • Timers (timerfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_TIMER events)
    • Condition variables and/or mutexes becoming available
    • Filesystem events (inotify(7) on Linux, EVFILT_VNODE with kqueue)
    • Networking events (netlink(7) (PF_NETLINK) sockets on Linux, EVFILT_NETDEV with kqueue)
  • One or more event notifiers (epoll or kqueue fd)
  • One or more event vectors, into which notifiers dump events
    • kqueue supports vectorized registration of event changes, extending the issue
  • Threads -- one event notifier per? one shared event notifier with one event vector per? one shared event notifier feeding one shared event vector? work-stealing/handoff?
    • It is doubtful (but not, AFAIK, proven impossible) that one scheduling/sharing solution is optimal for all workloads

DoS Prevention or, Maximizing Useful Service

  • TCP SYN -- to Syncookie or nay? The "half-open session" isn't nearly as meaningful or important a concept on modern networking stacks as it was in 2000.
    • Long-fat-pipe options, fewer MSS values, etc...but recent work (in Linux, at least) has improved them (my gut feeling: nay)
  • Various attacks like slowloris, TCPPersist as written up in Phrack 0x0d-0x42-0x09, Outpost24 etc...
  • What are the winning feedbacks? fractals and queueing theory, oh my! fixme detail

See Also