Check out my first novel, midnight's simulacra!
Fast UNIX Servers: Difference between revisions
No edit summary |
|||
Line 1: | Line 1: | ||
Everyone ought start with Dan Kegel's classic site, "[http://www.kegel.com/c10k.html The C10K Problem]" (still updated from time to time). Jeff Darcy's "[http://pl.atyp.us/content/tech/servers.html High-Performance Server Architecture]" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens. | Everyone ought start with Dan Kegel's classic site, "[http://www.kegel.com/c10k.html The C10K Problem]" (still updated from time to time). Jeff Darcy's "[http://pl.atyp.us/content/tech/servers.html High-Performance Server Architecture]" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens. | ||
<code>"I love the smell of 10GbE in the morning. Smells like...victory." - W. Richard Stevens, ''Secret Teachings Regarding the UNIX Environment''</code> | |||
==Central Design Principles== | ==Central Design Principles== | ||
Line 66: | Line 67: | ||
* Various attacks like [http://ha.ckers.org/blog/20090617/slowloris-http-dos slowloris], [http://www.phrack.com/issues.html?issue=66&id=9#article TCPPersist] as written up in Phrack 0x0d-0x42-0x09, [https://www.cert.fi/haavoittuvuudet/2008/tcp-vulnerabilities.html Outpost24] etc... | * Various attacks like [http://ha.ckers.org/blog/20090617/slowloris-http-dos slowloris], [http://www.phrack.com/issues.html?issue=66&id=9#article TCPPersist] as written up in Phrack 0x0d-0x42-0x09, [https://www.cert.fi/haavoittuvuudet/2008/tcp-vulnerabilities.html Outpost24] etc... | ||
* What are the winning feedbacks? ''fractals and queueing theory, oh my!'' '''fixme detail''' | * What are the winning feedbacks? ''fractals and queueing theory, oh my!'' '''fixme detail''' | ||
==Hardware Esoterica== | |||
* [[Direct Cache Access]] must be supported by NICs, northbridge chipset, OS and microarchitecture | |||
* Checksum offloading / TSO | |||
==See Also== | ==See Also== |
Revision as of 00:40, 26 June 2009
Everyone ought start with Dan Kegel's classic site, "The C10K Problem" (still updated from time to time). Jeff Darcy's "High-Performance Server Architecture" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens.
"I love the smell of 10GbE in the morning. Smells like...victory." - W. Richard Stevens, Secret Teachings Regarding the UNIX Environment
Central Design Principles
Varghese's "Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices" is in a league of its own in this regard.
- Principle 1: Exploit all cycles/bandwidth. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate. Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
- Principle 2: Don't duplicate work. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like Linux's splice(2).
- Principle 3: Measure, measure, and measure again, preferably automatically. Hardware, software and networks will all surprise you. Become friends with your hardware's performance counters.
"I thought of another moral, more down to earth and concrete, and I believe that every militant chemist can confirm it: that one must distrust the almost-the-same (sodium is almost the same as potassium, but with sodium nothing would have happened), the practically identical, the approximate, all surrogates, and all patchwork. The differences can be small, but they can lead to radically different consequences, like a railroad's switch points: the chemist's trade consists in good part of being aware of these differences, knowing them close up and foreseeing their effects. And not only the chemist's trade." - Primo Levi, The Periodic Table
Queueing Theory
- "Introduction to Queueing"
- Leonard Kleinrock's peerless Queueing Systems (Volume 1: Theory, Volume 2: Computer Applications)
Event Cores
- epoll on Linux, /dev/poll on Solaris, kqueue on FreeBSD
- liboop, libev and libevent
- Ulrich Drepper's "The Need for Aynchronous, ZeroCopy Network I/O"
- If nothing else, Drepper's plans tend to become sudden and crushing realities in the glibc world
Edge and Level Triggering
- Historic interfaces like POSIX.1g/POSIX.1-2001's select(2) and POSIX.1-2001's poll(2) were level-triggered
- Asynchronous I/O is pretty much by definition edge-triggered.
- epoll (via EPOLLET) and kqueue (via EV_CLEAR) provide edge-triggered semantics
- fixme a thorough comparison of these is sorely needed -- in short, edge-triggered saves syscalls and makes concurrency easier
A Garden of Interfaces
We all know doddering old read(2) and write(2) (which can't, by the way, be portably used with shared memory). But what about...
- readv(2), writev(2) (FreeBSD's sendfile(2) has a struct iov handily attached)
- splice(2), vmsplice(2) and tee(2) on Linux since version 2.6.17
- (When the first page of results for your interface centers largely on exploits, might it be time to reconsider your design assumptions?)
- sendfile(2) (with charmingly different interfaces on FreeBSD and Linux)
- On Linux since 2.6.2x (FIXME get a link), sendfile(2) is implemented in terms of splice(2)
- aio_ and friends for aysnchronous i/o
- mmap(2) and an entire associated bag of tricks (FIXME detail)
- most uses of mincore(2) and madvise(2) are questionable at best and useless at likely. FIXME defend
- broad use of mlock(2) as a performance hack is not even really questionable, just a bad idea FIXME defend
- use of large pages is highly recommended for any large, non-sparse maps FIXME explain
- mremap(2) and remap_file_pages(2) on Linux can be used effectively at times
- There's nothing wrong with MAP_FIXED so long as you've already allocated the region before (see caveats...)
- User-space networking stacks: The Return of Mach!
- Linux has long had zero-copy PF_PACKET RX; get ready for zero-copy TX (using the same PACKET_MMAP interface)
- "Zero-copy" gets banded about a good bit; be sure you're aware of hardware limitations (see FreeBSD's zero_copy(9), for instance)
The Full Monty: A Theory of UNIX Servers
We must mix and match:
- Many event sources, of multiple types and possibly various triggering mechanisms (edge- vs level-triggered):
- Socket descriptors, pipes
- File descriptors referring to actual files (these usually have different blocking semantics)
- Signals, perhaps being used for asynchronous I/O with descriptors (signalfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_SIGNAL events)
- Timers (timerfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_TIMER events)
- Condition variables and/or mutexes becoming available
- Filesystem events (inotify(7) on Linux, EVFILT_VNODE with kqueue)
- Networking events (netlink(7) (PF_NETLINK) sockets on Linux, EVFILT_NETDEV with kqueue)
- One or more event notifiers (epoll or kqueue fd)
- One or more event vectors, into which notifiers dump events
- kqueue supports vectorized registration of event changes, extending the issue
- Threads -- one event notifier per? one shared event notifier with one event vector per? one shared event notifier feeding one shared event vector? work-stealing/handoff?
"Thread scheduling provides a facility for juggling between clients without further programming; if it is too expensive, the application may benefit from doing the juggling itself. Effectively, the application must implement its own internal scheduler that juggles the state of each client." - George Varghese, Network Algorithmics
- It is doubtful (but not, AFAIK, proven impossible) that one scheduling/sharing solution is optimal for all workloads
- The Flash web server dynamically spawns and culls helper threads for high-latency I/O operations
- The contest is between the costs of demultiplexing asynchronous event notifications vs managing threads
- My opinion: if fast async notifications can be distributed across threads, one thread per processing element always suffices
DoS Prevention or, Maximizing Useful Service
- TCP SYN -- to Syncookie or nay? The "half-open session" isn't nearly as meaningful or important a concept on modern networking stacks as it was in 2000.
- Long-fat-pipe options, fewer MSS values, etc...but recent work (in Linux, at least) has improved them (my gut feeling: nay)
- Various attacks like slowloris, TCPPersist as written up in Phrack 0x0d-0x42-0x09, Outpost24 etc...
- What are the winning feedbacks? fractals and queueing theory, oh my! fixme detail
Hardware Esoterica
- Direct Cache Access must be supported by NICs, northbridge chipset, OS and microarchitecture
- Checksum offloading / TSO
See Also
- "sendfile(): fairly sexy (nothing to do with ECN)" on lkml
- "mmap() sendfile()" on freebsd-hackers
- "sharing memory map between processes (same parent)" on comp.unix.programmer
- "some mmap observations compared to Linux 2.6/OpenBSD" on freebsd-hackers
- Stuart Cheshire's "Laws of Networkdynamics" and "It's the Latency, Stupid"
- "mremap help? or no support for FreeBSD?" on freebsd-hackers
- "Edge-triggered interfaces are too difficult?" on LWN, 2003-05-16
- "Edge- vs Level-Triggered Events on Pierre Phaneuf's livejournal (pphaneuf)
- "edge-triggered vs level-triggered epoll in kernel 2.6" on comp.unix.programmer, 2004-12-01
- Ian Barile's 2004-02 Dr. Dobb's Journal article, "I/O Multiplexing & Scalable Socket Servers"