Check out my first novel, midnight's simulacra!
Fast UNIX Servers: Difference between revisions
From dankwiki
No edit summary |
|||
Line 1: | Line 1: | ||
Everyone ought start with Dan Kegel's classic site, "[http://www.kegel.com/c10k.html The C10K Problem]" (still updated from time to time). Jeff Darcy's "[http://pl.atyp.us/content/tech/servers.html High-Performance Server Architecture]" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens. | Everyone ought start with Dan Kegel's classic site, "[http://www.kegel.com/c10k.html The C10K Problem]" (still updated from time to time). Jeff Darcy's "[http://pl.atyp.us/content/tech/servers.html High-Performance Server Architecture]" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens. | ||
==Central Design Principles== | |||
* Varghese's "[http://www.amazon.com/Network-Algorithmics-Interdisciplinary-Designing-Networking/dp/0120884771 Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices]" is in a league of its own in this regard. | |||
* Principle 1: '''Exploit all cycles/bandwidth'''. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate. Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation). | |||
* Principle 2: '''Don't duplicate work'''. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like [[Linux APIs|Linux's]] <tt>splice(2)</tt>. | |||
* Principle 3: ...? | |||
==Queueing Theory== | ==Queueing Theory== |
Revision as of 22:45, 25 June 2009
Everyone ought start with Dan Kegel's classic site, "The C10K Problem" (still updated from time to time). Jeff Darcy's "High-Performance Server Architecture" is much of the same. Everything here is advanced followup material to these excellent works, and of course the books of W. Richard Stevens.
Central Design Principles
- Varghese's "Network Algorithmics: An Interdisciplinary Approach to Designing Fast Networked Devices" is in a league of its own in this regard.
- Principle 1: Exploit all cycles/bandwidth. Avoid blocking I/O and unnecessary evictions of cache, but prefetch into cache where appropriate. Be prepared to exploit multiple processing elements. Properly align data and avoid cache-aliasing effects. Use jumbo frames in appropriate scenarios and proactively warn on network degradation (e.g., half-duplex Ethernet due to failed link negotiation).
- Principle 2: Don't duplicate work. Avoid unnecessary copies, context switches, system calls and signals. Use double-buffering or calls like Linux's splice(2).
- Principle 3: ...?
Queueing Theory
- "Introduction to Queueing"
- Leonard Kleinrock's peerless Queueing Systems (Volume 1: Theory, Volume 2: Computer Applications)
Event Cores
- epoll on Linux, /dev/poll on Solaris, kqueue on FreeBSD
- liboop, libev and libevent
- Ulrich Drepper's "The Need for Aynchronous, ZeroCopy Network I/O"
- If nothing else, Drepper's plans tend to become sudden and crushing realities in the glibc world
Edge and Level Triggering
- Historic interfaces like POSIX.1g/POSIX.1-2001's select(2) and POSIX.1-2001's poll(2) were level-triggered
- epoll (via EPOLLET) and kqueue (via EV_CLEAR) provide edge-triggered semantics
- fixme a thorough comparison of these is sorely needed
A Garden of Interfaces
We all know doddering old read(2) and write(2) (which can't, by the way, be portably used with shared memory). But what about...
- readv(2), writev(2) (FreeBSD's sendfile(2) has a struct iov handily attached)
- splice(2), vmsplice(2) and tee(2) on Linux since version 2.6.17
- (When the first page of results for your interface centers largely on exploits, might it be time to reconsider your design assumptions?)
- sendfile(2) (with charmingly different interfaces on FreeBSD and Linux)
- On Linux since 2.6.2x (FIXME get a link), sendfile(2) is implemented in terms of splice(2)
- aio_ and friends for aysnchronous i/o
- mmap(2) and an entire associated bag of tricks (FIXME detail)
- most uses of mincore(2) and madvise(2) are questionable at best and useless at likely. FIXME defend
- broad use of mlock(2) as a performance hack is not even really questionable FIXME defend
- use of large pages is highly recommended for any large, non-sparse maps FIXME explain
- mremap(2) and remap_file_pages(2) on Linux can be used effectively at times
- There's nothing wrong with MAP_FIXED so long as you've already allocated the region before (see caveats...)
The Full Monty: A Theory of UNIX Servers
We must mix and match:
- Many event sources, of multiple types and possibly various triggering mechanisms (edge- vs level-triggered):
- Socket descriptors, pipes
- File descriptors referring to actual files (these usually have different blocking semantics)
- Signals, perhaps being used for asynchronous I/O with descriptors (signalfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_SIGNAL events)
- Timers (timerfd(2) on Linux unifies these with socket descriptors; kqueue supports EVFILT_TIMER events)
- Condition variables and/or mutexes becoming available
- Filesystem events (inotify(7) on Linux, EVFILT_VNODE with kqueue)
- Networking events (netlink(7) (PF_NETLINK) sockets on Linux, EVFILT_NETDEV with kqueue)
- One or more event notifiers (epoll or kqueue fd)
- One or more event vectors, into which notifiers dump events
- kqueue supports vectorized registration of event changes, extending the issue
- Threads -- one event notifier per? one shared event notifier with one event vector per? one shared event notifier feeding one shared event vector? work-stealing/handoff?
- It is doubtful (but not, AFAIK, proven impossible) that one scheduling/sharing solution is optimal for all workloads
DoS Prevention or, Maximizing Useful Service
- TCP SYN -- to Syncookie or nay? The "half-open session" isn't nearly as meaningful or important a concept on modern networking stacks as it was in 2000.
- Long-fat-pipe options, fewer MSS values, etc...but recent work (in Linux, at least) has improved them (my gut feeling: nay)
- Various attacks like slowloris, TCPPersist as written up in Phrack 0x0d-0x42-0x09, Outpost24 etc...
- What are the winning feedbacks? fractals and queueing theory, oh my! fixme detail
See Also
- "sendfile(): fairly sexy (nothing to do with ECN)" on lkml
- "mmap() sendfile()" on freebsd-hackers
- "sharing memory map between processes (same parent)" on comp.unix.programmer
- "some mmap observations compared to Linux 2.6/OpenBSD" on freebsd-hackers
- Stuart Cheshire's "Laws of Networkdynamics" and "It's the Latency, Stupid"
- "mremap help? or no support for FreeBSD?" on freebsd-hackers
- "Edge-triggered interfaces are too difficult?" on LWN, 2003-05-16
- "Edge- vs Level-Triggered Events on Pierre Phaneuf's livejournal (pphaneuf)
- "edge-triggered vs level-triggered epoll in kernel 2.6" on comp.unix.programmer, 2004-12-01