Libtorque: Difference between revisions

Revision as of 06:50, 5 November 2009

libtorque is a multithreaded event library for UNIX designed to take full advantage of the manycore, heterogenous, NUMA future. The project proposal suggests motivation for libtorque: I believe it necessary to take scheduling and memory-placement decisions into account to most optimally handle events, especially on manycore machines and especially to handle unexpected traffic sets (denial of service attacks, oversubscribed pipes, mixed-latency connections, etc). Along the way, I intend to shake up the UNIX programming idiom; my hope is that libtorque leads to more network programmers thinking about the complex issues involved, but simplifies rather than aggravates a task already fraught with difficulty.

Previous, non-threaded event libraries include libevent, libev and liboop.

Resources

git hosting from GitHub (dankamongmen/libtorque project page)
- Lots of good data in the README!
- git clone from git://github.com/dankamongmen/libtorque.git
bugzilla, hosted here on http://dank.qemfd.net/bugzilla/

Design/Functionality

libtorque exposes an affinity-managing, continuations-based, architecture-aware, multithreaded scheduler and various utility functions, including an architecture-, OS-, and thread-aware allocator with strong scheduler feedbacks. It can analyze arbitrary object code via libdl and libelf, discover where instructions live, and allocate around those areas. It can take dynamically balanced or (static) asymmetric interrupt loads into account. By making service decisions and allocations based on whole-system effects, libtorque provides low latency and high throughput under even pedantically asymmetric, irregular loads. By carefully distributing edge-triggered event descriptors among various processors' notification sets, highly scalable use is made of advanced low-level primitives such as kqueue and epoll. Through the use of lazy asynchronous I/O, each (expensive!) core is kept busy doing real work.

Q: Why aren't you letting the OS manage scheduling, thus following the advice of just about everyone (assuming no real-time requirements)?
A: Scalable event notification on UNIX is based around stateful polling (even in signal-driven POSIX asynchronous I/O, we want to use stateful polling on signals both for performance and event unification). The distribution of these events, at heart, drives scheduling. Since we must influence this distribution, we must make scheduling decisons.
Q: Why not just launch a thread per configured processing element, and use an existing event-handling library?
A: These threads will have to communicate with one another to spread around event sources. If reshuffling is desired, they must communicate further; if not, the system can be forced into dynamic behavior. The information most pertinent to this shuffling is at the event-handling level; we thus make the decisions there.

System discovery

Full support for CPUID as most recently defined by Intel and AMD (more advanced, as of 2009-10-31, than x86info)
Full support for Linux and FreeBSD's native cpuset libraries, and SGI's libcpuset and libNUMA
Discovers and makes available, for each processor type:
- ISA, ISA-specific capabilities, and number of concurrent threads supported (degrees of SMT)
- Line count, associativity, line length, geometry, and type of all caches
- Entry count, associativity, page size and type of all TLBs
- Inclusiveness relationships among cache and TLB levels
- Interconnection topology, APIC ID's, and how caches are shared among processors
- More: properties of hardware prefetching, ability to support non-temporal loads (MOVNTDQA, PREFETCHNTA, etc)
Discovers and makes available, for each memory node type:
- Connected processor groups and relative distance information
- Number of pages and bank geometry
- More: OS page NUMA/prefetching policies, error-recovery info

archdetect

Utility built/packaged with libtorque:

(   1x) Memory node 1 of 1:
	12292928KB (11.723 GB) total, 4KB pages
(  16x) Processing unit type 1 of 1:
	Brand name: Intel(R) Xeon(R) CPU           E5520  @ 2.27GHz (OEM)
	Family: 0x006 (6) Model: 0x1a (26) Stepping: 5
	2 threads per processing core, 8 cores per package
	Cpuset ID 0: Type 1, SMT 1 Core 1 Package 2
	Cpuset ID 1: Type 1, SMT 1 Core 1 Package 1
	Cpuset ID 2: Type 1, SMT 1 Core 2 Package 2
	Cpuset ID 3: Type 1, SMT 1 Core 2 Package 1
	Cpuset ID 4: Type 1, SMT 1 Core 3 Package 2
	Cpuset ID 5: Type 1, SMT 1 Core 3 Package 1
	Cpuset ID 6: Type 1, SMT 1 Core 4 Package 2
	Cpuset ID 7: Type 1, SMT 1 Core 4 Package 1
	Cpuset ID 8: Type 1, SMT 2 Core 1 Package 2
	Cpuset ID 9: Type 1, SMT 2 Core 1 Package 1
	Cpuset ID 10: Type 1, SMT 2 Core 2 Package 2
	Cpuset ID 11: Type 1, SMT 2 Core 2 Package 1
	Cpuset ID 12: Type 1, SMT 2 Core 3 Package 2
	Cpuset ID 13: Type 1, SMT 2 Core 3 Package 1
	Cpuset ID 14: Type 1, SMT 2 Core 4 Package 2
	Cpuset ID 15: Type 1, SMT 2 Core 4 Package 1
	Cache 1 of 4: 32KB total, 64B line, 8-assoc, 2-shared (L1 data)
	Cache 2 of 4: 32KB total, 64B line, 4-assoc, 2-shared (L1 code)
	Cache 3 of 4: 256KB total, 64B line, 8-assoc, 2-shared (L2 unified)
	Cache 4 of 4: 8MB total, 64B line, 16-assoc, 16-shared (L3 unified)
	TLB 1 of 5: 4KB pages, 7-entry, 7-assoc, unshared (L2 code)
	TLB 2 of 5: 4KB pages, 64-entry, 4-assoc, unshared (L2 data)
	TLB 3 of 5: 4MB pages, 32-entry, 4-assoc, unshared (L1 data)
	TLB 4 of 5: 4KB pages, 64-entry, 4-assoc, unshared (L1 code)
	TLB 5 of 5: 4KB pages, 512-entry, 4-assoc, unshared (L2 data)

Scheduling

System architecture, allocations, continuation targets, event pattern and of course processing itself play a role in our scheduling. We first construct the scheduling universe, a scheduling model using the hardware detected during library initialization. Taken from src/libtorque/topology.h (commit a03fc592d03ccc9c20b016faaa1362b88bc07417):

// We are not considering distributed systems in this model.
//
// The scheduling universe is defined by an ordered set of $N > 1$ levels
// $L_0..L_n$. A scheduling group is a structural isomorphism, a counter $C$ of
// instances, and the $C$ affinity masks of corresponding processing elements.
// A level contains $N > 0$ scheduling groups and a description of hardware
// unique to that level. A unique surjection from classes of usable hardware to
// levels (no hardware class is described at two levels, and all usable
// hardware is described by the scheduling universe) is defined by via our
// discovery algorithm.
//
// The reality compactifies things a bit, but in theory:
// Level L_0 consists of scheduling groups over threads and processor types.
// Each successive level $L_{n+1}, n >= 0$ extends $L_n$'s elements. For each
// element $E \in L_n$, extend $E$ through those paths reachable at equal cost
// from all elements in $E$.
//
// Examples:
// A uniprocessor, no matter its memories, is a single topology level.
// 2 unicore, unithread SMP processors with distinct L1 and shared L2 are two
// topology levels: 2x{proc + L1} and {L2 + memory}. Split the L2, and they
// remain two topology levels: 2x{proc + L1 + L2} and {memory}. Combine both
// caches for a single level: {2proc + L1 + L2 + memory}. Sum over a level's
// processors' threads for a thread count.
//
// Note that SMT does not come into play in shaping the topology hierarchy,
// only in calculating the number of threads in a topological group. We only
// schedule to SMT processors if
// - shared data is allocated which fits entirely within the group's cache, or
// - we need to.

If the memory being used is distinct, space things out as far as possible:
- Use different NUMA nodes, to spread memory bandwidth
- Use different packages, to engage different caches
- Use different cores, to engage different execution units
- Use different threads, to engage different architectural state
If the memory being used is shared, compress as much as possible:
- Use the same core, to maximize per-core cache utilization and minimize trips off-core
- Use the same package, to maximize per-package cache utilization and minimize trips off-die
- Use the same NUMA node, to maximize per-node memory utilization and minimize trips off-node
Whether to use the same core is a function of proportion/distribution of shared accesses...
- Shared writes are even more important to keep close, to minimize true coherence overhead
- Unshared writes are even more important to space out, to minimize false coherence overhead
- Buffers (both RX and TX) are written to once, then read from N (N>0) times -- worst case for coherence protocols
...and SMT-affability of the instruction load (in other words, degree of fine-grained parallelism)
- Network servers are not matrix kernels! Assume crappy IPC (syscalls, branching, I/O), and exploit sharing purely in terms of memory
Let's color connections

Edge-triggering

libtorque deals only in edge-triggered event notification, thus evading an entire level of otherwise necessary locking. FIXME EXPLAIN

Allocation

FIXME EXPLAIN

Networks imply buffers. Big, unshared, easily-colored buffers.
- As much as possible, parameterize buffer allocation and move it into libtorque.
  - We get slabs/recycling for free, and only need lock when we enlarge slabs (kernel also locks, so try to match)
- Expose allocators for explicitly shared/unshared data
- When done based on connection type, libtorque can wait for data's arrival to allocate
  - Excellent way to introduce L5+ van Jacobson channels (a zetetic-like engine, perhaps driven by parvenu)

Scalability

Commonly-used algorithms (anything in the event-handling hotpath) oughtn't depend on the number of processors, depth of the scheduling topology, or number of memories -- that is, O(1) as these grow. Performance per processor ought exhibit linear growth from top to bottom, save perhaps superlinear growth when a scheduling level shares caches.

Robustness

As the number of dies grow, so do interconnects. We're extending Moore's Law, but what about Weibull's distribution? As the number of cores rises, so too does the likelyhood that a processor or DIMM failure will be seen across the lifetime of a machine (remember, a failure distribution exponential as a function of time suggests only ~37% likelyhood of any given component reaching MTBF). Hot-pluggable processors and memories are also likely to be encountered, especially in cloud/virtual environments. For that matter, sysadmins can reconfigure cpusets and NUMA on the fly, or even migrate our process between sets. We ought design with all this in mind.

Determine: how are Linux/FreeBSD processes notified/affected when their processors or memories fail?
Determine: how to recover/redetect/redistribute?

Compatibility wrappers

FIXME

References/Prior Art

Philip Mucci's "Linux Multicore Performance Analysis and Optimization in a Nutshell", delivered at NOTUR 2009
Elmeleegy et al's "Lazy Asynchronous I/O", USENIX 2004
PGAS: Kathy Yelick's "Performance and Productivity Opportunities using Global Address Space Programming Models", 2006
Emery Berger's Hoard and other manycore-capable allocators (libumem aka magazined slab, Google's ctmalloc, etc).
Williams, Vuduc, Oliker, Shalf, Yelick, Demmel. "Tuning Sparse Matrix Vector Multiplication for multi-core SMPs", 2007
The Radovic-Hagersten lock team has a page on NUMA locking
A lot of discursive theorizing/ruminating is captured on my Fast UNIX Servers page

Miscellanea

History

libtorque, under that name, began as a project for Professor Rich Vuduc's Fall 2009 CSE6230. It really began gestating in three parts:

Work on intrusion prevention at Reflex got me hooked on automata-based networking activations and fast networking
Work on ICAP servers and reverse proxies at McAfee, especially snare, got me thinking about networking API's
Professor Tom Conte's Spring 2009 "CS 8803 MCA: Multicore and Manycore Architecture" lit up the parallelism fire

Milestones

2009-10-22: First commit (e7429294beb9dc581a7cdab2371d2ddca3169047)
2009-11-13: CS 8001-CAS Arch-Whiskey presentation
2009-11-19: CSE 6230 checkpoint (see proposal)
2009-12-10: CSE 6230 due date (again, see proposal)

Logos

888 ,e, 888        d8           "...tear the roof off the sucka..."
888  "  888 88e   d88    e88 88e  888,8,  e88 888 8888 8888  ,e e,
888 888 888 888b d88888 d888 888b 888 "  d888 888 8888 8888 d88 88b
888 888 888 888P  888   Y888 888P 888    Y888 888 Y888 888P 888   ,
888 888 888 88"   888    "88 88"  888     "88 888  "88 88"   "YeeP"
_____________________________________________ 888 _________________
continuation-based unix i/o for manycore numa\888/© nick black 2009

@@ Line 13: / Line 13: @@
 *'''Q:''' Why aren't you letting the OS manage scheduling, thus following the advice of just about everyone (assuming no real-time requirements)?
 *'''A:''' Scalable event notification on UNIX is based around stateful polling (even in signal-driven POSIX asynchronous I/O, we want to use stateful polling on signals both for performance and event unification). The distribution of these events, at heart, drives scheduling. Since we must influence this distribution, we must make scheduling decisons.
+*'''Q:''' Why not just launch a thread per configured processing element, and use an existing event-handling library?
+*'''A:''' These threads will have to communicate with one another to spread around event sources. If reshuffling is desired, they must communicate further; if not, the system can be forced into dynamic behavior. The information most pertinent to this shuffling is at the event-handling level; we thus make the decisions there.
 ===System discovery===
 * Full support for [[Cpuid|CPUID]] as most recently defined by Intel and AMD (more advanced, as of 2009-10-31, than [http://www.codemonkey.org.uk/projects/x86info/ x86info])

anonymous

Search

Libtorque: Difference between revisions

Namespaces

more

page actions

Revision as of 06:50, 5 November 2009

Contents

Resources

Design/Functionality

System discovery

archdetect

Scheduling

Edge-triggering

Allocation

Scalability

Robustness

Compatibility wrappers

References/Prior Art

Miscellanea

History

Milestones

Logos

navigation

wiki tools

wiki tools

anonymous

Search

Libtorque: Difference between revisions

Revision as of 06:50, 5 November 2009

Resources

Design/Functionality

System discovery

archdetect

Scheduling

Edge-triggering

Allocation

Scalability

Robustness

Compatibility wrappers

References/Prior Art

Miscellanea

History

Milestones

Logos

navigation

wiki tools

page tools