Libtorque: Difference between revisions

Revision as of 22:21, 1 November 2009

My project for Professor Rich Vuduc's Fall 2009 CSE6230, libtorque is a multithreaded event library for UNIX designed to take full advantage of the manycore, heterogenous, NUMA future. Previous, non-threaded event libraries include libevent, libev and liboop. My project proposal suggests motivation for libtorque: I believe it necessary to take scheduling and memory-placement decisions into account to most optimally handle events, especially on manycore machines and especially to handle unexpected traffic sets (denial of service attacks, oversubscribed pipes, mixed-latency connections, etc).

Resources

git hosting from GitHub:
- Lots of good data in the README!
- Available from the dankamongmen/libtorque project page
- git clone from git://github.com/dankamongmen/libtorque.git
bugzilla, hosted here on http://dank.qemfd.net/bugzilla/

Milestones

2009-11-19: CSE 6230 checkpoint
2009-12-10: CSE 6230 due date

Design/Functionality

libtorque exposes an affinity-managing, continuations-based, architecture-aware, multithreaded scheduler and various utility functions, including an architecture-, OS-, and thread-aware allocator with strong scheduler feedbacks. It can analyze arbitrary object code via libdl and libelf, discover where instructions live, and allocate around those areas. It can take dynamically balanced or (static) asymmetric interrupt loads into account. By making service decisions and allocations based on whole-system effects, libtorque provides low latency and high throughput under even pedantically asymmetric, irregular loads. By carefully distributing edge-triggered event descriptors among various processors' notification sets, highly scalable use is made of advanced low-level primitives such as kqueue and epoll. Through the use of lazy asynchronous I/O, each (expensive!) core is kept busy doing real work.

Q: Why aren't you letting the OS manage scheduling, thus following the advice of just about everyone (assuming no real-time requirements)?
A: Scalable event notification on UNIX is based around stateful polling (even in signal-driven POSIX asynchronous I/O, we want to use stateful polling on signals both for performance and event unification). The distribution of these events, at heart, drives scheduling. Since we must influence this distribution, we must make scheduling decisons.

System discovery

Full support for CPUID as most recently defined by Intel and AMD (more advanced, as of 2009-10-31, than x86info)
Full support for Linux and FreeBSD's native cpuset libraries, and SGI's libcpuset and libNUMA
Discovers and makes available, for each processor type:
- ISA, ISA-specific capabilities, and number of concurrent threads supported (degrees of SMT)
- Line count, associativity, line length, geometry, and type of all caches
- Entry count, associativity, page size and type of all TLBs
- Inclusiveness relationships among cache and TLB levels
- Interconnection topology, APIC ID's, and how caches are shared among processors
- More: properties of hardware prefetching, ability to support non-temporal loads (MOVNTDQA, PREFETCHNTA, etc)
Discovers and makes available, for each memory node type:
- Connected processor groups and relative distance information
- Number of pages and bank geometry
- More: OS page prefetching policy, error-recovery info

Scheduling

System architecture, allocations, continuation targets, event pattern and of course processing itself play a role in our scheduling. We first construct the topology hierarchy, a scheduling model using the hardware detected during library initialization:

FIXME detail

Robustness

As the number of dies grow, so do interconnects. We're extending Moore's Law, but what about Weibull's distribution? As the number of cores rises, so too does the likelyhood that a processor or DIMM failure will be seen across the lifetime of a machine (remember, a failure distribution exponential as a function of time suggests only ~37% likelyhood of any given component reaching MTBF). Hot-pluggable processors and memories are also likely to be encountered, especially in cloud/virtual environments. For that matter, sysadmins can reconfigure cpusets and NUMA on the fly, or even migrate our process between sets. We ought design with all this in mind.

Determine: how are Linux/FreeBSD processes notified/affected when their processors or memories fail?
Determine: how to recover/redetect/redistribute?

References/Prior Art

Philip Mucci's "Linux Multicore Performance Analysis and Optimization in a Nutshell", delivered at NOTUR 2009
Elmeleegy et al's "Lazy Asynchronous I/O", USENIX 2004
PGAS: Kathy Yelick's "Performance and Productivity Opportunities using Global Address Space Programming Models", 2006
Emery Berger's Hoard and other manycore-capable allocators (libumem aka magazined slab, Google's ctmalloc, etc).
Williams, Vuduc, Oliker, Shalf, Yelick, Demmel. "Tuning Sparse Matrix Vector Multiplication for multi-core SMPs", 2007
A lot of discursive theorizing/ruminating is captured on my Fast UNIX Servers page

Phatty ASCII Logo

888 ,e, 888        d8           "...tear the roof off the sucka..."
888  "  888 88e   d88    e88 88e  888,8,  e88 888 8888 8888  ,e e,
888 888 888 888b d88888 d888 888b 888 "  d888 888 8888 8888 d88 88b
888 888 888 888P  888   Y888 888P 888    Y888 888 Y888 888P 888   ,
888 888 888 88"   888    "88 88"  888     "88 888  "88 88"   "YeeP"
_____________________________________________ 888 _________________
continuation-based unix i/o for manycore numa\888/© nick black 2009

@@ Line 29: / Line 29: @@
 ** Number of pages and bank geometry
 ** More: OS page prefetching policy, error-recovery info
+===Scheduling===
+System architecture, allocations, continuation targets, event pattern and of course processing itself play a role in our scheduling. We first construct the ''topology hierarchy'', a scheduling model using the hardware detected during library initialization:
+* '''FIXME''' detail
+===Robustness===
+As the number of dies grow, so do interconnects. We're extending Moore's Law, but what about Weibull's distribution? As the number of cores rises, so too does the likelyhood that a processor or DIMM failure will be seen across the lifetime of a machine (remember, a failure distribution exponential as a function of time suggests only ~37% likelyhood of any given component reaching MTBF). Hot-pluggable processors and memories are also likely to be encountered, especially in cloud/virtual environments. For that matter, sysadmins can reconfigure cpusets and NUMA on the fly, or even migrate our process between sets. We ought design with all this in mind.
+* Determine: how are Linux/FreeBSD processes notified/affected when their processors or memories fail?
+* Determine: how to recover/redetect/redistribute?
 ==References/Prior Art==

anonymous

Search

Libtorque: Difference between revisions

Namespaces

more

page actions

Revision as of 22:21, 1 November 2009

Contents

Resources

Milestones

Design/Functionality

System discovery

Scheduling

Robustness

References/Prior Art

Phatty ASCII Logo

navigation

wiki tools

wiki tools

anonymous

Search

Libtorque: Difference between revisions

Revision as of 22:21, 1 November 2009

Resources

Milestones

Design/Functionality

System discovery

Scheduling

Robustness

References/Prior Art

Phatty ASCII Logo

navigation

wiki tools

page tools