Extended disquisitions pertaining to eXpress data paths (XDP)

dankblog! 2023-04-20, 0423 EST, at the danktower

I've spent the past three months building a substantial XDP-based application for my employer, intended to be run in Azure using the latter's "Accelerated Networking" (SR-IOV). By "substantial", I mean that it includes a significant userspace component using AF_XDP sockets, that the XDP code must use dynamic configuration data, and that it must operate on arbitrary hardware configurations. I've not seen any documentation that covers the details of such an application, whether official Linux kernel documentation, conference papers, or the essays of the technorati. Most examples involve simple packet filtering using static data, never touching on the AF_XDP funnel to userspace that makes XDP an actual eXpress Data Path and not just a good place to stick some eBPF. I hope with this post to somewhat remedy that situation.

These kinds of applications are typically developed atop the Data Plane Development Kit. I chose XDP over the DPDK, and would be remiss were I not to first discuss that more established technology.

If you're new to XDP, you might want to read my page on that topic. I recently (2023-02-03) gave a presentation to Microsoft Azure Orbital you might also read. I'm not a true expert on either of these two technologies yet, but I probably know more than you do.

DPDK

Intel's (now the Linux Foundation's) DPDK saw public release in late 2010 under the BSD license. It is mature technology with extensive hardware support and great documentation, and several general-purpose applications have been released on top of it. It consists of the Runtime Environment (RTE), the Environment Abstraction Layer (EAL), and Poll-Mode Drivers (PMDs) for various hardware and paravirtualized NICs. Devices are bound to Linux's Virtual Function I/O (VFIO) or Userspace I/O (UIO) subsystems rather than using their typical kernel drivers. This allows userspace to perform basic PCIe BAR configuration, necessary for registering buffers and enabling the card. The RTE sets up tx/rx data ("mbuf") rings (ideally backed by huge pages), and descriptor rings for same. The PMD then begins polling on the RX descriptor ring. Packets are received entirely without overhead, and processed in userspace without any compulsory context switches (mbufs can be transferred among cores using the "rings" subsystem of the RTE). Like any userspace networking system worth the price of entry, careful attention has been paid to isolation and affinity of threads, constraining allocation to local NUMA nodes, interrupt mapping, and NIC features like RSS.

I'd argue that XDP is an implementation of Van Jacobson channels, while DPDK is a true userspace networking stack. The most fundamental difference is that a device bound to DPDK is no longer a Linux networking interface (usually. I believe there's something called dtap which gets around this somehow?). It will not show up in ip link output, and will not have an entry in /sys/class/net (it will show up on the PCIe bus). DPDK furthermore encompasses devices beyond network interface cards, including crypto, dma, and regex accelerators. The userspace application configures the device through UIO, and after that theoretically needn't touch the kernel (i.e. invoke system calls) to perform I/O. Everything is based off userspace ringbuffers and memory-mapped I/O.

With a dual-port 10GBASE-T Intel X550 (ixgbe driver), we see the following in lspci:

44:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
44:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)

With no driver loaded, dpdk-devbind.py --status shows:

Other Network devices
=====================
0000:44:00.0 'Ethernet Controller X550 1563'
0000:44:00.1 'Ethernet Controller X550 1563'

We load the ixgbe driver, bringing up ixgbe1, and now see:

Network devices using kernel driver
===================================
0000:44:00.0 'Ethernet Controller X550 1563' if=ixgbe0 drv=ixgbe unused=
0000:44:00.1 'Ethernet Controller X550 1563' if=ixgbe1 drv=ixgbe unused= *Active*

We can unbind the driver from one of the ports using dpdk-devbind.py -u ixgbe0. Our ixgbe0 interface disappears, and we now have:

Network devices using kernel driver
===================================
0000:44:00.1 'Ethernet Controller X550 1563' if=ixgbe1 drv=ixgbe unused= *Active*

Other Network devices
=====================
0000:44:00.0 'Ethernet Controller X550 1563' unused=ixgbe

We load the uio_pci_generic, vfio-pci, and igb_uio kernel modules (the first two are provided in kernel sources; the last comes from DPDK, and is built by DKMS):

Network devices using kernel driver
===================================
0000:44:00.1 'Ethernet Controller X550 1563' if=ixgbe1 drv=ixgbe unused=igb_uio,vfio-pci,uio_pci_generic *Active*

Other Network devices
=====================
0000:44:00.0 'Ethernet Controller X550 1563' unused=ixgbe,igb_uio,vfio-pci,uio_pci_generic

We now have one port suitable for use by a DPDK application using the igb_uio passthrough. Note that the UIO passthroughs are inferior to vfio-pci (which also would have worked), due to their inability to use IOMMUs and lack of DMA-safety. uio_pci_generic doesn't support even support MSI, making it unusable under SR-IOV or with RSS.

Anyway, this was supposed to be about AF_XDP. But there's your whirlwind introduction to DPDK. The essential bedrock fact is: DPDK is a mature system for fast userspace packet dancing. It likes to poll. It cares about your machine and its setup only so far as it gets in DPDK's way.

Why AF_XDP?

The fundamental value proposition of AF_XDP (aside from the fuzzy feeling of needing not leave the upstream kernel tree) is this: DPDK wants to own its cards and their traffic. By default, once a NIC is being used by DPDK, it's lost to the rest of the machine (this can be worked around using both software dispatch and NIC hardware queues, I believe). DPDK requires understanding a fairly large and esoteric system with its own ex nihilo API. The kernel receive path is rather more heavyweight than its transmit path; XDP focuses on this more usefully elided RX path. I expect both to fester/live through at least the decade's end.

Before AF_XDP

By the way, there are several other intersections of eBPF with the RX path for doing more or less this same kind of thing. You can attach eBPF to ingress tc (traffic control)'s cls_bpf classifier, with its own (incompatible) set of hooks and actions. This is what you want if you can accept the skbuff creation hit and want to work on that more flexible structure. It is furthermore largely device/driver-independent. You can also attach eBPF to a socket using the SO_ATTACH_BPF sockopt, though this can only AFAIK be used to filter ingress.

Rounding out this loose confederation is the esoteric SO_ATTACH_REUSEPORT_EBPF, which allows programmatic control of distribution among multiple listeners on the same REUSEPORT-aliased endpoint.

XDP sweetness

I'm not very interested in pure kernelspace XDP programs, and anyone who's spent quality time with the kernel's eBPF verifier is likely to agree. If you can get away with one for your task, awesome; do so. If all you need do is examine some headers, possibly scribble a little, maybe use a little per-cpu state, and drop the frame or kick it out some other interface, you need never leave the soft IRQ handler, and Godspeed. This is all reasonably well documented and straightforward.

Where I get really hot is AF_XDP, the (potentially zerocopy) path to userspace. In ways, it's like a fast AF_PACKET packet socket (on drivers without XDP acceleration, it's almost exactly like AF_PACKET), except that you can prevent the packet from progressing further through the kernel (using bpf_redirect_map() to divert to an AF_XDP socket inhibits further traversal whether the frame was copied or not, though you can clone the packet to facilitate traversal should you want to). Theoretically, we could get very close to DPDK's all-userspace, all-polled path: if we put our kernel code and userspace code on different processing elements (but accessing the same memories) we could very well match it (NARRATOR: they would not match it.).

Getting the best performance out of AF_XDP requires a lot of the same system administration-esque preconditioning needed by DPDK. We're coming up on two decades of mainstream multicore: you'll need to take advantage of parallelism via multiple threads. Those threads ought be bound to isolated processing elements, which means cgroups and the cpuset controller. These days, that means CGroupsv2, ideally prepared in the form of a systemd slice. The standard technique is a "cpu shield": you pin your threads, one thread per core, and remove those cpus from other cgroups. Likewise, interrupt handlers (aside from our NIC queues) have our cores masked out of their affinities, while relevant interrupt handlers are bound to our cores (you can't generally move all unrelated kernel work away; some kernel tasks, including soft IRQs, are fundamentally per-cpu). In the language of Cgroupsv2, you'll want your cores in a root partition, or an isolated root partition if you don't need the crutch of a scheduling domain. On NUMA systems, create a partition per zone, and bind that group to the zone's local memory nodes, holding off on allocations until bound. Of course, cgroupsv2 and cgroupsv1 can kinda coexist in a fine example of Terrible Decisions by Smart People, so you might be on a cgroupsv2 machine that has old-school cpusets with their slightly but devastatingly different semantics, so know you're pretty much fucked I guess.

If you need serialization of a flow, direct it to a single queue, run your XDP on the single core to which that queue is directed, and stamp the frame there. You can then fling it to random processors. Otherwise, distribute frames among multiple XDP programs on different cores using multiple hardware queues. Queue counts can be retrieved with netlink, but beyond that you're in ethtool country. Per-queue statistics are almost always available via ethtool's NIC-specific stats.

Use huge pages where appropriate. Intel's DDIO can be very effective. For physical NICs, ensure you're using cores in the same zone (usually this just means using cores from the package supplying the PCIe root to which the device is attached. Consult sysfs to get mappings at runtime). If you've got DDIO, you're on a Xeon and probably want to look at CAT (Cache Allocation Technology) as well.

Almost all of this stuff requires root or a swath of capabilities, and is difficult to coordinate with other programs on arbitrary systems. Allah, the All-Powerful, has fucked us again!

Configuring XDP programs

Most existing XDP examples, especially of the kind one finds on cute little blog posts, are minimal and fairly useless. Oftentimes the author spends significant effort introducing the basics of eBPF, meaning you read more about Clang and maps than you do XDP. Once you finally bind some code, it's doing nothing more than checking ports or addresses against some compiled-in constants and dropping matches. Serious, production-level demonstrations are few and far between.

One element that's rarely touched upon is configuration of an XDP program, especially if that configuration is dynamic (i.e. it changes during runtime). The most basic means is of course embedding some constants in the C source. Leaving aside the annoyances of automatically modifying and non-interactively compiling code (let's not talk about directly patching binaries; it is after all peacetime), this is going to add some fundamental delay to your configuration cycle. The idea of rapidly reconfiguring via such a method can be rejected out of hand.

So the other obvious method is an eBPF map. Similarly to the process environment accessed with getenv(3) and setenv(3), we can map up an array of __u32 as so (in our XDP program):

struct {                                                                                                                            
  __uint(type, BPF_MAP_TYPE_ARRAY);                                                                                                 
  __uint(max_entries, PARAM_COUNT);                                                                                                 
  __uint(map_flags, BPF_F_MMAPABLE);                                                                                                
  __type(key, __u32);                                                                                                               
  __type(value, __u32);                                                                                                             
} parameters SEC(".maps");

then in our XDP-to-userspace interface definition (aka a header file) we might have, say:

enum parameters {
  PARAM_IPV4_MATCH_COUNT,
  PARAM_IPV6_MATCH_COUNT,
  PARAM_THREAD_COUNT,
  PARAM_COUNT
};

Userspace can set elements with bpf_map_update_elem(), and XDP can read them with bpf_map_lookup_elem(). These interfaces are atomic. It's a system call from userspace, which is lame, but not a big deal for write-once data structures or anything we're updating infrequently. From within eBPFspace, it's a CALL virtual instruction with no prologue/epilogue overhead, so essentially just an atomic read (all maps are of fixed size, and bounds checking is performed by the verifier). So some immediate questions:

Q: is there atomicity if your value type is larger than a word? A: as far as I can tell, this is implemented via RCU.
Q: can i cache the results? A: sure, static variables work as expected. but how will you invalidate this cache? and will it actually improve performance? how?
Q: can my value type contain pointers? A: sure, but nothing (i hope obviously) chases-and-copies them or anything, and they're not immediately useful to you in kernelspace.
Q: how do i make them useful? A: sigh, try bpf_probe_read_user_str() etc.
Q: can the user safely modify an area i'm reading with bpf_probe_read_user_str()? A: no, that's how bitches die, obviously not.

XDP problems

There are some serious problems imho affecting XDP usability. Some of them seem mere technical challenges; some of them seem more fundamental.

Overlap with desirable networking stack basics

Back in 1998, Ptacek and Newsham wrote one of my all-time favorite infosec papers, "Insertion, Deletion, and Denial of Service: Eluding Network Intrusion Detection." I thought it a great paper then, and have thought about it regularly in the past twenty-five (jfc!) years. The central point is that an IDS/IPS can be only as confident about the traffic reaching a process as it can faithfully reproduce the networking between itself and that process. As a simple example, imagine an IPv4 packet is fragmented into two fragments A and B. Two different payloads for "A" are sent in succession using fragment offset 0, and only then is B sent. What does a userspace socket receive--the first "A" payload, or the second? It depends on the host's networking stack (and possibly on any middleware between the IPS and the host). If the IPS doesn't match (and know!) said stack (which is generally impossible), it can lead to errors.

XDP has some similar problems, and requires a good amount of boilerplate from the XDP program if all kinds of problems are not to be introduced. As the simplest example, a host will not normally deliver a payload to a userspace socket unless the packet was addressed to some address on the host. An XDP program forwarding to XSKs must perform this check itself (and there is no obvious way to access netlink-style stack details from an XDP program). rp_filter usually provides reverse path filtering as described in RFC 3704, but it is not applied to XDP. If a NIC is in promiscuous mode, frames will be delivered to XDP no matter their L2 destination address, etc.

There seems no way to determine whether UDP and TCP checksums have been validated in hardware (there are methods for checksum deltas when modifying packets). If they have not, I usually want to validate them myself before passing packets to userspace, but if they have, I don't want to incur that cost. If a copy must be made of the packet data, I'd like any checksumming folded into that, where it will likely be hidden under memory access costs.

Small annoyances

Binding an XDP program effectively cycles the device on some NICs.
I can't determine what class of networking device I've been bound to from within my XDP program. Ought I expect Ethernet L2? Who knows!
The rx_fill_ring_empty_descs statistic seems to come preloaded with a value of 1. lol?
You can't generally take advantage of hardware offloading (LRO/TSO) in conjunction with XDP, though it seems you (sometimes) can with frag-aware XDP (read on...).
How, oh how do I get hardware timestamps? SO_TIMESTAMP[NS] is sadly unsupported on AF_XDP.

Mere technical issues

XSKs must bind to a specific NIC hardware queue. The APIs to get hardware queue information are old ethtool ioctls, or the more recent netlink reimplementations of same. There are at least four types of hardware queue, dependent on hardware and driver. The APIs to *manage* hardware queues are barely there, yet to use XDP one absolutely must either (a) collapse all RX queues into a single queue or (b) direct all desired traffic to the XDP-bound queue with a NIC ntuple rule or (c) bind XDP programs to all RX queues. Per-queue statistics are completely NIC-dependent and unstructured with no unifying infrastructure despite near-universal support for basic per-queue stats. These are all really variations on the theme that "kernel-side ethtool seems to have been implemented by a drunken goldfish".

If I've got a pool of workers, each with their own XSK, I'd often like to distribute packets to XSKs based on dynamic backlog, and I *never* want to dispatch to an XSK whose ringbuffer is full. The XSK abstraction offers no way to do this (you could build it yourself with eBPF maps).

I ought be able to have all my rings backed by huge pages if I so desire, but this can't be accomplished with kernel-allocated rings as used with XSKs (the larger UMEM is allocated in user space, and can (should) use huge pages).

Running libxdp-based programs under valgrind results in all kinds of illegal access errors being thrown, discomforting regarding such a low-level component. Also, if xdp-loader crashes while holding its filesystem-based lock, it can't be used again until the lock is deleted, even if the PID recorded therein no longer exists.

Deeper problems

Though the infrastructure is there suggesting that multiple XDP applications can easily coexist on an interface, it's something of a charade. xdp-loader gives you a basic chaining infrastructure, sure, but take the checksum example above. In the absence of hardware checksums (and the ability to detect that they've been validated), must each XDP program validate? Must each XDP program check for L3 well-formedness? If they need defragmentation, must all programs carry that code? It seems to me that eBPF is sufficiently limited that the kernel ought be able to weave together various XDP programs, warn "this one will never match because its matches are all eaten by this already-loaded program", etc. Even that would probably be a mess. Until then, though, I think xdp-loader's default policy of chaining is misguided.

The jumbo frame support is a goddamn mess. Generally, you can't run XDP on an MTU larger than your page size less about a thousand bytes (3050 is typically as high as one can go on 4KiB pages). This effectively shoves an atomic bomb right up the ass of many high-speed networking setups. Do many people use Ethernet MTUs larger than 1500? No. Do many people who are interested in high-performance networking? Absolutely. Some drivers are rumored to support "multibuf"-based jumbo frames, but there's no easy way to discover whether you're working with such a driver. The best you can do seems to be setting the MTU high, setting the BPF_F_XDP_HAS_FRAGS flag on the struct bpf_program, and giving it the ol' college try. If it does load, it doesn't seem that XDP_OPTIONS_ZEROCOPY can be used, and your XDP program must use the "frag-aware API" if it's to access packet data in secondary fragments. Give me my 8K MTUs back!

Locally-generated packets generally don't hit XDP unless you're binding to loopback. I.e. if I have an Intel X540 with an IP address of 172.16.88.40/24 at ixgbe0, and I bind an XDP program to it, and I ping 172.16.88.40 from that same machine, an XDP program bound to ixgbe0 is not going to see that traffic, while XDP bound to lo will. This contradicts how local addresses typically work, requiring either that one explains the strange restriction to users (bonne chance!), bind to both devices (requiring two XDP programs and two sets of XSKs), or that one build an alternate path using regular sockets to handle this case.

There doesn't seem to be any way to detect that you were unable to queue a packet to userspace via bpf_redirect_map() due to descriptor rings being full. If you're inducing reliability on a flow, this means an unfixable gap in your sequence numbers. Furthermore, it doesn't seem that such packets follow the third argument to bpf_redirect_map(), the policy on error. Instead, you just get a drop. If you could at least rely on XDP_PASS being honored, you could try to catch the packet on a regular socket.

Executive summary

XDP is mad decent, especially as DPDK gives me a terrible rash. With that said, it's by no means beautiful, and has a number of serious problems. It is easier to achieve peak rates with DPDK in most cases, so long as you needn't use other kernel infrastructure, which is entirely bypassed. Vendor DPDK sucks complete ass, because you will eventually need some very basic functionality that would normally be only an obscure iproute invocation away, whereas you will have to get your shitty vendor involved to customize their crap, and it will not be cheap. If you need transmit in addition to receive, use either XDP's TX functionality or io_uring with batched sends.

One last thing

DPDK lives on github at DPDK/dpdk like every other reasonable project in existence, resting snugly in the loving bosom of Microsoft, somehow making us money (disclosure: Microsoft has for several years now compensated me handsomely despite forthright sentiments like "Windows 11 is the worm-rotted cherry atop three decades of unceasing crapwaffle dogshit joke operating systems from Redmond. Anyone with half a brain in their head runs Linux, unless they run FreeBSD. Those unfortunate minutes of my day unwillingly spent in Outlook make me want to vomit. The other day there was a motherfucking movie advertisement in my "Start menu". Why anyone tolerates our garbage is a complete mystery to me, aside of course from the best-of-breed space/satellite service offerings due Microsoft Orbital. Kudos, Microsoft Orbital! Kudos to your handsome and stalwart maldevs, your gracious and skilled ladycoders, and your marvelous radomes! PS boss, if you're reading this, I need my laptop replaced yesterday.").

When I sent a patch in to bring the XDP documentation up-to-date with the kernel's structure definitions, I was told to modify my git-diff output and mail a different list. I mean, I get it, and I'm not going to tell you how to run your networking mailing list, but if this was github, that change would be in and I'd not be thinking about what's possibly broken about my git-diff.

previously: "a rack of one's own" 2023-03-11

Extended disquisitions pertaining to eXpress data paths (XDP)

Contents

DPDK

Why AF_XDP?

Before AF_XDP

XDP sweetness

Configuring XDP programs

XDP problems

Overlap with desirable networking stack basics

Small annoyances

Mere technical issues

Deeper problems

Executive summary

One last thing

navigation menu

Extended disquisitions pertaining to eXpress data paths (XDP)

DPDK

Why AF_XDP?

Before AF_XDP

XDP sweetness

Configuring XDP programs

XDP problems

Overlap with desirable networking stack basics

Small annoyances

Mere technical issues

Deeper problems

Executive summary

One last thing

navigation menu

Search