Check out my first novel, midnight's simulacra!

Extended disquisitions pertaining to eXpress data paths (XDP)

From dankwiki
Revision as of 04:42, 18 April 2023 by Dank (talk | contribs) (Created page with "I've spent the past two months building a substantial XDP-based application for my employer, intended to be run in Azure using the latter's "Accelerated Networking" (SR-IOV). By "substantial", I mean that it includes a significant userspace component using <tt>AF_XDP</tt> sockets, that the XDP code must use dynamic configuration data, and that it must operate on arbitrary hardware configurations. I've not seen any documentation that covers the details of such...")
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

I've spent the past two months building a substantial XDP-based application for my employer, intended to be run in Azure using the latter's "Accelerated Networking" (SR-IOV). By "substantial", I mean that it includes a significant userspace component using AF_XDP sockets, that the XDP code must use dynamic configuration data, and that it must operate on arbitrary hardware configurations. I've not seen any documentation that covers the details of such an application, whether official Linux kernel documentation, conference papers, or the essays of the technorati. Most examples involve simple packet filtering using static data, never touching on the AF_XDP funnel to userspace that makes XDP an actual eXpress Data Path and not just a good place to stick some eBPF. I hope with this post to somewhat remedy that situation.

These kinds of applications are typically developed atop the Data Plane Development Kit. I chose XDP over the DPDK, and would be remiss were I not to first discuss that more established technology.

XDP vs DPDK

Intel's (now the Linux Foundation's) DPDK saw public release in late 2010 under the BSD license. It is mature technology with extensive hardware support and great documentation, and several general-purpose applications have been released on top of it. It consists of the Runtime Environment (RTE), the Environment Abstraction Layer (EAL), and Poll-Mode Drivers (PMDs) for various hardware and paravirtualized NICs. Devices are bound to Linux's Virtual Function I/O (VFIO) or Userspace I/O (UIO) subsystems rather than using their typical kernel drivers. This allows userspace to perform basic PCIe BAR configuration, necessary for registering buffers and enabling the card. The RTE sets up tx/rx data ("mbuf") rings (ideally backed by huge pages), and descriptor rings for same. The PMD then begins polling on the RX descriptor ring. Packets are received entirely without overhead, and processed in userspace without any compulsory context switches (mbufs can be transferred among cores using the "rings" subsystem of the RTE). Like any userspace networking system worth the price of entry, careful attention has been paid to isolation and affinity of threads, constraining allocation to local NUMA nodes, interrupt mapping, and NIC features like RSS.

I'd argue that XDP is an implementation of Van Jacobson Channels, while DPDK is a true userspace networking stack. The most fundamental difference is that a device bound to DPDK is *no longer a Linux networking interface*. It will not show up in ip link output, and will not have an entry in /sys/class/net (it *will* show up on the PCIe bus). DPDK furthermore encompasses devices beyond network interface cards, including crypto, dma, and regex accelerators. The userspace application configures the device through UIO, and after that theoretically needn't touch the kernel (i.e. invoke system calls) to perform I/O. Everything is based off userspace ringbuffers and memory-mapped I/O.

With a dual-port 10GBASE-T Intel X550 (ixgbe driver), we see the following in lspci:

44:00.0 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)
44:00.1 Ethernet controller: Intel Corporation Ethernet Controller X550 (rev 01)

With no driver loaded, dpdk-devbind.py --status shows:

Other Network devices
=====================
0000:44:00.0 'Ethernet Controller X550 1563'
0000:44:00.1 'Ethernet Controller X550 1563'

We load the ixgbe driver, bringing up ixgbe1, and now see:

Network devices using kernel driver
===================================
0000:44:00.0 'Ethernet Controller X550 1563' if=ixgbe0 drv=ixgbe unused=
0000:44:00.1 'Ethernet Controller X550 1563' if=ixgbe1 drv=ixgbe unused= *Active*

We can unbind the driver from one of the ports using dpdk-devbind.py -u ixgbe0. Our ixgbe0 interface disappears, and we now have:

Network devices using kernel driver
===================================
0000:44:00.1 'Ethernet Controller X550 1563' if=ixgbe1 drv=ixgbe unused= *Active*

Other Network devices
=====================
0000:44:00.0 'Ethernet Controller X550 1563' unused=ixgbe

We load the uio_pci_generic, vfio-pci, and igb_uio kernel modules (the first two are provided in kernel sources; the last comes from DPDK, and is built by DKMS):

Network devices using kernel driver
===================================
0000:44:00.1 'Ethernet Controller X550 1563' if=ixgbe1 drv=ixgbe unused=igb_uio,vfio-pci,uio_pci_generic *Active*

Other Network devices
=====================
0000:44:00.0 'Ethernet Controller X550 1563' unused=ixgbe,igb_uio,vfio-pci,uio_pci_generic

We now have one port suitable for use by a DPDK application using the igb_uio passthrough. Note that the UIO passthroughs are inferior to vfio-pci (which also would have worked), due to their inability to use IOMMUs and lack of DMA-safety. uio_pci_generic doesn't support even support MSI, making it unusable under SR-IOV or with RSS.

XDP sweetness

XDP problems

Back in 1998, Ptacek and Newsham wrote one of my all-time favorite infosec papers, "Insertion, Deletion, and Denial of Serice: Eluding Network Intrusion Detection." I thought it a great paper then, and have thought about it regularly in the past twenty-five (jfc!) years. The central point is that an IDS/IPS can be only as confident about the traffic reaching a process as it can faithfully reproduce the networking between itself and that process. As a simple example, imagine an IPv4 packet is fragmented into two fragments A and B. Two different payloads for "A" are sent in succession using fragment offset 0, and only then is B sent. What does a userspace socket receive--the first "A" payload, or the second? It depends on the host's networking stack (and possibly on any middleware between the IPS and the host). If the IPS doesn't match (and know!) said stack (which is generally impossible), it can lead to errors.

XDP has some similar problems, and requires a good amount of boilerplate from the XDP program if all kinds of problems are not to be introduced. As the simplest example, a host will not normally deliver a payload to a userspace socket unless the packet was addressed to some address on the host. An XDP program forwarding to XSKs must perform this check itself (and there is no obvious way to access netlink-style stack details from an XDP program). rp_filter usually provides reverse path filtering as described in RFC 3704, but it is not applied to XDP. If a NIC is in promiscuous mode, frames will be delivered to XDP no matter their L2 destination address, etc.

  • There seems no way to determine whether UDP and TCP checksums have been validated in hardware (there are methods for checksum deltas when modifying packets). If they have not, I usually want to validate them myself before passing packets to userspace, but if they have, I don't want to incur that cost. If a copy must be made of the packet data, I'd like any checksumming folded into that, where it will likely be hidden under memory access costs.
  • XSKs must bind to a specific NIC hardware queue. The APIs to get hardware queue information are old ethtool ioctls, or the more recent netlink reimplementations of same. There are at least four types of hardware queue, dependent on hardware and driver. The APIs to *manage* hardware queues are barely there, yet to use XDP one absolutely must either (a) collapse all RX queues into a single queue or (b) direct all desired traffic to the XDP-bound queue with a NIC ntuple rule or (c) bind XDP programs to all RX queues.
  • If I've got a pool of workers, each with their own XSK, I'd often like to distribute packets to XSKs based on dynamic backlog, and I *never* want to dispatch to an XSK whose ringbuffer is full. The XSK abstraction offers no way to do this (you could build it yourself with eBPF maps).
  • Multiple XDP applications can't easily coexist. xdp-loader gives you a basic chaining infrastructure, sure, but take the checksum example above. In the absence of hardware checksums (and the ability to detect that they've been validated), must each XDP program validate? Must each XDP program check for L3 well-formedness? If they need defragmentation, must all programs carry that code? It seems to me that eBPF is sufficiently limited that the kernel ought be able to weave together various XDP programs, warn "this one will never match because its matches are all eaten by this already-loaded program", etc. Even that would probably be a mess. Until then, though, I think xdp-loader's default policy of chaining is misguided.
  • I ought be able to have all my rings backed by huge pages if I so desire, but this can't be accomplished with kernel-allocated rings as used with XSKs.
  • The jumbo frame support is a goddamn mess. Generally, you can't run XDP on an MTU larger than your page size less about a thousand bytes (3050 is typically as high as one can go on 4KiB pages). This effectively shoves an atomic bomb right up the ass of many high-speed networking setups. Do many people use Ethernet MTUs larger than 1500? No. Do many people who are interested in high-performance networking? Absolutely. Some drivers are rumored to support "multibuf"-based jumbo frames, but there's no way to discover whether you're working with such a driver. Give me my 8K MTUs back!
  • Locally-generated packets generally don't hit XDP unless you're binding to loopback. I.e. if I have an Intel X540 with an IP address of 172.16.88.40/24 at ixgbe0, and I bind an XDP program to it, and I ping 172.16.88.40 from that same machine, the XDP program is not going to see that traffic. This contradicts how local addresses typically work, requiring either that one explains the strange restriction to users, or that one build an alternate path using regular sockets to handle this case.
  • Running libxdp-based programs under valgrind results in all kinds of illegal access errors being thrown, discomforting regarding such a low-level component. Also, if xdp-loader crashes while holding its filesystem-based lock, it can't be used again until the lock is deleted, even if the PID recorded therein no longer exists.
  • Binding an XDP program effectively cycles the device on some NICs.
  • I can't determine what kind of device I've been bound to from within my XDP program. Ought I expect Ethernet L2? Who knows!