Check out my first novel, midnight's simulacra!

Io uring: Difference between revisions

From dankwiki
Line 119: Line 119:
* [https://kernel.dk/io_uring.pdf Efficient IO with io_uring]
* [https://kernel.dk/io_uring.pdf Efficient IO with io_uring]
* [https://unixism.net/loti/index.html Lord of the io_uring]
* [https://unixism.net/loti/index.html Lord of the io_uring]
* [https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/ IoRing vs io_uring: A comparison of Windows and Linux implementations]
* [https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/ IoRing vs io_uring: A comparison of Windows and Linux implementations], and the same author's [https://windows-internals.com/i-o-rings-when-one-i-o-operation-is-not-enough/ I/O Rings—When One I/O is Not Enough]

Revision as of 23:04, 3 May 2023

io_uring, introduced in 2019 (kernel 5.1) by Jens Axboe, is a system for providing the kernel with a schedule of system calls, and receiving the results as they're generated. It combines asynchronous I/O, system call polybatching, and flexible buffer management, and is IMHO the most substantial development in the Linux I/O model since Berkeley sockets:

  • Asynchronous I/O without the large copy overheads and restrictions of POSIX AIO (no more O_DIRECT, etc.)
  • System call batching across distinct system calls (not just readv() and recvmmsg())
    • Whole sequences of distinct system calls can be strung together
  • Provide a buffer pool, and they'll be used as needed

The core system calls of io_uring (henceforth uring) are wrapped by the C API of liburing. Windows added a very similar interface, IoRing, in 2020. In my opinion, uring ought largely displace epoll in new Linux code. FreeBSD seems to be sticking with kqueue, meaning code using uring won't run there, but neither did epoll (save through FreeBSD's somewhat dubious Linux compatibility layer).

Rings

Central to every uring are two ringbuffers holding CQEs (Completion Queue Entries) and SQE (Submission Queue Entries) descriptors (as best I can tell, this terminology was used in the NVMe specification, and before that on the IBM AS400). SQEs roughly correspond to a single system call: they are tagged with an operation type, and filled in with the values that would traditionally be supplied as arguments to the appropriate function. Userspace is provided references to SQEs on the SQE ring, filled in, and submitted. Submission operates up through a specified SQE, and thus all SQEs before it in the ring must also be ready to go. The kernel places results in the CQE ring. These rings are shared between kernel- and userspace. The rings must be distinct unless the kernel specifies the IORING_FEAT_SINGLE_MMAP feature (see below). Note that SQEs are allocated externally to the SQ descriptor ring.

It is possible for a single submission to result in multiple completions (e.g. io_uring_prep_multishot_accept(3)); this is known as multishot.

uring does not generally make use of errno. Synchronous functions return the negative error code as their result. Completion queue entries have the negated error code placed in their res fields.

CQEs are usually 16 bytes, and SQEs are usually 64 bytes (but see IORING_SETUP_SQE128 and IORING_SETUP_CQE32 below). Either way, SQEs are allocated externally to the submission queue, which is merely a ring of descriptors.

Setup

The io_uring_setup(2) system call returns a file descriptor, and accepts two parameters, u32 entries and struct io_uring_params *p:

int io_uring_setup(u32 entries, struct io_uring_params *p);
struct io_uring_params {                                                                                                            
  __u32 sq_entries;                                                                                                                 
  __u32 cq_entries;                                                                                                                 
  __u32 flags;                                                                                                                      
  __u32 sq_thread_cpu;                                                                                                              
  __u32 sq_thread_idle;                                                                                                             
  __u32 features;                                                                                                                   
  __u32 wq_fd;                                                                                                                      
  __u32 resv[3];                                                                                                                    
  struct io_sqring_offsets sq_off;                                                                                                  
  struct io_cqring_offsets cq_off;                                                                                                  
};

It is wrapped by liburing's io_uring_queue_init(3) and io_uring_queue_init_params(3). When using these wrappers, io_uring_queue_exit(3) should be used to clean up. These wrappers operate on a struct io_uring. io_uring_queue_init(3) takes an unsigned flags argument, which is passed as the flags field of io_uring_params. io_uring_queue_init_params(3) takes a struct io_uring_params* argument, which is passed through directly to io_uring_queue_init(3).

Flags

Flag Kernel version Description
IORING_SETUP_IOPOLL 5.1 Instruct the kernel to use polled (as opposed to interrupt-driven) I/O. This is intended for block devices, and requires that O_DIRECT was provided when the file descriptor was opened.
IORING_SETUP_SQPOLL 5.1 (5.11 for full features) Create a kernel thread to poll on the submission queue. If the submission queue is kept busy, this thread will reap SQEs without the need for a system call. If enough time goes by without new submissions, the kernel thread goes to sleep, and io_uring_enter(2) must be called to wake it.
IORING_SETUP_SQ_AFF 5.1 Only meaningful with IORING_SETUP_SQPOLL. The poll thread will be bound to the core specified in sq_thread_cpu.
IORING_SETUP_CQSIZE 5.1 Create the completion queue with cq_entries entries. This value must be greater than entries, and might be rounded up to the next power of 2.
IORING_SETUP_CLAMP 5.1
IORING_SETUP_ATTACH_WQ 5.1
IORING_SETUP_R_DISABLED 5.10 Start the uring disabled, requiring that it be enabled with io_uring_register(2).
IORING_SETUP_SUBMIT_ALL 5.18 Continue submitting SQEs from a batch even after one results in error.
IORING_SETUP_COOP_TASKRUN 5.19
IORING_SETUP_TASKRUN_FLAG 5.19
IORING_SETUP_SQE128 5.19 Use 128-byte SQEs, necessary for NVMe passthroughs using IORING_OP_URING_CMD.
IORING_SETUP_CQE32 5.19 Use 32-byte CQEs, necessary for NVMe passthroughs using IORING_OP_URING_CMD.
IORING_SETUP_SINGLE_ISSUER 6.0 Hint to the kernel that only a single thread will submit requests, allowing for optimizations. This thread must either be the thread which created the ring, or (iff IORING_SETUP_R_DISABLED is used) the thread which enables the ring.
IORING_SETUP_DEFER_TASKRUN 6.1

Kernel features

Various functionality was added to the kernel following the initial release of uring, and thus not necessarily available to all kernels supporting the basic system calls. The __u32 features field of the io_uring_params parameter to io_uring_setup(2) is filled in with feature flags by the kernel.

Feature Kernel version Description
IORING_FEAT_SINGLE_MMAP 5.4 A single mmap(2) can be used for both the submission and completion rings.
IORING_FEAT_NODROP 5.5 (5.19 for full features)
IORING_FEAT_SUBMIT_STABLE 5.5
IORING_FEAT_RW_CUR_POS 5.6
IORING_FEAT_CUR_PERSONALITY 5.6
IORING_FEAT_FAST_POLL 5.7
IORING_FEAT_POLL_32BITS 5.9
IORING_FEAT_SQPOLL_NONFIXED 5.11
IORING_FEAT_ENTER_EXT_ARG 5.11
IORING_FEAT_NATIVE_WORKERS 5.12
IORING_FEAT_RSRC_TAGS 5.13
IORING_FEAT_CQE_SKIP 5.17
IORING_FEAT_LINKED_FILE 5.17

Registered resources

Submitting work

Submitting work consists of four steps:

  • Acquiring free SQEs
  • Filling in those SQEs
  • Placing those SQEs at the tail of the submission queue
  • Submitting the work, possibly using a system call

Reaping completions

External links