Check out my first novel, midnight's simulacra!
Linux APIs: Difference between revisions
(10 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
The Linux kernel is a protean, rapidly-changing thing. This is reflected even in "stable" APIs, especially when additions are strictly augmentative with regards to standards. Readers of classics like [[APIUE]] would do well to keep up with their [[LKML]], or at least man pages. I'm documenting departures from these standards as I come across them in various man pages and source code. The kernel man pages can be browsed at http://www.kernel.org/doc/man-pages/online_pages.html. This page serves as a companion to "[[FreeBSD APIs]]". | The Linux kernel is a protean, rapidly-changing thing. This is reflected even in "stable" APIs, especially when additions are strictly augmentative with regards to standards. Readers of classics like [[APIUE]] would do well to keep up with their [[LKML]], or at least man pages. I'm documenting departures from these standards as I come across them in various man pages and source code. The kernel man pages can be browsed at http://www.kernel.org/doc/man-pages/online_pages.html. This page serves as a companion to "[[FreeBSD APIs]]" and "[[POSIX]]". | ||
== File descriptors == | == File descriptors == | ||
* Since 2.1.69 and glibc 2.1, <tt>pread(2)</tt> and <tt>pwrite(2)</tt> have allowed <tt>read(2)</tt>- and <tt>write(2)</tt>-like behavior on a file descriptor from a specified offset, without updating the offset. This allows for the atomic equivalent of an <tt>lseek(2)</tt> and an I/O, particularly useful when multiple threads are working with the same file descriptor (since file descriptor offset is shared across the process). | |||
* Since 2.6.23, the <tt>open(2)</tt> system call accepts the <tt>O_CLOEXEC</tt> flag (as does <tt>recvmsg(2)</tt> and the corresponding <tt>MSG_CMSG_CLOEXEC</tt> flag). This atomically sets the close-on-exec flag upon update of the dtable, protecting against a race condition arising from <tt>fork(2)</tt>+<tt>exec(2)</tt> calls in other threads. | * Since 2.6.23, the <tt>open(2)</tt> system call accepts the <tt>O_CLOEXEC</tt> flag (as does <tt>recvmsg(2)</tt> and the corresponding <tt>MSG_CMSG_CLOEXEC</tt> flag). This atomically sets the close-on-exec flag upon update of the dtable, protecting against a race condition arising from <tt>fork(2)</tt>+<tt>exec(2)</tt> calls in other threads. | ||
** Since 2.6.24, <tt>fcntl(2)</tt> implements a <tt>F_DUPFD_CLOEXEC</tt> operation, mating <tt>O_CLOEXEC</tt> to <tt>F_DUPFD</tt>. | ** Since 2.6.24, <tt>fcntl(2)</tt> implements a <tt>F_DUPFD_CLOEXEC</tt> operation, mating <tt>O_CLOEXEC</tt> to <tt>F_DUPFD</tt>. | ||
Line 10: | Line 11: | ||
* The obsolete BSD implementation of [[asynchronous I/O]] is extended via the <tt>F_GETSIG</tt> and <tt>F_SETSIG</tt> subcommands to <tt>fcntl(2)</tt>. | * The obsolete BSD implementation of [[asynchronous I/O]] is extended via the <tt>F_GETSIG</tt> and <tt>F_SETSIG</tt> subcommands to <tt>fcntl(2)</tt>. | ||
* Since 2.6.22, the <tt>eventfd(2)</tt> system call returns a file descriptor that can be used for userspace and kernel to userspace event notification. It is associated with an 8-byte kernelspace counter, which the first parameter initializes. Since Linux 2.6.27, <tt>eventfd2(2)</tt> system call also accepts a ''flags'' parameter, specified using <tt>EFD_NONBLOCK</tt> and <tt>EFD_CLOEXEC</tt>. Support for <tt>eventfd(2)</tt> was added in [[Glibc|glibc]] 2.8, and transparent support for <tt>eventfd2(2)</tt> was added in [[Glibc|glibc]] 2.9. | * Since 2.6.22, the <tt>eventfd(2)</tt> system call returns a file descriptor that can be used for userspace and kernel to userspace event notification. It is associated with an 8-byte kernelspace counter, which the first parameter initializes. Since Linux 2.6.27, <tt>eventfd2(2)</tt> system call also accepts a ''flags'' parameter, specified using <tt>EFD_NONBLOCK</tt> and <tt>EFD_CLOEXEC</tt>. Support for <tt>eventfd(2)</tt> was added in [[Glibc|glibc]] 2.8, and transparent support for <tt>eventfd2(2)</tt> was added in [[Glibc|glibc]] 2.9. | ||
* 2.6.39 introduced <tt>name_to_handle_at()</tt> and <tt>open_by_handle_at()</tt>, similar to FreeBSD's <tt>getfh()</tt> and <tt>openfh()</tt>. They effectively break <tt>openat()</tt> into two parts. | |||
* Since 6.10, <tt>fcntl(2)</tt> supports <tt>F_DUPFD_QUERY</tt> to test whether two file descriptors reference the same underlying file. | |||
==Synchronization== | ==Synchronization== | ||
* Since 2.5.7, [[Futexes|futexes]] have provided Linux's primary userspace locking primitive, and been at the heart of [[NPTL]]. Their API changed numerous times through 2.5's development. | * Since 2.5.7, [[Futexes|futexes]] have provided Linux's primary userspace locking primitive, and been at the heart of [[NPTL]]. Their API changed numerous times through 2.5's development. | ||
Line 16: | Line 20: | ||
* <tt>arch_prctl(2)</tt> address architecture-specific <tt>prctl(2)</tt>-like features. It needn't generally be used. [[libc|glibc]] provides no prototype for <tt>arch_prctl(2)</tt>. | * <tt>arch_prctl(2)</tt> address architecture-specific <tt>prctl(2)</tt>-like features. It needn't generally be used. [[libc|glibc]] provides no prototype for <tt>arch_prctl(2)</tt>. | ||
** On x86-64, it supports setting and retrieving the value of the FS and GS registers. | ** On x86-64, it supports setting and retrieving the value of the FS and GS registers. | ||
* <tt>kcmp(2)</tt> was added in 3.5 (when built with <tt>CONFIG_CHECKPOINT_RESTORE</tt>) to test whether two resources of two (possibly distinct) processes are equal. Since 5.12, this can be enabled with <tt>CONFIG_KCMP</tt>. | |||
===Execution resources=== | ===Execution resources=== | ||
General details of cpu partitioning and affinity can be found on the [[cpuset]] page. | General details of cpu partitioning and affinity can be found on the [[cpuset]] page. | ||
Line 50: | Line 55: | ||
==Networking== | ==Networking== | ||
See [[Linux APIs#Ethtool|below]] for ethtool (SIOCETHTOOL) coverage. | See [[Linux APIs#Ethtool|below]] for ethtool (SIOCETHTOOL) coverage. | ||
* <tt>recvmmsg(2)</tt>, added in 2.6.32 and [[glibc]] 2.12, allows multiple messages to be received from a socket, with a timeout, using a single system call. | |||
** The new flag <tt>MSG_WAITFORONE</tt> enables <tt>MSG_DONTWAIT</tt> following receipt of the first message. | |||
* <tt>sendmmsg(2)</tt>, added in 3.0 and [[glibc]] 2.14, allows multiple messages to be sent on a socket using a single system call. | |||
=== Netlink === | === Netlink === | ||
* Kernel 2.1 introduced the Linux [[netlink]] system and the PF_NETLINK <tt>socket(2)</tt> protocol family. | * Kernel 2.1 introduced the Linux [[netlink]] system and the PF_NETLINK <tt>socket(2)</tt> protocol family. | ||
=== Socket Options === | === Socket Options === | ||
====SOL_SOCKET==== | ====SOL_SOCKET==== | ||
Line 66: | Line 73: | ||
** Only since Linux 2.5.71 can TCP_CORK be combined with [[TCP|TCP_NODELAY]]. | ** Only since Linux 2.5.71 can TCP_CORK be combined with [[TCP|TCP_NODELAY]]. | ||
* TCP_DEFER_ACCEPT, introduced during Linux 2.4, prevents <tt>listen(2)</tt>ing sockets from appearing ready, and <tt>accept(2)</tt> from passing back descriptors, until data has been received into socket memory. Compare with [[FreeBSD APIs#SOL_SOCKET|FreeBSD's]] SO_ACCEPTFILTER. | * TCP_DEFER_ACCEPT, introduced during Linux 2.4, prevents <tt>listen(2)</tt>ing sockets from appearing ready, and <tt>accept(2)</tt> from passing back descriptors, until data has been received into socket memory. Compare with [[FreeBSD APIs#SOL_SOCKET|FreeBSD's]] SO_ACCEPTFILTER. | ||
====ICMP_FILTER==== | |||
Used only with SOCK_RAW sockets bound to the IPPROTO_ICMP protocol. The value is a bitmask of ICMP types to filter out. | |||
== Memory == | == Memory == | ||
Line 73: | Line 83: | ||
* The <tt>hugetlbfs</tt> file system supports [http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage.txt reduction of mapping granularity] in the VM. It's used by (among other applications) [http://www.cyberciti.biz/tips/linux-hugetlbfs-and-mysql-performance.html MySQL] and [[kvm]]. More details are available at [[Pages]]. | * The <tt>hugetlbfs</tt> file system supports [http://www.mjmwired.net/kernel/Documentation/vm/hugetlbpage.txt reduction of mapping granularity] in the VM. It's used by (among other applications) [http://www.cyberciti.biz/tips/linux-hugetlbfs-and-mysql-performance.html MySQL] and [[kvm]]. More details are available at [[Pages]]. | ||
** The <tt>*_largepages(2)/*_hugepages(2)</tt> calls were present only in Linux 2.5.36-2.5.54. | ** The <tt>*_largepages(2)/*_hugepages(2)</tt> calls were present only in Linux 2.5.36-2.5.54. | ||
* <tt>process_vm_readv(2)</tt> allows a process to directly read from another process's address space, while <tt>process_vm_writev(2)</tt> allows one process to write into another's. Both were introduced in 3.5, and require the CROSS_MEMORY_ATTACH kernel option. | |||
==Mounts== | |||
6.8 introduced <tt>listmount</tt> and <tt>statmount</tt>. | |||
==Devices== | ==Devices== | ||
===Ethtool=== | ===Ethtool=== | ||
* The <tt>SIOCETHTOOL</tt> ioctl supports low-level operations on supported networking devices. It exchanges a <tt>struct ifreq</tt> whose <tt>ifr_data</tt> field points to some ethtool struct corresponding to a provided subcommand. | * The <tt>SIOCETHTOOL</tt> ioctl supports low-level operations on supported networking devices. It exchanges a <tt>struct ifreq</tt> whose <tt>ifr_data</tt> field points to some ethtool struct corresponding to a provided subcommand. | ||
==See Also== | ==See Also== | ||
* "[http://lkml.indiana.edu/hypermail/linux/kernel/0503.1/2603.html Capabilities across execve(2)]" on [[LKML]] is insightful commentary on capabilities | * "[http://lkml.indiana.edu/hypermail/linux/kernel/0503.1/2603.html Capabilities across execve(2)]" on [[LKML]] is insightful commentary on capabilities |
Latest revision as of 21:09, 16 May 2024
The Linux kernel is a protean, rapidly-changing thing. This is reflected even in "stable" APIs, especially when additions are strictly augmentative with regards to standards. Readers of classics like APIUE would do well to keep up with their LKML, or at least man pages. I'm documenting departures from these standards as I come across them in various man pages and source code. The kernel man pages can be browsed at http://www.kernel.org/doc/man-pages/online_pages.html. This page serves as a companion to "FreeBSD APIs" and "POSIX".
File descriptors
- Since 2.1.69 and glibc 2.1, pread(2) and pwrite(2) have allowed read(2)- and write(2)-like behavior on a file descriptor from a specified offset, without updating the offset. This allows for the atomic equivalent of an lseek(2) and an I/O, particularly useful when multiple threads are working with the same file descriptor (since file descriptor offset is shared across the process).
- Since 2.6.23, the open(2) system call accepts the O_CLOEXEC flag (as does recvmsg(2) and the corresponding MSG_CMSG_CLOEXEC flag). This atomically sets the close-on-exec flag upon update of the dtable, protecting against a race condition arising from fork(2)+exec(2) calls in other threads.
- Since 2.6.24, fcntl(2) implements a F_DUPFD_CLOEXEC operation, mating O_CLOEXEC to F_DUPFD.
- epoll(7) was introduced in 2.5.44; it is in ways similar to FreeBSD's kqueue or Solaris's /dev/poll (one major advantage over poll(2) is that the pollfd structures need not be copied from user- to kernel-space with each invocation).
- Since 2.6.25 and 2.6.23 respectively, timerfd and signalfd system call families exist, supporting timer and signal-based operations via the file descriptor model (they are fully supported by poll(2), epoll(7), etc). These roughly correspond to EVFILT_TIMER and EVFILT_SIG event filters in kqueue. Since 2.6.27, timerfd_create(2) supports TFD_CLOEXEC and TFD_NONBLOCK bits in the flags parameter, and signalfd(2) supports a corresponding SFD_CLOEXEC/SFD_NONBLOCK pair.
- In the course of 2.6.26 development, signalfd(2) saw an API change; the third parameter went from a (frankly baffling) masksize to an integer flags. See the LKML thread "signalfd API issues", and this LWN article.
- The obsolete BSD implementation of asynchronous I/O is extended via the F_GETSIG and F_SETSIG subcommands to fcntl(2).
- Since 2.6.22, the eventfd(2) system call returns a file descriptor that can be used for userspace and kernel to userspace event notification. It is associated with an 8-byte kernelspace counter, which the first parameter initializes. Since Linux 2.6.27, eventfd2(2) system call also accepts a flags parameter, specified using EFD_NONBLOCK and EFD_CLOEXEC. Support for eventfd(2) was added in glibc 2.8, and transparent support for eventfd2(2) was added in glibc 2.9.
- 2.6.39 introduced name_to_handle_at() and open_by_handle_at(), similar to FreeBSD's getfh() and openfh(). They effectively break openat() into two parts.
- Since 6.10, fcntl(2) supports F_DUPFD_QUERY to test whether two file descriptors reference the same underlying file.
Synchronization
- Since 2.5.7, futexes have provided Linux's primary userspace locking primitive, and been at the heart of NPTL. Their API changed numerous times through 2.5's development.
Processes
- prctl(2) was added to Linux 2.1.57 as an "ioctl(2) for processes". It has any number of capabilities (list them FIXME).
- arch_prctl(2) address architecture-specific prctl(2)-like features. It needn't generally be used. glibc provides no prototype for arch_prctl(2).
- On x86-64, it supports setting and retrieving the value of the FS and GS registers.
- kcmp(2) was added in 3.5 (when built with CONFIG_CHECKPOINT_RESTORE) to test whether two resources of two (possibly distinct) processes are equal. Since 5.12, this can be enabled with CONFIG_KCMP.
Execution resources
General details of cpu partitioning and affinity can be found on the cpuset page.
- getcpu(2) was added in 2.6.19, with glibc support in 2.6. It identifies the current CPU and NUMA node of the thread (this might be immediately invalidated). Only one CPU and NUMA id can be reported, which might not make sense for all process models (especially the NUMA part). sched_getcpu(2) is equivalent to calling getcpu(&aid,NULL,NULL).
- Since 2.5.8, sched_getaffinity(2) and sched_setaffinity(2) have provided affinity mask management within the process's cpuset. Glibc support was added in 2.3. In glibc 2.3.4's pthreads implementation, pthread_getaffinity_np(3) and pthread_setaffinity_np(3) were added as wrappers around these system calls.
- Since 2.6.26, getrusage(2) with a RUSAGE_THREAD parameter retrieves statistics for the calling thread only.
- modify_ldt(2), specific to the x86 architecture, allows the Local Descriptor Table to be modified.
- set_thread_area(2) allows an area of memory to be designated thread-specific data. It was introduced in the 2.5.29 kernel.
clone(2)
- clone(2) is far more granular with regards to what's copied and shared that fork(2).
- ...
POSIX capabilities
- CONFIG_SECURITY_CAPABILITIES must be set. Chris Friedhoff's POSIXFileCaps page is excellent.
- Since 2.2.18, the prctl(2) system call accepts the PR_SET_KEEPCAPS flag, allowing capabilities to be maintained across an event causing all of effective, real and saved-set-user UIDs to become non-zero, when at least one was previously zero. This can be used together with cap_set_proc(2) for a program run as root due to need for some capability (say, CAP_NET_RAW) to drop root privileges and most capabilities.
- Since 2.6.24 or 2.6.19-rc5-mm2, CONFIG_SECURITY_FILE_CAPABILITIES enables association of POSIX capabilities with filenames via the setcap(1) tool.
- There's a new Debian package, libcap2-dev. I've not yet looked into this.
Signals
- Since 2.1.57, a process can receive notification of its parent process's death using the prctl(2) system call with an option argument of PR_SET_DEATHSIG. Any signal can be overloaded for such notifications; supply the chosen signal as arg2 (or 0 to cancel parent process death notification). PR_GET_DEATHSIG can be used to obtain the current value since 2.3.15.
- Glibc's old pthreads library, LinuxThreads, has some major inconsistencies signal-wise with the pthreads standard. Check the docs.
- Glibc's new pthreads library, NPTL, remedies most of these inconsistencies, but check the docs.
Monitoring
dnotify
- Dnotify is deprecated and terrible. Eschew it! (see the fcntl(2) man page, F_NOTIFY)
inotify
- Sexy, sexy inotify(7) has replaced dnotify as of 2.6.13 (glibc 2.4). (Here's a useful FAQ).
- FreeBSD looks like it'll be emulating inotify, likely using EVFILT_VNODE.
- inotify_init(void), since 2.6.13, creates an inotify file descriptor
- inotify_init1(int flags), since 2.6.27, creates an inotify file descriptor. Pass IN_NONBLOCK for a nonblocking descriptor, and IN_CLOEXEC for a close-on-exec descriptor.
fanotify
Merged in 2.6.36.
Networking
See below for ethtool (SIOCETHTOOL) coverage.
- recvmmsg(2), added in 2.6.32 and glibc 2.12, allows multiple messages to be received from a socket, with a timeout, using a single system call.
- The new flag MSG_WAITFORONE enables MSG_DONTWAIT following receipt of the first message.
- sendmmsg(2), added in 3.0 and glibc 2.14, allows multiple messages to be sent on a socket using a single system call.
Netlink
- Kernel 2.1 introduced the Linux netlink system and the PF_NETLINK socket(2) protocol family.
Socket Options
SOL_SOCKET
- SO_DOMAIN (since 2.6.32) retrieves the socket domain as an integer (eg AF_INET, AF_INET6). This is a readonly sockopt.
- SO_PROTOCOL (since 2.6.32) retrieves the socket protocol as an integer (eg IPPROTO_TCP, IPPROTO_SCTP). This is a readonly sockopt.
- SO_RCVBUFFORCE (since 2.6.14) allows processes with CAP_NET_ADMIN capabilities to perform a SO_RCVBUF operation which overrides the rmem_max proc limit.
- Likewise, SO_SNDBUFFORCE (also since 2.6.14) allows SO_SNDBUF to override the wmem_max proc limit.
IPPROTO_IP
- IP_MTU_DISCOVER modifies Path MTU discovery for the associated socket descriptor.
IPPROTO_TCP
- TCP_CORK, introduced during Linux 2.2, suppresses emission of packets smaller than the MSS (through a 200ms ceiling) by coalescing writes to the socket. This can be a slight hit to latency (up through the ceiling), but can be very useful for throughput-oriented services. Clearing the flag results in queued data immediately being sent. Compare with FreeBSD's TCP_NOPUSH.
- Only since Linux 2.5.71 can TCP_CORK be combined with TCP_NODELAY.
- TCP_DEFER_ACCEPT, introduced during Linux 2.4, prevents listen(2)ing sockets from appearing ready, and accept(2) from passing back descriptors, until data has been received into socket memory. Compare with FreeBSD's SO_ACCEPTFILTER.
ICMP_FILTER
Used only with SOCK_RAW sockets bound to the IPPROTO_ICMP protocol. The value is a bitmask of ICMP types to filter out.
Memory
- mremap(2) allows an existing memory map (resulting from a successful mmap(2) call) to be shrunk or expanded. In combination with MAP_ANONYMOUS, this provides the base for a high-speed realloc(3) implementation. By default, the map will not be moved (and an error will be returned if such a move would be necessary); by supplying the MREMAP_MAYMOVE flag, this behavior can be changed (pointers into the buffer might be invalidated by such a call). MREMAP_FIXED causes mremap(2) to accept a fifth argument, specifying an address to which the map must be moved. MREMAP_MAYMOVE must be supplied with MREMAP_FIXED, according to the Linux man pages version 3.07, even if the target destination is the same as the source.
- remap_file_pages(2), present since Linux 2.5.46 and glibc 2.3.3, allows the pages of a VMA to be permuted (the actual VMA cannot be shrunk or enlarged, as in mremap(2)).
- madvise(2) accepts several options beyond those specified by POSIX.1b.; in addition to those arguments specified by POSIX.1-2001 for posix_madvise(2), Linux since version 2.6.16 supports MADV_REMOVE, MADV_DONTFORK, and MADV_DOFORK. MADV_REMOVE allows reclamation of unused pages within a sparse mapping, similar to FreeBSD's MADV_FREE. The other disable the default sharing of maps with child processes across a fork(2), and reenable it, respectively.
- The hugetlbfs file system supports reduction of mapping granularity in the VM. It's used by (among other applications) MySQL and kvm. More details are available at Pages.
- The *_largepages(2)/*_hugepages(2) calls were present only in Linux 2.5.36-2.5.54.
- process_vm_readv(2) allows a process to directly read from another process's address space, while process_vm_writev(2) allows one process to write into another's. Both were introduced in 3.5, and require the CROSS_MEMORY_ATTACH kernel option.
Mounts
6.8 introduced listmount and statmount.
Devices
Ethtool
- The SIOCETHTOOL ioctl supports low-level operations on supported networking devices. It exchanges a struct ifreq whose ifr_data field points to some ethtool struct corresponding to a provided subcommand.
See Also
- "Capabilities across execve(2)" on LKML is insightful commentary on capabilities