Linux APIs

From dankwiki
Jump to: navigation, search

The Linux kernel is a protean, rapidly-changing thing. This is reflected even in "stable" APIs, especially when additions are strictly augmentative with regards to standards. Readers of classics like APIUE would do well to keep up with their LKML, or at least man pages. I'm documenting departures from these standards as I come across them in various man pages and source code. The kernel man pages can be browsed at http://www.kernel.org/doc/man-pages/online_pages.html. This page serves as a companion to "FreeBSD APIs".

File descriptors

  • Since 2.6.16, the openat(2) system call can open(2) a file relative to an open directory provided as the first argument. As the man page says:
       openat() and other similar system calls suffixed "at" are supported for
       two reasons.

       First,  openat()  allows  an  application to avoid race conditions that
       could occur when using open(2) to open files in directories other  than
       the  current  working directory.  These race conditions result from the
       fact that some component of the directory prefix given to open(2) could
       be  changed  in  parallel  with the call to open(2).  Such races can be
       avoided by opening a file descriptor for the target directory, and then
       specifying that file descriptor as the dirfd argument of openat().

       Second,  openat()  allows  the  implementation of a per-thread "current
       working directory", via file descriptor(s) maintained by  the  applica‐
       tion.   (This functionality can also be obtained by tricks based on the
       use of /proc/self/fd/dirfd, but less efficiently.)
  • Since 2.6.23, the open(2) system call accepts the O_CLOEXEC flag (as does recvmsg(2) and the corresponding MSG_CMSG_CLOEXEC flag). This atomically sets the close-on-exec flag upon update of the dtable, protecting against a race condition arising from fork(2)+exec(2) calls in other threads.
    • Since 2.6.24, fcntl(2) implements a F_DUPFD_CLOEXEC operation, mating O_CLOEXEC to F_DUPFD.
  • epoll(7) was introduced in 2.5.44; it is in ways similar to FreeBSD's kqueue or Solaris's /dev/poll (one major advantage over poll(2) is that the pollfd structures need not be copied from user- to kernel-space with each invocation).
    • FreeBSD aims to emulate epoll(7), but they don't yet seem to have done so...
  • Since 2.6.25 and 2.6.23 respectively, timerfd and signalfd system call families exist, supporting timer and signal-based operations via the file descriptor model (they are fully supported by poll(2), epoll(7), etc). These roughly correspond to EVFILT_TIMER and EVFILT_SIG event filters in kqueue. Since 2.6.27, timerfd_create(2) supports TFD_CLOEXEC and TFD_NONBLOCK bits in the flags parameter, and signalfd(2) supports a corresponding SFD_CLOEXEC/SFD_NONBLOCK pair.
    • In the course of 2.6.26 development, signalfd(2) saw an API change; the third parameter went from a (frankly baffling) masksize to an integer flags. See the LKML thread "signalfd API issues", and this LWN article.
  • The obsolete BSD implementation of asynchronous I/O is extended via the F_GETSIG and F_SETSIG subcommands to fcntl(2).
  • Since 2.6.22, the eventfd(2) system call returns a file descriptor that can be used for userspace and kernel to userspace event notification. It is associated with an 8-byte kernelspace counter, which the first parameter initializes. Since Linux 2.6.27, eventfd2(2) system call also accepts a flags parameter, specified using EFD_NONBLOCK and EFD_CLOEXEC. Support for eventfd(2) was added in glibc 2.8, and transparent support for eventfd2(2) was added in glibc 2.9.

Synchronization

  • Since 2.5.7, futexes have provided Linux's primary userspace locking primitive, and been at the heart of NPTL. Their API changed numerous times through 2.5's development.

Processes

  • prctl(2) was added to Linux 2.1.57 as an "ioctl(2) for processes". It has any number of capabilities (list them FIXME).
  • arch_prctl(2) address architecture-specific prctl(2)-like features. It needn't generally be used. glibc provides no prototype for arch_prctl(2).
    • On x86-64, it supports setting and retrieving the value of the FS and GS registers.

Execution resources

General details of cpu partitioning and affinity can be found on the cpuset page.

  • getcpu(2) was added in 2.6.19, with glibc support in 2.6. It identifies the current CPU and NUMA node of the thread (this might be immediately invalidated). Only one CPU and NUMA id can be reported, which might not make sense for all process models (especially the NUMA part). sched_getcpu(2) is equivalent to calling getcpu(&aid,NULL,NULL).
  • Since 2.5.8, sched_getaffinity(2) and sched_setaffinity(2) have provided affinity mask management within the process's cpuset. Glibc support was added in 2.3. In glibc 2.3.4's pthreads implementation, pthread_getaffinity_np(3) and pthread_setaffinity_np(3) were added as wrappers around these system calls.
  • Since 2.6.26, getrusage(2) with a RUSAGE_THREAD parameter retrieves statistics for the calling thread only.
  • modify_ldt(2), specific to the x86 architecture, allows the Local Descriptor Table to be modified.
  • set_thread_area(2) allows an area of memory to be designated thread-specific data. It was introduced in the 2.5.29 kernel.

clone(2)

  • clone(2) is far more granular with regards to what's copied and shared that fork(2).
    • ...

POSIX capabilities

  • CONFIG_SECURITY_CAPABILITIES must be set. Chris Friedhoff's POSIXFileCaps page is excellent.
  • Since 2.2.18, the prctl(2) system call accepts the PR_SET_KEEPCAPS flag, allowing capabilities to be maintained across an event causing all of effective, real and saved-set-user UIDs to become non-zero, when at least one was previously zero. This can be used together with cap_set_proc(2) for a program run as root due to need for some capability (say, CAP_NET_RAW) to drop root privileges and most capabilities.
  • Since 2.6.24 or 2.6.19-rc5-mm2, CONFIG_SECURITY_FILE_CAPABILITIES enables association of POSIX capabilities with filenames via the setcap(1) tool.

Signals

  • Since 2.1.57, a process can receive notification of its parent process's death using the prctl(2) system call with an option argument of PR_SET_DEATHSIG. Any signal can be overloaded for such notifications; supply the chosen signal as arg2 (or 0 to cancel parent process death notification). PR_GET_DEATHSIG can be used to obtain the current value since 2.3.15.
  • Glibc's old pthreads library, LinuxThreads, has some major inconsistencies signal-wise with the pthreads standard. Check the docs.
  • Glibc's new pthreads library, NPTL, remedies most of these inconsistencies, but check the docs.

Monitoring

dnotify

  • Dnotify is deprecated and terrible. Eschew it! (see the fcntl(2) man page, F_NOTIFY)

inotify

  • Sexy, sexy inotify(7) has replaced dnotify as of 2.6.13 (glibc 2.4). (Here's a useful FAQ).
  • inotify_init(void), since 2.6.13, creates an inotify file descriptor
  • inotify_init1(int flags), since 2.6.27, creates an inotify file descriptor. Pass IN_NONBLOCK for a nonblocking descriptor, and IN_CLOEXEC for a close-on-exec descriptor.

fanotify

Merged in 2.6.36.

Networking

See below for ethtool (SIOCETHTOOL) coverage.

  • recvmmsg(2), added in 2.6.32 and glibc 2.12, allows multiple messages to be received from a socket, with a timeout, using a single system call.
    • The new flag MSG_WAITFORONE enables MSG_DONTWAIT following receipt of the first message.
  • sendmmsg(2), added in 3.0 and glibc 2.14, allows multiple messages to be sent on a socket using a single system call.

Netlink

  • Kernel 2.1 introduced the Linux netlink system and the PF_NETLINK socket(2) protocol family.

Socket Options

SOL_SOCKET

  • SO_DOMAIN (since 2.6.32) retrieves the socket domain as an integer (eg AF_INET, AF_INET6). This is a readonly sockopt.
  • SO_PROTOCOL (since 2.6.32) retrieves the socket protocol as an integer (eg IPPROTO_TCP, IPPROTO_SCTP). This is a readonly sockopt.
  • SO_RCVBUFFORCE (since 2.6.14) allows processes with CAP_NET_ADMIN capabilities to perform a SO_RCVBUF operation which overrides the rmem_max proc limit.
    • Likewise, SO_SNDBUFFORCE (also since 2.6.14) allows SO_SNDBUF to override the wmem_max proc limit.

IPPROTO_IP

  • IP_MTU_DISCOVER modifies Path MTU discovery for the associated socket descriptor.

IPPROTO_TCP

  • TCP_CORK, introduced during Linux 2.2, suppresses emission of packets smaller than the MSS (through a 200ms ceiling) by coalescing writes to the socket. This can be a slight hit to latency (up through the ceiling), but can be very useful for throughput-oriented services. Clearing the flag results in queued data immediately being sent. Compare with FreeBSD's TCP_NOPUSH.
    • Only since Linux 2.5.71 can TCP_CORK be combined with TCP_NODELAY.
  • TCP_DEFER_ACCEPT, introduced during Linux 2.4, prevents listen(2)ing sockets from appearing ready, and accept(2) from passing back descriptors, until data has been received into socket memory. Compare with FreeBSD's SO_ACCEPTFILTER.

ICMP_FILTER

Used only with SOCK_RAW sockets bound to the IPPROTO_ICMP protocol. The value is a bitmask of ICMP types to filter out.

Memory

  • mremap(2) allows an existing memory map (resulting from a successful mmap(2) call) to be shrunk or expanded. In combination with MAP_ANONYMOUS, this provides the base for a high-speed realloc(3) implementation. By default, the map will not be moved (and an error will be returned if such a move would be necessary); by supplying the MREMAP_MAYMOVE flag, this behavior can be changed (pointers into the buffer might be invalidated by such a call). MREMAP_FIXED causes mremap(2) to accept a fifth argument, specifying an address to which the map must be moved. MREMAP_MAYMOVE must be supplied with MREMAP_FIXED, according to the Linux man pages version 3.07, even if the target destination is the same as the source.
  • remap_file_pages(2), present since Linux 2.5.46 and glibc 2.3.3, allows the pages of a VMA to be permuted (the actual VMA cannot be shrunk or enlarged, as in mremap(2)).
  • madvise(2) accepts several options beyond those specified by POSIX.1b.; in addition to those arguments specified by POSIX.1-2001 for posix_madvise(2), Linux since version 2.6.16 supports MADV_REMOVE, MADV_DONTFORK, and MADV_DOFORK. MADV_REMOVE allows reclamation of unused pages within a sparse mapping, similar to FreeBSD's MADV_FREE. The other disable the default sharing of maps with child processes across a fork(2), and reenable it, respectively.
  • The hugetlbfs file system supports reduction of mapping granularity in the VM. It's used by (among other applications) MySQL and kvm. More details are available at Pages.
    • The *_largepages(2)/*_hugepages(2) calls were present only in Linux 2.5.36-2.5.54.
  • process_vm_readv(2) allows a process to directly read from another process's address space, while process_vm_writev(2) allows one process to write into another's. Both were introduced in 3.5, and require the CROSS_MEMORY_ATTACH kernel option.

Devices

Ethtool

  • The SIOCETHTOOL ioctl supports low-level operations on supported networking devices. It exchanges a struct ifreq whose ifr_data field points to some ethtool struct corresponding to a provided subcommand.

See Also