Check out my first novel, midnight's simulacra!

Linux APIs

From dankwiki
Revision as of 03:48, 9 January 2010 by Dank (talk | contribs) (→‎Processes)

The Linux kernel is a protean, rapidly-changing thing. This is reflected even in "stable" APIs, especially when additions are strictly augmentative with regards to standards. Readers of classics like APIUE would do well to keep up with their LKML, or at least man pages. I'm documenting departures from these standards as I come across them in various man pages and source code. The kernel man pages can be browsed at http://www.kernel.org/doc/man-pages/online_pages.html. This page serves as a companion to "FreeBSD APIs".

File descriptors

  • Since 2.6.23, the open(2) system call accepts the O_CLOEXEC flag (as does recvmsg(2) and the corresponding MSG_CMSG_CLOEXEC flag). This atomically sets the close-on-exec flag upon update of the dtable, protecting against a race condition arising from fork(2)+exec(2) calls in other threads.
    • Since 2.6.24, fcntl(2) implements a F_DUPFD_CLOEXEC operation, mating O_CLOEXEC to F_DUPFD.
  • epoll(7) was introduced in 2.5.44; it is in ways similar to FreeBSD's kqueue or Solaris's /dev/poll (one major advantage over poll(2) is that the pollfd structures need not be copied from user- to kernel-space with each invocation).
    • FreeBSD aims to emulate epoll(7), but they don't yet seem to have done so...
  • Since 2.6.25 and 2.6.23 respectively, timerfd and signalfd system call families exist, supporting timer and signal-based operations via the file descriptor model (they are fully supported by poll(2), epoll(7), etc). These roughly correspond to EVFILT_TIMER and EVFILT_SIG event filters in kqueue. Since 2.6.27, timerfd_create(2) supports TFD_CLOEXEC and TFD_NONBLOCK bits in the flags parameter, and signalfd(2) supports a corresponding SFD_CLOEXEC/SFD_NONBLOCK pair.
    • In the course of 2.6.26 development, signalfd(2) saw an API change; the third parameter went from a (frankly baffling) masksize to an integer flags. See the LKML thread "signalfd API issues", and this LWN article.
  • The obsolete BSD implementation of asynchronous I/O is extended via the F_GETSIG and F_SETSIG subcommands to fcntl(2).
  • Since 2.6.22, the eventfd(2) system call returns a file descriptor that can be used for userspace and kernel to userspace event notification. It is associated with an 8-byte kernelspace counter, which the first parameter initializes. Since Linux 2.6.27, eventfd2(2) system call also accepts a flags parameter, specified using EFD_NONBLOCK and EFD_CLOEXEC. Support for eventfd(2) was added in glibc 2.8, and transparent support for eventfd2(2) was added in glibc 2.9.

Synchronization

  • Since 2.5.7, futexes have provided Linux's primary userspace locking primitive, and been at the heart of NPTL. Their API changed numerous times through 2.5's development.

Processes

Execution resources

General details of cpu partitioning and affinity can be found on the cpusets page.

  • getcpu(2) was added in 2.6.19. It identifies the current CPU and NUMA node of the thread (this might be immediately invalidated). Only one CPU and NUMA id can be reported, which might not make sense for all process models (especially the NUMA part). sched_getcpu(2) is equivalent to calling getcpu(&aid,NULL,NULL).
  • Since 2.5.8, sched_getaffinity(2) and sched_setaffinity(2) have provided affinity mask management within the process's cpuset.

clone(2)

  • clone(2) is far more granular with regards to what's copied and shared that fork(2).
    • ...

POSIX capabilities

  • CONFIG_SECURITY_CAPABILITIES must be set. This page is excellent: Chris Friedhoff's POSIXFileCaps page.
  • Since 2.2.18, the prctl(2) system call accepts the PR_SET_KEEPCAPS flag, allowing capabilities to be maintained across an event causing all of effective, real and saved-set-user UIDs to become non-zero, when at least one was previously zero. This can be used together with cap_set_proc(2) for a program run as root due to need for some capability (say, CAP_NET_RAW) to drop root privileges and most capabilities.
  • Since 2.6.24 or 2.6.19-rc5-mm2, CONFIG_SECURITY_FILE_CAPABILITIES enables association of POSIX capabilities with filenames via the setcap(1) tool.

Signals

  • Since 2.1.57, a process can receive notification of its parent process's death using the prctl(2) system call with an option argument of PR_SET_DEATHSIG. Any signal can be overloaded for such notifications; supply the chosen signal as arg2 (or 0 to cancel parent process death notification). PR_GET_DEATHSIG can be used to obtain the current value since 2.3.15.
  • Glibc's old pthreads library, LinuxThreads, has some major inconsistencies signal-wise with the pthreads standard. Check the docs.
  • Glibc's new pthreads library, NPTL, remedies most of these inconsistencies, but check the docs.

Monitoring

  • Dnotify is deprecated and terrible. Eschew it! (see the fcntl(2) man page, F_NOTIFY)
  • Sexy, sexy inotify(7) has replaced it as of 2.6.13 (glibc 2.4). (Here's a useful FAQ).

Netlink

  • Kernel 2.1 introduced the Linux netlink system and the PF_NETLINK socket(2) protocol family.

Socket Options

IPPROTO_IP

  • IP_MTU_DISCOVER modifies Path MTU discovery for the associated socket descriptor.

IPPROTO_TCP

  • TCP_CORK, introduced during Linux 2.2, suppresses emission of packets smaller than the MSS (through a 200ms ceiling) by coalescing writes to the socket. This can be a slight hit to latency (up through the ceiling), but can be very useful for throughput-oriented services. Clearing the flag results in queued data immediately being sent. Compare with FreeBSD's TCP_NOPUSH.
    • Only since Linux 2.5.71 can TCP_CORK be combined with TCP_NODELAY.
  • TCP_DEFER_ACCEPT, introduced during Linux 2.4, prevents listen(2)ing sockets from appearing ready, and accept(2) from passing back descriptors, until data has been received into socket memory. Compare with FreeBSD's SO_ACCEPTFILTER.

Memory

  • mremap(2) allows an existing memory map (resulting from a successful mmap(2) call) to be shrunk or expanded. In combination with MAP_ANONYMOUS, this provides the base for a high-speed realloc(3) implementation. By default, the map will not be moved (and an error will be returned if such a move would be necessary); by supplying the MREMAP_MAYMOVE flag, this behavior can be changed (pointers into the buffer might be invalidated by such a call). MREMAP_FIXED causes mremap(2) to accept a fifth argument, specifying an address to which the map must be moved. MREMAP_MAYMOVE must be supplied with MREMAP_FIXED, according to the Linux man pages version 3.07, even if the target destination is the same as the source.
  • remap_file_pages(2), present since Linux 2.5.46 and glibc 2.3.3, allows the pages of a VMA to be permuted (the actual VMA cannot be shrunk or enlarged, as in mremap(2)).
  • madvise(2) accepts several options beyond those specified by POSIX.1b.; in addition to those arguments specified by POSIX.1-2001 for posix_madvise(2), Linux since version 2.6.16 supports MADV_REMOVE, MADV_DONTFORK, and MADV_DOFORK. MADV_REMOVE allows reclamation of unused pages within a sparse mapping, similar to FreeBSD's MADV_FREE. The other disable the default sharing of maps with child processes across a fork(2), and reenable it, respectively.
  • The hugetlbfs file system supports reduction of mapping granularity in the VM. It's used by (among other applications) MySQL and kvm. More details are available at Pages.
    • The *_largepages(2)/*_hugepages(2) calls were present only in Linux 2.5.36-2.5.54.

See Also