Linux APIs

From dankwiki
Revision as of 16:16, 23 December 2008 by Dank (talk | contribs) (→‎File descriptors)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)

The Linux kernel is a protean, rapidly-changing thing. This is reflected even in "stable" APIs, especially when additions are strictly augmentative with regards to standards. Readers of classics like APIUE would do well to keep up with their LKML, or at least man pages. I'm documenting departures from these standards as I come across them in various man pages and source code.

File descriptors

  • Since 2.6.23, the open(2) system call accepts the O_CLOEXEC flag (as does recvmsg(2) and the corresponding MSG_CMSG_CLOEXEC flag). This atomically sets the close-on-exec flag upon update of the dtable, protecting against a race condition arising from fork(2)+exec(2) calls in other threads.
    • Since 2.6.24, fcntl(2) implements a F_DUPFD_CLOEXEC operation, mating O_CLOEXEC to F_DUPFD.
  • epoll(7) was introduced in 2.5.44; it is in ways similar to FreeBSD's kqueue or Solaris's /dev/poll (one major advantage over poll(2) is that the pollfd structures need not be copied from user- to kernel-space with each invocation).
    • FreeBSD aims to emulate epoll(7), but they don't yet seem to have done so...
  • Since 2.6.25 and 2.6.23 respectively, timerfd and signalfd system call families exist, supporting timer and signal-based operations via the file descriptor model (they are fully supported by poll(2), epoll(7), etc). These roughly correspond to EVFILT_TIMER and EVFILT_SIG event filters in kqueue. Since 2.6.27, timerfd_create(2) supports TFD_CLOEXEC and TFD_NONBLOCK bits in the flags parameter, and signalfd(2) supports a corresponding SFD_CLOEXEC/SFD_NONBLOCK pair.
    • In the course of 2.6.26 development, signalfd(2) saw an API change; the third parameter went from a (frankly baffling) masksize to an integer flags. See the LKML thread "signalfd API issues", and this LWN article.

Processes

clone(2)

  • clone(2) is far more granular with regards to what's copied and shared that fork(2).
    • ...

POSIX capabilities

  • CONFIG_SECURITY_CAPABILITIES must be set. This page is excellent: Chris Friedhoff's POSIXFileCaps page.
  • Since 2.2.18, the prctl(2) system call accepts the PR_SET_KEEPCAPS flag, allowing capabilities to be maintained across an event causing all of effective, real and saved-set-user UIDs to become non-zero, when at least one was previously zero. This can be used together with cap_set_proc(2) for a program run as root due to need for some capability (say, CAP_NET_RAW) to drop root privileges and most capabilities.
  • Since 2.6.24 or 2.6.19-rc5-mm2, CONFIG_SECURITY_FILE_CAPABILITIES enables association of POSIX capabilities with filenames via the setcap(1) tool.

Signals

  • Since 2.1.57, a process can receive notification of its parent process's death using the prctl(2) system call with an option argument of PR_SET_DEATHSIG. Any signal can be overloaded for such notifications; supply the chosen signal as arg2 (or 0 to cancel parent process death notification). PR_GET_DEATHSIG can be used to obtain the current value since 2.3.15.
  • Glibc's old pthreads library, LinuxThreads, has some major inconsistencies signal-wise with the pthreads standard. Check the docs.
  • Glibc's new pthreads library, NPTL, remedies most of these inconsistencies, but check the docs.

Monitoring

  • Dnotify is deprecated and terrible. Eschew it! (see the fcntl(2) man page, F_NOTIFY)
  • Sexy, sexy inotify(7) has replaced it as of 2.6.13 (glibc 2.4). (Here's a useful FAQ).

Netlink

  • Kernel 2.1 introduced the Linux netlink system and the PF_NETLINK socket(2) protocol family.

Socket Options

IPPROTO_IP

  • IP_MTU_DISCOVER modifies Path MTU discovery for the associated socket descriptor.

IPPROTO_TCP

  • TCP_CORK, introduced during Linux 2.2, suppresses emission of packets smaller than the MSS (through a 200ms ceiling) by coalescing writes to the socket. This can be a slight hit to latency (up through the ceiling), but can be very useful for throughput-oriented services. Clearing the flag results in queued data immediately being sent. Compare with FreeBSD's TCP_NOPUSH.
    • Only since Linux 2.5.71 can TCP_CORK be combined with TCP_NODELAY.
  • TCP_DEFER_ACCEPT, introduced during Linux 2.4, prevents listen(2)ing sockets from appearing ready, and accept(2) from passing back descriptors, until data has been received into socket memory. Compare with FreeBSD's SO_ACCEPTFILTER.

Memory

  • mremap(2) allows an existing memory map (resulting from a successful mmap(2) call) to be shrunk or expanded. In combination with MAP_ANONYMOUS, this provides the base for a high-speed realloc(3) implementation. By default, the map will not be moved (and an error will be returned if such a move would be necessary); by supplying the MREMAP_MAYMOVE flag, this behavior can be changed (pointers into the buffer might be invalidated by such a call). MREMAP_FIXED causes mremap(2) to accept a fifth argument, specifying an address to which the map must be moved. MREMAP_MAYMOVE must be supplied with MREMAP_FIXED, according to the Linux man pages version 3.07, even if the target destination is the same as the source.
  • madvise(2) accepts several options beyond those specified by POSIX.1b.; in addition to those arguments specified by POSIX.1-2001 for posix_madvise(2), Linux since version 2.6.16 supports MADV_REMOVE, MADV_DONTFORK, and MADV_DOFORK. MADV_REMOVE allows reclamation of unused pages within a sparse mapping, similar to FreeBSD's MADV_FREE. The other disable the default sharing of maps with child processes across a fork(2), and reenable it, respectively.