Check out my first novel, midnight's simulacra!


From dankwiki
Revision as of 23:53, 9 July 2010 by Dank (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
CUDA'd-up beyond all repair

CUDA (and General-Purpose Graphics Processing Unit programming in general) is rapidly becoming a mainstay of high-performance computing. As CUDA and OpenCL move off of the workstation, and into the server -- off of the console, and into the cluster -- the security of these systems will become critical parts of the associated trusted computing base. Even ignoring the issue of multiuser security, the properties of isolation and (to a lesser extent) confidentiality are important for debugging, profiling and reproducibility. I've authored the CUBAR set of tools to investigate the security properties -- primarily the means and parameters of memory protection, and the division of protection between soft- and hardware -- of CUDA on NVIDIA hardware since the G80 architecture. This was essential information in the attempt to implement an open source CUDA software stack (see libcudest).

See our 2010 paper, "My Other Computer is Your GPU."


Memory details

  • What address translations, if any, are performed?
    • If address translation is performed, can physical memory be multiply aliased?
  • How are accesses affected by use of incorrect state space affixes?
    • Compute Capability 2.0 introduces unified addressing, but still supports modal addressing
  • How do physical addresses correspond to distinct memory regions?

From Lindholm et al's "NVIDIA Tesla: A Unified Graphics and Computing Architecture":

The DRAM memory data bus width is 384 pins, arranged in six independent
partitions of 64 pins each. Each partition owns 1/6 of the physical address
space. The memory partition units directly enqueue requests. They arbitrate
among hundreds of in-flight requests from the parallel stages of the graphics
and computation pipelines. The arbitration seeks to maximize total DRAM
transfer efficiency, which favors grouping related requests by DRAM bank and
read/write direction, while minimizing latency as far as possible. The memory
controllers support a wide range of DRAM clock rates, protocols, device
densities, and data bus widths. 
A single hub unit routes requests to the appropriate partition from the
nonparallel requesters (PCI-Express, host and command front end, input
assembler, and display). Each memory partition has its own depth and color ROP
units, so ROP memory traffic originates locally. Texture and load/store
requests, however, can occur between any TPC and any memory partition, so an
interconnection network routes requests and responses. 
All processing engines generate addresses in a virtual address space. A memory
management unit performs virtual to physical translation. Hardware reads the
page tables from local memory to respond to misses on behalf of a hierarchy of
translation look-aside buffers spread out among the rendering engines.

Driver details

  • Is a CUDA context a true security capability?
    • Can a process modify details of the contexts it creates?
    • Can a process transmit its contexts to another? Will they persist if the originating process exits?
    • Can a process forge another process's contexts on its own?


  • What mechanisms, if any, exist to protect memory? At what granularities (of address and access) do they operate?
    • What about registers? CUDA doesn't do traditional context switches due to register preallocation.
  • How is memory protection split across hardware, kernelspace, and userspace?
    • Any userspace protection can, of course, be trivially subverted
  • Are code and data memories separated (a Harvard architecture), or unified (Von Neumann architecture)?
  • What memories, if any, are scrubbed between kernels' execution?
  • How many different regions can be tracked? How many contexts? What behavior exists at these limits?


  • Have these mechanisms changed over various hardware?
    • The "Fermi" hardware (Compute Capability 2.0) adds unified addressing and caches for global memory. Effects?
  • Have these mechanisms changed over the course of various driver releases?
  • Open source efforts (particularly the nouveau project) are working on their own drivers.
    • What all needs be addressed by these softwares?
  • How is the situation affected by multiple devices, whether in an SLI/CrossFire setup or not?


  • What forensic data, if any, is created by typical CUDA programs? Adversarial programs? Broken programs?
  • What relationship exists between CPU processes and GPU kernels?


Memory space exploration

Probe memory via attempts to read, write and execute various addresses, including:

  • those unallocated within the probing context,
  • those unallocated by any running context, and
  • those unallocated by any existing context.

Probe addresses using the various state space affixes, and 2.0's unified addressing.

Context exploration

Determine whether CUDA contexts can be moved or shared between processes:

  • fork(2) and execute cudaAlloc(3) without creating a new context
    • If this works, see whether the change is reflected in the parent binary
    • Ensure that PPID isn't just being checked (dubious, but possible) by fork(2)ing twice
  • Transmit the CUcontext body to another process via IPC or the filesystem, and repeat the tests

Determine how many contexts can be created across a process, and across the system.



Clone the git repository. Define (either via shell export or within a new file Makefile.local):

  • NVCC to point at your nvcc binary (from the CUDA Toolkit)
  • CUDA to point at your cuda-dev root


cudash in use

CUBAR tools

Only those tools which launch CUDA kernels use the Runtime API. They might be converted to the Driver API.

Driver API

  • cudaminimal - The minimal possible CUDA program (a call to cuInit()), used for strace(1) etc.
  • cudapinner - Allocates and pins (locks) all possible shared (system) memory, using the minimum necessary number of distinct allocations.
  • cudaspawner - Determine the maximum number of CUcontexts supported by a device, optionally applying memory pressure proportional to n.
  • cudastuffer - Allocates all possible memory on a device, using the minimum necessary number of distinct allocations.

Runtime API

  • cudadump - Driver for cudaranger and sanity checker (FIXME: ought be split up into two or more tools). For each device, it:
    • verifies that the constant memory region can be read in its entirety by a CUDA kernel
    • explores the virtual address space in its entirety via calling cudaranger with progressively smaller ranges in the case of access failures
  • cudaranger - Attempts to perform selectable memory accesses across the specified virtual address range of a device. Calculates the number of non-zero words.

It is necessary that cudadump spawn cudaranger rather than launching the kernels itself; a kernel failure results in all libcuda calls returning an early error (CUDA_ERROR_DEINITIALIZED (4) or CUDA_ERROR_LAUNCH_FAILURE (700)) for the remainder of a process's life (ie, creating a new context usually also fails FIXME: provide proof).


The CUDA Shell (currently making use of the runtime api for kernel launches) facilitates stateful, invasive examination of the CUDA memory/process system. It has online help via the "help" command.

  • FIXME: document commands

Other tools

See Also