Check out my first novel, midnight's simulacra!
CUBAR
CUDA (and General-Purpose Graphics Processing Unit programming in general) is rapidly becoming a mainstay of high-performance computing. As CUDA and OpenCL move off of the workstation, and into the server -- off of the console, and into the cluster -- the security of these systems will become critical parts of the associated trusted computing base. Even ignoring the issue of multiuser security, the properties of isolation and (to a lesser extent) confidentiality are important for debugging, profiling and reproducibility. I've authored cudafucker and associated tools to investigate the security properties -- primarily the means and parameters of memory protection, and the division of protection between soft- and hardware -- of CUDA on NVIDIA hardware since the G80 architecture.
Questions
Memory details
- What address translations, if any, are performed?
- If address translation is performed, can physical memory be multiply aliased?
- How are accesses affected by use of incorrect state space affixes?
- Compute Capability 2.0 introduces unified addressing, but still supports modal addressing
- How do physical addresses correspond to distinct memory regions?
Driver details
- Is a CUDA context a true security capability?
- Can a process modify details of the contexts it creates?
- Can a process transmit its contexts to another? Will they persist if the originating process exits?
- Can a process forge another process's contexts on its own?
Protection
- What mechanisms, if any, exist to protect memory? At what granularities (of address and access) do they operate?
- How is memory protection split across hardware, kernelspace, and userspace?
- Any userspace protection can, of course, be trivially subverted
- Are code and data memories separated (a Harvard architecture), or unified (Von Neumann architecture)?
- In the case of Von Neumann or Modified Harvard, is there execution protection?
- What memories, if any, are scrubbed between kernels' execution?
- How many different regions can be tracked? How many contexts? What behavior exists at these limits?
Variation
- Have these mechanisms changed over various hardware?
- The "Fermi" hardware (Compute Capability 2.0) adds unified addressing and caches for global memory. Effects?
- Have these mechanisms changed over the course of various driver releases?
- Open source efforts (particularly the nouveau project) are working on their own drivers.
- What all needs be addressed by these softwares?
- How is the situation affected by multiple devices, whether in an SLI/CrossFire setup or not?
Accountability
- What forensic data, if any, is created by typical CUDA programs? Adversarial programs? Broken programs?
- What relationship exists between CPU processes and GPU kernels?
Experiments
Memory space exploration
Probe memory via attempts to read, write and execute various addresses, including:
- those unallocated within the probing context,
- those unallocated by any running context, and
- those unallocated by any existing context.
Probe addresses using the various state space affixes, and 2.0's unified addressing.
Context exploration
Determine whether CUDA contexts can be moved or shared between processes:
- fork(2) and execute cudaAlloc(3) without creating a new context
- If this works, see whether the change is reflected in the parent binary
- Ensure that PPID isn't just being checked (dubious, but possible) by fork(2)ing twice
- Transmit the CUcontext body to another process via IPC or the filesystem, and repeat the tests
Determine how many contexts can be created across a process, and across the system.