Check out my first novel, midnight's simulacra!
Libcudest
Reverse engineering of the CUDA system. CUDA primarily communicates with the NVIDIA closed-source driver via several dozen undocumented ioctl()s. My open source implementation, libcudest, is located at GitHub. Sundry utilities for reverse engineering are also within this repository, though recent modifications to valgrind-mmt have rather superseded my tools.
libcudest began as a project for Hyesoon Kim's CS4803DGC at the Georgia Institute of Technology.
Driver versions
Newer drivers can be used with older CUDA versions, but the converse is not true. The "CUDA macroversion" listed below is the first CUDA release designed explicitly for use with the listed drivers.
Version | CUDA macroversion | Notes |
---|---|---|
195.36.15 | 3.0 | |
195.36.24 | 3.0 | |
195.36.31 | 3.0 | |
256.22 | 3.1-beta | |
256.29 | 3.1-beta | |
256.35 | 3.1-beta |
CUDA Environment variables
Discovered via binary analysis and a shimmed getenv(3). Effects determined via blackbox and binary analyses:
Variable | Notes | Documented? | Effects | |
---|---|---|---|---|
__RM_NO_VERSION_CHECK | N | Also checked by nvidia-smi | ||
COMPUTE_PROFILE | Y | If set to 1, profiling will be performed. Implies CUDA_LAUNCH_BLOCKING. | ||
COMPUTE_PROFILE_CONFIG | Y | Specifies a profiler configuration file. Only checked if COMPUTE_PROFILE is set. | ||
COMPUTE_PROFILE_CSV | Y | If set to 1, a profiling data will be written in CSV format. Only checked if COMPUTE_PROFILE is set. | ||
COMPUTE_PROFILE_LOG | Y | Specifies profiler output file (default: "./cuda_profile.log"). Only checked if COMPUTE_PROFILE is set. | ||
CUDA_AMODEL_DLL | N | |||
CUDA_AMODEL_GPU | N | |||
CUDA_API_TRACE_PTR | N | |||
CUDA_CACHE_DISABLE | Y | If this is unset, the code cache will be used. | ||
CUDA_CACHE_MAXSIZE | Y | |||
CUDA_CACHE_PATH | Y | If this is set, it overrides the code cache's default path of $HOME/.nv/ComputeCache | ||
CUDA_DEVCODE_CACHE | Y | PTX compilation cache. | ||
CUDA_DEVCODE_PATH | Y | Search path for fat binaries. | ||
CUDA_EMULATION_MODE | ||||
CUDA_FORCE_PTX_JIT | ||||
CUDA_HEAP_RANGE | Checked each time a context is created | |||
CUDA_INJECTION64_PATH | ||||
CUDA_LAUNCH_BLOCKING | Y (CUDA 3.0 Programmer's Guide, 3.2.6.1) | Forces synchronization of host threads on GPU kernels. | ||
CUDA_MEMCHECK | Checked each time a context is created | |||
CUDA_MEMORY_LOG | Checked each time a context is created | |||
CUDA_VISIBLE_DEVICES |
Maps
Ordered from highest to lowest locations in x86 memory. These are architecture-, and to a lesser degree driver- and kernel version-specific. Applications and libraries can of course create many more maps than these.
- vsyscalls. read-execute-private, very few pages, topmost area of memory, usually highest mapping
- VDSO. read-execute-private, one page, high in memory (SYSENTER/SYSEXIT)
- Userspace stack. read-write-private, many pages, high in memory
- Anonymous map, 3 read-write-private pages, high in memory.
- Possibly associated with nvidia driver's NV_STACK_SIZE stack. read-write-private, (3 * 4096 on amd64, 2 * 4096 on i686)
- Two sets of /dev/nvidiaX maps for each bound device. Sets are usually continguous, and contain:
- an anonymous page, read-write-private
- several mappings of the device, having variable number of pages, all read-write-shared
- Libraries. variable, middle of memory.
- Userspace heap. read-write-private, many pages, low in memory
- Application (data region). read-write-private, variable, low in memory
- Application (text region). read-execute-private, variable, usually lowest mapping
mmap()s
offset | size | notes | Nouveau name | block range |
---|---|---|---|---|
reg_addr + 0x0000 | 0x2000 | not mapped by libcuda | PMC functional block | 0x000000--0x001fff |
reg_addr + 0x9000 | 0x1000 | [Rwxs] mapped in cuInit(). first mapping. per-device. | PTIMER functional block | 0x009000--0x009fff |
reg_addr + 0xc0a000 / 0xc0c000 | 0x1000 | [RWxs] location is acquired from ioctl 4e | PFIFO command submission interface | 0xc00000--0xcfffff |
ioctls
An ioctl (on x86) is 32 bits. The following definition comes from linux/asm-generic/ioctl.h in a 2.6.34 kernel:
- Bit 31: Read?
- Bit 30: Write?
- Bits 29-16: Parameter size
- Bits 15-8: Type (module)
- Bits 7-0: Number (command)
Looking at the source of the 195.36.15 kernel driver's OS interface, we see that NVIDIA uses the standard ioctl-creation macros from ioctl.h, and can be expected to adhere to this format. The type code used (NV_IOCTL_MAGIC) is 'F' (0x46), which overlaps with the framebuffer ioctl range as registered in 2.6.34. We further see that only _IOWR() is used to declare ioctls, implying that the first two bits will always be '11'. Both of these deductions are borne out observing strace output of a CUDA process.
Code | Param size | Param location(s) | Driver API call sites | Notes |
---|---|---|---|---|
/dev/nvidiactl | ||||
0xc8
NV_ESC_CARD_INFO |
0x600 (1536) | anonymous page | cuInit |
typedef struct nv_ioctl_card_info { NvU16 flags; /* see below */ NvU8 bus; /* bus number (PCI, AGP, etc) */ NvU8 slot; /* card slot */ NvU16 vendor_id; /* PCI vendor id */ NvU16 device_id; NvU16 interrupt_line; NvU64 reg_address NV_ALIGN_BYTES(8); NvU64 reg_size NV_ALIGN_BYTES(8); NvU64 fb_address NV_ALIGN_BYTES(8); NvU64 fb_size NV_ALIGN_BYTES(8); } nv_ioctl_card_info_t;
0x00010001 0x0cb110de 0x00000026 0x00000000 0xf2000000 0x00000000 0x01000000 0x00000000 0xe0000000 0x00000000 0x10000000 0x00000000
e0000000-f30fffff : PCI Bus 0000:01 e0000000-efffffff : 0000:01:00.0 f0000000-f1ffffff : 0000:01:00.0 f2000000-f2ffffff : 0000:01:00.0 f2000000-f2ffffff : nvidia f3000000-f307ffff : 0000:01:00.0 f3080000-f3083fff : 0000:01:00.1 f3080000-f3083fff : ICH HD audio |
0xca
NV_ESC_ENV_INFO |
0x004 | anonymous page | cuInit |
typedef struct nv_ioctl_env_info { NvU32 pat_supported; } nv_ioctl_env_info_t; |
0xce
NV_ESC_ALLOC_OS_EVENT |
0x14 | |||
0xcf
NV_ESC_FREE_OS_EVENT |
||||
0xd1
NV_ESC_STATUS_CODE |
||||
0xd2
NV_ESC_CHECK_VERSION_STR |
0x048 | stack | cuInit |
typedef struct nv_ioctl_rm_api_version { NvU32 cmd; NvU32 reply; char versionString[NV_RM_API_VERSION_STRING_LENGTH]; } nv_ioctl_rm_api_version_t; #define NV_RM_API_VERSION_CMD_STRICT 0 #define NV_RM_API_VERSION_CMD_RELAXED '1' #define NV_RM_API_VERSION_CMD_OVERRIDE '2' #define NV_RM_API_VERSION_REPLY_UNRECOGNIZED 0 #define NV_RM_API_VERSION_REPLY_RECOGNIZED 1
|
0x22 | 0x00c | stack | cuInit |
3251635025 65 0
|
0x2a | 0x020 | stack | cuInit |
Sample inputs: 0x7fffffffd310: 3251635025 3251635025 533 0 0x7fffffffd320: 4294955888 32767 132 0
ioctl 2a, 32-byte param, fd 3 0xc1d04214 0x5c000002 0x2080012f 0x00000000 0x0010 0x950713f0 0x00007fff 0x000000a8 0x00000000 GPU method 0x5c000002:2080012f 0x00000000 0x00000000 0x00000000 0x00000000 0x0010 0x00000000 0x00000000 0x00000000 0x00000000 0x0020 0x00000000 0x00000000 0x00000000 0x00000000 0x0030 0x00000000 0x00000000 0x00000000 0x00000000 0x0040 0x00000000 0x00000000 0x00000000 0x00000000 0x0050 0x00000000 0x00000000 0x00000000 0x00000000 0x0060 0x00000000 0x00000000 0x00000000 0x00000000 0x0070 0x00000000 0x00000000 0x00000000 0x00000000 0x0080 0x00000000 0x00000000 0x00000000 0x00000000 0x0090 0x00000000 0x00000000 0x00000000 0x00000000 0x00a0 0x00000000 0x00000000 RESULT: 0 0xc1d04214 0x5c000002 0x2080012f 0x00000000 0x0010 0x950713f0 0x00007fff 0x000000a8 0x00000029 GPU method 0x5c000002:2080012f **************MODIFICATION FROM CALL 0x00000000 0x00000000 0x00000000 0x00000000 0x0010 0x00000000 0x00000000 0x00000000 0x00000000 0x0020 0x00000000 0x00000000 0x00000000 0x00000000 0x0030 0x00000000 0x00000000 0x00000000 0x00000000 0x0040 0x00000000 0x00000000 0x00000000 0x00000000 0x0050 0x00000000 0x00000000 0x00000000 0x00000000 0x0060 0x00000000 0x00000000 0x00000000 0x00000000 0x0070 0x00000000 0x00000000 0x00000000 0x00000000 0x0080 0x00000000 0x00000000 0x00000000 0x00000000 0x0090 0x00000000 0x00000000 0x00000000 0x00000000 0x00a0 0x00000000 0x00000000 |
0x2b | 0x020 | stack | cuInit |
|
0x4d | 0x048 | stack | cuInit |
|
0x2d | 0x014 | stack | cuInit |
|
0x4e | 0x030 | cuInit |
| |
0x4f | 0x020 | cuInit |
| |
0x54 | 0x30 | |||
0x57 | 0x038 | |||
0x58 | 0x28 | |||
0x59 | 0x10 | |||
/dev/nvidiaX | ||||
0x32 | 0x014 | stack | cuInit |
|
0x37 | 0x020 | stack | cuInit |
|
GPU methods
Code | Param size | Notes |
---|---|---|
0x5c000002 (per-device) | ||
0x20800110 | 0x84 |
RESULT: 0 0xc1d04277 0x5c000002 0x20800110 0x00000000 0x0010 0x73be4970 0x00007fff 0x00000084 0x00000000 GPU method 0x5c000002:20800110 0x00000000 0x6f466547 0x20656372 0x20535447 0x0010 0x4d303633 0x00000000 0x00000000 0x00000000
|
disassembly
These disassembly makes use of libcuda.so.195.36.15 (0867d66be617faab3782fa0ba19ec9ba, 7404990 bytes). Symbols were extracted via objdump -T. AMD64 ABI:
- Integer arguments via RDI, RSI, RDX, RCX, R8 and R9, then stack
- FP arguments in XMM0..XMM7, then stack
- Return value in RAX
- libcuda traces
See Also
- Kernel ioctl numbering documentation
- My CUDA and CUBAR pages
- I develped ptracer to get traces for this project
- Some traces