Libcudest: Difference between revisions

Latest revision as of 22:18, 22 August 2011

Reverse engineering of the CUDA system. CUDA primarily communicates with the NVIDIA closed-source driver via several dozen undocumented ioctl()s. My open source implementation, libcudest, is located at GitHub. Sundry utilities for reverse engineering are also within this repository, though recent modifications to valgrind-mmt have rather superseded my tools.

libcudest began as a project for Hyesoon Kim's CS4803DGC at the Georgia Institute of Technology.

Driver versions

Newer drivers can be used with older CUDA versions, but the converse is not true. The "CUDA macroversion" listed below is the first CUDA release designed explicitly for use with the listed drivers.

Version	CUDA macroversion	Notes
195.36.15	3.0
195.36.24	3.0
195.36.31	3.0
256.22	3.1-beta
256.29	3.1-beta
256.35	3.1-beta

CUDA Environment variables

Discovered via binary analysis and a shimmed getenv(3). Effects determined via blackbox and binary analyses:

Variable	Notes	Documented?	Effects
__RM_NO_VERSION_CHECK		N	Also checked by nvidia-smi
COMPUTE_PROFILE		Y	If set to 1, profiling will be performed. Implies CUDA_LAUNCH_BLOCKING.
COMPUTE_PROFILE_CONFIG		Y	Specifies a profiler configuration file. Only checked if COMPUTE_PROFILE is set.
COMPUTE_PROFILE_CSV		Y	If set to 1, a profiling data will be written in CSV format. Only checked if COMPUTE_PROFILE is set.
COMPUTE_PROFILE_LOG		Y	Specifies profiler output file (default: "./cuda_profile.log"). Only checked if COMPUTE_PROFILE is set.
CUDA_AMODEL_DLL		N
CUDA_AMODEL_GPU		N
CUDA_API_TRACE_PTR		N
CUDA_CACHE_DISABLE		Y	If this is unset, the code cache will be used.
CUDA_CACHE_MAXSIZE		Y
CUDA_CACHE_PATH		Y	If this is set, it overrides the code cache's default path of $HOME/.nv/ComputeCache
CUDA_DEVCODE_CACHE		Y	PTX compilation cache.
CUDA_DEVCODE_PATH		Y	Search path for fat binaries.
CUDA_EMULATION_MODE
CUDA_FORCE_PTX_JIT
CUDA_HEAP_RANGE	Checked each time a context is created
CUDA_INJECTION64_PATH
CUDA_LAUNCH_BLOCKING		Y (CUDA 3.0 Programmer's Guide, 3.2.6.1)	Forces synchronization of host threads on GPU kernels.
CUDA_MEMCHECK	Checked each time a context is created
CUDA_MEMORY_LOG	Checked each time a context is created
CUDA_VISIBLE_DEVICES

Maps

Ordered from highest to lowest locations in x86 memory. These are architecture-, and to a lesser degree driver- and kernel version-specific. Applications and libraries can of course create many more maps than these.

vsyscalls. read-execute-private, very few pages, topmost area of memory, usually highest mapping
VDSO. read-execute-private, one page, high in memory (SYSENTER/SYSEXIT)
Userspace stack. read-write-private, many pages, high in memory
Anonymous map, 3 read-write-private pages, high in memory.
- Possibly associated with nvidia driver's NV_STACK_SIZE stack. read-write-private, (3 * 4096 on amd64, 2 * 4096 on i686)
Two sets of /dev/nvidiaX maps for each bound device. Sets are usually continguous, and contain:
- an anonymous page, read-write-private
- several mappings of the device, having variable number of pages, all read-write-shared
Libraries. variable, middle of memory.
Userspace heap. read-write-private, many pages, low in memory
Application (data region). read-write-private, variable, low in memory
Application (text region). read-execute-private, variable, usually lowest mapping

mmap()s

offset	size	notes	Nouveau name	block range
reg_addr + 0x0000	0x2000	not mapped by libcuda	PMC functional block	0x000000--0x001fff
reg_addr + 0x9000	0x1000	[Rwxs] mapped in cuInit(). first mapping. per-device.	PTIMER functional block	0x009000--0x009fff
reg_addr + 0xc0a000 / 0xc0c000	0x1000	[RWxs] location is acquired from ioctl `4e`	PFIFO command submission interface	0xc00000--0xcfffff

ioctls

An ioctl (on x86) is 32 bits. The following definition comes from linux/asm-generic/ioctl.h in a 2.6.34 kernel:

Bit 31: Read?
Bit 30: Write?
Bits 29-16: Parameter size
Bits 15-8: Type (module)
Bits 7-0: Number (command)

Looking at the source of the 195.36.15 kernel driver's OS interface, we see that NVIDIA uses the standard ioctl-creation macros from ioctl.h, and can be expected to adhere to this format. The type code used (NV_IOCTL_MAGIC) is 'F' (0x46), which overlaps with the framebuffer ioctl range as registered in 2.6.34. We further see that only _IOWR() is used to declare ioctls, implying that the first two bits will always be '11'. Both of these deductions are borne out observing strace output of a CUDA process.

Code	Param size	Param location(s)	Driver API call sites	Notes
/dev/nvidiactl
0xc8 NV_ESC_CARD_INFO	0x600 (1536)	anonymous page	cuInit	Largest parameter by far. Possibly scaled? Shifted 3 bits left, this is 0x3000, the size of the amd64 anonymous mapping. More likely we support returning up to 32x 48-byte descriptors, and... Wants the first 32 bits to be 1, all others 0. ...this is most likely a mask indicating which card IDs we want information for! typedef struct nv_ioctl_card_info { NvU16 flags; /* see below / NvU8 bus; / bus number (PCI, AGP, etc) / NvU8 slot; / card slot / NvU16 vendor_id; / PCI vendor id */ NvU16 device_id; NvU16 interrupt_line; NvU64 reg_address NV_ALIGN_BYTES(8); NvU64 reg_size NV_ALIGN_BYTES(8); NvU64 fb_address NV_ALIGN_BYTES(8); NvU64 fb_size NV_ALIGN_BYTES(8); } nv_ioctl_card_info_t; Returns (all subsequent bytes are 0): 0x00010001 0x0cb110de 0x00000026 0x00000000 0xf2000000 0x00000000 0x01000000 0x00000000 0xe0000000 0x00000000 0x10000000 0x00000000 0x0001: flag (NV_IOCTL_CARD_INFO_FLAG_PRESENT) 0x0001: bus/slot 0x0cb110de: vendor + device IDs lspci -n: `01:00.0 0300: 10de:0cb1 (rev a2)` lspci -t -v: `\-[0000:00]-+-03.0-[01]--+-00.0 nVidia Corporation GT215 [GeForce GTS 360M]` 0x26: IRQ line (here, #38) 0xf2000000 00000000: reg_address 0x01000000 00000000: reg_size 0xe0000000 00000000: fb_address 0x10000000 00000000: fb_size these are all system memory references, see `/proc/iomem`: e0000000-f30fffff : PCI Bus 0000:01 e0000000-efffffff : 0000:01:00.0 f0000000-f1ffffff : 0000:01:00.0 f2000000-f2ffffff : 0000:01:00.0 f2000000-f2ffffff : nvidia f3000000-f307ffff : 0000:01:00.0 f3080000-f3083fff : 0000:01:00.1 f3080000-f3083fff : ICH HD audio
0xca NV_ESC_ENV_INFO	0x004	anonymous page	cuInit	Seems to ignore input value. Writes result value (0x00000001). typedef struct nv_ioctl_env_info { NvU32 pat_supported; } nv_ioctl_env_info_t;
0xce NV_ESC_ALLOC_OS_EVENT	0x14
0xcf NV_ESC_FREE_OS_EVENT
0xd1 NV_ESC_STATUS_CODE
0xd2 NV_ESC_CHECK_VERSION_STR	0x048	stack	cuInit	Performed immediately following opening of the nvidiactl device typedef struct nv_ioctl_rm_api_version { NvU32 cmd; NvU32 reply; char versionString[NV_RM_API_VERSION_STRING_LENGTH]; } nv_ioctl_rm_api_version_t; #define NV_RM_API_VERSION_CMD_STRICT 0 #define NV_RM_API_VERSION_CMD_RELAXED '1' #define NV_RM_API_VERSION_CMD_OVERRIDE '2' #define NV_RM_API_VERSION_REPLY_UNRECOGNIZED 0 #define NV_RM_API_VERSION_REPLY_RECOGNIZED 1 0x312e 3633 2e35 3931 35ull == 195.36.15 '1' '.' '6' '3' '.' '5' '9' '1', '5' looks like: all version chars in ascii. first 8 reversed, then any left follow? All other bytes are 0. Writes result to first 8 bytes (0x00000001), leaves others untouched
0x22	0x00c	stack	cuInit	Inputs set to 0. Outputs (example): 3251635025 65 0 First value is used as first input word to the majority of subsequent ioctls Second value ranges over (at least) 41--65... Not sent in 256.22/3.10...
0x2a	0x020	stack	cuInit	GPU method invocation. Second and third words specify the method being called. Fifth and sixth specify the address being passed; seventh and eighth the size thereof. Sample inputs: 0x7fffffffd310: 3251635025 3251635025 533 0 0x7fffffffd320: 4294955888 32767 132 0 First and second words are not always equivalent. Outputs are usually unchanged, but not always: ioctl 2a, 32-byte param, fd 3 0xc1d04214 0x5c000002 0x2080012f 0x00000000 0x0010 0x950713f0 0x00007fff 0x000000a8 0x00000000 GPU method 0x5c000002:2080012f 0x00000000 0x00000000 0x00000000 0x00000000 0x0010 0x00000000 0x00000000 0x00000000 0x00000000 0x0020 0x00000000 0x00000000 0x00000000 0x00000000 0x0030 0x00000000 0x00000000 0x00000000 0x00000000 0x0040 0x00000000 0x00000000 0x00000000 0x00000000 0x0050 0x00000000 0x00000000 0x00000000 0x00000000 0x0060 0x00000000 0x00000000 0x00000000 0x00000000 0x0070 0x00000000 0x00000000 0x00000000 0x00000000 0x0080 0x00000000 0x00000000 0x00000000 0x00000000 0x0090 0x00000000 0x00000000 0x00000000 0x00000000 0x00a0 0x00000000 0x00000000 RESULT: 0 0xc1d04214 0x5c000002 0x2080012f 0x00000000 0x0010 0x950713f0 0x00007fff 0x000000a8 0x00000029 GPU method 0x5c000002:2080012f **************MODIFICATION FROM CALL 0x00000000 0x00000000 0x00000000 0x00000000 0x0010 0x00000000 0x00000000 0x00000000 0x00000000 0x0020 0x00000000 0x00000000 0x00000000 0x00000000 0x0030 0x00000000 0x00000000 0x00000000 0x00000000 0x0040 0x00000000 0x00000000 0x00000000 0x00000000 0x0050 0x00000000 0x00000000 0x00000000 0x00000000 0x0060 0x00000000 0x00000000 0x00000000 0x00000000 0x0070 0x00000000 0x00000000 0x00000000 0x00000000 0x0080 0x00000000 0x00000000 0x00000000 0x00000000 0x0090 0x00000000 0x00000000 0x00000000 0x00000000 0x00a0 0x00000000 0x00000000
0x2b	0x020	stack	cuInit	GPU object creation(?)
0x4d	0x048	stack	cuInit	Performed following opening of nvidiaX device
0x2d	0x014	stack	cuInit	Performed following read of /proc/interrupts
0x4e	0x030		cuInit	Immediately prior to first mmap()
0x4f	0x020		cuInit	Invoked if mmap() returns MAP_FAILED, prior to failing out
0x54	0x30
0x57	0x038
0x58	0x28
0x59	0x10
/dev/nvidiaX
0x32	0x014	stack	cuInit	Performed several times in succession
0x37	0x020	stack	cuInit	Follows burst of 3x 0x32's, then interwoven with bursts of 2a's

GPU methods

Code

Param size

Notes

0x5c000002 (per-device)

0x20800110

0x84

Retrieves device name:

RESULT: 0			0xc1d04277 0x5c000002 0x20800110 0x00000000 
0x0010				0x73be4970 0x00007fff 0x00000084 0x00000000 
GPU method 0x5c000002:20800110	0x00000000 0x6f466547 0x20656372 0x20535447 
0x0010				0x4d303633 0x00000000 0x00000000 0x00000000

6f46654720656372205354474d303633 == "oFeG ecr STGM063"

disassembly

These disassemblies makes use of libcuda.so.195.36.15 (0867d66be617faab3782fa0ba19ec9ba, 7404990 bytes). Symbols were extracted via objdump -T.

AMD64 ABI:
- Integer arguments via RDI, RSI, RDX, RCX, R8 and R9, then stack
- FP arguments in XMM0..XMM7, then stack
- Return value in RAX
- libcuda traces

Libcudest: Difference between revisions

Latest revision as of 22:18, 22 August 2011

Contents