Libcudest

From dankwiki
Jump to navigation Jump to search

Reverse engineering of the CUDA system. CUDA primarily communicates with the NVIDIA closed-source driver via several dozen undocumented ioctl()s. My open source implementation, libcudest, is located at GitHub. Sundry utilities for reverse engineering are also within this repository, though recent modifications to valgrind-mmt have rather superseded my tools.

libcudest began as a project for Hyesoon Kim's CS4803DGC at the Georgia Institute of Technology.

Driver versions

Newer drivers can be used with older CUDA versions, but the converse is not true. The "CUDA macroversion" listed below is the first CUDA release designed explicitly for use with the listed drivers.

Version CUDA macroversion Notes
195.36.15 3.0
195.36.24 3.0
195.36.31 3.0
256.22 3.1-beta
256.29 3.1-beta
256.35 3.1-beta

CUDA Environment variables

Discovered via binary analysis and a shimmed getenv(3). Effects determined via blackbox and binary analyses:

Variable Notes Documented? Effects
__RM_NO_VERSION_CHECK N Also checked by nvidia-smi
COMPUTE_PROFILE Y If set to 1, profiling will be performed. Implies CUDA_LAUNCH_BLOCKING.
COMPUTE_PROFILE_CONFIG Y Specifies a profiler configuration file. Only checked if COMPUTE_PROFILE is set.
COMPUTE_PROFILE_CSV Y If set to 1, a profiling data will be written in CSV format. Only checked if COMPUTE_PROFILE is set.
COMPUTE_PROFILE_LOG Y Specifies profiler output file (default: "./cuda_profile.log"). Only checked if COMPUTE_PROFILE is set.
CUDA_AMODEL_DLL N
CUDA_AMODEL_GPU N
CUDA_API_TRACE_PTR N
CUDA_CACHE_DISABLE Y If this is unset, the code cache will be used.
CUDA_CACHE_MAXSIZE Y
CUDA_CACHE_PATH Y If this is set, it overrides the code cache's default path of $HOME/.nv/ComputeCache
CUDA_DEVCODE_CACHE Y PTX compilation cache.
CUDA_DEVCODE_PATH Y Search path for fat binaries.
CUDA_EMULATION_MODE
CUDA_FORCE_PTX_JIT
CUDA_HEAP_RANGE Checked each time a context is created
CUDA_INJECTION64_PATH
CUDA_LAUNCH_BLOCKING Y (CUDA 3.0 Programmer's Guide, 3.2.6.1) Forces synchronization of host threads on GPU kernels.
CUDA_MEMCHECK Checked each time a context is created
CUDA_MEMORY_LOG Checked each time a context is created
CUDA_VISIBLE_DEVICES

Maps

Ordered from highest to lowest locations in x86 memory. These are architecture-, and to a lesser degree driver- and kernel version-specific. Applications and libraries can of course create many more maps than these.

  • vsyscalls. read-execute-private, very few pages, topmost area of memory, usually highest mapping
  • VDSO. read-execute-private, one page, high in memory (SYSENTER/SYSEXIT)
  • Userspace stack. read-write-private, many pages, high in memory
  • Anonymous map, 3 read-write-private pages, high in memory.
    • Possibly associated with nvidia driver's NV_STACK_SIZE stack. read-write-private, (3 * 4096 on amd64, 2 * 4096 on i686)
  • Two sets of /dev/nvidiaX maps for each bound device. Sets are usually continguous, and contain:
    • an anonymous page, read-write-private
    • several mappings of the device, having variable number of pages, all read-write-shared
  • Libraries. variable, middle of memory.
  • Userspace heap. read-write-private, many pages, low in memory
  • Application (data region). read-write-private, variable, low in memory
  • Application (text region). read-execute-private, variable, usually lowest mapping

mmap()s

offset size notes Nouveau name block range
reg_addr + 0x0000 0x2000 not mapped by libcuda PMC functional block 0x000000--0x001fff
reg_addr + 0x9000 0x1000 [Rwxs] mapped in cuInit(). first mapping. per-device. PTIMER functional block 0x009000--0x009fff
reg_addr + 0xc0a000 / 0xc0c000 0x1000 [RWxs] location is acquired from ioctl 4e PFIFO command submission interface 0xc00000--0xcfffff

ioctls

An ioctl (on x86) is 32 bits. The following definition comes from linux/asm-generic/ioctl.h in a 2.6.34 kernel:

  • Bit 31: Read?
  • Bit 30: Write?
  • Bits 29-16: Parameter size
  • Bits 15-8: Type (module)
  • Bits 7-0: Number (command)

Looking at the source of the 195.36.15 kernel driver's OS interface, we see that NVIDIA uses the standard ioctl-creation macros from ioctl.h, and can be expected to adhere to this format. The type code used (NV_IOCTL_MAGIC) is 'F' (0x46), which overlaps with the framebuffer ioctl range as registered in 2.6.34. We further see that only _IOWR() is used to declare ioctls, implying that the first two bits will always be '11'. Both of these deductions are borne out observing strace output of a CUDA process.

Code Param size Param location(s) Driver API call sites Notes
/dev/nvidiactl
0xc8

NV_ESC_CARD_INFO

0x600 (1536) anonymous page cuInit
  • Largest parameter by far.
    • Possibly scaled? Shifted 3 bits left, this is 0x3000, the size of the amd64 anonymous mapping.
    • More likely we support returning up to 32x 48-byte descriptors, and...
  • Wants the first 32 bits to be 1, all others 0.
    • ...this is most likely a mask indicating which card IDs we want information for!
typedef struct nv_ioctl_card_info
{
    NvU16    flags;               /* see below                   */
    NvU8     bus;                 /* bus number (PCI, AGP, etc)  */
    NvU8     slot;                /* card slot                   */
    NvU16    vendor_id;           /* PCI vendor id               */
    NvU16    device_id;
    NvU16    interrupt_line;
    NvU64    reg_address    NV_ALIGN_BYTES(8);
    NvU64    reg_size       NV_ALIGN_BYTES(8);
    NvU64    fb_address     NV_ALIGN_BYTES(8);
    NvU64    fb_size        NV_ALIGN_BYTES(8);
} nv_ioctl_card_info_t;
  • Returns (all subsequent bytes are 0):
0x00010001	0x0cb110de	0x00000026	0x00000000
0xf2000000	0x00000000	0x01000000	0x00000000
0xe0000000	0x00000000	0x10000000	0x00000000
  • 0x0001: flag (NV_IOCTL_CARD_INFO_FLAG_PRESENT)
  • 0x0001: bus/slot
  • 0x0cb110de: vendor + device IDs
    • lspci -n: 01:00.0 0300: 10de:0cb1 (rev a2)
    • lspci -t -v: \-[0000:00]-+-03.0-[01]--+-00.0 nVidia Corporation GT215 [GeForce GTS 360M]
  • 0x26: IRQ line (here, #38)
  • 0xf2000000 00000000: reg_address
  • 0x01000000 00000000: reg_size
  • 0xe0000000 00000000: fb_address
  • 0x10000000 00000000: fb_size
    • these are all system memory references, see /proc/iomem:
  e0000000-f30fffff : PCI Bus 0000:01
    e0000000-efffffff : 0000:01:00.0
    f0000000-f1ffffff : 0000:01:00.0
    f2000000-f2ffffff : 0000:01:00.0
      f2000000-f2ffffff : nvidia
    f3000000-f307ffff : 0000:01:00.0
    f3080000-f3083fff : 0000:01:00.1
      f3080000-f3083fff : ICH HD audio
0xca

NV_ESC_ENV_INFO

0x004 anonymous page cuInit
  • Seems to ignore input value.
  • Writes result value (0x00000001).
typedef struct nv_ioctl_env_info
{
    NvU32 pat_supported;
} nv_ioctl_env_info_t;
0xce

NV_ESC_ALLOC_OS_EVENT

0x14
0xcf

NV_ESC_FREE_OS_EVENT

0xd1

NV_ESC_STATUS_CODE

0xd2

NV_ESC_CHECK_VERSION_STR

0x048 stack cuInit
  • Performed immediately following opening of the nvidiactl device
typedef struct nv_ioctl_rm_api_version
{
    NvU32 cmd;
    NvU32 reply;
    char versionString[NV_RM_API_VERSION_STRING_LENGTH];
} nv_ioctl_rm_api_version_t;

#define NV_RM_API_VERSION_CMD_STRICT         0
#define NV_RM_API_VERSION_CMD_RELAXED       '1'
#define NV_RM_API_VERSION_CMD_OVERRIDE      '2'

#define NV_RM_API_VERSION_REPLY_UNRECOGNIZED 0
#define NV_RM_API_VERSION_REPLY_RECOGNIZED   1
  • 0x312e 3633 2e35 3931 35ull == 195.36.15
    • '1' '.' '6' '3' '.' '5' '9' '1', '5'
    • looks like: all version chars in ascii. first 8 reversed, then any left follow?
  • All other bytes are 0.
  • Writes result to first 8 bytes (0x00000001), leaves others untouched
0x22 0x00c stack cuInit
  • Inputs set to 0.
  • Outputs (example):
3251635025	65	0
  • First value is used as first input word to the majority of subsequent ioctls
  • Second value ranges over (at least) 41--65...
  • Not sent in 256.22/3.10...
0x2a 0x020 stack cuInit
  • GPU method invocation. Second and third words specify the method being called. Fifth and sixth specify the address being passed; seventh and eighth the size thereof.

Sample inputs:

0x7fffffffd310:	3251635025	3251635025	533	0
0x7fffffffd320:	4294955888	32767	132	0
  • First and second words are *not* always equivalent.
  • Outputs are usually unchanged, but not always:
ioctl 2a, 32-byte param, fd 3	0xc1d04214 0x5c000002 0x2080012f 0x00000000 
0x0010				0x950713f0 0x00007fff 0x000000a8 0x00000000 
GPU method 0x5c000002:2080012f	0x00000000 0x00000000 0x00000000 0x00000000 
0x0010				0x00000000 0x00000000 0x00000000 0x00000000 
0x0020				0x00000000 0x00000000 0x00000000 0x00000000 
0x0030				0x00000000 0x00000000 0x00000000 0x00000000 
0x0040				0x00000000 0x00000000 0x00000000 0x00000000 
0x0050				0x00000000 0x00000000 0x00000000 0x00000000 
0x0060				0x00000000 0x00000000 0x00000000 0x00000000 
0x0070				0x00000000 0x00000000 0x00000000 0x00000000 
0x0080				0x00000000 0x00000000 0x00000000 0x00000000 
0x0090				0x00000000 0x00000000 0x00000000 0x00000000 
0x00a0				0x00000000 0x00000000 
RESULT: 0			0xc1d04214 0x5c000002 0x2080012f 0x00000000 
0x0010				0x950713f0 0x00007fff 0x000000a8 0x00000029 
GPU method 0x5c000002:2080012f	**************MODIFICATION FROM CALL
0x00000000 0x00000000 0x00000000 0x00000000 
0x0010				0x00000000 0x00000000 0x00000000 0x00000000 
0x0020				0x00000000 0x00000000 0x00000000 0x00000000 
0x0030				0x00000000 0x00000000 0x00000000 0x00000000 
0x0040				0x00000000 0x00000000 0x00000000 0x00000000 
0x0050				0x00000000 0x00000000 0x00000000 0x00000000 
0x0060				0x00000000 0x00000000 0x00000000 0x00000000 
0x0070				0x00000000 0x00000000 0x00000000 0x00000000 
0x0080				0x00000000 0x00000000 0x00000000 0x00000000 
0x0090				0x00000000 0x00000000 0x00000000 0x00000000 
0x00a0				0x00000000 0x00000000 
0x2b 0x020 stack cuInit
  • GPU object creation(?)
0x4d 0x048 stack cuInit
  • Performed following opening of nvidiaX device
0x2d 0x014 stack cuInit
  • Performed following read of /proc/interrupts
0x4e 0x030 cuInit
  • Immediately prior to first mmap()
0x4f 0x020 cuInit
  • Invoked if mmap() returns MAP_FAILED, prior to failing out
0x54 0x30
0x57 0x038
0x58 0x28
0x59 0x10
/dev/nvidiaX
0x32 0x014 stack cuInit
  • Performed several times in succession
0x37 0x020 stack cuInit
  • Follows burst of 3x 0x32's, then interwoven with bursts of 2a's

GPU methods

Code Param size Notes
0x5c000002 (per-device)
0x20800110 0x84
  • Retrieves device name:
RESULT: 0			0xc1d04277 0x5c000002 0x20800110 0x00000000 
0x0010				0x73be4970 0x00007fff 0x00000084 0x00000000 
GPU method 0x5c000002:20800110	0x00000000 0x6f466547 0x20656372 0x20535447 
0x0010				0x4d303633 0x00000000 0x00000000 0x00000000 
  • 6f46654720656372205354474d303633 == "oFeG ecr STGM063"

disassembly

These disassemblies makes use of libcuda.so.195.36.15 (0867d66be617faab3782fa0ba19ec9ba, 7404990 bytes). Symbols were extracted via objdump -T.

  • AMD64 ABI:
    • Integer arguments via RDI, RSI, RDX, RCX, R8 and R9, then stack
    • FP arguments in XMM0..XMM7, then stack
    • Return value in RAX
    • libcuda traces

See Also