Check out my first novel, midnight's simulacra!
CUDA
Hardware
NVIDIA maintains a list of supported hardware. For actual hardware, you'll need the "nvidia.ko" kernel module. Download the nvidia-kernel-source and nvidia-kernel-common packages, unpack /usr/src/nvidia-kernel.tar.bz2, and run make-kpkg modules_image. Install the resulting .deb, and modprobe nvidia. You'll see something like this in dmesg output:
nvidia: module license 'NVIDIA' taints kernel. Disabling lock debugging due to kernel taint nvidia 0000:07:00.0: enabling device (0000 -> 0003) nvidia 0000:07:00.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21 nvidia 0000:07:00.0: setting latency timer to 64 NVRM: loading NVIDIA UNIX x86_64 Kernel Module 190.53 Wed Dec 9 15:29:46 PST 2009
Once the module is loaded, CUDA should be able to find the device. See below for sample outputs.
Emulation
Otherwise, there's emulation...
[recombinator](0) $ ~/local/cuda/C/bin/linux/emurelease/deviceQuery CUDA Device Query (Runtime API) version (CUDART static linking) There is no device supporting CUDA. Device 0: "Device Emulation (CPU)" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 9999 CUDA Capability Minor revision number: 9999 Total amount of global memory: 4294967295 bytes Number of multiprocessors: 16 Number of cores: 128 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 1 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.35 GHz Concurrent copy and execution: No Run time limit on kernels: No Integrated: Yes Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously) Test PASSED
Each device has a compute capability, though this does not encompass all differentiated capabilities (see also deviceOverlap and canMapHostMemory...).
Installation on Debian
libcuda-dev packages exist in the non-free archive area, and supply the core library libcuda.so. Together with the upstream toolkit and SDK from NVIDIA, this provides a full CUDA development environment for 64-bit Debian Unstable systems. I installed CUDA 2.3 on 2010-01-25 (hand-rolled 2.6.32.6 kernel, built with gcc-4.4). This machine did not have CUDA-compatible hardware (it uses Intel 965).
- Download the Ubuntu 9.04 files from NVIDIA's "CUDA Zone".
- Run the toolkit installer (sh cudatoolkit_2.3_linux_64_ubuntu9.04.run)
- For a user-mode install, supply $HOME/local or somesuch
* Please make sure your PATH includes /home/dank/local/cuda/bin * Please make sure your LD_LIBRARY_PATH * for 32-bit Linux distributions includes /home/dank/local/cuda/lib * for 64-bit Linux distributions includes /home/dank/local/cuda/lib64 * OR * for 32-bit Linux distributions add /home/dank/local/cuda/lib * for 64-bit Linux distributions add /home/dank/local/cuda/lib64 * to /etc/ld.so.conf and run ldconfig as root * Please read the release notes in /home/dank/local/cuda/doc/ * To uninstall CUDA, delete /home/dank/local/cuda * Installation Complete
- Run the SDK installer (sh cudasdk_2.3_linux.run)
- I just installed it to the same directory as the toolkit, which seems to work fine.
======================================== Configuring SDK Makefile (/home/dank/local/cuda/shared/common.mk)... ======================================== * Please make sure your PATH includes /home/dank/local/cuda/bin * Please make sure your LD_LIBRARY_PATH includes /home/dank/local/cuda/lib * To uninstall the NVIDIA GPU Computing SDK, please delete /home/dank/local/cuda * Installation Complete
Building CUDA Apps
nvcc flags
- -ptax-options=-v displays per-thread register usage
SDK's common.mk
This assumes use of the SDK's common.mk, as recommended by the documentation.
- Add the library path to LD_LIBRARY_PATH, assuming CUDA's been installed to a non-standard directory.
- Set the CUDA_INSTALL_PATH and ROOTDIR (yeargh!) if outside the SDK.
- I keep the following in bin/cudasetup of my home directory. Source it, using sh's . cudasetup syntax:
CUDA="$HOME/local/cuda/" export CUDA_INSTALL_PATH="$CUDA" export ROOTDIR="$CUDA/C/common/" if [ -n "$LD_LIBRARY_PATH" ] ; then export "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA/lib64" else export "LD_LIBRARY_PATH=$CUDA/lib64" fi unset CUDA
- Set EXECUTABLE in your Makefile, and include $CUDA_INSTALL_PATH/C/common/common.mk
Unit testing
The DEFAULT_GOAL special variable of GNU Make can be used:
.PHONY: test .DEFAULT_GOAL:=test include $(CUDA_INSTALL_PATH)/C/common/common.mk test: $(TARGET) $(TARGET)
Libraries
Two mutually exclusive means of driving CUDA are available: the "Driver API" and "C for CUDA" with its accompanying nvcc compiler and runtime. The latter (libcudart) is built atop the former, and requires its libcuda library.
CUDA model
Host
- A host contains zero or more CUDA-capable devices (emulation must be used if zero devices are available).
- It can run multiple CUDA processes, each composed of one or more host threads.
- A given host thread can execute code on only one device at once.
- Multiple host threads can execute code on the same device.
Device
- A device packages a streaming processor array (SPA), a memory interface, and possibly memory (global memory. device memory).
- In CUDA terminology, an integrated (vs discrete) device does not have its own global memory.
- Specially-prepared global memory is designated constant memory, and can be cached.
- Pinned (locked) host memory avoids a bounce buffer, accelerating transfers.
- Larger one-time setup cost due to device register programming for DMA transfers.
- This memory will be unswappable -- allocate only as much as is needed.
- Pinned memory can be mapped directly into CUDAspace on integrated devices or in the presence of some IOMMUs.
- "Zero (explicit)-copy" interface (can never hide all bus delays)
- Write-combining memory (configured via MTRRs or PATs) avoids PCI snoop requirements and maximizes linear throughput
- Subtle side-effects; not to be used glibly or carelessly!
- Distributes work at block granularity to Texture Processing Clusters (TPCs).
Texture Processing Cluster
Streaming Multiprocessors (SMs) are grouped into TPCs. Each TPC contains some number of SMs and a single texture processing unit, including a few filters and a cache for texture memory. The details of these texture caches have not generally been publicized, but NVIDIA optimization guides confirm 1- and 2-dimensional spatial caching to be in effect.
Streaming Multiprocessor
- Each SM has a register file, fast local (shared) memory, a cache for constant memory, an instruction cache (ROP), a multithreaded instruction dispatcher, and some number of Stream Processors (SPs).
- 8192 registers for compute capability <= 1.1, otherwise
- 16384 for compute capability <= 1.3
- A group of threads which share a memory and can "synchronize their execution to coördinate accesses to memory" (use a barrier) form a block. Each thread has a threadId within its (three-dimensional) block.
- For a block of dimensions <Dx, Dy, Dz>, the threadId of the thread having index <x, y, z> is (x + y * Dx + z * Dy * Dx).
- Register allocation is performed per-block, and rounded up to the nearest
- 256 registers per block for compute capability <= 1.1, otherwise
- 512 registers per block for compute capability <= 1.3.
- A group of blocks which share a kernel form a grid. Each block (and each thread within that block) has a blockId within its (two-dimensional) grid.
- For a grid of dimensions <Dx, Dy>, the blockId of the block having index <x, y> is (x + y * Dx).
- Thus, a given thread's <blockId X threadId> dyad is unique across the grid. All the threads of a block share a blockId, and corresponding threads of various blocks share a threadId.
- Each time the kernel is instantiated, new grid and block dimensions may be provided.
- A block's threads, starting from threadId 0, are broken up into contiguous warps having some warp size number of threads.
- Distributes work at thread granularity to SPs.
Stream Processor
- In-order, multithreaded processor: latencies can be hidden only by TLP, not ILP.
- Arithmetic intensity and parallelism are paramount!
- Memory-bound kernels require sufficiently high occupancy (the ratio of concurrently-running warps to maximum possible concurrent warps (as applied, usually, to SMs)) to hide latency.
- No branch prediction or speculation (and thus also no pipeline flushes on mispredicted branches).
Memory type | Replication | Kernel access | Host access | Cache location |
---|---|---|---|---|
Registers | Per-thread | Read-write | None | None |
Local memory | Per-thread | Read-write | None | None |
Shared memory | Per-block | Read-write | None | None |
Global memory | Per-grid | Read-write | Read-write | None |
Constant memory | Per-grid | Read | Read-write | Stream multiprocessor |
Texture memory | Per-grid | Read | Read-write | Texture processing cluster |
Compute Capabilities
The original public CUDA revision was 1.0, implemented on the NV50 chipset corresponding to the GeForce 8 series. Compute capability, formed of a non-negative major and minor revision number, can be queried on CUDA-capable cards. All revisions thus far have been backwards-compatible.
Revision | Changes |
---|---|
1.1 | Atomic ops on 32-bit global integers. Breakpoints and other debugging support. |
1.2 | Atomic ops on 64-bit global integers and 32-bit shared integers. 32 warps (1024 threads) and 16K registers per multiprocessor (MP). Vote instructions. Three MPs per Texture Processing Cluster (TPC). Relaxed memory coalescing constraints. |
1.3 | Double-precision floating point at 32 cycles per operation. |
2.0 | Atomic addition on 32-bit global and shared FP. 48 warps (1536 threads), 48K shared memory banked 32 ways, and 32K registers per MP. 512K local memory per thread. __syncthreads_{count,and,or}(), __threadfence_system(), and __ballot(). 1024 threads per block and blockIdx.{x,y} values ranging through 1024. Larger texture references. |
deviceQuery info
Compute capability 2.0
Compute capability 1.3
Tesla C1060
Device 0: "Tesla C1060" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 4294705152 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.30 GHz Concurrent copy and execution: Yes Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously)
GeForce GTX 295
Device 1: "GeForce GTX 295" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 939261952 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.24 GHz Concurrent copy and execution: Yes Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously)
GeForce GTX 280
Device 0: "GeForce GTX 280" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 1073020928 bytes Number of multiprocessors: 30 Number of cores: 240 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.30 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously)
GeForce GTX 260
Device 0: "GeForce GTX 260" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 3 Total amount of global memory: 938803200 bytes Number of multiprocessors: 27 Number of cores: 216 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.47 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously)
Compute capability 1.2
GeForce 310 (PCIe x16)
Device 0: "GeForce 310" CUDA Driver Version: 3.0 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 2 Total amount of global memory: 536084480 bytes Number of multiprocessors: 2 Number of cores: 16 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.40 GHz Concurrent copy and execution: Yes Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously)
GeForce 240 GT
Device 0 GeForce GT 240 CUDA Driver Version: 3.0 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 2 Total amount of global memory: 1073414144 bytes Number of multiprocessors: 12 Number of cores: 96 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 16384 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512,512,64 Maximum sizes of each dimension of a grid: 65535,65535,1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.424 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously)
Compute capability 1.1
Quadro FX 570
Device 1: "Quadro FX 570" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 1 Total amount of global memory: 268107776 bytes Number of multiprocessors: 2 Number of cores: 16 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 0.92 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: No Compute mode: Default (multiple host threads can use this device simultaneously)
GeForce 9600 GT
Device 0: “GeForce 9600 GT” CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 1 Total amount of global memory: 536543232 bytes Number of multiprocessors: 8 Number of cores: 64 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.50 GHz Concurrent copy and execution: Yes Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: No Compute mode: Default (multiple host threads can use this device simultaneously)
GeForce 9400M
Device 0: "GeForce 9400M" Major revision number: 1 Minor revision number: 1 Total amount of global memory: 266010624 bytes Number of multiprocessors: 2 Number of cores: 16 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 0.80 GHz Concurrent copy and execution: No
GeForce 8800 GTS 512
Device 0: "GeForce 8800 GTS 512" Major revision number: 1 Minor revision number: 1 Total amount of global memory: 536150016 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1674000 kilohertz
GeForce 8600 GT
Device 0: "GeForce 8600 GT" Major revision number: 1 Minor revision number: 1 Total amount of global memory: 268107776 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1674000 kilohertz
GeForce 8400 GS (PCI)
Device 0: "GeForce 8400 GS" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 1 Total amount of global memory: 536608768 bytes Number of multiprocessors: 1 Number of cores: 8 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.40 GHz Concurrent copy and execution: No Run time limit on kernels: No Integrated: No Support host page-locked memory mapping: No Compute mode: Default (multiple host threads can use this device simultaneously)
Compute capability 1.0
GeForce 8800 Ultra
Device 0: "GeForce 8800 Ultra" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 1 CUDA Capability Minor revision number: 0 Total amount of global memory: 804585472 bytes Number of multiprocessors: 16 Number of cores: 128 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 32 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.51 GHz Concurrent copy and execution: No Run time limit on kernels: Yes Integrated: No Support host page-locked memory mapping: No Compute mode: Default (multiple host threads can use this device simultaneously)
See Also
- The nouveau Wiki's CUDA page
- The gpuocelot project, hosted on Google Code.
- The NVIDIA GPU Developer Zone