Check out my first novel, midnight's simulacra!
CUDA: Difference between revisions
No edit summary |
No edit summary |
||
Line 31: | Line 31: | ||
Test PASSED</pre> | Test PASSED</pre> | ||
Each device has a '''compute capability''', though this does not encompass all differentiated capabilities (see also <tt>deviceOverlap</tt> and <tt>canMapHostMemory</tt>...). | Each device has a '''compute capability''', though this does not encompass all differentiated capabilities (see also <tt>deviceOverlap</tt> and <tt>canMapHostMemory</tt>...). | ||
==Installation on [[Debian]]== | ==Installation on [[Debian]]== | ||
[http://packages.debian.org/sid/libdevel/libcuda1-dev libcuda-dev] packages exist in the <tt>non-free</tt> archive area, and supply the core library <tt>libcuda.so</tt>. Together with the upstream toolkit and SDK from NVIDIA, this provides a full CUDA development environment for 64-bit Debian Unstable systems. I installed CUDA 2.3 on 2010-01-25 (hand-rolled 2.6.32.6 kernel, built with gcc-4.4). This machine did not have CUDA-compatible hardware (it uses [[Intel 965]]). | [http://packages.debian.org/sid/libdevel/libcuda1-dev libcuda-dev] packages exist in the <tt>non-free</tt> archive area, and supply the core library <tt>libcuda.so</tt>. Together with the upstream toolkit and SDK from NVIDIA, this provides a full CUDA development environment for 64-bit Debian Unstable systems. I installed CUDA 2.3 on 2010-01-25 (hand-rolled 2.6.32.6 kernel, built with gcc-4.4). This machine did not have CUDA-compatible hardware (it uses [[Intel 965]]). | ||
Line 115: | Line 62: | ||
* To uninstall the NVIDIA GPU Computing SDK, please delete /home/dank/local/cuda | * To uninstall the NVIDIA GPU Computing SDK, please delete /home/dank/local/cuda | ||
* Installation Complete</pre> | * Installation Complete</pre> | ||
==Building== | ==Building CUDA Apps== | ||
===<tt>nvcc</tt> flags=== | ===<tt>nvcc</tt> flags=== | ||
* <tt>-ptax-options=-v</tt> displays per-thread register usage | * <tt>-ptax-options=-v</tt> displays per-thread register usage | ||
Line 146: | Line 93: | ||
==Libraries== | ==Libraries== | ||
Two mutually exclusive means of driving CUDA are available: the "Driver API" and "C for CUDA" with its accompanying <tt>nvcc</tt> compiler and runtime. The latter (<tt>libcudart</tt>) is built atop the former, and requires its <tt>libcuda</tt> library. | Two mutually exclusive means of driving CUDA are available: the "Driver API" and "C for CUDA" with its accompanying <tt>nvcc</tt> compiler and runtime. The latter (<tt>libcudart</tt>) is built atop the former, and requires its <tt>libcuda</tt> library. | ||
==CUDA model== | |||
* A given host thread can execute code on only one device at once (but multiple host threads can execute code on the same device) | |||
* Each processor has a register file. | |||
** 8192 registers for compute capability <= 1.1, otherwise | |||
** 16384 for compute capability <= 1.3 | |||
* A group of threads which share a memory and can "synchronize their execution to coördinate accesses to memory" (use a [[barrier]]) form a '''block'''. Each thread has a ''threadId'' within its (three-dimensional) block. | |||
** For a block of dimensions <D<sub>x</sub>, D<sub>y</sub>, D<sub>z</sub>>, the threadId of the thread having index <x, y, z> is (x + y * D<sub>x</sub> + z * D<sub>y</sub> * D<sub>x</sub>). | |||
* Register allocation is performed per-block, and rounded up to the nearest | |||
** 256 registers per block for compute capability <= 1.1, otherwise | |||
** 512 registers per block for compute capability <= 1.3 | |||
* A group of blocks which share a kernel form a '''grid'''. Each block (and each thread within that block) has a ''blockId'' within its (two-dimensional) grid. | |||
** For a grid of dimensions <D<sub>x</sub>, D<sub>y</sub>>, the blockId of the block having index <x, y> is (x + y * D<sub>x</sub>). | |||
* Thus, a given thread's <blockId X threadId> dyad is unique across the device. All the threads of a block share a blockId, and corresponding threads of various blocks share a threadId. | |||
* Each time the kernel is instantiated, new grid and block dimensions may be provided | |||
* A block's threads, starting from threadId 0, are broken up into contiguous '''warps''' having some '''warp size''' number of threads. | |||
{| border="1" | |||
! Memory type | |||
! Replication | |||
! Access | |||
! Host access | |||
|- | |||
| Registers | |||
| Per-thread | |||
| Read-write | |||
| None | |||
|- | |||
| Local memory | |||
| Per-thread | |||
| Read-write | |||
| None | |||
|- | |||
| Shared memory | |||
| Per-block | |||
| Read-write | |||
| None | |||
|- | |||
| Global memory | |||
| Per-grid | |||
| Read-write | |||
| Read-write | |||
|- | |||
| Constant memory | |||
| Per-grid | |||
| Read | |||
| Read-write | |||
|- | |||
| Texture memory | |||
| Per-grid | |||
| Read | |||
| Read-write | |||
|- | |||
|} |
Revision as of 04:14, 11 February 2010
Hardware/Emulation
NVIDIA maintains a list of supported hardware. Otherwise, there's emulation...
[recombinator](0) $ ~/local/cuda/C/bin/linux/emurelease/deviceQuery CUDA Device Query (Runtime API) version (CUDART static linking) There is no device supporting CUDA. Device 0: "Device Emulation (CPU)" CUDA Driver Version: 2.30 CUDA Runtime Version: 2.30 CUDA Capability Major revision number: 9999 CUDA Capability Minor revision number: 9999 Total amount of global memory: 4294967295 bytes Number of multiprocessors: 16 Number of cores: 128 Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 16384 bytes Total number of registers available per block: 8192 Warp size: 1 Maximum number of threads per block: 512 Maximum sizes of each dimension of a block: 512 x 512 x 64 Maximum sizes of each dimension of a grid: 65535 x 65535 x 1 Maximum memory pitch: 262144 bytes Texture alignment: 256 bytes Clock rate: 1.35 GHz Concurrent copy and execution: No Run time limit on kernels: No Integrated: Yes Support host page-locked memory mapping: Yes Compute mode: Default (multiple host threads can use this device simultaneously) Test PASSED
Each device has a compute capability, though this does not encompass all differentiated capabilities (see also deviceOverlap and canMapHostMemory...).
Installation on Debian
libcuda-dev packages exist in the non-free archive area, and supply the core library libcuda.so. Together with the upstream toolkit and SDK from NVIDIA, this provides a full CUDA development environment for 64-bit Debian Unstable systems. I installed CUDA 2.3 on 2010-01-25 (hand-rolled 2.6.32.6 kernel, built with gcc-4.4). This machine did not have CUDA-compatible hardware (it uses Intel 965).
- Download the Ubuntu 9.04 files from NVIDIA's "CUDA Zone".
- Run the toolkit installer (sh cudatoolkit_2.3_linux_64_ubuntu9.04.run)
- For a user-mode install, supply $HOME/local or somesuch
* Please make sure your PATH includes /home/dank/local/cuda/bin * Please make sure your LD_LIBRARY_PATH * for 32-bit Linux distributions includes /home/dank/local/cuda/lib * for 64-bit Linux distributions includes /home/dank/local/cuda/lib64 * OR * for 32-bit Linux distributions add /home/dank/local/cuda/lib * for 64-bit Linux distributions add /home/dank/local/cuda/lib64 * to /etc/ld.so.conf and run ldconfig as root * Please read the release notes in /home/dank/local/cuda/doc/ * To uninstall CUDA, delete /home/dank/local/cuda * Installation Complete
- Run the SDK installer (sh cudasdk_2.3_linux.run)
- I just installed it to the same directory as the toolkit, which seems to work fine.
======================================== Configuring SDK Makefile (/home/dank/local/cuda/shared/common.mk)... ======================================== * Please make sure your PATH includes /home/dank/local/cuda/bin * Please make sure your LD_LIBRARY_PATH includes /home/dank/local/cuda/lib * To uninstall the NVIDIA GPU Computing SDK, please delete /home/dank/local/cuda * Installation Complete
Building CUDA Apps
nvcc flags
- -ptax-options=-v displays per-thread register usage
SDK's common.mk
This assumes use of the SDK's common.mk, as recommended by the documentation.
- Add the library path to LD_LIBRARY_PATH, assuming CUDA's been installed to a non-standard directory.
- Set the CUDA_INSTALL_PATH and ROOTDIR (yeargh!) if outside the SDK.
- I keep the following in bin/cudasetup of my home directory. Source it, using sh's . cudasetup syntax:
CUDA="$HOME/local/cuda/" export CUDA_INSTALL_PATH="$CUDA" export ROOTDIR="$CUDA/C/common/" if [ -n "$LD_LIBRARY_PATH" ] ; then export "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA/lib64" else export "LD_LIBRARY_PATH=$CUDA/lib64" fi unset CUDA
- Set EXECUTABLE in your Makefile, and include $CUDA_INSTALL_PATH/C/common/common.mk
Unit testing
The DEFAULT_GOAL special variable of GNU Make can be used:
.PHONY: test .DEFAULT_GOAL:=test include $(CUDA_INSTALL_PATH)/C/common/common.mk test: $(TARGET) $(TARGET)
Libraries
Two mutually exclusive means of driving CUDA are available: the "Driver API" and "C for CUDA" with its accompanying nvcc compiler and runtime. The latter (libcudart) is built atop the former, and requires its libcuda library.
CUDA model
- A given host thread can execute code on only one device at once (but multiple host threads can execute code on the same device)
- Each processor has a register file.
- 8192 registers for compute capability <= 1.1, otherwise
- 16384 for compute capability <= 1.3
- A group of threads which share a memory and can "synchronize their execution to coördinate accesses to memory" (use a barrier) form a block. Each thread has a threadId within its (three-dimensional) block.
- For a block of dimensions <Dx, Dy, Dz>, the threadId of the thread having index <x, y, z> is (x + y * Dx + z * Dy * Dx).
- Register allocation is performed per-block, and rounded up to the nearest
- 256 registers per block for compute capability <= 1.1, otherwise
- 512 registers per block for compute capability <= 1.3
- A group of blocks which share a kernel form a grid. Each block (and each thread within that block) has a blockId within its (two-dimensional) grid.
- For a grid of dimensions <Dx, Dy>, the blockId of the block having index <x, y> is (x + y * Dx).
- Thus, a given thread's <blockId X threadId> dyad is unique across the device. All the threads of a block share a blockId, and corresponding threads of various blocks share a threadId.
- Each time the kernel is instantiated, new grid and block dimensions may be provided
- A block's threads, starting from threadId 0, are broken up into contiguous warps having some warp size number of threads.
Memory type | Replication | Access | Host access |
---|---|---|---|
Registers | Per-thread | Read-write | None |
Local memory | Per-thread | Read-write | None |
Shared memory | Per-block | Read-write | None |
Global memory | Per-grid | Read-write | Read-write |
Constant memory | Per-grid | Read | Read-write |
Texture memory | Per-grid | Read | Read-write |