CUDA: Difference between revisions

Revision as of 04:14, 11 February 2010

Hardware/Emulation

NVIDIA maintains a list of supported hardware. Otherwise, there's emulation...

[recombinator](0) $ ~/local/cuda/C/bin/linux/emurelease/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is no device supporting CUDA.

Device 0: "Device Emulation (CPU)"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         9999
  CUDA Capability Minor revision number:         9999
  Total amount of global memory:                 4294967295 bytes
  Number of multiprocessors:                     16
  Number of cores:                               128
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     1
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.35 GHz
  Concurrent copy and execution:                 No
  Run time limit on kernels:                     No
  Integrated:                                    Yes
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Test PASSED

Each device has a compute capability, though this does not encompass all differentiated capabilities (see also deviceOverlap and canMapHostMemory...).

Installation on Debian

libcuda-dev packages exist in the non-free archive area, and supply the core library libcuda.so. Together with the upstream toolkit and SDK from NVIDIA, this provides a full CUDA development environment for 64-bit Debian Unstable systems. I installed CUDA 2.3 on 2010-01-25 (hand-rolled 2.6.32.6 kernel, built with gcc-4.4). This machine did not have CUDA-compatible hardware (it uses Intel 965).

Download the Ubuntu 9.04 files from NVIDIA's "CUDA Zone".
Run the toolkit installer (sh cudatoolkit_2.3_linux_64_ubuntu9.04.run)
- For a user-mode install, supply $HOME/local or somesuch

* Please make sure your PATH includes /home/dank/local/cuda/bin
* Please make sure your LD_LIBRARY_PATH
*   for 32-bit Linux distributions includes /home/dank/local/cuda/lib
*   for 64-bit Linux distributions includes /home/dank/local/cuda/lib64
* OR
*   for 32-bit Linux distributions add /home/dank/local/cuda/lib
*   for 64-bit Linux distributions add /home/dank/local/cuda/lib64
* to /etc/ld.so.conf and run ldconfig as root

* Please read the release notes in /home/dank/local/cuda/doc/

* To uninstall CUDA, delete /home/dank/local/cuda
* Installation Complete

Run the SDK installer (sh cudasdk_2.3_linux.run)
- I just installed it to the same directory as the toolkit, which seems to work fine.

========================================

Configuring SDK Makefile (/home/dank/local/cuda/shared/common.mk)...

========================================

* Please make sure your PATH includes /home/dank/local/cuda/bin
* Please make sure your LD_LIBRARY_PATH includes /home/dank/local/cuda/lib

* To uninstall the NVIDIA GPU Computing SDK, please delete /home/dank/local/cuda
* Installation Complete

Building CUDA Apps

`nvcc` flags

-ptax-options=-v displays per-thread register usage

SDK's common.mk

This assumes use of the SDK's common.mk, as recommended by the documentation.

Add the library path to LD_LIBRARY_PATH, assuming CUDA's been installed to a non-standard directory.
Set the CUDA_INSTALL_PATH and ROOTDIR (yeargh!) if outside the SDK.
I keep the following in bin/cudasetup of my home directory. Source it, using sh's . cudasetup syntax:

CUDA="$HOME/local/cuda/"

export CUDA_INSTALL_PATH="$CUDA"
export ROOTDIR="$CUDA/C/common/"
if [ -n "$LD_LIBRARY_PATH" ] ; then
	export "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA/lib64"
else
	export "LD_LIBRARY_PATH=$CUDA/lib64"
fi

unset CUDA

Set EXECUTABLE in your Makefile, and include $CUDA_INSTALL_PATH/C/common/common.mk

Unit testing

The DEFAULT_GOAL special variable of GNU Make can be used:

.PHONY: test
.DEFAULT_GOAL:=test

include $(CUDA_INSTALL_PATH)/C/common/common.mk

test: $(TARGET)
        $(TARGET)

Libraries

Two mutually exclusive means of driving CUDA are available: the "Driver API" and "C for CUDA" with its accompanying nvcc compiler and runtime. The latter (libcudart) is built atop the former, and requires its libcuda library.

CUDA model

A given host thread can execute code on only one device at once (but multiple host threads can execute code on the same device)
Each processor has a register file.
- 8192 registers for compute capability <= 1.1, otherwise
- 16384 for compute capability <= 1.3
A group of threads which share a memory and can "synchronize their execution to coördinate accesses to memory" (use a barrier) form a block. Each thread has a threadId within its (three-dimensional) block.
- For a block of dimensions <D_x, D_y, D_z>, the threadId of the thread having index <x, y, z> is (x + y * D_x + z * D_y * D_x).
Register allocation is performed per-block, and rounded up to the nearest
- 256 registers per block for compute capability <= 1.1, otherwise
- 512 registers per block for compute capability <= 1.3
A group of blocks which share a kernel form a grid. Each block (and each thread within that block) has a blockId within its (two-dimensional) grid.
- For a grid of dimensions <D_x, D_y>, the blockId of the block having index <x, y> is (x + y * D_x).
Thus, a given thread's <blockId X threadId> dyad is unique across the device. All the threads of a block share a blockId, and corresponding threads of various blocks share a threadId.
Each time the kernel is instantiated, new grid and block dimensions may be provided
A block's threads, starting from threadId 0, are broken up into contiguous warps having some warp size number of threads.

Memory type	Replication	Access	Host access
Registers	Per-thread	Read-write	None
Local memory	Per-thread	Read-write	None
Shared memory	Per-block	Read-write	None
Global memory	Per-grid	Read-write	Read-write
Constant memory	Per-grid	Read	Read-write
Texture memory	Per-grid	Read	Read-write

@@ Line 31: / Line 31: @@
 Test PASSED</pre>
 Each device has a '''compute capability''', though this does not encompass all differentiated capabilities (see also <tt>deviceOverlap</tt> and <tt>canMapHostMemory</tt>...).
-===CUDA model===
-* A given host thread can execute code on only one device at once (but multiple host threads can execute code on the same device)
-* Each processor has a register file.
-** 8192 registers for compute capability <= 1.1, otherwise
-** 16384 for compute capability <= 1.3
-* A group of threads which share a memory and can "synchronize their execution to coördinate accesses to memory" (use a [[barrier]]) form a '''block'''. Each thread has a ''threadId'' within its (three-dimensional) block.
-** For a block of dimensions &lt;D<sub>x</sub>, D<sub>y</sub>, D<sub>z</sub>&gt;, the threadId of the thread having index &lt;x, y, z&gt; is (x + y * D<sub>x</sub> + z * D<sub>y</sub> * D<sub>x</sub>).
-* Register allocation is performed per-block, and rounded up to the nearest
-** 256 registers per block for compute capability <= 1.1, otherwise
-** 512 registers per block for compute capability <= 1.3
-* A group of blocks which share a kernel form a '''grid'''. Each block (and each thread within that block) has a ''blockId'' within its (two-dimensional) grid.
-** For a grid of dimensions &lt;D<sub>x</sub>, D<sub>y</sub>&gt;, the blockId of the block having index &lt;x, y&gt; is (x + y * D<sub>x</sub>).
-* Thus, a given thread's &lt;blockId X threadId&gt; dyad is unique across the device. All the threads of a block share a blockId, and corresponding threads of various blocks share a threadId.
-* Each time the kernel is instantiated, new grid and block dimensions may be provided
-* A block's threads, starting from threadId 0, are broken up into contiguous '''warps''' having some '''warp size''' number of threads.
-{| border="1"
-! Memory type
-! Replication
-! Access
-! Host access
-|-
-| Registers
-| Per-thread
-| Read-write
-| None
-|-
-| Local memory
-| Per-thread
-| Read-write
-| None
-|-
-| Shared memory
-| Per-block
-| Read-write
-| None
-|-
-| Global memory
-| Per-grid
-| Read-write
-| Read-write
-|-
-| Constant memory
-| Per-grid
-| Read
-| Read-write
-|-
-| Texture memory
-| Per-grid
-| Read
-| Read-write
-|-
-|}
 ==Installation on [[Debian]]==
 [http://packages.debian.org/sid/libdevel/libcuda1-dev libcuda-dev] packages exist in the <tt>non-free</tt> archive area, and supply the core library <tt>libcuda.so</tt>. Together with the upstream toolkit and SDK from NVIDIA, this provides a full CUDA development environment for 64-bit Debian Unstable systems. I installed CUDA 2.3 on 2010-01-25 (hand-rolled 2.6.32.6 kernel, built with gcc-4.4). This machine did not have CUDA-compatible hardware (it uses [[Intel 965]]).
@@ Line 115: / Line 62: @@
 * To uninstall the NVIDIA GPU Computing SDK, please delete /home/dank/local/cuda
 * Installation Complete</pre>
-==Building==
+==Building CUDA Apps==
 ===<tt>nvcc</tt> flags===
 * <tt>-ptax-options=-v</tt> displays per-thread register usage
@@ Line 146: / Line 93: @@
 ==Libraries==
 Two mutually exclusive means of driving CUDA are available: the "Driver API" and "C for CUDA" with its accompanying <tt>nvcc</tt> compiler and runtime. The latter (<tt>libcudart</tt>) is built atop the former, and requires its <tt>libcuda</tt> library.
+==CUDA model==
+* A given host thread can execute code on only one device at once (but multiple host threads can execute code on the same device)
+* Each processor has a register file.
+** 8192 registers for compute capability <= 1.1, otherwise
+** 16384 for compute capability <= 1.3
+* A group of threads which share a memory and can "synchronize their execution to coördinate accesses to memory" (use a [[barrier]]) form a '''block'''. Each thread has a ''threadId'' within its (three-dimensional) block.
+** For a block of dimensions &lt;D<sub>x</sub>, D<sub>y</sub>, D<sub>z</sub>&gt;, the threadId of the thread having index &lt;x, y, z&gt; is (x + y * D<sub>x</sub> + z * D<sub>y</sub> * D<sub>x</sub>).
+* Register allocation is performed per-block, and rounded up to the nearest
+** 256 registers per block for compute capability <= 1.1, otherwise
+** 512 registers per block for compute capability <= 1.3
+* A group of blocks which share a kernel form a '''grid'''. Each block (and each thread within that block) has a ''blockId'' within its (two-dimensional) grid.
+** For a grid of dimensions &lt;D<sub>x</sub>, D<sub>y</sub>&gt;, the blockId of the block having index &lt;x, y&gt; is (x + y * D<sub>x</sub>).
+* Thus, a given thread's &lt;blockId X threadId&gt; dyad is unique across the device. All the threads of a block share a blockId, and corresponding threads of various blocks share a threadId.
+* Each time the kernel is instantiated, new grid and block dimensions may be provided
+* A block's threads, starting from threadId 0, are broken up into contiguous '''warps''' having some '''warp size''' number of threads.
+{| border="1"
+! Memory type
+! Replication
+! Access
+! Host access
+|-
+| Registers
+| Per-thread
+| Read-write
+| None
+|-
+| Local memory
+| Per-thread
+| Read-write
+| None
+|-
+| Shared memory
+| Per-block
+| Read-write
+| None
+|-
+| Global memory
+| Per-grid
+| Read-write
+| Read-write
+|-
+| Constant memory
+| Per-grid
+| Read
+| Read-write
+|-
+| Texture memory
+| Per-grid
+| Read
+| Read-write
+|-
+|}

anonymous

Search

CUDA: Difference between revisions

Namespaces

more

page actions

Revision as of 04:14, 11 February 2010

Contents

Hardware/Emulation

Installation on Debian

Building CUDA Apps

`nvcc` flags

SDK's common.mk

Unit testing

Libraries

CUDA model

navigation

wiki tools

wiki tools

anonymous

Search

CUDA: Difference between revisions

Revision as of 04:14, 11 February 2010

Hardware/Emulation

Installation on Debian

Building CUDA Apps

nvcc flags

SDK's common.mk

Unit testing

Libraries

CUDA model

navigation

wiki tools

page tools

`nvcc` flags