CUDA: Difference between revisions

Revision as of 13:12, 15 March 2010

Hardware

NVIDIA maintains a list of supported hardware. For actual hardware, you'll need the "nvidia.ko" kernel module. Download the nvidia-kernel-source and nvidia-kernel-common packages, unpack /usr/src/nvidia-kernel.tar.bz2, and run make-kpkg modules_image. Install the resulting .deb, and modprobe nvidia. You'll see something like this in dmesg output:

nvidia: module license 'NVIDIA' taints kernel.
Disabling lock debugging due to kernel taint
nvidia 0000:07:00.0: enabling device (0000 -> 0003)
nvidia 0000:07:00.0: PCI INT A -> GSI 21 (level, low) -> IRQ 21
nvidia 0000:07:00.0: setting latency timer to 64
NVRM: loading NVIDIA UNIX x86_64 Kernel Module  190.53  Wed Dec  9 15:29:46 PST 2009

Once the module is loaded, CUDA should be able to find the device. See below for sample outputs.

Emulation

Otherwise, there's emulation...

[recombinator](0) $ ~/local/cuda/C/bin/linux/emurelease/deviceQuery
CUDA Device Query (Runtime API) version (CUDART static linking)
There is no device supporting CUDA.

Device 0: "Device Emulation (CPU)"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         9999
  CUDA Capability Minor revision number:         9999
  Total amount of global memory:                 4294967295 bytes
  Number of multiprocessors:                     16
  Number of cores:                               128
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     1
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.35 GHz
  Concurrent copy and execution:                 No
  Run time limit on kernels:                     No
  Integrated:                                    Yes
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Test PASSED

Each device has a compute capability, though this does not encompass all differentiated capabilities (see also deviceOverlap and canMapHostMemory...).

CUDA model

Host

A host contains zero or more CUDA-capable devices (emulation must be used if zero devices are available).
It can run multiple CUDA processes, each composed of one or more host threads.
A given host thread can execute code on only one device at once.
Multiple host threads can execute code on the same device.

Device

A device packages a streaming processor array (SPA), a memory interface, and possibly memory (global memory. device memory).
- In CUDA terminology, an integrated (vs discrete) device does not have its own global memory.
- Specially-prepared global memory is designated constant memory, and can be cached.
Pinned (locked) host memory avoids a bounce buffer, accelerating transfers.
- Larger one-time setup cost due to device register programming for DMA transfers.
- This memory will be unswappable -- allocate only as much as is needed.
Pinned memory can be mapped directly into CUDAspace on integrated devices or in the presence of some IOMMUs.
- "Zero (explicit)-copy" interface (can never hide all bus delays)
Write-combining memory (configured via MTRRs or PATs) avoids PCI snoop requirements and maximizes linear throughput
- Subtle side-effects; not to be used glibly or carelessly!
Distributes work at block granularity to Texture Processing Clusters (TPCs).

Texture Processing Cluster

Streaming Multiprocessors (SMs) are grouped into TPCs. Each TPC contains some number of SMs and a single texture processing unit, including a few filters and a cache for texture memory. The details of these texture caches have not generally been publicized, but NVIDIA optimization guides confirm 1- and 2-dimensional spatial caching to be in effect.

Streaming Multiprocessor

Each SM has a register file, fast local (shared) memory, a cache for constant memory, an instruction cache (ROP), a multithreaded instruction dispatcher, and some number of Stream Processors (SPs).
- 8192 registers for compute capability <= 1.1, otherwise
- 16384 for compute capability <= 1.3
A group of threads which share a memory and can "synchronize their execution to coördinate accesses to memory" (use a barrier) form a block. Each thread has a threadId within its (three-dimensional) block.
- For a block of dimensions <D_x, D_y, D_z>, the threadId of the thread having index <x, y, z> is (x + y * D_x + z * D_y * D_x).
Register allocation is performed per-block, and rounded up to the nearest
- 256 registers per block for compute capability <= 1.1, otherwise
- 512 registers per block for compute capability <= 1.3.
A group of blocks which share a kernel form a grid. Each block (and each thread within that block) has a blockId within its (two-dimensional) grid.
- For a grid of dimensions <D_x, D_y>, the blockId of the block having index <x, y> is (x + y * D_x).
Thus, a given thread's <blockId X threadId> dyad is unique across the grid. All the threads of a block share a blockId, and corresponding threads of various blocks share a threadId.
Each time the kernel is instantiated, new grid and block dimensions may be provided.
A block's threads, starting from threadId 0, are broken up into contiguous warps having some warp size number of threads.
Distributes out-of-order work at warp granularity across SPs.
- One program counter per warp -- divergence within warp leads to serialization.
- Divergence is trivially supported with a per-warp stack; warps reconverge at immediate post-dominators of branches
Supports some maximum number of blocks and threads (~8 and ~768 on G80).

Block sizing

FIXME: review/verify this!

How tightly can we bound the optimal block size T, given a warp size w? The number of threads per block ought almost always be a multiple of w, both to:

facilitate coalescing (coalescing requirements are related to w/2), and
maximize utilization of SPs within warp-granular scheduling.

A SM has r registers and s words of shared memory, allocated per-block (see above). Assuming that w threads can be supported (i.e., that none requires more than r/w registers or s/w words of shared memory), the most obvious lower bound is w itself. The most obvious upper bound, assuming arbitrary available work, is the greatest multiple of w supported by hardware (and, obviously, the SDK). A block must be scheduled to an SM, which requires:

registers sufficient to support the block,
shared memory sufficient to support the block,
that the total number of threads not exceed some limit t (likely bounding the divergence-tracking stacks), and
that the total number of blocks not exceed some limit b (likely bounding the warp-scheduling complexity).

A given SM, then, supports T values through the minimum of {r/Thr_reg, s/Blk_shmem, and t}; as the block requires fewer registers and less shared memory, the upper bound converges to t. Motivations for larger blocks include:

freedom in the b dimension exposes parallelism until t <= b * T
larger maximum possible kernels (an absolute limit exists on grid dimensions)
better if data can be reused among threads (e.g. in tiled matrix multiply)

Motivations for smaller blocks include:

freedom in the t dimension exposes parallelism until t >= b * T
freedom in the r and s dimensions exposes parallelism until r >= b * T * Thr_reg or s >= b * Blk_shmem.
cheaper per-block operations(?) (__syncthreads(), voting, etc)
support for older hardware and SDKs
fairer distribution among SMs and thus possibly better utilization, lower latency
- relative speedup tends to 0 as work grows arbitrarily on finite SMs
- relative speedup tends to 1/Frac_par on infinitely many SMs

We can now optimize occupancy for a specific {t, b, r and s}, assuming t to be a multiple of both w and b:

Let T = t / b. T is thus guaranteed to be the smallest multiple of w such that t == b * T.
Check the r and w conditions. FIXME: handle reduction
FIXME: handle very large (external) kernels

Optimizing for ranges of hardware values is left as an exercise for the reader. Occupancy is only worth optimizing if the number of warps are insufficient to hide latencies. It might be possible to eliminate latencies altogether by reusing data throughout a block via shared memory; if the algorithm permits, this is almost certainly a net win. In that case, we likely want to maximize Thr_shmem. A more advanced theory would incorporate the arithmetic intensity of a kernel...FIXME

Stream Processor

In-order, multithreaded processor: memory latencies can be hidden only by TLP, not ILP.
- Arithmetic intensity and parallelism are paramount!
- Memory-bound kernels require sufficiently high occupancy (the ratio of concurrently-running warps to maximum possible concurrent warps (as applied, usually, to SMs)) to hide latency.
No branch prediction or speculation (and thus also no pipeline flushes on mispredicted branches).

Memory type	Replication	Kernel access	Host access	Cache location
Registers	Per-thread	Read-write	None	None
Local memory	Per-thread	Read-write	None	None
Shared memory	Per-block	Read-write	None	None
Global memory	Per-grid	Read-write	Read-write	None
Constant memory	Per-grid	Read	Read-write	Stream multiprocessor
Texture memory	Per-grid	Read	Read-write	Texture processing cluster

Compute Capabilities

The original public CUDA revision was 1.0, implemented on the NV50 chipset corresponding to the GeForce 8 series. Compute capability, formed of a non-negative major and minor revision number, can be queried on CUDA-capable cards. All revisions thus far have been backwards-compatible.

Revision	Changes
1.1	Atomic ops on 32-bit global integers. Breakpoints and other debugging support.
1.2	Atomic ops on 64-bit global integers and 32-bit shared integers. 32 warps (1024 threads) and 16K registers per multiprocessor (MP). Vote instructions. Three MPs per Texture Processing Cluster (TPC). Relaxed memory coalescing constraints.
1.3	Double-precision floating point at 32 cycles per operation.
2.0	Atomic addition on 32-bit global and shared FP. 48 warps (1536 threads), 48K shared memory banked 32 ways, and 32K registers per MP. 512K local memory per thread. `__syncthreads_{count,and,or}()`, `__threadfence_system()`, and `__ballot()`. 1024 threads per block and `blockIdx.{x,y}` values ranging through 1024. Larger texture references.

Installation on Debian

libcuda-dev packages exist in the non-free archive area, and supply the core library libcuda.so. Together with the upstream toolkit and SDK from NVIDIA, this provides a full CUDA development environment for 64-bit Debian Unstable systems. I installed CUDA 2.3 on 2010-01-25 (hand-rolled 2.6.32.6 kernel, built with gcc-4.4). This machine did not have CUDA-compatible hardware (it uses Intel 965).

Download the Ubuntu 9.04 files from NVIDIA's "CUDA Zone".
Run the toolkit installer (sh cudatoolkit_2.3_linux_64_ubuntu9.04.run)
- For a user-mode install, supply $HOME/local or somesuch

* Please make sure your PATH includes /home/dank/local/cuda/bin
* Please make sure your LD_LIBRARY_PATH
*   for 32-bit Linux distributions includes /home/dank/local/cuda/lib
*   for 64-bit Linux distributions includes /home/dank/local/cuda/lib64
* OR
*   for 32-bit Linux distributions add /home/dank/local/cuda/lib
*   for 64-bit Linux distributions add /home/dank/local/cuda/lib64
* to /etc/ld.so.conf and run ldconfig as root

* Please read the release notes in /home/dank/local/cuda/doc/

* To uninstall CUDA, delete /home/dank/local/cuda
* Installation Complete

Run the SDK installer (sh cudasdk_2.3_linux.run)
- I just installed it to the same directory as the toolkit, which seems to work fine.

========================================

Configuring SDK Makefile (/home/dank/local/cuda/shared/common.mk)...

========================================

* Please make sure your PATH includes /home/dank/local/cuda/bin
* Please make sure your LD_LIBRARY_PATH includes /home/dank/local/cuda/lib

* To uninstall the NVIDIA GPU Computing SDK, please delete /home/dank/local/cuda
* Installation Complete

Building CUDA Apps

`nvcc` flags

-ptax-options=-v displays per-thread register usage

SDK's common.mk

This assumes use of the SDK's common.mk, as recommended by the documentation.

Add the library path to LD_LIBRARY_PATH, assuming CUDA's been installed to a non-standard directory.
Set the CUDA_INSTALL_PATH and ROOTDIR (yeargh!) if outside the SDK.
I keep the following in bin/cudasetup of my home directory. Source it, using sh's . cudasetup syntax:

CUDA="$HOME/local/cuda/"

export CUDA_INSTALL_PATH="$CUDA"
export ROOTDIR="$CUDA/C/common/"
if [ -n "$LD_LIBRARY_PATH" ] ; then
	export "LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$CUDA/lib64"
else
	export "LD_LIBRARY_PATH=$CUDA/lib64"
fi

unset CUDA

Set EXECUTABLE in your Makefile, and include $CUDA_INSTALL_PATH/C/common/common.mk

Unit testing

The DEFAULT_GOAL special variable of GNU Make can be used:

.PHONY: test
.DEFAULT_GOAL:=test

include $(CUDA_INSTALL_PATH)/C/common/common.mk

test: $(TARGET)
        $(TARGET)

Libraries

Two mutually exclusive means of driving CUDA are available: the "Driver API" and "C for CUDA" with its accompanying nvcc compiler and runtime. The latter (libcudart) is built atop the former, and requires its libcuda library.

deviceQuery info

Compute capability 2.0

Compute capability 1.3

Tesla C1060

Device 0: "Tesla C1060"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 4294705152 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

GeForce GTX 295

Device 1: "GeForce GTX 295"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 939261952 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.24 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

GeForce GTX 280

Device 0: "GeForce GTX 280"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 1073020928 bytes
  Number of multiprocessors:                     30
  Number of cores:                               240
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.30 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

GeForce GTX 260

Device 0: "GeForce GTX 260"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         3
  Total amount of global memory:                 938803200 bytes
  Number of multiprocessors:                     27
  Number of cores:                               216
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.47 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Compute capability 1.2

GeForce 310 (PCIe x16)

Device 0: "GeForce 310"
  CUDA Driver Version:                           3.0
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         2
  Total amount of global memory:                 536084480 bytes
  Number of multiprocessors:                     2
  Number of cores:                               16
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.40 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

GeForce 240 GT

Device 0 GeForce GT 240
  CUDA Driver Version:                           3.0
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         2
  Total amount of global memory:                 1073414144 bytes
  Number of multiprocessors:                     12
  Number of cores:                               96
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 16384
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512,512,64
  Maximum sizes of each dimension of a grid:     65535,65535,1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.424 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Compute capability 1.1

Quadro FX 570

Device 1: "Quadro FX 570"
  CUDA Driver Version: 2.30
  CUDA Runtime Version: 2.30
  CUDA Capability Major revision number: 1
  CUDA Capability Minor revision number: 1
  Total amount of global memory: 268107776 bytes
  Number of multiprocessors: 2
  Number of cores: 16
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 16384 bytes
  Total number of registers available per block: 8192
  Warp size: 32
  Maximum number of threads per block: 512
  Maximum sizes of each dimension of a block: 512 x 512 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 262144 bytes
  Texture alignment: 256 bytes
  Clock rate: 0.92 GHz
  Concurrent copy and execution: Yes
  Run time limit on kernels: Yes
  Integrated: No
  Support host page-locked memory mapping: No
  Compute mode: Default (multiple host threads can use this device simultaneously)

GeForce 9600 GT

Device 0: “GeForce 9600 GT”
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 536543232 bytes
  Number of multiprocessors:                     8
  Number of cores:                               64
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.50 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       No
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

GeForce 9400M

Device 0: "GeForce 9400M"
  Major revision number:                         1
  Minor revision number:                         1
  Total amount of global memory:                 266010624 bytes
  Number of multiprocessors:                     2
  Number of cores:                               16
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    0.80 GHz
  Concurrent copy and execution:                 No

GeForce 8800 GTS 512

Device 0: "GeForce 8800 GTS 512"
  Major revision number: 1
  Minor revision number: 1
  Total amount of global memory: 536150016 bytes
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 16384 bytes
  Total number of registers available per block: 8192
  Warp size: 32
  Maximum number of threads per block: 512
  Maximum sizes of each dimension of a block: 512 x 512 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 262144 bytes
  Texture alignment: 256 bytes
  Clock rate: 1674000 kilohertz

GeForce 8600 GT

Device 0: "GeForce 8600 GT"
  Major revision number: 1
  Minor revision number: 1
  Total amount of global memory: 268107776 bytes
  Total amount of constant memory: 65536 bytes
  Total amount of shared memory per block: 16384 bytes
  Total number of registers available per block: 8192
  Warp size: 32
  Maximum number of threads per block: 512
  Maximum sizes of each dimension of a block: 512 x 512 x 64
  Maximum sizes of each dimension of a grid: 65535 x 65535 x 1
  Maximum memory pitch: 262144 bytes
  Texture alignment: 256 bytes
  Clock rate: 1674000 kilohertz

GeForce 8400 GS (PCI)

Device 0: "GeForce 8400 GS"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 536608768 bytes
  Number of multiprocessors:                     1
  Number of cores:                               8
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.40 GHz
  Concurrent copy and execution:                 No
  Run time limit on kernels:                     No
  Integrated:                                    No
  Support host page-locked memory mapping:       No
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

Compute capability 1.0

GeForce 8800 Ultra

Device 0: "GeForce 8800 Ultra"
  CUDA Driver Version:                           2.30
  CUDA Runtime Version:                          2.30
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         0
  Total amount of global memory:                 804585472 bytes
  Number of multiprocessors:                     16
  Number of cores:                               128
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          262144 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    1.51 GHz
  Concurrent copy and execution:                 No
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       No
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

@@ Line 89: / Line 89: @@
 * that the total number of threads not exceed some limit ''t'' (likely bounding the divergence-tracking stacks), and
 * that the total number of blocks not exceed some limit ''b'' (likely bounding the warp-scheduling complexity).
-A given SM, then, supports '''T''' values through the minimum of {''r''/'''Thr<sub>reg</sub>''', ''s''/'''Thr<sub>shmem</sub>''', and ''t''}; as each thread requires fewer registers and less shared memory, the upper bound converges to ''t''. Motivations for larger blocks include:
+A given SM, then, supports '''T''' values through the minimum of {''r''/'''Thr<sub>reg</sub>''', ''s''/'''Blk<sub>shmem</sub>''', and ''t''}; as the block requires fewer registers and less shared memory, the upper bound converges to ''t''. Motivations for larger blocks include:
 * freedom in the ''b'' dimension exposes parallelism until ''t'' <= ''b'' * '''T'''
 * larger maximum possible kernels (an absolute limit exists on grid dimensions)
@@ Line 95: / Line 95: @@
 Motivations for smaller blocks include:
 * freedom in the ''t'' dimension exposes parallelism until ''t'' >= ''b'' * '''T'''
-* freedom in the ''r'' and ''s'' dimensions exposes parallelism until ''r'' >= ''b'' * '''T''' * '''Thr<sub>reg</sub>''' or ''s'' >= ''b'' * '''T''' * '''Thr<sub>shmem</sub>'''.
+* freedom in the ''r'' and ''s'' dimensions exposes parallelism until ''r'' >= ''b'' * '''T''' * '''Thr<sub>reg</sub>''' or ''s'' >= ''b'' * '''Blk<sub>shmem</sub>'''.
 * cheaper per-block operations(?) (<tt>__syncthreads()</tt>, voting, etc)
 * support for older hardware and SDKs