SIMD: Difference between revisions

Latest revision as of 19:09, 17 February 2012

First introduced on the Illiac IV, and not to be confused with vector processing. SIMD-capable units are often taxonomized via SIMDnXm nomenclature, where n is the issue width and m is the channel width; a superscalar processor would be SIMDnX1, while a scalar SISD processor would be SIMD1X1.

Compilers and Assemblers

GCC can issue SIMD four different ways:
- Target-specific code generation will sometimes use scalar SIMD
- Target-specific builtins, using vector extensions
- Inline assembly, using syntax of the GNU Assembler and a baroque register specification convention
- Auto-vectorization from gcc's autovect branch attempts to autovectorize some loops

Data Types (taken from SSE specs)

These do not necessarily map to C data types of the same name, for any given compiler!

half precision: 16-bit IEEE 754 floating-point (bias-15) (IEEE 754 2008 binary16)
single: 32-bit IEEE 754 floating-point (bias-127) (IEEE 754 2008 binary32)
double: 64-bit IEEE 754 floating-point (bias-1023) (IEEE 754 2008 binary64)
long double: 80-bit "double extended" IEEE 754-1985 floating-point (bias-16383)
- not an actual SIMD type, but an artifact of x87
word: 32-bit two's complement integer
doubleword, dword: 64-bit two's complement integer

x86 YMM

The AVX (Advanced Vector eXtensions) were introduced on Intel's Sandy Bridge (2010) and AMD's Bulldozer (2011). They operate on the 256-bit YMM registers (YMM0..YMM15), which are aliased by the XMM registers. They are encoded using the VEX scheme. Support for AVX can be determined via:

Determine that CPUID.1:ECX.OSXSAVE[bit 27] is set (XGETBV is enabled for application use)
Issue XGETBV and verify that XFEATURE_ENABLED_MASK[2:1] = 0x3 (XMM state and YMM state are enabled by OS)
Determine that CPUID.1:ECX.AVX[bit 28] is set (AVX instructions supported)

x86 AVX

New 256-bit YMM registers.

New Haswell Instructions

AVX2 instructions: integer SIMD using AVX's 256-bit YMM registers.
YMM-SIMD bit manipulation, permute, gather, shift, and FMA

AVX1

Introduced on Sandy Bridge

x86 XMM

The Streaming SIMD Extensions operate on the 128-bit XMM registers (XMM0..XMM7 in 32-bit mode, XMM0..XMM15 in 64-bit mode). In its original incarnation on the PIII, execution units (but not registers) were shared with the x87 floating-point architecture. The execution units were separated in the NetBurst microarchitecture. In the Core microarchitecture, the execution engine has been widened for greater SSE throughput.

SSE5 (AMD)

Unimplemented extensions competing with SSE4, encoded using a method incompatible with VEX
Withdrawn, converted into VEX-compatible encodings, and split into:
- FMA4: Fused floating-point multiply-add (compare Intel's FMA)
- XOP: Fused integer multiply-add, byte permutations, shifts, rotates, integer vector horizontal operations (compare Intel's SSE4)
- CVT16: Half-precision conversion

SSE4 (Intel)

SSE4.1

Introduced on Penryn
dpps -- dot product of two vectors having four single components each
dppd -- dot product of two vectors having two double components each
insertps

SSE4.2

Introduced on Nehalem
crc32
pcmpestri
pcmpestrm
pcmpistri
pcmpistrm
pcmpgtq
popcnt

SSE4a (AMD)

lzcnt
popcnt
extrq
insertq
movntsd
movntss

SSE3 (PNI)

Originally known as Prescott New Instructions, and introduced on P4-Prescott
movddup -- move a double from a 8-byte-aligned memory location or lower half of XMM register to upper half, then duplicate upper half to lower half
MWAIT/MONITOR

SSSE3 (TNI/MNI)

Introduced with the Core microarchitecture. Sometimes referred to as Tejas New Instructions or Merom New Instructions
pmaddwd -- multiply packed words, then horizontally sum pairs, accumulating into doublewords

SSE2

Introduced with the P4.
Extends the MMX instructions to XMM registers.
movapd -- move two packed doubles from a 16-byte-aligned memory location to XMM registers, or vice versa, or between two XMM registers.
- movupd -- movapd safe for unaligned memory references, with far inferior performance.
mulpd -- multiply two packed doubles. the multiplier is a 16-byte-aligned memory location or XMM register. the target XMM register serves as the multiplicand.
addpd -- add two packed doubles. the addend is a 16-byte-aligned memory location or XMM register. the target XMM register serves as the augend.

SSE (KNI/ISSE)

Introduced with the PIII. Sometimes referred to as Katmai New Instructions, and branded for some time as the Internet Streaming SIMD Extensions
movaps -- move four packed singles from a 16-byte-aligned memory location to XMM registers, or vice versa, or between two XMM registers.
- movups -- movaps safe for unaligned memory references, with far inferior performance.
mulps -- multiply four packed singles. the multiplier is a 16-byte-aligned memory location or XMM register. the target XMM register serves as the multiplicand.
addps -- add four packed singles. the addend is a 16-byte-aligned memory location or XMM register. the target XMM register serves as the augend.

Future Directions

The FMA instruction set extension to x86 should hit around 2011, providing floating-point fused multiply-add
- AMD appears to call this FMA4, part of what was SSE5

x87 MMX

MMX (Intel)

3DNow! (AMD)

Other Architectures

PowerPC implements AltiVec
SPARC implements VIS, the Visual Instruction Set
PA-RISC implements MAX, the Multimedia Acceleration eXtensions
ARM implements NEON
Alpha implemented MVI, the Motion Video Instructions
SWAR: SIMD Within a Register (bit-parallel methods)

@@ Line 1: / Line 1: @@
-==x86==
+[[File:Simd.jpeg|thumb|right|A 4-way SIMD unit]]
-* [http://software.intel.com/en-us/avx/ AVX] (Advanced Vector eXtensions) -- to be introduced on Intel's Sandy Bridge (2010) and AMD's Bulldozer (2011), and implemented within the [http://en.wikipedia.org/wiki/VEX_prefix VEX coding scheme]
+First introduced on the Illiac IV, and not to be confused with vector processing. SIMD-capable units are often taxonomized via <tt>SIMDnXm</tt> nomenclature, where '''n''' is the issue width and '''m''' is the channel width; a superscalar processor would be SIMDnX1, while a scalar SISD processor would be SIMD1X1.
-===SSE3===
+==Compilers and Assemblers==
+* [[GCC]] can issue SIMD four different ways:
+** [http://gcc.gnu.org/onlinedocs/gcc/Submodel-Options.html#Submodel-Options Target-specific code generation] will sometimes use scalar SIMD
+** [http://gcc.gnu.org/onlinedocs/gcc/Target-Builtins.html Target-specific builtins], using [http://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html vector extensions]
+** [[GCC#Inline_Assembly|Inline assembly]], using syntax of the [http://sourceware.org/binutils/docs/as/index.html GNU Assembler] and a baroque register specification convention
+** [http://gcc.gnu.org/projects/tree-ssa/vectorization.html Auto-vectorization] from [[GCC|gcc's]] <tt>autovect</tt> branch attempts to autovectorize some loops
+==Data Types (taken from SSE specs)==
+These do not necessarily map to [[C]] data types of the same name, for any given compiler!
+* '''half precision''': 16-bit IEEE 754 floating-point (bias-15) (IEEE 754 2008 '''binary16''')
+* '''single''': 32-bit IEEE 754 floating-point (bias-127) (IEEE 754 2008 '''binary32''')
+* '''double''': 64-bit IEEE 754 floating-point (bias-1023) (IEEE 754 2008 '''binary64''')
+* '''long double''': 80-bit "double extended" IEEE 754-1985 floating-point (bias-16383)
+** not an actual SIMD type, but an artifact of x87
+* '''word''': 32-bit two's complement integer
+* '''doubleword''', '''dword''': 64-bit two's complement integer
+==x86 YMM==
+The [http://software.intel.com/en-us/avx/ AVX] (Advanced Vector eXtensions) were introduced on Intel's [[Sandy Bridge]] (2010) and AMD's [[Bulldozer]] (2011). They operate on the 256-bit YMM registers (YMM0..YMM15), which are aliased by the [[SIMD#x86_XMM|XMM]] registers. They are encoded using the [[VEX]] scheme. Support for AVX can be determined via:
+* Determine that [[CPUID]].1:ECX.OSXSAVE[bit 27] is set (XGETBV is enabled for application use)
+* Issue XGETBV and verify that XFEATURE_ENABLED_MASK[2:1] = 0x3 (XMM state and YMM state are enabled by OS)
+* Determine that [[CPUID]].1:ECX.AVX[bit 28] is set (AVX instructions supported)
+==x86 AVX==
+New 256-bit YMM registers.
+===New [[Haswell]] Instructions===
+* AVX2 instructions: integer SIMD using AVX's 256-bit YMM registers.
+* YMM-SIMD bit manipulation, permute, gather, shift, and FMA
+===AVX1===
+* Introduced on [[Sandy Bridge]]
+==x86 XMM==
+The Streaming SIMD Extensions operate on the 128-bit XMM registers (XMM0..XMM7 in 32-bit mode, XMM0..XMM15 in 64-bit mode). In its original incarnation on the PIII, execution units (but not registers) were shared with the x87 floating-point architecture. The execution units were separated in the NetBurst microarchitecture. In the Core microarchitecture, the execution engine has been widened for greater SSE throughput.
+===SSE5 (AMD)===
+* Unimplemented extensions competing with SSE4, encoded using a method incompatible with VEX
+* Withdrawn, converted into VEX-compatible encodings, and split into:
+** FMA4: Fused floating-point multiply-add (compare Intel's FMA)
+** XOP: Fused integer multiply-add, byte permutations, shifts, rotates, integer vector horizontal operations (compare Intel's SSE4)
+** CVT16: Half-precision conversion
+===SSE4 (Intel)===
+====SSE4.1====
+[[File:Dppd.gif|thumb|DPPD instruction dataflow]]
+* Introduced on Penryn
+*<tt>dpps</tt> -- dot product of two vectors having four single components each
+*<tt>dppd</tt> -- dot product of two vectors having two double components each
+*<tt>insertps</tt>
+====SSE4.2====
+[[File:Intel Nehalem arch.png|thumb|Intel [[Nehalem]] microarchitecture]]
+*Introduced on [[Nehalem]]
+*<tt>crc32</tt>
+*<tt>pcmpestri</tt>
+*<tt>pcmpestrm</tt>
+*<tt>pcmpistri</tt>
+*<tt>pcmpistrm</tt>
+*<tt>pcmpgtq</tt>
+*<tt>popcnt</tt>
+===SSE4a (AMD)===
+*<tt>lzcnt</tt>
+*<tt>popcnt</tt>
+*<tt>extrq</tt>
+*<tt>insertq</tt>
+*<tt>movntsd</tt>
+*<tt>movntss</tt>
+===SSE3 (PNI)===
+*Originally known as Prescott New Instructions, and introduced on P4-Prescott
 *<tt>[http://www.intel.com/software/products/documentation/vlin/mergedprojects/analyzer_ec/mergedprojects/reference_olh/mergedProjects/instructions/instruct32_hh/movddup--move_one_double-fp_and_duplicate.htm movddup]</tt> -- move a double from a 8-byte-aligned memory location or lower half of XMM register to upper half, then duplicate upper half to lower half
+*<tt>MWAIT/MONITOR</tt>
+====SSSE3 (TNI/MNI)====
+*Introduced with the Core microarchitecture. Sometimes referred to as Tejas New Instructions or Merom New Instructions
+*<tt>[http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc235.htm pmaddwd]</tt> -- multiply packed words, then horizontally sum pairs, accumulating into doublewords
 ===SSE2===
+*Introduced with the P4.
+*Extends the [[SIMD#MMX|MMX]] instructions to XMM registers.
 *<tt>[http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc180.htm movapd]</tt> -- move two packed doubles from a 16-byte-aligned memory location to XMM registers, or vice versa, or between two XMM registers.
 **<tt>[http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc205.htm movupd]</tt> -- <tt>movapd</tt> safe for unaligned memory references, with far inferior performance.
@@ Line 10: / Line 83: @@
 *<tt>[http://www.intel.com/software/products/documentation/vlin/mergedprojects/analyzer_ec/mergedprojects/reference_olh/mergedprojects/instructions/instruct32_hh/vc8a.htm addpd]</tt> -- add two packed doubles. the addend is a 16-byte-aligned memory location or XMM register. the target XMM register serves as the augend.
-===SSE===
+===SSE (KNI/ISSE)===
+* Introduced with the PIII. Sometimes referred to as Katmai New Instructions, and branded for some time as the Internet Streaming SIMD Extensions
 *<tt>[http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc181.htm movaps]</tt> -- move four packed singles from a 16-byte-aligned memory location to XMM registers, or vice versa, or between two XMM registers.
 **<tt>[http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc206.htm movups]</tt> -- <tt>movaps</tt> safe for unaligned memory references, with far inferior performance.
@@ Line 16: / Line 90: @@
 *<tt>[http://www.sesp.cse.clrc.ac.uk/html/SoftwareTools/vtune/users_guide/mergedProjects/analyzer_ec/mergedProjects/reference_olh/mergedProjects/instructions/instruct32_hh/vc9a.htm addps]</tt> -- add four packed singles. the addend is a 16-byte-aligned memory location or XMM register. the target XMM register serves as the augend.
-===Fused Multiply-Add===
+===Future Directions===
-* The [http://en.wikipedia.org/wiki/FMA_instruction_set FMA instruction set] extension to x86 should hit around 2011
+* The [http://en.wikipedia.org/wiki/FMA_instruction_set FMA instruction set] extension to x86 should hit around 2011, providing floating-point fused multiply-add
+** AMD appears to call this [http://en.wikipedia.org/wiki/FMA_instruction_set FMA4], part of what was SSE5
+==x87 MMX==
+===MMX (Intel)===
+===3DNow! (AMD)===
 ==Other Architectures==
 * PowerPC implements [http://en.wikipedia.org/wiki/AltiVec AltiVec]
+* SPARC implements [http://en.wikipedia.org/wiki/Visual_Instruction_Set VIS], the Visual Instruction Set
+* PA-RISC implements [http://en.wikipedia.org/wiki/Multimedia_Acceleration_eXtensions MAX], the Multimedia Acceleration eXtensions
+* ARM implements [http://en.wikipedia.org/wiki/ARM_architecture#Advanced_SIMD_.28NEON.29 NEON]
+* Alpha implemented [http://www.alphalinux.org/wiki/index.php/Motion_Video_Instructions MVI], the Motion Video Instructions
+* [http://cobweb.ecn.purdue.edu/~hankd/SWAR/ SWAR]: SIMD Within a Register (bit-parallel methods)
 ==See Also==
 * "[http://software.intel.com/en-us/forums/intel-avx-and-cpu-instructions/topic/61121/ Why no FMA in AVX in Sandy Bridge?]", Intel Developers Forum
 * [http://developer.amd.com/cpu/SSE5/Pages/default.aspx SSE5] guide at AMD
+* [http://softwarecommunity.intel.com/isn/Downloads/Intel%20SSE4%20Programming%20Reference.pdf SSE4] reference at Intel
+* 2007-04-19 post to http://virtualdub.org, "[http://www.virtualdub.org/blog/pivot/entry.php?id=150 SSE4 finally adds dot products]"
+* AMD64 Architecture Programmer’s Manual Volume 6: [http://support.amd.com/us/Processor_TechDocs/43479.pdf 128-Bit and 256-Bit XOP, FMA4 and CVT16 Instructions]
+* Agner Fog's [http://www.agner.org/optimize/instruction_tables.pdf x86 Instruction Tables]
+* "[http://software.intel.com/en-us/blogs/2011/06/13/haswell-new-instruction-descriptions-now-available Haswell New Instruction Descriptions now Available]", Intel Software Developer Network, 2011-06-13
+* General [[architecture]] page
 [[Category: x86]]

SIMD: Difference between revisions

Latest revision as of 19:09, 17 February 2012

Contents

Compilers and Assemblers

Data Types (taken from SSE specs)

x86 YMM

x86 AVX

New Haswell Instructions

AVX1

x86 XMM

SSE5 (AMD)

SSE4 (Intel)

SSE4.1

SSE4.2

SSE4a (AMD)

SSE3 (PNI)

SSSE3 (TNI/MNI)

SSE2

SSE (KNI/ISSE)

Future Directions

x87 MMX

MMX (Intel)

3DNow! (AMD)

Other Architectures

See Also

navigation menu

SIMD: Difference between revisions

Latest revision as of 19:09, 17 February 2012

Compilers and Assemblers

Data Types (taken from SSE specs)

x86 YMM

x86 AVX

New Haswell Instructions

AVX1

x86 XMM

SSE5 (AMD)

SSE4 (Intel)

SSE4.1

SSE4.2

SSE4a (AMD)

SSE3 (PNI)

SSSE3 (TNI/MNI)

SSE2

SSE (KNI/ISSE)

Future Directions

x87 MMX

MMX (Intel)

3DNow! (AMD)

Other Architectures

See Also

navigation menu

Search