Nehalem: Difference between revisions

Latest revision as of 16:32, 25 February 2012

The successor of Core 2, and predecessor of Sandy Bridge.

Move to 133MHz QPI (Quick Path Interconnect), against which the CPU clock is multiplied
Reintroduction of SMT (HyperThreading)
Store-forwarding aliasing issue on 4k strides.
Double-pumped FP SSE + integer SSE/x87 + load + store units
Fetch up to 16 bytes of aligned instructions from cache per cycle.
Up to 4 instructions, no more than 1 complex (this does not necessarily mean 1 µop), decoded per cycle. 64-bit macro-fusion
Instructions with more than 4 µops are fed from MSROM, and will take more than one cycle in the Instruction Decoder Queue.
Forwarding results between integer, integer SIMD, and FP units adds latency compared to forwards within the domain.
One register may be written per cycle.
48 load buffers (up from 32), 32 store buffers (up from 20), 10 fill buffers.
36 reservation stations (up from 32), 128 ROB entries (up from 96).
Calltrace cache of 16 entries.
2-way loop end BTB for every 16 bytes, 4-way general BTB.
Loop Stream Detector replays from IDQ if the loop consists of:
- 4 16-byte icache fetches or less
- 28 total µops or less
- 4 taken branches or less, none of them a RET
- preferably more than 64 iterations?
Be sure to use register parameter-passing conventions, not the stack, to avoid stalls on store-forward of high-latency floating point stores.
Peak issue rate of 1 128-bit load and 1 128-bit store per cycle.
Turbo Boost in 133MHz increments

@@ Line 1: / Line 1: @@
+[[File:Nehalem.svg|thumb|right|Nehalem-EP NUMA arrangement]]
 [[File:IAmNehalem.jpg|thumb|right|Ehyeh asher ehyeh]]
 [[File:Intel Nehalem arch.png|thumb|right|Nehalem microarchitecture]]
+The successor of [[Core 2]], and predecessor of [[Sandy Bridge]].
+* Move to 133MHz [[QPI (Quick Path Interconnect)]], against which the CPU clock is multiplied
+* Reintroduction of [[SMP on x86#SMT|SMT (HyperThreading)]]
 * Store-forwarding aliasing issue on 4k strides.
 * Double-pumped FP SSE + integer SSE/x87 + load + store units
@@ Line 8: / Line 12: @@
 * Forwarding results between integer, integer SIMD, and FP units adds latency compared to forwards within the domain.
 * One register may be written per cycle.
-* 48 load buffers, 32 store buffers, 10 fill buffers.
+* 48 load buffers (up from 32), 32 store buffers (up from 20), 10 fill buffers.
 * 36 reservation stations (up from 32), 128 ROB entries (up from 96).
 * Calltrace cache of 16 entries.
@@ Line 19: / Line 23: @@
 * Be sure to use register parameter-passing conventions, not the stack, to avoid stalls on store-forward of high-latency floating point stores.
 * Peak issue rate of 1 128-bit load and 1 128-bit store per cycle.
+* [[Turbo Boost]] in 133MHz increments
+==See Also==
+* The Dark Knight's [http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3448 AnandTech] article, 2008-11-03.
+[[Category: x86]]
+[[CATEGORY: Hardware]]