Check out my first novel, midnight's simulacra!
Nehalem: Difference between revisions
From dankwiki
(Created page with '* Store-forwarding aliasing issue on 4k strides. * Double-pumped FP SSE + integer SSE/x87 + load + store units * Fetch up to 16 bytes of aligned instructions from cache per cycle...') |
mNo edit summary |
||
(13 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[File:Nehalem.svg|thumb|right|Nehalem-EP NUMA arrangement]] | |||
[[File:IAmNehalem.jpg|thumb|right|Ehyeh asher ehyeh]] | |||
[[File:Intel Nehalem arch.png|thumb|right|Nehalem microarchitecture]] | |||
The successor of [[Core 2]], and predecessor of [[Sandy Bridge]]. | |||
* Move to 133MHz [[QPI (Quick Path Interconnect)]], against which the CPU clock is multiplied | |||
* Reintroduction of [[SMP on x86#SMT|SMT (HyperThreading)]] | |||
* Store-forwarding aliasing issue on 4k strides. | * Store-forwarding aliasing issue on 4k strides. | ||
* Double-pumped FP SSE + integer SSE/x87 + load + store units | * Double-pumped FP SSE + integer SSE/x87 + load + store units | ||
Line 6: | Line 12: | ||
* Forwarding results between integer, integer SIMD, and FP units adds latency compared to forwards within the domain. | * Forwarding results between integer, integer SIMD, and FP units adds latency compared to forwards within the domain. | ||
* One register may be written per cycle. | * One register may be written per cycle. | ||
* 48 load buffers, 32 store buffers, 10 fill buffers. | * 48 load buffers (up from 32), 32 store buffers (up from 20), 10 fill buffers. | ||
* 36 reservation stations, | * 36 reservation stations (up from 32), 128 ROB entries (up from 96). | ||
* Calltrace cache of 16 entries. | * Calltrace cache of 16 entries. | ||
* 2-way loop end BTB for every 16 bytes, 4-way general BTB. | * 2-way loop end BTB for every 16 bytes, 4-way general BTB. | ||
* Loop Stream Detector replays from IDQ if the loop consists of: | * [[Daytripper|Loop Stream Detector]] replays from IDQ if the loop consists of: | ||
** 4 16-byte icache fetches or less | ** 4 16-byte icache fetches or less | ||
** 28 total µops or less | ** 28 total µops or less | ||
Line 17: | Line 23: | ||
* Be sure to use register parameter-passing conventions, not the stack, to avoid stalls on store-forward of high-latency floating point stores. | * Be sure to use register parameter-passing conventions, not the stack, to avoid stalls on store-forward of high-latency floating point stores. | ||
* Peak issue rate of 1 128-bit load and 1 128-bit store per cycle. | * Peak issue rate of 1 128-bit load and 1 128-bit store per cycle. | ||
* [[Turbo Boost]] in 133MHz increments | |||
==See Also== | |||
* The Dark Knight's [http://www.anandtech.com/cpuchipsets/intel/showdoc.aspx?i=3448 AnandTech] article, 2008-11-03. | |||
[[Category: x86]] | |||
[[CATEGORY: Hardware]] |
Latest revision as of 16:32, 25 February 2012
The successor of Core 2, and predecessor of Sandy Bridge.
- Move to 133MHz QPI (Quick Path Interconnect), against which the CPU clock is multiplied
- Reintroduction of SMT (HyperThreading)
- Store-forwarding aliasing issue on 4k strides.
- Double-pumped FP SSE + integer SSE/x87 + load + store units
- Fetch up to 16 bytes of aligned instructions from cache per cycle.
- Up to 4 instructions, no more than 1 complex (this does not necessarily mean 1 µop), decoded per cycle. 64-bit macro-fusion
- Instructions with more than 4 µops are fed from MSROM, and will take more than one cycle in the Instruction Decoder Queue.
- Forwarding results between integer, integer SIMD, and FP units adds latency compared to forwards within the domain.
- One register may be written per cycle.
- 48 load buffers (up from 32), 32 store buffers (up from 20), 10 fill buffers.
- 36 reservation stations (up from 32), 128 ROB entries (up from 96).
- Calltrace cache of 16 entries.
- 2-way loop end BTB for every 16 bytes, 4-way general BTB.
- Loop Stream Detector replays from IDQ if the loop consists of:
- 4 16-byte icache fetches or less
- 28 total µops or less
- 4 taken branches or less, none of them a RET
- preferably more than 64 iterations?
- Be sure to use register parameter-passing conventions, not the stack, to avoid stalls on store-forward of high-latency floating point stores.
- Peak issue rate of 1 128-bit load and 1 128-bit store per cycle.
- Turbo Boost in 133MHz increments
See Also
- The Dark Knight's AnandTech article, 2008-11-03.