Nehalem: Difference between revisions

From dankwiki
No edit summary
No edit summary
Line 24: Line 24:
* The Dark Knight's [ AnandTech] article, 2008-11-03.
* The Dark Knight's [ AnandTech] article, 2008-11-03.
[[Category: x86]]
[[Category: x86]]
[[CATEGORY: Hardware]]

Revision as of 19:35, 22 March 2010

Ehyeh asher ehyeh
Nehalem microarchitecture
  • Move to 133MHz QPI (Quick Path Interconnect), against which the CPU clock is multiplied
  • Reintroduction of SMT (HyperThreading)
  • Store-forwarding aliasing issue on 4k strides.
  • Double-pumped FP SSE + integer SSE/x87 + load + store units
  • Fetch up to 16 bytes of aligned instructions from cache per cycle.
  • Up to 4 instructions, no more than 1 complex (this does not necessarily mean 1 µop), decoded per cycle. 64-bit macro-fusion
  • Instructions with more than 4 µops are fed from MSROM, and will take more than one cycle in the Instruction Decoder Queue.
  • Forwarding results between integer, integer SIMD, and FP units adds latency compared to forwards within the domain.
  • One register may be written per cycle.
  • 48 load buffers (up from 32), 32 store buffers (up from 20), 10 fill buffers.
  • 36 reservation stations (up from 32), 128 ROB entries (up from 96).
  • Calltrace cache of 16 entries.
  • 2-way loop end BTB for every 16 bytes, 4-way general BTB.
  • Loop Stream Detector replays from IDQ if the loop consists of:
    • 4 16-byte icache fetches or less
    • 28 total µops or less
    • 4 taken branches or less, none of them a RET
    • preferably more than 64 iterations?
  • Be sure to use register parameter-passing conventions, not the stack, to avoid stalls on store-forward of high-latency floating point stores.
  • Peak issue rate of 1 128-bit load and 1 128-bit store per cycle.

See Also

  • The Dark Knight's AnandTech article, 2008-11-03.