Nehalem

Move to 133MHz QPI (Quick Path Interconnect), against which the CPU clock is multiplied
Reintroduction of SMT (HyperThreading)
Store-forwarding aliasing issue on 4k strides.
Double-pumped FP SSE + integer SSE/x87 + load + store units
Fetch up to 16 bytes of aligned instructions from cache per cycle.
Up to 4 instructions, no more than 1 complex (this does not necessarily mean 1 µop), decoded per cycle. 64-bit macro-fusion
Instructions with more than 4 µops are fed from MSROM, and will take more than one cycle in the Instruction Decoder Queue.
Forwarding results between integer, integer SIMD, and FP units adds latency compared to forwards within the domain.
One register may be written per cycle.
48 load buffers (up from 32), 32 store buffers (up from 20), 10 fill buffers.
36 reservation stations (up from 32), 128 ROB entries (up from 96).
Calltrace cache of 16 entries.
2-way loop end BTB for every 16 bytes, 4-way general BTB.
Loop Stream Detector replays from IDQ if the loop consists of:
- 4 16-byte icache fetches or less
- 28 total µops or less
- 4 taken branches or less, none of them a RET
- preferably more than 64 iterations?
Be sure to use register parameter-passing conventions, not the stack, to avoid stalls on store-forward of high-latency floating point stores.
Peak issue rate of 1 128-bit load and 1 128-bit store per cycle.

anonymous