Check out my first novel, midnight's simulacra!

Sandy Bridge

From dankwiki
Revision as of 07:50, 12 January 2011 by Dank (talk | contribs)
Sandy Bridge microarchitecture
Sandy Bridge die

Intel released Sandy Bridge in January 2011 as the major successor to Nehalem. Core i7, i5 and i3 variants were released simultaneously. Sandy Bridge can support an on-die integrated graphics processor. All Sandy Bridge processors to date use the new LGA 1155 socket ("Socket H2"), the successor to LGA 1156 ("Socket H"). The P67, H67, Q67 and B65 chipsets ("Platform Controller Hub" or PCH in recent Intel terminology) have been released to support Sandy Bridge, and are compatible with all current variants. It exhibits the ring-based bus designed for Larrabee (introduced on Nehalem EX) and supports AVX instructions. Sandy Bridge processors (and their on-die IGP) are based on a 32nm process.

Microarchitecture

Sandy Bridge frontend (contrast with Loop Stream Decoder of Nehalem)

See my architecture page for background information.

  • As opposed to Nehalem's Loop Stream Decoder, there's a simple, direct-mapped/LRU 1.5k μop cache
  • Branch prediction can use multiple target sizes, depending on relative distance, and multiple history widths, depending on branch variance
  • 2 load/store ports using symmetric addressing (2 loads or stores can execute at once)
  • 100MHz Turbo Boost increments as compared to Nehalem's 133Mhz, and more bins when multiple cores are in use

Instruction Window

  • New physical register file (PRF) outside the OOO core (RRF) contains all in-flight operands; ops in window carry pointers only
    • Likely motivator: 256-bit operand width of AVX instructions
  • Load buffers: 48 -> 64
  • Store buffers: 32 -> 36
  • Reservation stations: 30 -> 54
  • ROB entries: 128 -> 168

Last-Level Cache

Sandy Bridge per-core memory
  • Shared LLC remains "sliced" (NUMA), with slices distributed among cores
  • L3 LLC moved out of uncore and now runs at core frequency
    • Downclocked cores mean the L3 can be underclocked relative to the IGP!
  • Cache pipeline is per-slice, rather than global to the cache
  • L3 latencies reduced
  • Three coherency domains

Interconnect

  • Larrabee's ring interconnect connects the cores, IGP, LLC slices, media engine and System Agent (northbridge). Four 32-byte rings are used:
    • Data, Request, ACK, Snoop
  • Interconnect is built directly into L3 cache
  • Fully pipelined at core clock frequency

Branding/Differentiation

  • Core i7 exhibits the highest clock speeds and largest speedups from Turbo Boost. It uses SMT and provides vPro and AES-NI instructions.
  • Core i5 does not support SMT. The Core i5-2500K does not support vPro, while other Core i5's do.
  • Core i3 does not support Turbo Boost, vPro, or AES-NI. It *does* support SMT.

All current Core i7 and Core i5 Sandy Bridge processors are quad-core, while Core i3 is dual-core. Assuming SMT to be enabled where possible, this means Core i7 provides 8 execution units, while Core i5 and Core i3 both provide 4. Without HyperThreading, the Core i3 provides 2, while the others provide 4. Currently, the Core i3 is limited to 3MB of cache, while Core i5/i7 support up to 8MB (current i5's ship with 6MB).

Processor Support

LGA1155 chipset details

Northbridge ("System Agent")

  • Different clock and power plane
  • Provides an IOMMU
  • 16 PCIe 2.0 lanes
  • Dual-channel DDR3

IGP

The 2000-series IGP has 6 execution units, while the 3000 has 12. Currently, 3000-series IGP's are reserved for K-series processors (the P67 performance-oriented chipset cannot make use of the IGP, and requires a discrete graphics adapter). The IGP has limited locking on its clock -- the H67 chipset can control IGP multipliers.

Unlike Larrabee's fully programmable shaders (suitable for GPGPU), Sandy Bridge IGP makes extensive use of fixed-function components. The instruction set is said to closely parallel the DirectX 10 API. There are a fixed 120 registers per thread, as opposed to the dynamically partitionable register file of previous Intel HD Graphics.

The IGP runs on its own voltage and frequency planes, and supports its own Turbo Boost.

MFX

The Multi-Format Codec (MFX) engine is fixed functionality for high-performance, low-power transcoding. It features intensely parallel hardware.

Chipset

Chipsets are currently using a 65nm process. The PCH is connected to the processor via a 20Gb/s Direct Media Interconnect. The P67 explicitly supports unlocked (overclockable) memory, power and cores (only K-series i7 and i5 processors provide unlocked multipliers, thus a P67+K-series is required for core overclocking). The P67 does not support the IGP, but does provide lane-splitting for SLI/CrossFire setups. While the H67 locks the FSB to 1333MHz, the P67 can likely use DDR3-1866 and DDR3-2133.

Implications

It's impossible to fully utilize the hardware capabilities of K-series processors or P67 motherboards.

  • The K-series touts two advantages: an unlocked clock multiplier, and a 3000-series IGP. To take advantage of the unlocked multiplier, a P67 PCH is required, but the P67 does not support the IGP (overclockers are presumed, I suppose, to use discrete GPUs)!
  • The P67 supports only LGA 1155 processors thus far, all of which have IGPs unusable by the P67.

Sources