Check out my first novel, midnight's simulacra!

Sandy Bridge

From dankwiki
Sandy Bridge microarchitecture
Sandy Bridge die

Intel released Sandy Bridge in January 2011 as a major revision of Nehalem. Core i7 and i5 models were released simultaneously; a 2011-02 drop will include i3's. All released and planned Sandy Bridge packages support on-die integrated GMA-HD2 graphics processors and make use of the new LGA 1155 ("H2") socket, a successor to LGA 1156 ("Socket H"). The P67, H67, Q67 and B65 chipsets ("Platform Controller Hub" or PCH in recent Intel terminology) have been released to support Sandy Bridge, and are compatible with all current variants. It uses the scalable ring-based bus designed for Larrabee (introduced on Nehalem EX) and supports AVX instructions. Sandy Bridge processors (and their on-die IGPs) use a 32nm process (Ivy Bridge, a planned minor successor to Sandy Bridge, will use 22nm process).

Architecture

Sandy Bridge is the first Intel processor to support AVX instructions, including the 256-bit YMM0..15 registers and the VEX instruction encoding.

  • XSAVE/XRSTOR operations will properly preserve and restore the full 256-bit registers.
    • May be called from any processor state. Use the XSAVE/XRSTOR memory areas, specified by MSRs.
  • XMM SSE registers map to the lower 128 bit of the corresponding YMM registers; the upper 128 bits will be zeroed out by SSE instructions.
    • Obviously, YMM-oriented AVX and XMM-oriented SSE instructions cannot be freely mixed

Microarchitecture

Sandy Bridge frontend (contrast with Loop Stream Decoder of Nehalem)

See my architecture page for background information.

  • As opposed to Nehalem's Loop Stream Decoder, there's a simple, direct-mapped/LRU ~1.5k μop cache
    • Unlike the P4's trace cache, we're caching at instruction granularity, not trace (superblock)
    • Augments 32K L1 icache, occupying 6K (FIXME: verify that it augments rather than overlaps!)
  • Branch prediction can use multiple target sizes, depending on relative distance, and multiple history widths, depending on branch variance
  • 2 dataload/addrstore ports are now dataload/addrload/addrstore
  • 100MHz Turbo Boost increments as compared to Nehalem's 133Mhz, and more bins when multiple cores are in use

Instruction Window

  • New physical register file (PRF) outside the OOO core (RRF) contains all in-flight operands; ops in window carry pointers only
    • Likely motivator: 256-bit operand width of AVX instructions
    • There's a PRF for FP and integer vectors, and one for pure 64-bit integers
    • A return to the much-maligned NetBurst architecture! (Core architecture replaced PRF's with RRF's)
    • Downside: need for dereferencing hardware (done via deeper pipelines, not longer cycles -- higher stall/flush costs)
  • Load buffers: 48 -> 64
  • Store buffers: 32 -> 36
  • Reservation stations: 30 -> 54
  • ROB entries: 128 -> 168

Last-Level Cache

Sandy Bridge per-core memory
  • Shared LLC remains "sliced" (NUMA), with slices distributed among cores
  • L3 LLC moved out of "uncore" and now runs at core frequency
    • Downclocked cores mean the L3 can be underclocked relative to the IGP!
  • Cache pipeline is per-slice, rather than global to the cache
  • L3 latencies reduced
  • Three coherency domains

Interconnect

  • Larrabee's ring interconnect connects the cores, IGP, LLC slices, media engine and System Agent (northbridge). Four 32-byte rings are used:
    • Data, Request, ACK, Snoop
  • Interconnect is built directly into L3 cache
  • Fully pipelined at core clock frequency

Branding/Differentiation

See Intel's product matrix for complete, up-to-date information.

  • Core i7 exhibits the highest clock speeds and largest speedups from Turbo Boost. It uses SMT and provides vPro and AES-NI instructions.
  • Core i5 does not support SMT. The Core i5-2500K does not support vPro, while other Core i5's do.
  • Core i3 does not support Turbo Boost, vPro, or AES-NI. It *does* support SMT.

Core i7 and Core i5 Sandy Bridge processors are thus far either quad- or dual-core, while Core i3 is strictly dual-core. Assuming SMT to be enabled where possible, this means Core i7 provides 8 execution units, while Core i5 and Core i3 both provide 4. Without HyperThreading, the Core i3 provides 2, while the others provide 4. Currently, the Core i3 is limited to 3MB of cache, while Core i5/i7 support up to 8MB (current i5's ship with no more than 6MB).

Release Sets

  • 2011-01-09 saw the first set of Sandy Bridge processors. TDP range: 45--96W. Freq range: 2.1GHz--3.4GHz, 3GHz--3.8GHz. Quad cores with 6--8MB LLC.
    • i7-2720QM, i5-2300, i5-2400, i5-2400S, i5-2500, i5-2500K, i5-2500S, i5-2500T, i7-2600, i7-2600K, i7-2600S, i7-2820QM, i7-2920XM, i7-2710QE, i7-2715QE
  • 2011-02-20 sees a second set, with more mobile options (and the first Sandy Bridge Core i3's). TDP range: 17W--65W. Freq range: 1.4GHz--3.3GHz, 3--3.3GHz. Dual cores with 3--6MB LLC.
    • i5-2540M, i5-2520M, i7-2620M, i3-2100, i3-2100T, i3-2120, i5-2390T, i5-2510E, i7-2629M, i7-2649M, i7-2657M, i7-2617M, i5-2537M, i5-2515E

K: unlocked S: performance-optimized T: power-optimized M: mobile QM: quad mobile XM: eXtreme mobile E: embedded QE: quad embedded

Processor Support

LGA1155 chipset details

Northbridge ("System Agent")

  • Different clock and power plane
  • Provides an IOMMU
  • 16 PCIe 2.0 lanes
  • Dual-channel DDR3

IGP

GMA HD2 improves transcendental math hardware over HD1, and expands the register file. The 2000-series IGP has 6 execution units, while the 3000 has 12. Currently, 3000-series IGP's are reserved for K-series processors (the P67 performance-oriented chipset cannot make use of the IGP, and requires a discrete graphics adapter). The IGP has limited locking on its clock -- the H67 chipset can control IGP multipliers.

Unlike Larrabee's fully programmable shaders (suitable for GPGPU), Sandy Bridge IGP makes extensive use of fixed-function components. The instruction set is said to closely parallel the DirectX 10 API. There are a fixed 120 registers per thread, as opposed to the dynamically partitionable register file of previous Intel HD Graphics.

The IGP runs on its own voltage and frequency planes, and supports its own Turbo Boost. It supports HDMI 1.4.

MFX

The Multi-Format Codec (MFX) engine is fixed functionality for high-performance, low-power transcoding. It features intensely parallel hardware.

Chipset

Chipsets are currently using a 65nm process. The PCH is connected to the processor via a 20Gb/s Direct Media Interface 2.0 bus. The P67 explicitly supports unlocked (overclockable) memory, power and cores (only K-series i7 and i5 processors provide unlocked multipliers, thus a P67+K-series is required for core overclocking). The P67 does not support the IGP, but does provide lane-splitting for SLI/CrossFire setups. While the H67 locks the FSB to 1333MHz, the P67 can likely use DDR3-1600, DDR3-1866 and DDR3-2133.

Implications

It's impossible to fully utilize the hardware capabilities of K-series processors or P67 motherboards.

  • The K-series touts two advantages: an unlocked clock multiplier, and a 3000-series IGP. To take advantage of the unlocked multiplier, a P67 PCH is required, but the P67 does not support the IGP (overclockers are presumed, I suppose, to use discrete GPUs)!
  • The P67 supports only LGA 1155 processors thus far, all of which have IGPs unusable by the P67.

Sources