Check out my first novel, midnight's simulacra!

Threadripper L3 CPUID Strangeness

From dankwiki
Revision as of 10:27, 5 February 2021 by Dank (talk | contribs)

dankblog! 2021-02-05, 0356 EDT, at the danktower

i was updating copyrights upon my libtorque, a project from 2010's CSE 6230 with Professor Richard Vuduc. libtorque was tremendous fun to work on, and resulted in me distilling many thoughts that i'd been kicking around for a few years, but i consider it a research project and not an industrial-strength library. i don't touch it terribly often, though I do check for compiler warnings every few gcc releases. behold the presentation i gave on it for GT's Arch-Whiskey seminar; yes, loves, i explicitly selected that background, and looked upon it, and thought it Good. meth is a hell of a drug.

on my AMD 3970X Threadripper, the archdetect program included with libtorque failed out. I traced this down to the function decode_amd_l23cache() in my x86 hardware discovery. Intel and AMD caches were at one time defined by a disordered map from integers to complete cache descriptions, as in each integer meant a completely different set of cache parameters, leading to code like:

 { .descriptor = 0xe2,
   .linesize = 64,
   .totalsize = 2 * 1024 * 1024,
   .associativity = 16,
   .level = 3,
   .memtype = MEMTYPE_UNIFIED,
 },
 { .descriptor = 0xe3,
   .linesize = 64,
   .totalsize = 4 * 1024 * 1024,
   .associativity = 16,
   .level = 3,
   .memtype = MEMTYPE_UNIFIED,
 },

and to a great deal of misery and frustration, and daydreams of becoming a stripper because you sure as shit don't want to be a programmer. writing this is about as fun as bobbing for apples in a sunseared bucket of curdling possum shit. every time a new microarchitecture employed a different cache size, thus mandating a new CPUID number, your discovery failed. in fact, my very first PR at GOOG was to add descriptors for whatever rhodium-crusted CapEx-demolishing Xeons we were fielding, as apparently 70,000 engineers had been content until then to just read "Couldn't discover cache size for processor type FOO!" twice in their logs every time they ran a binary before bitching on eng-misc for six hours or cosplaying a union. perhaps it was lost in the 400KB of messages about your single-core text search program being unable to elect a paxos leader. google-sized problems, baybee!

Thankfully, both Intel and AMD unfucked themselves late in the aughts, or unfucked in any case this small fuckgrove, and introduced more sensibly-structured CPUID results. AMD provides CPUID leaf 0x80000006, "L2/L3 Cache and TLB Identification". The EDX register returns "L3 Cache Identifiers", structured thusly:

Bits Description
31:18 L3Size: L3 cache size. Specifies the L3 cache size is within the following range:
(L3Size[31:18] * 512KB) ≤ L3 cache size < ((L3Size[31:18]+1) * 512KB)
17:16 Reserved
15:12 L3Assoc: L3 cache associativity:
0h L2/L3 cache is disabled.
1h Direct mapped. 2h 2-way associative.
4h 4-way associative. 6h 8-way associative.
8h 16-way associative. Ah 32-way associative.
Bh 48-way associative. Ch 64-way associative.
Dh 96-way associative. Eh 128-way associative.
Fh Fully associative.
All other encodings are reserved.
11:8 L3LinesPerTag: L3 cache lines per tag.
7:0 L3LineSize: L3 cache line size in bytes.

sounds good, and I've used this for over a decade to size up my MLCs and LLCs, being a faithful acolyte of pebble engineering.

as you can see, my 3970X ate shit and died with the last words mask: 9 size: 134217728 lsize: 64 assoc: 0 lines: 2097152, indicating a failure to discover associativity on 0x80000006:EDX. and indeed, what the fuck, 0x04009140 in edx. now, unless the New Math has changed things up, one extracts bits 15:12 by shifting 12 right and masking against 0xF. yielding....9. scrotumtightening shitfucker! 9! i assumed i'd fucked up the CPUID state machine somehow, and consulted cpuid -r:

0x80000006 0x00: eax=0x48006400 ebx=0x68006400 ecx=0x02006140 edx=0x04009140

mother of God, i thought. i've forgotten how to mask bits, or perhaps how to count to 12, or maybe even how to spell edx. i went to check the decoded cpuid output...

   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x40 (64)
      lines per tag         = 0x1 (1)
      associativity         = 0x9 (9)
      size (in 512KB units) = 0x100 (256)

dogs fucked the Pope; no fault of mine! now hardware architects can do some very strange things, and indeed I saw an Intel i7 of the broadwell era with a twice-cursed 6-way TLB, and the 96KB L2 of the Alpha 21164 was famously ménage à associative, as befitted the carefree California culture of the mid-90s. tupac was still alive (doing 187s), big pete wilson was governing (proposing 187s), and the homeless were conveniently hidden behind piles of AOL install media (each man, woman, and child on earth had approximately 187 AOL cds). I once mailed Yale Patt about the 3-way associativity, and he responded with twelve pages of baseball stats descending into a challenge of pistols at dawn. "ps I'll predict your branches you brain-dead ass-eyed Atlanta son of a bitch. pps do you know where I can find any more female GRAs?" but i digress.

so...what's goin' on here? i've gotta get back to profitable work for the moment, but watch this space for the inevitable solution. hack on!