Check out my first novel, midnight's simulacra!

Threadripper L3 CPUID Strangeness: Difference between revisions

From dankwiki
No edit summary
No edit summary
 
(56 intermediate revisions by the same user not shown)
Line 1: Line 1:
'''[[Dankblog|dankblog!]] 2021-02-05, 0356 EDT, at the danktower'''
'''[[Dankblog|dankblog!]] 2021-02-05, 0356 EDT, at [[Viewpoint|the danktower]]'''


I was updating copyrights upon my [[libtorque]], a project from 2010's [http://vuduc.org/cse6230/ CSE 6230] with Professor Richard Vuduc. libtorque was tremendous fun to work on, and resulted in me distilling [[Fast UNIX Servers|many thoughts]] that I'd been kicking around for a few years, but I consider it a research project and not an industrial-strength library. I don't touch it terribly often, though I do check for compiler warnings every so often.
i was chanting "all rights worth shit" whilst casting meaningless Annual Copyright Update cantrips upon [[libtorque]], a project from 2010's [http://vuduc.org/cse6230/ CSE 6230] with Professor Richard Vuduc, who can bench press you with his mind. libtorque was tremendous fun to work on, leading me to distill [[Fast UNIX Servers|many thoughts]] that i'd been kicking around for a few years, but i consider it a research project and not an industrial-strength library. i don't touch it terribly often, though I do check for compiler warnings every few [[gcc]] releases. behold! the [[:File:Libtorque-presentation.pdf|presentation]] i gave on it for GT's Arch-Whiskey seminar--yes, loves, i explicitly selected that background. i looked upon it approvingly, like Keats unto Chapman's Homer. i said, softly, "nicholas, you're really killing the ol' presentation backgrounds, motherfuckers are gonna be cheering when they aren't squinting". i might or might not have sung "don't let me get in my zone" with Kanye West. i thought it Good.


On my [[TRX40|AMD 3970X Threadripper]], the <tt>archdetect</tt> program included with libtorque failed out. I traced this down to the function <tt>decode_amd_l23cache()</tt> in my x86 hardware discovery. [[X86|Intel and AMD]] caches were at one time defined by a disordered set of mappings from integers to complete cache descriptions, as in each integer meant a completely different set of cache parameters, leading to code like:
indeed, meth is one hell of a drug.
 
on my [[TRX40|AMD 3970X Threadripper]], the <tt>archdetect</tt> program included with libtorque failed out. I traced this down to the function <tt>[https://github.com/dankamongmen/libtorque/blob/master/src/libtorque/hardware/x86cpuid.c decode_amd_l23cache()]</tt> in my [https://nick-black.com/x86.pdf x86] hardware discovery. [[X86|Intel and AMD]] caches were at one time defined by a disordered map from integers to complete cache descriptions, as in each integer meant a completely different set of cache parameters, leading to code like:


<tt>
<tt>
Line 22: Line 24:
</tt>
</tt>


and to a great deal of misery and frustration, and dreams of becoming a stripper because programming sucks, and most importantly to failing every time a new microarchitecture employed a new cache size and thus got a new [[CPUID]] number.
and to a great deal of misery and frustration, and daydreams of becoming a stripper because you sure as shit don't want to be a programmer. writing this is about as fun as bobbing for apples in an skinsearing hot metal washbucket of curdling possum smeg. every time a new [[Architecture|microarchitecture]] employed a different cache size, thus mandating a new [[CPUID]] number, your discovery failed. in fact, my very first PR at GOOG was to add descriptors for whatever rhodium-crusted CapEx-demolishing Xeons we were fielding, as apparently 70,000 engineers had until then been content to just read "Couldn't discover cache size for processor type FOO!" twice in their logs every time they ran a binary before bitching on ''eng-misc'' for six hours or [https://www.nytimes.com/2021/01/04/opinion/google-union.html cosplaying a union]. perhaps it was lost in the 400KB of messages about your 1.4GB HelloWorld <tt>go</tt> binary failing to elect a  [https://en.wikipedia.org/wiki/Paxos_(computer_science) paxos] leader. google-sized problems, baybee!


Thankfully, both Intel and AMD unfucked themselves late in the aughts, and introduced more sensibly-structured CPUID results. AMD provides the leaf 0x80000006, "L2/L3 Cache and TLB Identification". The EDX register returns "L3 Cache Identifiers", structured thusly:
thankfully, Blue and Red Teams comutually unfucked themselves late in the aughts, or unfucked in any case this small fuckgrove, and introduced more sensibly-structured CPUID results. [https://www.amd.com/system/files/TechDocs/25481.pdf AMD provides] CPUID leaf 0x80000006, "L2/L3 Cache and TLB Identification". The EDX register returns "L3 Cache Identifiers", structured thusly:


{| class="wikitable"
{| class="wikitable"
Line 35: Line 37:
| 17:16 || Reserved
| 17:16 || Reserved
|-
|-
| 15:12 || L3Assoc: L3 cache associativity. L3 cache associativity:
| 15:12 || L3Assoc: L3 cache associativity:
  0h L2/L3 cache or TLB is disabled.
  0h L2/L3 cache is disabled.
  1h Direct mapped. 2h 2-way associative.
  1h Direct mapped. 2h 2-way associative.
  4h 4-way associative. 6h 8-way associative.
  4h 4-way associative. 6h 8-way associative.
Line 51: Line 53:
|}
|}


Sounds good, and I've used this for over a decade to size up my MLCs and LLCs, being a faithful acolyte of [http://sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap293s5.pdf  pebbling engineering].
sounds good, and I've used this for over a decade to size up my MLCs and LLCs, being a faithful acolyte of [http://sc19.supercomputing.org/proceedings/tech_paper/tech_paper_files/pap293s5.pdf  pebble engineering].


as [https://github.com/dankamongmen/libtorque/issues/1 you can see], my 3970X ate shit and died with the last words <tt>mask: 9 size: 134217728 lsize: 64 assoc: 0 lines: 2097152</tt>, indicating a failure to discover associativity on 0x80000006:EDX. and indeed, what the fuck, 0x04009140 in edx. now, unless the New Math has changed things up, one extracts bits 15:12 by shifting 12 right and masking against 0xF. yielding....9. scrotumtightening shitfucker! 9! i assumed i'd fucked up the CPUID state machine somehow, and consulted <tt>cpuid -r</tt>:
as [https://github.com/dankamongmen/libtorque/issues/1 you can see], my 3970X ate shit and died with the last words <tt>mask: 9 size: 134217728 lsize: 64 assoc: 0 lines: 2097152</tt>, indicating a failure to discover associativity on 0x80000006:EDX. and indeed, what the fuck, 0x04009140 in edx. now, unless the New Math has changed things up, one extracts bits 15:12 by shifting 12 right and masking against 0xF. yielding....9. scrotumtightening shitfucker! 9! i assumed i'd fucked up the CPUID state machine somehow, and consulted <tt>cpuid -r</tt>:
Line 57: Line 59:
<tt>0x80000006 0x00: eax=0x48006400 ebx=0x68006400 ecx=0x02006140 edx=0x04009140</tt>
<tt>0x80000006 0x00: eax=0x48006400 ebx=0x68006400 ecx=0x02006140 edx=0x04009140</tt>


''Mother of God'', i thought. ''I've forgotten how to mask bits, or perhaps how to count to 12, or maybe even how to spell edx''. I went to check the decoded <tt>cpuid</tt> output...
''mother of God'', i thought. ''i've forgotten how to mask bits, or perhaps how to count to 12, or maybe even how to spell edx''. i went to check the decoded <tt>cpuid</tt> output...


<pre>
<pre>
Line 67: Line 69:
</pre>
</pre>


Dogs fucked the Pope; no fault of mine! Now hardware architects can do some very strange things, and indeed I saw an Intel i7 of the Broadwell era with TLB that was 6-way associative, and the 96KB L2 of the Alpha 21164 was famously ''ménage à trois-associative'', as befitted the California culture of the mid-90s. Tupac was still alive (doing 187s), big Pete Wilson was governing (proposing 187s), and the homeless were conveniently hidden behind piles of AOL install media (each man, woman, and child on earth had approximately 187 AOL cds). I once mailed Yale Patt about the 3-way associativity, and he responded with twelve pages of baseball stats descending into a challenge of pistols at dawn. "ps I'll predict your branches you brain-dead ass-eyed Atlanta son of a bitch. pps Do you know where I can find any more female GRAs? Mine have all left." but i digress.
[[File:Wallofaol.jpg|thumb|right|sometimes you got a magazine and it was just AOL cds]]
dogs fucked the Pope; no fault of mine! now hardware architects can do some very strange things, and indeed I saw an Intel i7 of the broadwell era with a twice-cursed 6-way TLB, and the 96KB L2 of the Alpha 21164 was famously ''associativé à trois'', as befitted the carefree California culture of the mid-90s. tupac was still alive ([https://en.wikipedia.org/wiki/187_(slang) doing 187s]), big pete wilson was governing ([https://en.wikipedia.org/wiki/1994_California_Proposition_187 proposing 187s]), and the homeless were conveniently hidden behind piles of AOL install media (each man, woman, and child on earth had approximately 187 AOL cds). I once mailed Yale Patt about the 3-way associativity, and he responded with twelve pages of baseball stats descending into a challenge of pistols at dawn. "ps I'll predict your branches you brain-dead ass-eyed Atlanta son of a bitch pps send nudes" but i digress. anyway, however (justifiably) [https://media1.tenor.com/images/a6a9150a7d8fd5a7cbbd0112a641c4d1/tenor.gif?itemid=14477035 drunk with dankness] the AMD boys are, i very much doubt Dr. Lisa "the Su is for Superwoman" Su is letting enneadic caches out the door.


so...what's goin' on here? i've gotta get back to profitable work for the moment, but watch this space for the inevitable solution. hack on!
so...what's goin' on here? i've gotta get back to profitable work for the moment, but watch this space for the inevitable solution. hack on!
'''''several Newports later'''''
ahhh, the [https://developer.amd.com/wordpress/media/2017/11/54945_PPR_Family_17h_Models_00h-0Fh.pdf Family 17h Processor Programming Reference] explains that this scheme is dank no more, and the fancy new 0x8000001D leaf must be used in its place. using said leaf, the results are what we expect, and my $2000 processor's big swinging cache is properly detected. i guess even a 3990X doesn't have enough cores to update the fucking cpuid pdf. they're paying by the electron down in santa clara--it's mad max times.
<pre>
      --- cache 3 ---
      type                            = unified (3)
      level                          = 0x3 (3)
      self-initializing              = true
      fully associative              = false
      extra cores sharing this cache  = 0x7 (7)
      line size in bytes              = 0x40 (64)
      physical line partitions        = 0x1 (1)
      number of ways                  = 0x10 (16)
      number of sets                  = 16384
      write-back invalidate          = true
      cache inclusive of lower levels = false
      (synth size)                    = 16777216 (16 MB)
</pre>
one must issue a 0x80000000 ExtendedMaxSupport leaf request, and if 0x8000001D ExtendedCacheProperties is available, that ought be used (it also seems necessary to execute 0x80000001 FeatureExtId and check for the TopologyExtensions bit (0x0x400000) in ECX). iterate on the leaf until EAX returns 0. this is actually a nice upgrade in functionality, especially as you can now detect sharing of caches among cores, inclusivity/exclusivity, and whether WBINVD or INVD is in use.
happy day! time to drink the bathroom cleaner!
'''previously: "[https://www.sprezzatech.com/blog/001A-not-with-a-bang-but-with-a-whimper.html not with a bang, but with a whimper]" 2013-10-14'''


[[Category: X86]]
[[Category: X86]]
[[Category: Blog]]
[[Category: Blog]]

Latest revision as of 21:57, 5 February 2021

dankblog! 2021-02-05, 0356 EDT, at the danktower

i was chanting "all rights worth shit" whilst casting meaningless Annual Copyright Update cantrips upon libtorque, a project from 2010's CSE 6230 with Professor Richard Vuduc, who can bench press you with his mind. libtorque was tremendous fun to work on, leading me to distill many thoughts that i'd been kicking around for a few years, but i consider it a research project and not an industrial-strength library. i don't touch it terribly often, though I do check for compiler warnings every few gcc releases. behold! the presentation i gave on it for GT's Arch-Whiskey seminar--yes, loves, i explicitly selected that background. i looked upon it approvingly, like Keats unto Chapman's Homer. i said, softly, "nicholas, you're really killing the ol' presentation backgrounds, motherfuckers are gonna be cheering when they aren't squinting". i might or might not have sung "don't let me get in my zone" with Kanye West. i thought it Good.

indeed, meth is one hell of a drug.

on my AMD 3970X Threadripper, the archdetect program included with libtorque failed out. I traced this down to the function decode_amd_l23cache() in my x86 hardware discovery. Intel and AMD caches were at one time defined by a disordered map from integers to complete cache descriptions, as in each integer meant a completely different set of cache parameters, leading to code like:

 { .descriptor = 0xe2,
   .linesize = 64,
   .totalsize = 2 * 1024 * 1024,
   .associativity = 16,
   .level = 3,
   .memtype = MEMTYPE_UNIFIED,
 },
 { .descriptor = 0xe3,
   .linesize = 64,
   .totalsize = 4 * 1024 * 1024,
   .associativity = 16,
   .level = 3,
   .memtype = MEMTYPE_UNIFIED,
 },

and to a great deal of misery and frustration, and daydreams of becoming a stripper because you sure as shit don't want to be a programmer. writing this is about as fun as bobbing for apples in an skinsearing hot metal washbucket of curdling possum smeg. every time a new microarchitecture employed a different cache size, thus mandating a new CPUID number, your discovery failed. in fact, my very first PR at GOOG was to add descriptors for whatever rhodium-crusted CapEx-demolishing Xeons we were fielding, as apparently 70,000 engineers had until then been content to just read "Couldn't discover cache size for processor type FOO!" twice in their logs every time they ran a binary before bitching on eng-misc for six hours or cosplaying a union. perhaps it was lost in the 400KB of messages about your 1.4GB HelloWorld go binary failing to elect a paxos leader. google-sized problems, baybee!

thankfully, Blue and Red Teams comutually unfucked themselves late in the aughts, or unfucked in any case this small fuckgrove, and introduced more sensibly-structured CPUID results. AMD provides CPUID leaf 0x80000006, "L2/L3 Cache and TLB Identification". The EDX register returns "L3 Cache Identifiers", structured thusly:

Bits Description
31:18 L3Size: L3 cache size. Specifies the L3 cache size is within the following range:
(L3Size[31:18] * 512KB) ≤ L3 cache size < ((L3Size[31:18]+1) * 512KB)
17:16 Reserved
15:12 L3Assoc: L3 cache associativity:
0h L2/L3 cache is disabled.
1h Direct mapped. 2h 2-way associative.
4h 4-way associative. 6h 8-way associative.
8h 16-way associative. Ah 32-way associative.
Bh 48-way associative. Ch 64-way associative.
Dh 96-way associative. Eh 128-way associative.
Fh Fully associative.
All other encodings are reserved.
11:8 L3LinesPerTag: L3 cache lines per tag.
7:0 L3LineSize: L3 cache line size in bytes.

sounds good, and I've used this for over a decade to size up my MLCs and LLCs, being a faithful acolyte of pebble engineering.

as you can see, my 3970X ate shit and died with the last words mask: 9 size: 134217728 lsize: 64 assoc: 0 lines: 2097152, indicating a failure to discover associativity on 0x80000006:EDX. and indeed, what the fuck, 0x04009140 in edx. now, unless the New Math has changed things up, one extracts bits 15:12 by shifting 12 right and masking against 0xF. yielding....9. scrotumtightening shitfucker! 9! i assumed i'd fucked up the CPUID state machine somehow, and consulted cpuid -r:

0x80000006 0x00: eax=0x48006400 ebx=0x68006400 ecx=0x02006140 edx=0x04009140

mother of God, i thought. i've forgotten how to mask bits, or perhaps how to count to 12, or maybe even how to spell edx. i went to check the decoded cpuid output...

   L3 cache information (0x80000006/edx):
      line size (bytes)     = 0x40 (64)
      lines per tag         = 0x1 (1)
      associativity         = 0x9 (9)
      size (in 512KB units) = 0x100 (256)
sometimes you got a magazine and it was just AOL cds

dogs fucked the Pope; no fault of mine! now hardware architects can do some very strange things, and indeed I saw an Intel i7 of the broadwell era with a twice-cursed 6-way TLB, and the 96KB L2 of the Alpha 21164 was famously associativé à trois, as befitted the carefree California culture of the mid-90s. tupac was still alive (doing 187s), big pete wilson was governing (proposing 187s), and the homeless were conveniently hidden behind piles of AOL install media (each man, woman, and child on earth had approximately 187 AOL cds). I once mailed Yale Patt about the 3-way associativity, and he responded with twelve pages of baseball stats descending into a challenge of pistols at dawn. "ps I'll predict your branches you brain-dead ass-eyed Atlanta son of a bitch pps send nudes" but i digress. anyway, however (justifiably) drunk with dankness the AMD boys are, i very much doubt Dr. Lisa "the Su is for Superwoman" Su is letting enneadic caches out the door.

so...what's goin' on here? i've gotta get back to profitable work for the moment, but watch this space for the inevitable solution. hack on!

several Newports later

ahhh, the Family 17h Processor Programming Reference explains that this scheme is dank no more, and the fancy new 0x8000001D leaf must be used in its place. using said leaf, the results are what we expect, and my $2000 processor's big swinging cache is properly detected. i guess even a 3990X doesn't have enough cores to update the fucking cpuid pdf. they're paying by the electron down in santa clara--it's mad max times.

      --- cache 3 ---
      type                            = unified (3)
      level                           = 0x3 (3)
      self-initializing               = true
      fully associative               = false
      extra cores sharing this cache  = 0x7 (7)
      line size in bytes              = 0x40 (64)
      physical line partitions        = 0x1 (1)
      number of ways                  = 0x10 (16)
      number of sets                  = 16384
      write-back invalidate           = true
      cache inclusive of lower levels = false
      (synth size)                    = 16777216 (16 MB)

one must issue a 0x80000000 ExtendedMaxSupport leaf request, and if 0x8000001D ExtendedCacheProperties is available, that ought be used (it also seems necessary to execute 0x80000001 FeatureExtId and check for the TopologyExtensions bit (0x0x400000) in ECX). iterate on the leaf until EAX returns 0. this is actually a nice upgrade in functionality, especially as you can now detect sharing of caches among cores, inclusivity/exclusivity, and whether WBINVD or INVD is in use.

happy day! time to drink the bathroom cleaner!

previously: "not with a bang, but with a whimper" 2013-10-14