Check out my first novel, midnight's simulacra!

Schwarzgerät: Difference between revisions

From dankwiki
Line 53: Line 53:
―Jorge Luis Borges, “The Library of Babel” (1941)</blockquote>
―Jorge Luis Borges, “The Library of Babel” (1941)</blockquote>


Holding microarchitecture and memory systems constant, embarrassingly parallel tasks want us to maximize cores*clocks, whereas serial tasks want us to maximize clocks. Collections of unrelated serial tasks, and incompletely parallel tasks, can almost always benefit from higher clocks (memory and I/O-bound tasks aside), and can sometimes make use of more cores. In general, we achieve peak performance by maximizing cores and clocks, '''but''' cores are subject to diminishing returns of utility, and higher clocks quickly run into issues of cooling. A processor's dynamic power consumption is linear with respect to frequency, but quadratic with respect to voltage (P=CV<sup>2</sup>f), and higher frequencies typically require higher voltages.
Holding microarchitecture and memory systems constant, embarrassingly parallel tasks want us to maximize cores*clocks, whereas serial tasks want us to maximize clocks. Collections of unrelated serial tasks, and incompletely parallel tasks, can almost always benefit from higher clocks (memory and I/O-bound tasks aside), and can sometimes make use of more cores. In general, we achieve peak performance by maximizing the product of cores and clocks, '''but''' cores are subject to diminishing returns of utility, and higher clocks quickly run into issues of cooling. A processor's dynamic power consumption is linear with respect to frequency, but quadratic with respect to voltage (P=CV<sup>2</sup>f), and higher frequencies typically require higher voltages.


My primary compute-intense, interactive work is software development, so my main concern is fast builds. The Linux kernel, [[gcc]], LLVM, [[glibc]], compiz etc. are large enough to drive a few dozen cores (assuming suitable storage/memory throughput). With that said, they'll have plenty of steps (linking, for one) which cut down on parallelism and want lots of cycles. Smaller projects, though, or even large projects once most objects have been built, will expose much less parallelism. During an edit-compile-test cycle, only a few products typically need be regenerated: only a few cores can be utilized, and clocks determine turnaround time. Recall [https://en.wikipedia.org/wiki/Amdahl%27s_law Amdahl's Law]: as available parallel resources approach infinity, latency becomes dominated by the time required to run non-parallelizable tasks.
My primary compute-intense, interactive work is software development, so my main concern is fast builds. The Linux kernel, [[gcc]], LLVM, [[glibc]], compiz etc. are large enough to drive a few dozen cores (assuming suitable storage/memory throughput). With that said, they'll have plenty of steps (linking, for one) which cut down on parallelism and want lots of cycles. Smaller projects, though, or even large projects once most objects have been built, will expose much less parallelism. During an edit-compile-test cycle, only a few products typically need be regenerated: only a few cores can be utilized, and clocks determine turnaround time. Recall [https://en.wikipedia.org/wiki/Amdahl%27s_law Amdahl's Law]: as available parallel resources approach infinity, latency becomes dominated by the time required to run non-parallelizable tasks.


Example: Let's assume that P1 executes serial code twice as fast as P2, but that we can stuff 10 P2s in our machine. We've got a source tree with 100 files, each of which takes 2 units of time on P2 to compile to objects. Linking these 100 objects into a binary takes 8 units of time on P2. An initial build will take 28 units on a P2 machine (100/10*2 + 8), and 104 units on a P1 machine (100/1*1 + 4); the P1 machine is four times as latent. We then edit five files, and rebuild. The P2 machine requires 10 units (5/5*2 + 8), with a utilization of 18% (2 units of 50% utilization, 8 units of 10% utilization), while the P1 machine requires 9 units (5/1*1 + 4), maintaining 100% utilization. It is of course unlikely that all source files take the same amount of time to compile; the assumption of perfect parallelism is a charade.
''Example:'' Let's assume that P1 executes serial code twice as fast as P2, but that we can stuff 10 P2s in our machine. We've got a source tree with 100 files, each of which takes 2 units of time on P2 to compile to objects. Linking these 100 objects into a binary takes 8 units of time on P2. An initial build will take 28 units on a P2 machine (100/10*2 + 8), and 104 units on a P1 machine (100/1*1 + 4); the P1 machine is four times as latent. We then edit five files, and rebuild. The P2 machine requires 10 units (5/5*2 + 8), with a utilization of 18% (2 units of 50% utilization, 8 units of 10% utilization), while the P1 machine requires 9 units (5/1*1 + 4), maintaining 100% utilization. It is of course unlikely that all source files take the same amount of time to compile; the assumption of perfect parallelism is thus a charade.
 
As for other things I run...HandBrake takes advantage of parallelism. MakeMKV does not. Valgrind does not. rtorrent does not. Blender does, sometimes. OpenShot does, kinda. Chrome does, though not within a tab. Stuff I write usually does. Stuff other people write often does not.
 
Let's then seek to maximize '''scaled cores*clocks''', with cores falling off in value exponentially with a base of, eh, call it 0.9. This means the second core is 90% as useful as the first core, the third is 81% as useful, the tenth is ~35% as useful, and the thirty-fifth is 2.5% as useful. "Useful" can be read here as "utilized". For ''n'' cores, the number of scaled cores is equal to the sum of 0.9<sup>''k''</sup> as ''k'' ranges from 0 to ''n - 1''. As ''n'' approaches infinity, this sum converges to 10, which is clearly sometimes "wrong" (we know certain loads can drive more than 10 cores), but it's also clearly sometimes "wrong" that a second core is worth 0.9 (we know certain loads can drive only one core). Once we hit 22 cores (9.02 scaled cores), we achieve more than 90% of our peak benefit. I'm OK with that, as I doubt I'll be loading 20+ cores very often. Such extreme computation is usually better served by GPUs, anyway.
 
''Aside regarding SMT:'' All the processors we consider support 2-way SMT aka HyperThreading. SMT will generally enhance throughput (i.e., make a core seem like two slightly weaker cores) for code with significant delays due to memory accesses, mispredicted branches, and data dependencies, provided of course that the virtual parallelism can be exploited. Code dominated by arithmetic intensity will not see much benefit from SMT, since the single physical core's execution units are not truly duplicated. If only one thread is available, SMT of course provides no benefit. In any of these cases, enabled SMT can actually degrade performance due to partitioning of microarchitectual resources. Machines primarily running tuned arithmetic code should probably disable SMT. For the case of compilation, it's usually going to be a win, hiding all manner of chaotic delays so long as the physical core's cache doesn't get blown out. We consider physical cores for simplicity.
 
''Aside regarding TurboBoost:'' All the processors we consider support TurboBoost 2.0 or 3.0. This allows automatic clock boosts when some cores are inactive. I list the turbo speed, but calculate using the base speed.
 
{| class="wikitable"
! Processor
! Cores / scaled
! Base / turbo GHz
! Scaled-base product
|-
| Skylake i7 6700K
| 4 / 3.44
| 4.0 / 4.2
| 13.760
|-
| Skylake Xeon E3 1280v5
| 4 / 3.44
| 3.7 / 4.0
| 12.728
|-
| Broadwell-E i7 6950X
| 10 / 6.51
| 3.0 / 3.5
| 19.53
|-
| Broadwell-E i7 6900K
| 8 / 5.70
| 3.2 / 3.7
| 18.24
|-
| Broadwell-E i7 6850K
| 6 / 4.69
| 3.6 / 3.8
| 16.88
|-
| Broadwell-EP Xeon E5 2699v4
| 22 / 9.02
| 2.2 / 3.6
| 19.84
|-
| Broadwell-EP Xeon E5 2697v4
| 18 / 8.50
| 2.3 / 3.6
| 19.55
|-
| Broadwell-EP Xeon E5 2697Av4
| 16 / 8.15
| 2.6 / 3.6
| 21.19
|-
| Broadwell-EP Xeon E5 2697Av4 x2
| 32 / 9.66
| 2.6 / 3.6
| 25.12
|-
| Broadwell-EP Xeon E5 2687Wv4
| 12 / 7.18
| 3.0 / 3.5
| 21.54
|-
| Broadwell-EP Xeon E5 2687Wv4 x2
| 24 / 9.20
| 3.0 / 3.5
| 27.60
|-
|}


===The mobo/chipset===
===The mobo/chipset===

Revision as of 07:30, 21 August 2016

“Money has only a different value in the eyes of each.” ―William Makepeace Thackeray, Vanity Fair (1847), Chapter XLIV

ABSTRACT: In August of 2016, I pulled the trigger on a long-planned workstation build. That same month, Intel and NVIDIA dropped new product. Hot, salivation-provoking product: densely packed marvels bursting with FLOPS, rough dank beasts woven up from high-κ 16nm and 14nm strained-silicon FinFETs. Both companies, utterly dominant at their markets' high ends, announced pricing that shoved atom bombs up the ass of every computing enthusiast in the free world.

This is the rambling, poorly-edited, relentlessly technical story of that build. Also: epigraphs.

Intro: 10 cores of garbage

“Their way to the City lay through this town of Vanity, they contrived here to set up a fair; a fair wherein should be sold of all sorts of vanity, and that it should last all the year long. Therefore at this fair are all such merchandise sold: as houses, lands, trades, places, honours, preferments, titles, countries, kingdoms; lusts, pleasures, and delights of all sorts – as whores, bawds, wives, husbands, children, masters, servants, lives, blood, bodies, souls, silver, gold, pearls, precious stones, and what not.

And moreover, at this fair there is at all times to be deceivers, cheats, games, plays, fools, apes, knaves, and rogues and that of every kind.”

―John Bunyan, The Pilgrim's Progress (1678), Part I

As I began researching recent years' incremental improvements to high-end desktop technologies, one thing seemed obvious: the Intel Broadwell-E Core i7 6950X was laughably useless, a processor distinguished only by its price, a money-grab narrowly targeted at suckers. Broadly dismissed in reviews, its clock speed and microarchitecture are inferior to the much cheaper Skylake i7 6700K, its cost per core exceeds that of all but the largest Broadwell-EP Xeon E5 v4s, and it doesn't support multisocket configurations. Arstechnica put it best in their Broadwell-E review:

Intel is somewhat shooting itself in the foot with the pricing on the i7-6950X. The recently released Xeon Broadwell-EP processor list includes the Xeon E5-2640 v4: a 10-core 2.4 GHz/3.4 GHz part that runs at 90W, and is priced at $939, which compares favorably to the i7-6950X and its 10-cores at a 3.0 GHz/3.5 GHz clockspeeds. And because it’s a Xeon E5 processor processor, with the right motherboard a user can put two into the same machine for 20 cores/40 threads for only $1878, or only $150 more than the 10-core i7-6950X.

I thus felt weird several days later buying a 6950X ($1650 at Amazon). I'd since concluded that the black sheep Broadwell-E Extreme decacore is actually the best option for a very small class of users, including myself. Keep reading. I promise to explain this (tl;dr: you can't overclock Xeons, and Amdahl's Law is bad shit).

With that said, Intel and NVIDIA sure are fucking us for all we're worth on premium components.

Build Goals

“There is no way to tell his story without telling my own. And if his story really is a confession, then so is mine.” ―Apocalypse Now (1979)

I am not a gamer. If you're building a gaming rig, buy the excellent Skylake 6700K and be done with it. Game performance is almost entirely based on on GPUs, games generally make poor use of extra cores, and the reference 6700K clocks in at 4GHz. You'll get a better interface to the southbridge via DMI 3.0, you'll draw less power, and you'll save money sufficient to buy a video card. You'll have a lot less PCIe capability, though: Skylake offers 16 PCIe 3.0 lanes, and another 20 via the Southbridge, but the Southbridge only gets 4 DMI 3.0 lanes to the processor. A 16 lane video card is thus going to monopolize your die's direct PCIe hookup.

My intense compute tasks include:

  • Compilation of large source packages
  • Developing and testing my own software, some of it using GPGPU
  • Running large simulations and data analysis, some of it using GPGPU
  • Research into high-performance computing and networking
  • Occasional virtualization, console emulation, video editing and transcoding

In addition, I require several dozen terabytes of spinning disk, significant solid state storage, XLR audio output driven by a quality DAC, and lots of USB/wireless connectivity. All non-volatile storage must employ redundancy.

Ideally, I want a multisocket NUMA solution to facilitate research on optimizing for such environments. Likewise, I'd like the 512-bit AVX extensions and TSX, for research into making use of these advanced features, and I prefer ECC memory. Beyond that, it's a matter of cores, clocks, lanes, and RAM, the more the better.

I'd like to avoid things I don't intend to use: Intel graphics, VGA outputs, IPMI and other spooky hidden network control stacks, and anything lacking Linux support.

My budget is essentially unlimited, and I expect to drop somewhere between three and five thousand dollars on the machine (not including spinning disks or case). If a particularly attractive option breaks that budget, fine. I'm not looking to spend for the sake of spending, but neither am I optimizing for price: I don't want to find myself wishing I'd splurged on something useful, and I don't want to look at purchased resources sitting idle. Beyond that, total freedom, save one constraint: I already had my case.

The case

“Die Welt ist alles, was der Fall ist (The world is all that is the case).” ―Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1921)

I'd purchased a CaseLabs Magnum T10 in 2013 (the Magnum T10 has been discontinued; the linked Magnum TH10A is similar), and was determined to finally make fitting use of it. The T10 is a gorgeous, brilliantly engineered monster of a double-wide reconfigurable case, weighing in at 24 pounds of aluminum. At 15 inches wide, 25.06 inches tall, and 20.06 inches deep (381mm x 637mm x 510mm), it'll easily fit EATX, SSI-CEB or even XL-ATX motherboards, dozens of hard drives, dual PSUs, and the powerful radiators/fans necessary to quietly cool it all. With casters attached, it could transport a small child in comfort and style.

Multisocket and custom liquid-cooled solutions are thus in play. Look upon my works, ye Mighty, and despair!

I can't praise CaseLabs enough. They're extremely expensive, but operating at a different level than anyone else. Even top-of-the-line offerings from Cooler Master, Fractal Design, and Ling Li seem glaringly deficient once you've worked with one of these masterpieces.

Cost: $490

Selecting the core

“I know of a wild region whose librarians repudiate the vain superstitious custom of seeking any sense in books and compare it to looking for meaning in dreams or in the chaotic lines of one's hands.” ―Jorge Luis Borges, “The Library of Babel” (1941)

Holding microarchitecture and memory systems constant, embarrassingly parallel tasks want us to maximize cores*clocks, whereas serial tasks want us to maximize clocks. Collections of unrelated serial tasks, and incompletely parallel tasks, can almost always benefit from higher clocks (memory and I/O-bound tasks aside), and can sometimes make use of more cores. In general, we achieve peak performance by maximizing the product of cores and clocks, but cores are subject to diminishing returns of utility, and higher clocks quickly run into issues of cooling. A processor's dynamic power consumption is linear with respect to frequency, but quadratic with respect to voltage (P=CV2f), and higher frequencies typically require higher voltages.

My primary compute-intense, interactive work is software development, so my main concern is fast builds. The Linux kernel, gcc, LLVM, glibc, compiz etc. are large enough to drive a few dozen cores (assuming suitable storage/memory throughput). With that said, they'll have plenty of steps (linking, for one) which cut down on parallelism and want lots of cycles. Smaller projects, though, or even large projects once most objects have been built, will expose much less parallelism. During an edit-compile-test cycle, only a few products typically need be regenerated: only a few cores can be utilized, and clocks determine turnaround time. Recall Amdahl's Law: as available parallel resources approach infinity, latency becomes dominated by the time required to run non-parallelizable tasks.

Example: Let's assume that P1 executes serial code twice as fast as P2, but that we can stuff 10 P2s in our machine. We've got a source tree with 100 files, each of which takes 2 units of time on P2 to compile to objects. Linking these 100 objects into a binary takes 8 units of time on P2. An initial build will take 28 units on a P2 machine (100/10*2 + 8), and 104 units on a P1 machine (100/1*1 + 4); the P1 machine is four times as latent. We then edit five files, and rebuild. The P2 machine requires 10 units (5/5*2 + 8), with a utilization of 18% (2 units of 50% utilization, 8 units of 10% utilization), while the P1 machine requires 9 units (5/1*1 + 4), maintaining 100% utilization. It is of course unlikely that all source files take the same amount of time to compile; the assumption of perfect parallelism is thus a charade.

As for other things I run...HandBrake takes advantage of parallelism. MakeMKV does not. Valgrind does not. rtorrent does not. Blender does, sometimes. OpenShot does, kinda. Chrome does, though not within a tab. Stuff I write usually does. Stuff other people write often does not.

Let's then seek to maximize scaled cores*clocks, with cores falling off in value exponentially with a base of, eh, call it 0.9. This means the second core is 90% as useful as the first core, the third is 81% as useful, the tenth is ~35% as useful, and the thirty-fifth is 2.5% as useful. "Useful" can be read here as "utilized". For n cores, the number of scaled cores is equal to the sum of 0.9k as k ranges from 0 to n - 1. As n approaches infinity, this sum converges to 10, which is clearly sometimes "wrong" (we know certain loads can drive more than 10 cores), but it's also clearly sometimes "wrong" that a second core is worth 0.9 (we know certain loads can drive only one core). Once we hit 22 cores (9.02 scaled cores), we achieve more than 90% of our peak benefit. I'm OK with that, as I doubt I'll be loading 20+ cores very often. Such extreme computation is usually better served by GPUs, anyway.

Aside regarding SMT: All the processors we consider support 2-way SMT aka HyperThreading. SMT will generally enhance throughput (i.e., make a core seem like two slightly weaker cores) for code with significant delays due to memory accesses, mispredicted branches, and data dependencies, provided of course that the virtual parallelism can be exploited. Code dominated by arithmetic intensity will not see much benefit from SMT, since the single physical core's execution units are not truly duplicated. If only one thread is available, SMT of course provides no benefit. In any of these cases, enabled SMT can actually degrade performance due to partitioning of microarchitectual resources. Machines primarily running tuned arithmetic code should probably disable SMT. For the case of compilation, it's usually going to be a win, hiding all manner of chaotic delays so long as the physical core's cache doesn't get blown out. We consider physical cores for simplicity.

Aside regarding TurboBoost: All the processors we consider support TurboBoost 2.0 or 3.0. This allows automatic clock boosts when some cores are inactive. I list the turbo speed, but calculate using the base speed.

Processor Cores / scaled Base / turbo GHz Scaled-base product
Skylake i7 6700K 4 / 3.44 4.0 / 4.2 13.760
Skylake Xeon E3 1280v5 4 / 3.44 3.7 / 4.0 12.728
Broadwell-E i7 6950X 10 / 6.51 3.0 / 3.5 19.53
Broadwell-E i7 6900K 8 / 5.70 3.2 / 3.7 18.24
Broadwell-E i7 6850K 6 / 4.69 3.6 / 3.8 16.88
Broadwell-EP Xeon E5 2699v4 22 / 9.02 2.2 / 3.6 19.84
Broadwell-EP Xeon E5 2697v4 18 / 8.50 2.3 / 3.6 19.55
Broadwell-EP Xeon E5 2697Av4 16 / 8.15 2.6 / 3.6 21.19
Broadwell-EP Xeon E5 2697Av4 x2 32 / 9.66 2.6 / 3.6 25.12
Broadwell-EP Xeon E5 2687Wv4 12 / 7.18 3.0 / 3.5 21.54
Broadwell-EP Xeon E5 2687Wv4 x2 24 / 9.20 3.0 / 3.5 27.60

The mobo/chipset

“If they can get you asking the wrong questions, they don't have to worry about answers.” ―Thomas Pynchon, Gravity's Rainbow (1973)

The processor

“Yog-Sothoth knows the gate. Yog-Sothoth is the gate. Yog-Sothoth is the key and guardian of the gate. Past, present, future, all are one in Yog-Sothoth.” ―H. P. Lovecraft, “The Dunwich Horror” (1929)