“There is no way to tell his story without telling my own. And if his story really is a confession, then so is mine.” ―Apocalypse Now (1979)
ABSTRACT: In August of 2016, I pulled the trigger on a long-planned workstation build. That same month, Intel and NVIDIA dropped new product. Hot, salivation-provoking product: densely packed marvels bursting with FLOPS, rough dank beasts woven up from high-κ 16nm and 14nm strained-silicon FinFETs. Both companies, utterly dominant at their markets' high ends, announced pricing that shoved atom bombs up the ass of every computing enthusiast in the free world.
This is the rambling, poorly-edited, relentlessly technical story of that build. Also: epigraphs. You can skip to the bill of materials, if you just want specs.
Intro: 10 cores of garbage
“Their way to the City lay through this town of Vanity, they contrived here to set up a fair; a fair wherein should be sold of all sorts of vanity, and that it should last all the year long. Therefore at this fair are all such merchandise sold: as houses, lands, trades, places, honours, preferments, titles, countries, kingdoms; lusts, pleasures, and delights of all sorts – as whores, bawds, wives, husbands, children, masters, servants, lives, blood, bodies, souls, silver, gold, pearls, precious stones, and what not.
And moreover, at this fair there is at all times to be deceivers, cheats, games, plays, fools, apes, knaves, and rogues and that of every kind.”
―John Bunyan, The Pilgrim's Progress (1678), Part I
As I began researching recent years' incremental improvements to high-end desktop technologies, one thing seemed obvious: the Intel Broadwell-E Core i7 6950X was laughably useless, a processor distinguished only by its price, a money-grab narrowly targeted at suckers. Broadly dismissed in reviews, its clock speed and microarchitecture are inferior to the much cheaper Skylake i7 6700K, its cost per core exceeds that of all but the largest Broadwell-EP Xeon E5 v4s, and it doesn't support multisocket configurations. Ian Cutress of Anandtech put it best in his Broadwell-E review:
Intel is somewhat shooting itself in the foot with the pricing on the i7-6950X. The recently released Xeon Broadwell-EP processor list includes the Xeon E5-2640 v4: a 10-core 2.4 GHz/3.4 GHz part that runs at 90W, and is priced at $939, which compares favorably to the i7-6950X and its 10-cores at a 3.0 GHz/3.5 GHz clockspeeds. And because it’s a Xeon E5 processor processor, with the right motherboard a user can put two into the same machine for 20 cores/40 threads for only $1878, or only $150 more than the 10-core i7-6950X.
I thus felt weird several days later buying a 6950X ($1650 at Amazon). I'd since concluded that the black sheep Broadwell-E Extreme decacore is actually the best option for a very small class of users, including myself. Keep reading. I promise to explain this (tl;dr: you can't easily overclock Xeons, and Amdahl's Law is bad shit).
With that said, Intel and NVIDIA sure are fucking us for all we're worth on premium components.
“Money has only a different value in the eyes of each.” ―William Makepeace Thackeray, Vanity Fair (1847), Chapter XLIV
I am not a gamer. If you're building a gaming rig, buy the excellent Skylake 6700K and be done with it. Game performance is almost entirely based on on GPUs, games generally make poor use of extra cores, and the reference 6700K clocks in at 4GHz. You'll get a better interface to the southbridge via DMI 3.0, you'll draw less power, and you'll save money sufficient to buy a video card. You'll have a lot less PCIe capability, though: Skylake offers 16 PCIe 3.0 lanes, and another 20 via the Southbridge, but the Southbridge only gets 4 DMI 3.0 lanes to the processor. A 16 lane video card is thus going to monopolize your die's direct PCIe hookup.
My intense compute tasks include:
- Compilation of large source packages
- Developing and testing my own software, some of it using GPGPU
- Running large simulations and data analysis, some of it using GPGPU
- Research into high-performance computing and networking
- Occasional virtualization, console emulation, video editing and transcoding
In addition, I require several dozen terabytes of spinning disk, significant solid state storage, XLR audio output driven by a quality DAC, and lots of USB/wireless connectivity. All non-volatile storage must employ redundancy.
Ideally, I want a multisocket NUMA solution to facilitate research on optimizing for such environments. Likewise, I'd like the 512-bit AVX extensions and TSX, for research into making use of these advanced features, and I prefer ECC memory. Beyond that, it's a matter of cores, clocks, lanes, and RAM, the more the better.
I'd like to avoid things I don't intend to use: Intel graphics, VGA outputs, IPMI and other spooky hidden network control stacks, and anything lacking Linux support.
My budget is essentially unlimited, and I expect to drop somewhere between three and five thousand dollars on the machine (not including spinning disks or case). If a particularly attractive option breaks that budget, fine. I'm not looking to spend for the sake of spending, but neither am I optimizing for price: I don't want to find myself wishing I'd splurged on something useful, and I don't want to look at purchased resources sitting idle. Beyond that, total freedom, save one constraint: I already had my case.
“Die Welt ist alles, was der Fall ist (The world is all that is the case).” ―Ludwig Wittgenstein, Tractatus Logico-Philosophicus (1921)
I'd purchased a CaseLabs Magnum T10 in 2013 (the Magnum T10 has been discontinued; the linked Magnum TH10A is similar), and was determined to finally make fitting use of it. The T10 is a gorgeous, brilliantly engineered monster of a double-wide reconfigurable case, weighing in at 24 pounds of aluminum. At 15 inches wide, 25.06 inches tall, and 20.06 inches deep (381mm x 637mm x 510mm), it'll easily fit EATX, SSI-CEB or even XL-ATX motherboards, dozens of hard drives, dual PSUs, and the powerful radiators/fans necessary to quietly cool it all. With casters attached, it could transport a small child in comfort and style.
Multisocket and custom liquid-cooled solutions are thus in play. Look upon my works, ye Mighty, and despair!
I can't praise CaseLabs enough. They're extremely expensive, but operating at a different level than anyone else. Even top-of-the-line offerings from Cooler Master, Fractal Design, and Ling Li seem glaringly deficient once you've worked with one of these masterpieces.
Selecting the core
“I know of a wild region whose librarians repudiate the vain superstitious custom of seeking any sense in books and compare it to looking for meaning in dreams or in the chaotic lines of one's hands.” ―Jorge Luis Borges, “The Library of Babel” (1941)
Holding microarchitecture and memory systems constant, embarrassingly parallel tasks want us to maximize cores*clocks, whereas serial tasks want us to maximize clocks. Collections of unrelated serial tasks, and incompletely parallel tasks, can almost always benefit from higher clocks (memory and I/O-bound tasks aside), and can sometimes make use of more cores. In general, we achieve peak performance by maximizing the product of cores and clocks, but cores are subject to diminishing returns of utility, and higher clocks quickly run into issues of cooling. A processor's dynamic power consumption is linear with respect to frequency, but quadratic with respect to voltage (P=CV2f), and higher frequencies typically require higher voltages.
My primary compute-intense, interactive work is software development, so my main concern is fast builds. The Linux kernel, gcc, LLVM, glibc, mplayer, compiz etc. are large enough to drive a few dozen cores (assuming suitable storage/memory throughput). With that said, they'll have plenty of steps (linking, for one) which cut down on parallelism and want lots of cycles. Smaller projects (or even large projects once most objects have been built) will expose much less parallelism. During an edit-compile-test cycle, only a few products typically need be regenerated: only a few cores can be utilized, and clocks determine turnaround time. Recall Amdahl's Law: as available parallel resources approach infinity, latency becomes dominated by the time required to run non-parallelizable tasks.
Example: Let's assume that P1 executes serial code twice as fast as P2, but that we can stuff 10 P2s in our machine. We've got a source tree with 100 files, each of which takes 2 units of time on P2 to compile to objects. Linking these 100 objects into a binary takes 8 units of time on P2. An initial build will take 28 units on a P2 machine (100/10*2 + 8), and 104 units on a P1 machine (100/1*1 + 4); the P1 machine is four times as latent. We then edit five files, and rebuild. The P2 machine requires 10 units (5/5*2 + 8), with a utilization of 18% (2 units of 50% utilization, 8 units of 10% utilization), while the P1 machine requires 9 units (5/1*1 + 4), maintaining 100% utilization. It is of course unlikely that all source files take the same amount of time to compile; assumption of perfect parallelism is a charade.
As for other things I run...HandBrake takes advantage of parallelism. MakeMKV does not. Valgrind does not. rtorrent does not. Blender does, sometimes. OpenShot does, kinda. Chrome does, though not within a tab. Stuff I write usually does. Stuff other people write often does not.
Let's then seek to maximize scaled cores*clocks, with cores falling off in value exponentially with a base of, eh, call it 0.9. This means the second core is 90% as useful as the first core, the third is 81% as useful, the tenth is ~35% as useful, and the thirty-fifth is 2.5% as useful. "Useful" can be read here as "utilized". For n cores, then, the number of scaled cores is equal to the sum over 0.9k as integer k ranges from 0 to n - 1. As n approaches infinity, this sum converges to 10, which is clearly sometimes "wrong" (we know certain loads can drive more than 10 cores), but it's also clearly sometimes "wrong" that a second core is worth 0.9 (we know certain loads can drive only one core). Once we hit 22 cores (9.02 scaled cores), we achieve more than 90% of our peak benefit. I'm OK with that, as I doubt I'll be loading 20+ cores very often. Such extreme computation is usually better served by GPUs, anyway.
Aside regarding SMT: All the processors we consider support 2-way SMT aka HyperThreading. SMT will generally enhance throughput (i.e., make a core seem like two slightly weaker cores) for code with significant delays due to memory accesses, mispredicted branches, and data dependencies. Code dominated by arithmetic intensity will not see much benefit from SMT, since the single physical core's execution units are not truly duplicated. If only one thread is available, SMT of course provides no benefit. In any of these cases, enabled SMT can actually degrade performance due to partitioning of microarchitectual resources. Machines primarily running tuned arithmetic code should probably disable SMT. For the case of compilation, it's usually going to be a win, hiding all manner of chaotic delays so long as the physical core's cache doesn't get blown out. We will calculate with physical cores, because fuck it.
Aside regarding TurboBoost: All the processors we consider support TurboBoost 2.0 or 3.0. This allows automatic clock boosts when some cores are inactive. TurboBoost range tends to shrink as the base clock increases. I provide both the base and turbo speed, but calculate using the base speed, again because fuck it.
|Processor(s)||Cores / scaled||Base / turbo GHz||Scaled-base product|
|Skylake i7 6700K||4 / 3.44||4.0 / 4.2||13.76|
|Skylake Xeon E3 1280v5||4 / 3.44||3.7 / 4.0||12.73|
|Broadwell-E i7 6950X||10 / 6.51||3.0 / 3.5||19.53|
|Broadwell-E i7 6900K||8 / 5.70||3.2 / 3.7||18.24|
|Broadwell-E i7 6850K||6 / 4.69||3.6 / 3.8||16.88|
|Broadwell-EP Xeon E5 2699v4||22 / 9.02||2.2 / 3.6||19.84|
|Broadwell-EP Xeon E5 2697v4||18 / 8.50||2.3 / 3.6||19.55|
|Broadwell-EP Xeon E5 2697Av4||16 / 8.15||2.6 / 3.6||21.19|
|Broadwell-EP Xeon E5 2697Av4 x2||32 / 9.66||2.6 / 3.6||25.12|
|Broadwell-EP Xeon E5 2687Wv4||12 / 7.18||3.0 / 3.5||21.54|
|Broadwell-EP Xeon E5 2687Wv4 x2||24 / 9.20||3.0 / 3.5||27.60|
|Broadwell-EP Xeon E5 2640v4||10 / 6.51||2.4 / 3.4||15.62|
|Broadwell-EP Xeon E5 2640v4 x2||20 / 8.78||2.4 / 3.4||21.07|
The Skylakes have some nice qualities (DMI 3.0 to the southbridge, AVX512, better AES implementation, Sunrise Point chipset), but the raw power just isn't there for heavy parallelism. Also, as we'll see below, their PCIe bandwidth to the chip is severely limited (DMI 3.0 helps out a bit with this, but not terribly much). I suppose it's worth noting that the 6700K has Intel graphics built in. Skylake and Broadwell-E have TurboBoost 3.0, and (excluding the Skylake Xeon) unlocked clock multipliers for trivial overclocking.
Note that our choice of scaling base can drastically change these calculations. At .75, there's little point going beyond the 6700K's four cores. At .99, the forty-four physical cores of a dual-socket Xeon 2699v4 are all contributing. I'll admit that the chosen 0.9 is something of a sweet spot for the 6950X vs the generally slower Xeons: its tenth core counts as a respectable 40% of its first. The octacore 6850K is close on its heels, while only the very largest Xeons top it in a single-socket configuration. Ignoring everything else, the dual-socket Xeon 2687Wv4 setup wins under this model. Awesome! I was hoping to go multi-socket.
Let's not ignore everything else, though. A new table:
|Processor(s)||Scaled-base product||PCIe 3.0 lanes||Memory||Chipset(s)||TDP||Price||$SBP|
|Skylake i7 6700K||13.76||16||64 DDR4-2133||Z170||91||$320||$23|
|Broadwell-E i7 6950X||19.53||40||128 DDR4-2400||X99||140||$1650||$84|
|Broadwell-E i7 6900K||18.24||40||128 DDR4-2400||X99||140||$1120||$61|
|Broadwell-E i7 6850K||16.88||40||128 DDR4-2400||X99||140||$620||$36|
|Broadwell-EP Xeon E5 2699v4||19.84||40||1536 DDR4-2400||C612 / X99||145||$4120||$207|
|Broadwell-EP Xeon E5 2697v4||19.55||40||1536 DDR4-2400||C612 / X99||135||$3230||$165|
|Broadwell-EP Xeon E5 2697Av4||21.19||40||1536 DDR4-2400||C612 / X99||145||$2900||$136|
|Broadwell-EP Xeon E5 2697Av4 x2||25.12||80||3072 DDR4-2400||C612||290||$5800||$230|
|Broadwell-EP Xeon E5 2687Wv4||21.54||40||1536 DDR4-2400||C612 / X99||160||$2140||$99|
|Broadwell-EP Xeon E5 2687Wv4 x2||27.60||80||3072 DDR4-2400||C612||320||$4280||$155|
|Broadwell-EP Xeon E5 2640v4 x2||21.07||80||3072 DDR4-2133||C612||180||$1880||$89|
A few observations: the hexa- and octacore Broadwell-Es are pretty great values for (entry-grade) enthusiast product. Broadwell-E claims support for 128GB of DDR4, but its quadchannel memory controller only accepts one DIMM per channel, so good luck with that. The aggressively-clocked Xeon E5 2687Wv4 seems intriguing, and with a higher scaling base, its dual-socket implementation rocks the house. Otherwise, our scaling base of 0.9 makes the massively multicore Xeons seem pretty expensive. Still, I had to think about dual 2687Wv4s for a few lightheaded minutes. I'd love to run a make -j 48 on such a machine, churning out kernel builds slick as shit from a goose and warming my home besides. Expensive, though. Two 2640v4s, though, offered a bit more power than a 6950X for not much more price, and offered the opportunity of a dual-socket upgrade when Skylake-EP/Broadwell-EX knock the bottom out of the Broadwell-EP market.
Overclocking changes this whole story. Taking the 6950X up to 4GHz turns its 19.53 into a 26.04. Its price per work unit drops to $63. Accounting for TurboBoost would help the Xeons, of course. With that said, overclocking can affect all cores, rather than the limited effects of TurboBoost. Computing Scaled-Turbo Products via adding the TurboBoost speed to the base clock times the scaled core count less one is left as an exercise for the reader.
“If they can get you asking the wrong questions, they don't have to worry about answers.” ―Thomas Pynchon, Gravity's Rainbow (1973)
You can't go multisocket without C612, though, and C612 got me down. It doesn't bring certain workstationish elements I want, seems USB-starved, and even large SSI motherboards don't seem to put as many I/O slots on there as one could. Aesthetics are brutish and fan support is limited. IPMI scares the hell out of me. So far as I could tell, no multisocket enthusiast/DIY scene exists. Shit shit shit. Meanwhile, there were any number of simply lovely "Generation 2" X99 boards, loaded with valuable capabilities that would consume PCIe slots on the SSI boards, PCIe slots I needed for GPUs and fast storage (with that said, I did appreciated the 10GigE built onto many C612 boards, and lament its absence on all save one X99). And anything but the 2x2640v4 was going to be awfully expensive in absolute terms. Shit shit shit. And it's not like $1880 is exactly cheap. See, this is what multi-socket thinking does to you. Shit. Alright. Single, boring old unisocket on a phat X99, FML.
“Past, present, future, all are one in Yog-Sothoth.” ―H. P. Lovecraft, “The Dunwich Horror” (1929)
Problems and annoyances
The build, and indeed the final product, were not without some problems.
- The locking mechanism on the motherboard's PCIe slots requires a delicate unlatching to remove a plugged-in card. The Noctua DH-N14 cooler's height and width makes this pretty much impossible for the topmost SafeSlot when loaded with a full-sized video card. I can't currently remove my GTX 1080.
- My PCIe 3.0 16x GTX 1080 is only posting at 8x, as reported by the UEFI BIOS and nvidia-smi -a.
- Update: this is confirmed to hit 16x under load
- The video card blocks easy access to 6 of the 10 SATA3 ports.
- The onboard Broadcom BCM4360 wireless+bluetooth chipset has no open source drivers. The broadcom-sta-dkms package provides a working wl driver, though one must ensure conflicting drivers are not loaded for it to recognize the device.
- The bluetooth module still then required extracting firmware from the Windows driver.
- The ASUS Aura lighting system, both on the motherboard and card, does not appear to have Linux support. The motherboard lighting can be minimally controlled through the UEFI BIOS.
- The Funtin U.2 cables are very thick and inflexible, leading to some cabling annoyances. The second U.2 connector's placement on the motherboard leads to unavoidable cabling pollution of my primary airflow channel.
- The EVGA SuperNova 850's cables are likewise problematic, especially given the location of the GTX 1080's power plugs.
- The CaseLabs Magnum T10's FlexBay system is intolerant of excessively-sized 5.25" devices, including the Rosewill RSV-SATA-Cage-3 hotswap bay I originally purchased.
- The CaseLabs Magnum T10's hard drive cages can't support front-loading bays, and adding/removing drives requires removing an entire cage.
- The absence of a PCI slot on the ASUS X99 Deluxe II meant I couldn't use my ASUS Xonar ST sound card.
- The CaseLabs Magnum T10's PSU side could really use some fan cutaways.
- When both U.2 connectors are populated with NVMe SSDs, Linux notes corrected PCIe errors at a rate proportional to load.
- Update: this was due to ASPM, which I have disabled. Problem gone.
- The ASUS X99 Deluxe II's hardware sensors do not appear to have Linux support.
- Update: with the kernel command line option acpi_enforce_resources=lax, the nct6775 chip can be found.
Final Bill of Materials
- Caselabs Magnum T10 chassis with 78mm ventilated top plus...
- StarTech HSB4SATSASBA 4-bay 3U HDD cage
- Icy Dock MB324SP-B 4-bay 1U SSD cage
- EVGA SuperNOVA 850 T2 80 Plus Titanium PSU
- Noctua NH-D14 SE2011 CPU cooler (NF-P14 and NF-P12 fans)
- 2x Noctua NF-F12 iPPC 120mm PWM fans
- 2x Noctua NF-S12A 120mm PWM fans
- ASUS X99 Deluxe II motherboard
- Intel 6950X Broadwell-E CPU
- 4x Crucial Ballistix Sport LT 16GB DDR4 DIMMs (64GB total)
- ASUS GTX1080 A8G video card (NVidia GP104 GPU)
- E-SDS USB/HD-Audio 3.5" front panel
- NZXT AA-APMU3-B1 5.25" USB/media card reader
- ASUS R2.0 14-pin TPM
- 9x Seagate EXOS 12TB SATA3 HDDs
- 2x Western Digital SN720 1TB NVMe SSDs
- 2x Intel 750 400GB NVMe U.2 SSDs (removed)
- 2x Funtin U.2 to Mini-SAS cables (removed)
- 4x WD40EFRX Western Digital Red 4TB SATA3 HDDs (removed, sold for $144)
- 6x ST4000VN000 Seagate NAS 4TB SATA3 HDDs (removed, sold for $216)
- Samsung 840 Pro 256GB SATA3 SSD (removed)
- Lite-on Bluray burner