Threadripping under Debian
Getting a Threadripper machine to work under Debian
After a long long time of more than eight years my old machine started showing hardware problems: first the power supply failed and had to be replaced. Next the CMOS battery died. It became clear that finally a replacement would be in order. When the first reports about AMDs Ryzen family came out and in particular with the Threadripper tests, I became interested. In the end, I waited until the Threadripper 2950X was available before ordering a custom-built machine from a local vendor. Here is the basic hardware setup:
- CPU: Threadripper 2950X (32 cores)
- Mainboard: MSI 399X Carbon
- Memory: 32 GB
- GPU: Nvidia 1060
- SSD: Samsung 970 Evo NVM 1TB
This post will describe what I had to do to on the software / linux side to bring the machine to a usable state.
Basic installation
As a Debian user of old and knowing that the latest Debian stable (Stretch) would not provide a recent enough kernel, I installed Debian Buster. This went more or less flawlessly. However, I can’t help but note that the installation of lvm and disk encryption (ie. lvm2 with lukssetup) provided by the Debian installer is really bad. It hasn’t improved one bit since ages: the automatic partitioning will come up with a crazy sizing scheme and there is absolutely zero useful support when you go manual. It’s easy to end up with an installation which looks like it succeeded but rebooting will end up on the rescue console because mounting the root volume group fails.
The only other noteworthy thing is that I directly installed the proprietary Nvidia drivers because the free nouveau driver doesn’t really support any of the “newer” and more advanced 10X0 cards from Nvidia. As I am a LXDE user, that was of course the desktop environment I installed. However, I installed LXQt and KDE along with it, to also learn about their current state, but went quickly back to LXDE — LXQt is supposed to be the successor of LXDE but it’s currently not really in a comparable state.
I also installed Gnome briefly, only to discover that it’s trying to use Wayland which apparently has problems with the Nvidia drivers. I couldn’t make it work and as I was never a big Gnome fan anyway, I simply threw out Gnome again.
system-udev crashes and kernel compilation
With the system up and running, the first problem I noted was system-udev constantly crashing. This also prevented suspend/hibernate from working. A longer internet search finally revealed this system-udev bug report in Red Hat’s bugzilla. Apparently “the BIOS/firmware is advertising it supports SEV, when in fact it doesn’t” where SEV is AMD’s secure encrypted virtualization technology. If the kernel config option CONFIG_CRYPTO_DEV_SP_PSP
is set to yes then the kernel would use it and apparently most distribution provided kernels are setting it — the one from Debian Buster (4.18.0-2 at the time) surely does.
There are apparently three ways to fix this issue:
- wait until the BIOS/firmware is fixed to no longer provide the wrong information,
- wait for a newer kernel to provide a workaround,
- or compile the current kernel with
CONFIG_CRYPTO_DEV_SP_PSP
set to no.
Apparently, since 4.19-rc5 the kernel has a workaround for the issue, but at the time of writing Buster and Sid only have 4.18. So, off to compile a new kernel without that setting. This turned out to have become a lot more complicated since the last time I had to do this on a Debian system (which was still using make-kpkg
at the time, so yes, that’s several years I didn’t had the need to compile a kernel). It also doesn’t help that quite a bit of documentation out there is outdated — the Debian Kernel handbook seems to be the proper documentation.
Unfortunately, it’s still easy to get it wrong. I ran into issues with certificates (why exactly you would need certificates to compile a kernel is beyond me, my wild guess is it’s related to “secure” boot) or my changes to the configuration were overwritten during the process and other issues. In the end, the following process “works for me”:
# cf. chapter 4.2.3, generate the setup for amd64
make -f debian/rules.gen setup_amd64_none_amd64
# fire up the config dialog, now enable RCU-BOOST and disable PSP
make -C debian/build/build_amd64_none_amd64 xconfig
# build the kernel
make -f debian/rules.gen binary-arch_amd64_none_amd64
The only problem with this is that the generated kernel will have the same numbering as the official package. So, a newer minor version of the Debian package can overwrite your manually build package. I haven’t really looked into this yet, but will update this section once I do (and yes, the process described in section 4.3 of said manual didn’t work for me).
Btw., don’t expect kernel compilation to be a fast process. Apparently, the kernel configuration that is used out-of-box for Debian compiles half the world and consequently this takes ages even on such a high-end machine.
The newly compiled kernel fixes the systemd-udev crashes and then also suspend / hibernate worked. However, when I triggered hibernate from the LXDE logout dialog, the machine wouldn’t power-off. Another long round of searching the web, including reading the source to lxsession-logout.c
, revealed the solution to disable upower via systemctl disable upower
.
Random hangs
During my first attempt to compile a kernel, my system just hang. Completely, not even the magic sysreq worked anymore. That’s apparently another known issue, but there are no good solutions. The attempt to set RCU_NOCB_CPU und RCU_BOOST is one such attempt (cf. the soft hang discussion in the AMD forum, but this didn’t really help me (cf. also the random soft lock discussion on the kernel bugtracker). However, the also linked ZenStates github repo contains a Python script which disables the C6 state.
Again, the forum suggests that the issue might be fixed with newer BIOS versions, but my mainboard has the mentioned AGESA version and the issue still occurred. Disabling C6 state per the script fixes the problem, but results in higher energy consumption and is hence not exactly a perfect solution either. If you want to run zenstates automatically, there is also a systemd template for zenstate. Note that this needs some additional tweaking (which I didn’t get around to yet) to run modprobe msr
.
PCI errors
One other thing I noted in the logs were re-occuring PCI errors. There are a number of suggested fixes, cf. this PCI error answer on askubuntu to use pci=nomsi
or pci=noaer
or this other PCI error suggestion on unixstackexchange to use pci=nommconf
. For me, pci=noaer
hides the errors successfully and for the moment that’s good enough (read: I haven’t investigated whether the other suggestions would actually fix the issue also).
Virtualization
The latest thing I ran into was that Virtualbox was refusing to start a virtual machine, claiming the AMD feature would be disabled in the BIOS. This turned out to be the case, SVM was disabled. Actually, I couldn’t easily find the option in the first place, I had to use the search in the Bios.
Two things I haven’t tried out yet
The two things I haven’t used yet are the goodies the motherboard provides over the one I originally wanted: Bluetooth and Wlan. I did install the intel firmware to support both features, but can’t really say anything more about it yet.
Conclusion
I don’t have a conclusion just yet. It’s clear this new machine is a lot faster than my old box, but given the Core2Duo cpu is also quite long in the teeth now and that the old one didn’t had a SSD, that’s to be expected. Although this article has gotten longer then I would have hoped for, overall the machine is running quite well (with me running a testing version of Debian which is usually really badly supported). In particular, I can’t say I ran into any issues with the peripherals so far. I guess the machine will have to show it’s power over the next months and hopefully there are some more fixes to BIOS/firmware and the kernel as needed.