Bugzilla – Bug 1225147
Kernel hard lockup under mild GPU load
Last modified: 2024-05-26 23:47:06 UTC
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0 Build Identifier: Randomly get hard lockups with no errors or dmesg logs, can't ssh into the system either. Reproducible: Always Steps to Reproduce: 1. Record the screen with vaapi encoding on amdgpu Actual Results: Kernel hard lockup some time after 5-30 minutes Expected Results: System should be stable With i5-13600k and RX 6600 XT. I'm running sway and the screen recording tool doesn't matter as long as it's using vaapi encoding. The Mesa version also doesn't matter. This happens on Kernel versions 6.7 or newer, I just tried 6.9.1 and can reproduce it there as well. It does not happen on 6.6.x. I'm currently running 6.6.31-lts. I'd go ahead and bisect the kernel but I'm not sure how to build and install the kernel in a way that it appears in the grub menu.
It's quite difficult to debug without any logs, unfortunately. You can try to set up kdump and get the kernel crash dump (at least the dmesg output), too. If it were a kernel panic, the crash dump will be triggered automatically. Other than that, you can trigger manually via magic sysrq-c.
(In reply to Takashi Iwai from comment #1) > It's quite difficult to debug without any logs, unfortunately. > You can try to set up kdump and get the kernel crash dump (at least the > dmesg output), too. If it were a kernel panic, the crash dump will be > triggered automatically. Other than that, you can trigger manually via > magic sysrq-c. I bisected it down to amdgpu changes in 6.7-rc1 and reported it upstream here https://gitlab.freedesktop.org/drm/amd/-/issues/3403 Unfortunately I can't bisect it down to a specific commit because amdgpu is broken at random commits in that tree
Thanks. As a blind shot (as it's a 6.7 regression), could you try later a test patched kernel in OBS home:tiwai:bsc1219983 repo? Once after the build finishes, the package will appear at http://download.opensuse.org/repositories/home:/tiwai:/bsc1219983/standard/
(In reply to Takashi Iwai from comment #3) > Thanks. > > As a blind shot (as it's a 6.7 regression), could you try later a test > patched kernel in OBS home:tiwai:bsc1219983 repo? Once after the build > finishes, the package will appear at > http://download.opensuse.org/repositories/home:/tiwai:/bsc1219983/standard/ Has the same issue, I get hard lockup. I'd imagine it's related to power management or gpu clocks because these crashes are very similar to what happens when you're running a very unstable overclock and you stress your system a little. Except I'm not overclocking.
So the patch didn't seem helping in your case? FWIW, it was a one-line revert mentioned in https://gitlab.freedesktop.org/drm/amd/-/issues/3142 The best you can do for the moment would be to try to catch any kernel crash or such messages and report / track the bug in the upstream gitlab.freedesktop.org Issues.
(In reply to Takashi Iwai from comment #5) > So the patch didn't seem helping in your case? > FWIW, it was a one-line revert mentioned in > https://gitlab.freedesktop.org/drm/amd/-/issues/3142 > > The best you can do for the moment would be to try to catch any kernel crash > or such messages and report / track the bug in the upstream > gitlab.freedesktop.org Issues. Actually that patch does work, thanks! I must've booted into the latest kernel instead of picking the one from your branch by accident when trying it out.
It's a good news. At least we're heading to the right direction. I can backport the workaround patch to TW, but since the upstream got a significant rewrite of the relevant code, let's check whether it covers your problem at first. I'm building another test kernel with two upstream backports: 2d5bb791e24f43b6b4231b7973009987bbcc9b06 drm/amd/display: Implement update_planes_and_stream_v3 sequence d62d5551dd615f9e488b13595d69b308cd019e16 drm/amd/display: Backup and restore only on full updates It's being built in OBS home:tiwai:bsc1225147 repo. The package will appear at http://download.opensuse.org/repositories/home:/tiwai:/bsc1225147/standard/ Please give it a try later. Meanwhile, you can join to the upstream gitlab.freedesktop.org issues mentioned in comment 2, echoing that the revert helped, too.
(In reply to Takashi Iwai from comment #7) > It's a good news. At least we're heading to the right direction. > > I can backport the workaround patch to TW, but since the upstream got a > significant rewrite of the relevant code, let's check whether it covers your > problem at first. > > I'm building another test kernel with two upstream backports: > 2d5bb791e24f43b6b4231b7973009987bbcc9b06 > drm/amd/display: Implement update_planes_and_stream_v3 sequence > d62d5551dd615f9e488b13595d69b308cd019e16 > drm/amd/display: Backup and restore only on full updates > > It's being built in OBS home:tiwai:bsc1225147 repo. The package will appear > at > http://download.opensuse.org/repositories/home:/tiwai:/bsc1225147/standard/ > Please give it a try later. > That does not resolve the issue, I can still reproduce the hard lockup. > Meanwhile, you can join to the upstream gitlab.freedesktop.org issues > mentioned in comment 2, echoing that the revert helped, too. I did https://gitlab.freedesktop.org/drm/amd/-/issues/3142#note_2427275
(In reply to llyyr from comment #8) > (In reply to Takashi Iwai from comment #7) > > It's a good news. At least we're heading to the right direction. > > > > I can backport the workaround patch to TW, but since the upstream got a > > significant rewrite of the relevant code, let's check whether it covers your > > problem at first. > > > > I'm building another test kernel with two upstream backports: > > 2d5bb791e24f43b6b4231b7973009987bbcc9b06 > > drm/amd/display: Implement update_planes_and_stream_v3 sequence > > d62d5551dd615f9e488b13595d69b308cd019e16 > > drm/amd/display: Backup and restore only on full updates > > > > It's being built in OBS home:tiwai:bsc1225147 repo. The package will appear > > at > > http://download.opensuse.org/repositories/home:/tiwai:/bsc1225147/standard/ > > Please give it a try later. > > > That does not resolve the issue, I can still reproduce the hard lockup. Thanks, good to know. Just to be sure, could you try kernel-vanilla package in my OBS home:tiwai:kernel:drm-tip repo? http://download.opensuse.org/repositories/home:/tiwai:/kernel:/drm-tip/standard/
(In reply to Takashi Iwai from comment #9) > Just to be sure, could you try kernel-vanilla package in my OBS > home:tiwai:kernel:drm-tip repo? > > http://download.opensuse.org/repositories/home:/tiwai:/kernel:/drm-tip/ > standard/ Freezes. Only thing that helps is the patch which deletes this line https://github.com/torvalds/linux/blob/c13320499ba0efd93174ef6462ae8a7a2933f6e7/drivers/gpu/drm/amd/display/dc/core/dc_state.c#L323 But it's definitely not ideal
OK, then please update the upstream bugtracker info accordingly. It's useful to know that the very latest code still suffers from the same problem. Let's take the workaround temporarily for now until the upstream gets the proper resolution. It's not ideal, but better than sorry.
(In reply to Takashi Iwai from comment #11) > OK, then please update the upstream bugtracker info accordingly. It's > useful to know that the very latest code still suffers from the same problem. > 6.10-rc1 should be tagged later today, I'll give that a spin then update upstream. Might also be worth trying out the https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-staging-drm-next branch? > Let's take the workaround temporarily for now until the upstream gets the > proper resolution. It's not ideal, but better than sorry. Thanks!
(In reply to llyyr from comment #12) > (In reply to Takashi Iwai from comment #11) > > OK, then please update the upstream bugtracker info accordingly. It's > > useful to know that the very latest code still suffers from the same problem. > > > 6.10-rc1 should be tagged later today, I'll give that a spin then update > upstream. Might also be worth trying out the > https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-staging-drm-next > branch? Sure, worth to try out.
Gave 6.10-rc1 a shot, got a freeze within minutes. The patch still workarounds the issue though.