Bug 1225147

Summary: Kernel hard lockup under mild GPU load
Product: [openSUSE] openSUSE Tumbleweed Reporter: llyyr <llyyr.public>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: NEW --- QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: llyyr.public, tiwai
Version: Current   
Target Milestone: ---   
Hardware: x86-64   
OS: openSUSE Tumbleweed   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description llyyr 2024-05-23 13:56:24 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:126.0) Gecko/20100101 Firefox/126.0
Build Identifier: 

Randomly get hard lockups with no errors or dmesg logs, can't ssh into the system either.

Reproducible: Always

Steps to Reproduce:
1. Record the screen with vaapi encoding on amdgpu
Actual Results:  
Kernel hard lockup some time after 5-30 minutes

Expected Results:  
System should be stable

With i5-13600k and RX 6600 XT.

I'm running sway and the screen recording tool doesn't matter as long as it's using vaapi encoding. The Mesa version also doesn't matter.

This happens on Kernel versions 6.7 or newer, I just tried 6.9.1 and can reproduce it there as well. It does not happen on 6.6.x. I'm currently running 6.6.31-lts.

I'd go ahead and bisect the kernel but I'm not sure how to build and install the kernel in a way that it appears in the grub menu.
Comment 1 Takashi Iwai 2024-05-24 11:14:22 UTC
It's quite difficult to debug without any logs, unfortunately.
You can try to set up kdump and get the kernel crash dump (at least the dmesg output), too.  If it were a kernel panic, the crash dump will be triggered automatically.  Other than that, you can trigger manually via magic sysrq-c.
Comment 2 llyyr 2024-05-24 16:27:51 UTC
(In reply to Takashi Iwai from comment #1)
> It's quite difficult to debug without any logs, unfortunately.
> You can try to set up kdump and get the kernel crash dump (at least the
> dmesg output), too.  If it were a kernel panic, the crash dump will be
> triggered automatically.  Other than that, you can trigger manually via
> magic sysrq-c.

I bisected it down to amdgpu changes in 6.7-rc1 and reported it upstream here https://gitlab.freedesktop.org/drm/amd/-/issues/3403

Unfortunately I can't bisect it down to a specific commit because amdgpu is broken at random commits in that tree
Comment 3 Takashi Iwai 2024-05-24 17:09:04 UTC
Thanks.

As a blind shot (as it's a 6.7 regression), could you try later a test patched kernel in OBS home:tiwai:bsc1219983 repo?  Once after the build finishes, the package will appear at
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1219983/standard/
Comment 4 llyyr 2024-05-25 03:21:37 UTC
(In reply to Takashi Iwai from comment #3)
> Thanks.
> 
> As a blind shot (as it's a 6.7 regression), could you try later a test
> patched kernel in OBS home:tiwai:bsc1219983 repo?  Once after the build
> finishes, the package will appear at
>   http://download.opensuse.org/repositories/home:/tiwai:/bsc1219983/standard/

Has the same issue, I get hard lockup.

I'd imagine it's related to power management or gpu clocks because these crashes are very similar to what happens when you're running a very unstable overclock and you stress your system a little. Except I'm not overclocking.
Comment 5 Takashi Iwai 2024-05-25 07:26:59 UTC
So the patch didn't seem helping in your case?
FWIW, it was a one-line revert mentioned in
  https://gitlab.freedesktop.org/drm/amd/-/issues/3142

The best you can do for the moment would be to try to catch any kernel crash or such messages and report / track the bug in the upstream gitlab.freedesktop.org Issues.
Comment 6 llyyr 2024-05-25 14:08:14 UTC
(In reply to Takashi Iwai from comment #5)
> So the patch didn't seem helping in your case?
> FWIW, it was a one-line revert mentioned in
>   https://gitlab.freedesktop.org/drm/amd/-/issues/3142
> 
> The best you can do for the moment would be to try to catch any kernel crash
> or such messages and report / track the bug in the upstream
> gitlab.freedesktop.org Issues.

Actually that patch does work, thanks! I must've booted into the latest kernel instead of picking the one from your branch by accident when trying it out.
Comment 7 Takashi Iwai 2024-05-26 07:25:01 UTC
It's a good news.  At least we're heading to the right direction.

I can backport the workaround patch to TW, but since the upstream got a significant rewrite of the relevant code, let's check whether it covers your problem at first.

I'm building another test kernel with two upstream backports:
2d5bb791e24f43b6b4231b7973009987bbcc9b06
  drm/amd/display: Implement update_planes_and_stream_v3 sequence
d62d5551dd615f9e488b13595d69b308cd019e16
  drm/amd/display: Backup and restore only on full updates

It's being built in OBS home:tiwai:bsc1225147 repo.  The package will appear at
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1225147/standard/
Please give it a try later.

Meanwhile, you can join to the upstream gitlab.freedesktop.org issues mentioned in comment 2, echoing that the revert helped, too.
Comment 8 llyyr 2024-05-26 10:30:14 UTC
(In reply to Takashi Iwai from comment #7)
> It's a good news.  At least we're heading to the right direction.
> 
> I can backport the workaround patch to TW, but since the upstream got a
> significant rewrite of the relevant code, let's check whether it covers your
> problem at first.
> 
> I'm building another test kernel with two upstream backports:
> 2d5bb791e24f43b6b4231b7973009987bbcc9b06
>   drm/amd/display: Implement update_planes_and_stream_v3 sequence
> d62d5551dd615f9e488b13595d69b308cd019e16
>   drm/amd/display: Backup and restore only on full updates
> 
> It's being built in OBS home:tiwai:bsc1225147 repo.  The package will appear
> at
>   http://download.opensuse.org/repositories/home:/tiwai:/bsc1225147/standard/
> Please give it a try later.
> 
That does not resolve the issue, I can still reproduce the hard lockup.

> Meanwhile, you can join to the upstream gitlab.freedesktop.org issues
> mentioned in comment 2, echoing that the revert helped, too.
I did https://gitlab.freedesktop.org/drm/amd/-/issues/3142#note_2427275
Comment 9 Takashi Iwai 2024-05-26 11:04:46 UTC
(In reply to llyyr from comment #8)
> (In reply to Takashi Iwai from comment #7)
> > It's a good news.  At least we're heading to the right direction.
> > 
> > I can backport the workaround patch to TW, but since the upstream got a
> > significant rewrite of the relevant code, let's check whether it covers your
> > problem at first.
> > 
> > I'm building another test kernel with two upstream backports:
> > 2d5bb791e24f43b6b4231b7973009987bbcc9b06
> >   drm/amd/display: Implement update_planes_and_stream_v3 sequence
> > d62d5551dd615f9e488b13595d69b308cd019e16
> >   drm/amd/display: Backup and restore only on full updates
> > 
> > It's being built in OBS home:tiwai:bsc1225147 repo.  The package will appear
> > at
> >   http://download.opensuse.org/repositories/home:/tiwai:/bsc1225147/standard/
> > Please give it a try later.
> > 
> That does not resolve the issue, I can still reproduce the hard lockup.

Thanks, good to know.

Just to be sure, could you try kernel-vanilla package in my OBS home:tiwai:kernel:drm-tip repo?
  http://download.opensuse.org/repositories/home:/tiwai:/kernel:/drm-tip/standard/
Comment 10 llyyr 2024-05-26 14:47:42 UTC
(In reply to Takashi Iwai from comment #9)
> Just to be sure, could you try kernel-vanilla package in my OBS
> home:tiwai:kernel:drm-tip repo?
>  
> http://download.opensuse.org/repositories/home:/tiwai:/kernel:/drm-tip/
> standard/

Freezes. Only thing that helps is the patch which deletes this line https://github.com/torvalds/linux/blob/c13320499ba0efd93174ef6462ae8a7a2933f6e7/drivers/gpu/drm/amd/display/dc/core/dc_state.c#L323

But it's definitely not ideal
Comment 11 Takashi Iwai 2024-05-26 16:03:08 UTC
OK, then please update the upstream bugtracker info accordingly.  It's useful to know that the very latest code still suffers from the same problem.

Let's take the workaround temporarily for now until the upstream gets the proper resolution.  It's not ideal, but better than sorry.
Comment 12 llyyr 2024-05-26 16:22:40 UTC
(In reply to Takashi Iwai from comment #11)
> OK, then please update the upstream bugtracker info accordingly.  It's
> useful to know that the very latest code still suffers from the same problem.
> 
6.10-rc1 should be tagged later today, I'll give that a spin then update upstream. Might also be worth trying out the https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-staging-drm-next branch?

> Let's take the workaround temporarily for now until the upstream gets the
> proper resolution.  It's not ideal, but better than sorry.

Thanks!
Comment 13 Takashi Iwai 2024-05-26 16:26:30 UTC
(In reply to llyyr from comment #12)
> (In reply to Takashi Iwai from comment #11)
> > OK, then please update the upstream bugtracker info accordingly.  It's
> > useful to know that the very latest code still suffers from the same problem.
> > 
> 6.10-rc1 should be tagged later today, I'll give that a spin then update
> upstream. Might also be worth trying out the
> https://gitlab.freedesktop.org/agd5f/linux/-/tree/amd-staging-drm-next
> branch?

Sure, worth to try out.
Comment 14 llyyr 2024-05-26 23:47:06 UTC
Gave 6.10-rc1 a shot, got a freeze within minutes. The patch still workarounds the issue though.