Bug 1180742 - [amdgpu]An AMD Vega series GPU randomly crashes
[amdgpu]An AMD Vega series GPU randomly crashes
Status: NEW
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.2
x86-64 openSUSE Leap 15.2
: P5 - None : Normal (vote)
: ---
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2021-01-10 12:46 UTC by Iakov Karpov
Modified: 2022-02-28 15:22 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
partial kernel log (8.04 KB, text/plain)
2021-01-10 12:46 UTC, Iakov Karpov
Details
Partial kernel log of 5.3.18-107.g0b709ea-default (39.65 KB, text/plain)
2021-02-06 13:11 UTC, Iakov Karpov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Iakov Karpov 2021-01-10 12:46:18 UTC
Created attachment 844970 [details]
partial kernel log

The AMDGPU kernel driver randomly crashes GPU, usually under load, with Radeon VII hardware.
The GPU hang is relatively hard to hit, as it usually takes 5 to 7 days before it crashes.
After a hang it attempts to reset the GPU, but sometimes the reset fails and system stays sort of unresponsive. You can still access it over network, and there's some sort of reaction on keyboard events, but display stays dead.
Also, it seems to bring PCIe bus down to 1.0 mode, and it stays that until reboot.

There's an upstream bug open that may have something to do about it: https://gitlab.freedesktop.org/drm/amd/-/issues/716

That particular GPU works fine on Windows machine

openSUSE Leap 15.2, kernel 5.3.18-lp152.57-default #1 SMP Fri Dec 4 07:27:58 UTC 2020 (7be5551)
Comment 1 Takashi Iwai 2021-01-11 12:58:09 UTC
It's some GPU hang that leads to the real kernel crash.... which happened on others sometimes, too.  Unfortunately there is no fix for this and likely not for Leap 15.2 kernel.

Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS Kernel:SLE15-SP3?  The latter contains the backport of DRM stack up to 5.9.x.
Comment 2 Iakov Karpov 2021-01-11 18:10:11 UTC
(In reply to Takashi Iwai from comment #1)
> It's some GPU hang that leads to the real kernel crash.... which happened on
> others sometimes, too.  Unfortunately there is no fix for this and likely
> not for Leap 15.2 kernel.
> 
> Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS
> Kernel:SLE15-SP3?  The latter contains the backport of DRM stack up to 5.9.x.

kernel 5.3.18-100.g3524980 of Kernel:SLES15-SP3 won't boot on this machine (stuck right after bootloader, not even a single line after "loading initrd" on screen.

Testing with Kernel:stable may require some time.
Comment 3 Takashi Iwai 2021-01-18 15:34:02 UTC
(In reply to Iakov Karpov from comment #2)
> (In reply to Takashi Iwai from comment #1)
> > It's some GPU hang that leads to the real kernel crash.... which happened on
> > others sometimes, too.  Unfortunately there is no fix for this and likely
> > not for Leap 15.2 kernel.
> > 
> > Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS
> > Kernel:SLE15-SP3?  The latter contains the backport of DRM stack up to 5.9.x.
> 
> kernel 5.3.18-100.g3524980 of Kernel:SLES15-SP3 won't boot on this machine
> (stuck right after bootloader, not even a single line after "loading initrd"
> on screen.

That's bad.  Do you have the secure boot enabled?  If so, disable it when you test a kernel from OBS repo that is other than the official release.
Comment 4 Iakov Karpov 2021-01-25 16:41:24 UTC
(In reply to Takashi Iwai from comment #1)
> It's some GPU hang that leads to the real kernel crash.... which happened on
> others sometimes, too.  Unfortunately there is no fix for this and likely
> not for Leap 15.2 kernel.
> 
> Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS
> Kernel:SLE15-SP3?  The latter contains the backport of DRM stack up to 5.9.x.

I've been testing kernel 5.10.6-3.g183dcff-default of Kernel:stable for almost 14 days now, not a single crash. 

(In reply to Takashi Iwai from comment #3)
> That's bad.  Do you have the secure boot enabled?  If so, disable it when
> you test a kernel from OBS repo that is other than the official release.

I'm on kernel 5.3.18-107.g0b709ea-default of Kernel:SLE15-SP3 now, it works for me. Didn't change anything about secure boot, though, I don't think I had it enabled. I'll report back when in another 2 weeks if it won't crash sooner.
Comment 5 Iakov Karpov 2021-02-06 13:10:01 UTC
(In reply to Takashi Iwai from comment #1)
> It's some GPU hang that leads to the real kernel crash.... which happened on
> others sometimes, too.  Unfortunately there is no fix for this and likely
> not for Leap 15.2 kernel.
> 
> Could you try the kernel in OBS Kernel:stable or SLE15-SP3 kernel in OBS
> Kernel:SLE15-SP3?  The latter contains the backport of DRM stack up to 5.9.x.

It crashed on 12th day with 5.3.18-107.g0b709ea-default (Kernel:SLE15-SP3)
Comment 6 Iakov Karpov 2021-02-06 13:11:04 UTC
Created attachment 845864 [details]
Partial kernel log of 5.3.18-107.g0b709ea-default
Comment 7 Takashi Iwai 2021-03-17 08:39:53 UTC
So something unstable is still floating round.  Maybe tweaking the module options (like disabling power management) might work around, but it's no right solution.

I believe the best way would be to report and/or track the upstream bug tracker.
Comment 8 Takashi Iwai 2021-03-17 08:40:57 UTC
It's rather similar to the upstream issue:
  https://gitlab.freedesktop.org/drm/amd/-/issues/934
Comment 9 Miroslav Beneš 2022-02-26 08:53:44 UTC
Still not resolved in upstream according to the reports. Might be worked around by disabling the dynamic power management of the GPU or by the GPU frequency throttling manipulation.

Iakov, by any chance, would the latest kernel from Leap 15.4 or the latest kernel from OBS Kernel:stable:Backport work better for you? Leap 15.2 is not supported anymore, Leap 15.3 is probably not better if I read your feedback correctly. Leap 15.4 will be based on v5.14 kernel.
Comment 10 Iakov Karpov 2022-02-26 09:07:54 UTC
(In reply to Miroslav Beneš from comment #9)
> Still not resolved in upstream according to the reports. Might be worked
> around by disabling the dynamic power management of the GPU or by the GPU
> frequency throttling manipulation.
> 
> Iakov, by any chance, would the latest kernel from Leap 15.4 or the latest
> kernel from OBS Kernel:stable:Backport work better for you? Leap 15.2 is not
> supported anymore, Leap 15.3 is probably not better if I read your feedback
> correctly. Leap 15.4 will be based on v5.14 kernel.

I'm currently using Leap 15.3 with kernel 5.15.13 of Kernel:stable:Backport. It's better, but still crashes sometimes. With 5.16.x kernels my crashing every few minutes, but I'm not sure the GPU is the case there. Was not able to recover any crash logs, so no bug report on that.
Comment 11 Miroslav Beneš 2022-02-26 09:22:15 UTC
Thanks for the feedback. I'll leave the bug open and will occasionally monitor it.

CCing Patrik and Thomas so that they are aware, but I am not sure if we can do anything here besides waiting for upstream.
Comment 12 Takashi Iwai 2022-02-28 15:22:57 UTC
One thing that might be worth is to update kernel-firmware-amdgpu from OBS Kernel:stable:Backport repo (if not done yet).