Bug 1212169 - drm:amdgpu_job_timeout
Summary: drm:amdgpu_job_timeout
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL: https://gitlab.freedesktop.org/drm/am...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-09 11:58 UTC by Berthold Gunreben
Modified: 2023-11-20 14:55 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
output of hwinfo (2.69 MB, text/plain)
2023-06-09 11:58 UTC, Berthold Gunreben
Details
messages from crash to hard reboot (804.97 KB, text/plain)
2023-07-24 08:53 UTC, Berthold Gunreben
Details
new occurance with current kernel and firmware (8.25 KB, application/gzip)
2023-08-11 12:46 UTC, Berthold Gunreben
Details
boot.msg with dyndbg=+p (27.11 KB, application/x-xz)
2023-08-16 13:31 UTC, Berthold Gunreben
Details
dmesg with kernel 6.5.2-1.gfdde566-default (196.68 KB, text/plain)
2023-09-11 21:56 UTC, Berthold Gunreben
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Berthold Gunreben 2023-06-09 11:58:27 UTC
Created attachment 867480 [details]
output of hwinfo

After updating Tumbleweed to kernel 6.3.4 from 6.3.2, I encounter sporadic hangs of display:

[11488.681228] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3214852, emitted seq=3214854
[11488.681693] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 3679 thread firefox:cs0 pid 3746
[11488.682121] amdgpu 0000:07:00.0: amdgpu: GPU reset begin!

From what I read, KDE would have to be restarted to work again, however the original issue is probably seen before.
Comment 1 Takashi Iwai 2023-06-09 13:52:12 UTC
Could you verify whether the problem is gone when you boot with 6.3.2?  i.e. confirm that it's a kernel regression between 6.3.2 and 6.3.4?
Comment 2 Berthold Gunreben 2023-07-24 08:33:09 UTC
This issue is kine of hard to reproduce. Yesterday, I again ran into the issue, which seems to be tracked here: 

https://gitlab.freedesktop.org/drm/amd/-/issues/1974

Noteably, I only run into the issue when the screensaver is on, which points to a power management issue with the graphics card. I never encountered this, when manually disabling screen saving.

Kernel right now is 6.4.2-1-default
Comment 3 Berthold Gunreben 2023-07-24 08:51:50 UTC
ok ... I should not have told so, just right now, the machine crashed. I could not even use ssh to access it. I'll add the messages starting with the GPU problem until I did a reboot.
Comment 4 Berthold Gunreben 2023-07-24 08:53:04 UTC
Created attachment 868392 [details]
messages from crash to hard reboot
Comment 5 Takashi Iwai 2023-08-02 12:52:17 UTC
This might be related with the firmware.
Could you try to update to the latest 6.4.x kernel in OBS Kernel:stable, and the latest kernel-firmware-* package (version 20230731) from OBS Kernel:HEAD?
Comment 6 Berthold Gunreben 2023-08-02 13:41:25 UTC
(In reply to Takashi Iwai from comment #5)
> This might be related with the firmware.
> Could you try to update to the latest 6.4.x kernel in OBS Kernel:stable, and
> the latest kernel-firmware-* package (version 20230731) from OBS Kernel:HEAD?

I updated to kernel 6.5.0-rc4-1.g2390421-default, the kernel-firmware packages are the version as mentioned above. I will get back to you if the system has issues, however those crashes are quite rare anyways...
Comment 7 Berthold Gunreben 2023-08-10 14:24:47 UTC
Just a headsup: Since the upgrade of kernel and kernel-firmware packages, I did not have any crashes. Obviously I don't know if this is by chance, or if the issue is fixed. I now run 6.5.0-rc5-2.g997a7e4-default and will continue with Kernel:stable for the time being.

One thing to notice (for what its worth ... I don't know): nvtop now displays way lower utilizations of the GPUs. Also, the usage looks more balanced, and the external GPU is also used when the internal is not fully loaded. Seems to be an improvement anyways.
Comment 8 Berthold Gunreben 2023-08-11 12:46:36 UTC
Created attachment 868767 [details]
new occurance with current kernel and firmware

The issue just got me again, this time with current kernel 6.5.0-rc5-2.g997a7e4-default and current kernel-firmware-radeon-20230731-444.1.noarch
Comment 9 Takashi Iwai 2023-08-11 13:30:21 UTC
OK, could you rather report / update the upstream bug tracker entry?
Comment 10 Berthold Gunreben 2023-08-16 10:41:37 UTC
Just a question: Is there a way to know what firmware has actually been loaded? The firmware has been installed, but I would like to double check that it is actually used.
Comment 11 Takashi Iwai 2023-08-16 10:53:27 UTC
As of now, the easiest way would be to boot with firmware_class.dyndbg=+p boot option.  This will enable the debug outputs in the firmware loader, and shows every attempt of loaded firmware files.
Comment 12 Berthold Gunreben 2023-08-16 13:31:59 UTC
Created attachment 868840 [details]
boot.msg with dyndbg=+p

Looks good to me, however adding boot.msg with
firmware_class.dyndbg=+p
to document and make sure everything is alright.
Comment 13 B 2023-08-22 01:41:04 UTC
Hello, i just wanted to mention that I also get this type of error and it causes my system to freeze. Need to restart with [Alt]+[SysReq]+REISUB.


openSUSE Tumbleweed VERSION="20230821"
6.4.11-1-default
Mesa 23.1.5

Dmesg log (full log): https://paste.opensuse.org/pastes/321366d5a4e7
Journalctl amdgpu: https://paste.opensuse.org/pastes/05e50981b729

Reported upstream... 
https://gitlab.freedesktop.org/drm/amd/-/issues/2801
Comment 14 Berthold Gunreben 2023-09-04 07:32:45 UTC
Just wanted to mention, last night I had this crash with 6.5.0-rc7-1.g869afb7-default. I just updated to 6.5.0-7.gb5edcad-default and kernel-firmware-20230829-448.1 from Kernel:HEAD. I will mention here if I get a crash again.
Comment 15 Takashi Iwai 2023-09-11 09:36:47 UTC
Just to be sure, could you guys check with the latest 6.5.x kernel from OBS Kernel:stable repo?

Also, 6.6-rc1 will be available soon later in OBS Kernel:HEAD repo in this week.  If 6.5.x still suffers from the problem, check 6.6-rc1 later, too.
Comment 16 B 2023-09-11 12:10:33 UTC
(In reply to Takashi Iwai from comment #15)
> Just to be sure, could you guys check with the latest 6.5.x kernel from OBS
> Kernel:stable repo?
> 
> Also, 6.6-rc1 will be available soon later in OBS Kernel:HEAD repo in this
> week.  If 6.5.x still suffers from the problem, check 6.6-rc1 later, too.

at first glance it seems kernel 6.5.2 fixed the issue, maybe someone else can confirm?
Comment 17 Berthold Gunreben 2023-09-11 20:32:12 UTC
(In reply to B from comment #16)
> (In reply to Takashi Iwai from comment #15)
> > Just to be sure, could you guys check with the latest 6.5.x kernel from OBS
> > Kernel:stable repo?
> > 
> > Also, 6.6-rc1 will be available soon later in OBS Kernel:HEAD repo in this
> > week.  If 6.5.x still suffers from the problem, check 6.6-rc1 later, too.
> 
> at first glance it seems kernel 6.5.2 fixed the issue, maybe someone else
> can confirm?

My last crash happened with 6.5.0-rc5-2.g997a7e4-default. I currently run 6.5.2-1.gfdde566-default and did not see any issues so far. However, since it typically took something like 2 weeks for this to happen, I would not yet be sure if it is fixed. I would say, give it another two weeks ...
Comment 18 Berthold Gunreben 2023-09-11 21:56:25 UTC
Created attachment 869434 [details]
dmesg with kernel 6.5.2-1.gfdde566-default

I (again) was too early. I just had another crash related to this bug. I am now in the process of updating to the latest kernel, but 6.5.2 did not really fix the issue.
Comment 19 B 2023-09-12 10:59:46 UTC
Actually, maybe i was  wrong and your error and mine aren't the related. I just re-read the issue as described on amd gitlab again and it seems while the symptom is the same (crash) it's not for the same reason. So, ignore what i posted above.
Comment 20 Berthold Gunreben 2023-09-15 07:58:13 UTC
ok, 6.6.0-rc1-2.g45a1ae6-default also displays the issue. I can add the dmesg output if you like, but I don't think, there is anything new in there.
Comment 21 Takashi Iwai 2023-09-15 10:48:34 UTC
If it happens with 6.6-rc1, you'd better report it to the upstream bug tracker.
Here we keep eyes on the development there, but cannot provide much help.
Comment 22 Berthold Gunreben 2023-11-18 11:06:45 UTC
I wanted to mention that the issue did not occur now for quite some weeks. Therefore, I think the issue might be solved in newer kernels. Currently running  kernel 6.7.0-rc1, but the one before also did not show the issue.
Comment 23 Takashi Iwai 2023-11-20 14:55:33 UTC
OK, then let's close for now.
Feel free to reopen if the problem reappears again.  Thanks.