Bugzilla – Bug 1212169
drm:amdgpu_job_timeout
Last modified: 2023-11-20 14:55:33 UTC
Created attachment 867480 [details] output of hwinfo After updating Tumbleweed to kernel 6.3.4 from 6.3.2, I encounter sporadic hangs of display: [11488.681228] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx_0.0.0 timeout, signaled seq=3214852, emitted seq=3214854 [11488.681693] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process firefox pid 3679 thread firefox:cs0 pid 3746 [11488.682121] amdgpu 0000:07:00.0: amdgpu: GPU reset begin! From what I read, KDE would have to be restarted to work again, however the original issue is probably seen before.
Could you verify whether the problem is gone when you boot with 6.3.2? i.e. confirm that it's a kernel regression between 6.3.2 and 6.3.4?
This issue is kine of hard to reproduce. Yesterday, I again ran into the issue, which seems to be tracked here: https://gitlab.freedesktop.org/drm/amd/-/issues/1974 Noteably, I only run into the issue when the screensaver is on, which points to a power management issue with the graphics card. I never encountered this, when manually disabling screen saving. Kernel right now is 6.4.2-1-default
ok ... I should not have told so, just right now, the machine crashed. I could not even use ssh to access it. I'll add the messages starting with the GPU problem until I did a reboot.
Created attachment 868392 [details] messages from crash to hard reboot
This might be related with the firmware. Could you try to update to the latest 6.4.x kernel in OBS Kernel:stable, and the latest kernel-firmware-* package (version 20230731) from OBS Kernel:HEAD?
(In reply to Takashi Iwai from comment #5) > This might be related with the firmware. > Could you try to update to the latest 6.4.x kernel in OBS Kernel:stable, and > the latest kernel-firmware-* package (version 20230731) from OBS Kernel:HEAD? I updated to kernel 6.5.0-rc4-1.g2390421-default, the kernel-firmware packages are the version as mentioned above. I will get back to you if the system has issues, however those crashes are quite rare anyways...
Just a headsup: Since the upgrade of kernel and kernel-firmware packages, I did not have any crashes. Obviously I don't know if this is by chance, or if the issue is fixed. I now run 6.5.0-rc5-2.g997a7e4-default and will continue with Kernel:stable for the time being. One thing to notice (for what its worth ... I don't know): nvtop now displays way lower utilizations of the GPUs. Also, the usage looks more balanced, and the external GPU is also used when the internal is not fully loaded. Seems to be an improvement anyways.
Created attachment 868767 [details] new occurance with current kernel and firmware The issue just got me again, this time with current kernel 6.5.0-rc5-2.g997a7e4-default and current kernel-firmware-radeon-20230731-444.1.noarch
OK, could you rather report / update the upstream bug tracker entry?
Just a question: Is there a way to know what firmware has actually been loaded? The firmware has been installed, but I would like to double check that it is actually used.
As of now, the easiest way would be to boot with firmware_class.dyndbg=+p boot option. This will enable the debug outputs in the firmware loader, and shows every attempt of loaded firmware files.
Created attachment 868840 [details] boot.msg with dyndbg=+p Looks good to me, however adding boot.msg with firmware_class.dyndbg=+p to document and make sure everything is alright.
Hello, i just wanted to mention that I also get this type of error and it causes my system to freeze. Need to restart with [Alt]+[SysReq]+REISUB. openSUSE Tumbleweed VERSION="20230821" 6.4.11-1-default Mesa 23.1.5 Dmesg log (full log): https://paste.opensuse.org/pastes/321366d5a4e7 Journalctl amdgpu: https://paste.opensuse.org/pastes/05e50981b729 Reported upstream... https://gitlab.freedesktop.org/drm/amd/-/issues/2801
Just wanted to mention, last night I had this crash with 6.5.0-rc7-1.g869afb7-default. I just updated to 6.5.0-7.gb5edcad-default and kernel-firmware-20230829-448.1 from Kernel:HEAD. I will mention here if I get a crash again.
Just to be sure, could you guys check with the latest 6.5.x kernel from OBS Kernel:stable repo? Also, 6.6-rc1 will be available soon later in OBS Kernel:HEAD repo in this week. If 6.5.x still suffers from the problem, check 6.6-rc1 later, too.
(In reply to Takashi Iwai from comment #15) > Just to be sure, could you guys check with the latest 6.5.x kernel from OBS > Kernel:stable repo? > > Also, 6.6-rc1 will be available soon later in OBS Kernel:HEAD repo in this > week. If 6.5.x still suffers from the problem, check 6.6-rc1 later, too. at first glance it seems kernel 6.5.2 fixed the issue, maybe someone else can confirm?
(In reply to B from comment #16) > (In reply to Takashi Iwai from comment #15) > > Just to be sure, could you guys check with the latest 6.5.x kernel from OBS > > Kernel:stable repo? > > > > Also, 6.6-rc1 will be available soon later in OBS Kernel:HEAD repo in this > > week. If 6.5.x still suffers from the problem, check 6.6-rc1 later, too. > > at first glance it seems kernel 6.5.2 fixed the issue, maybe someone else > can confirm? My last crash happened with 6.5.0-rc5-2.g997a7e4-default. I currently run 6.5.2-1.gfdde566-default and did not see any issues so far. However, since it typically took something like 2 weeks for this to happen, I would not yet be sure if it is fixed. I would say, give it another two weeks ...
Created attachment 869434 [details] dmesg with kernel 6.5.2-1.gfdde566-default I (again) was too early. I just had another crash related to this bug. I am now in the process of updating to the latest kernel, but 6.5.2 did not really fix the issue.
Actually, maybe i was wrong and your error and mine aren't the related. I just re-read the issue as described on amd gitlab again and it seems while the symptom is the same (crash) it's not for the same reason. So, ignore what i posted above.
ok, 6.6.0-rc1-2.g45a1ae6-default also displays the issue. I can add the dmesg output if you like, but I don't think, there is anything new in there.
If it happens with 6.6-rc1, you'd better report it to the upstream bug tracker. Here we keep eyes on the development there, but cannot provide much help.
I wanted to mention that the issue did not occur now for quite some weeks. Therefore, I think the issue might be solved in newer kernels. Currently running kernel 6.7.0-rc1, but the one before also did not show the issue.
OK, then let's close for now. Feel free to reopen if the problem reappears again. Thanks.