Bugzilla – Bug 1177973
amdgpu: kernel panic on boot upon modeset on AMD Renoir APU
Last modified: 2023-04-26 13:54:54 UTC
Created attachment 842864 [details] screenshot of kernel panic upon booting kernel-default-5.8.14 Hi All, I'm currently on kernel-default-5.8.14-1.2.x86_64 and kernel-firmware...20201005 Upon boot, my system (AMD Ryzen 5 PRO 4650G with Radeon Graphics, Gigabyte B550M AORUS PRO (rev. 1.0) - most recent bios F10, Display DELL U4320Q, 3840x2160 via displayport) frequently hangs with kernel-panic. Statistically, about one out of 5 boot-attempts are successful, sometimes it works on first attempt, sometimes it takes significantly more than 5 attempts, seems random. However, once boot was successful, I can run the computer without any stability issues the entire day. As the machine works nice apart from booting (or with nomodeset to avoid loading amdgpu), I think defective hardware can be ruled out. The only thing that does not work reliably is booting. Unfortunately I seem not to be able to get a direct log from the kernel-panic, I only managed to take a photo of the screen (attached). I also tried kernel:stable as of today: kernel-default-5.9.1-1.1.g8abc535.x86_64 plus the corresponding firmware from the same source - same result: most boot attempts lead to kernel panic. This is NOT a new problem / regression with a specific kernel or tumbleweed version, I'm experiencing these problems since I bought the Ryzen APU (plus mainboard). I did test various kernel versions starting from 5.8.x since about 2 months, all more or less the same behavior. As this is my first bug submission here, please be patient if any required information is missing, I'll try my best to deliver them upon request. Thanks!
Created attachment 842865 [details] screenshot of kernel panic upon booting kernel-default-5.9.1
Does 5.7.x kernel work? You can find an old kernel package in my OBS home:tiwai:kernel:5.7 repo. http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/ The Oops stack trace indicated that it's basically a kernel warning, but unfortunately the kgdb_breakpoint() calls in ASSERT_CRITICAL() leads to the kernel panic unnecessarily. Through a quick glance, I didn't find an easy way to ignore this breakpoint. Ideally the original issue (the unexpected code execution, in this case, it's about the doubly opens) should be fixed, but it's helpful to check whether it works otherwise or not... Let's see.
(In reply to Takashi Iwai from comment #2) > Does 5.7.x kernel work? You can find an old kernel package in my OBS > home:tiwai:kernel:5.7 repo. > > http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/ > Yes, it seems like 5.7.12-1.g9c98feb-default from your repo above does work indeed. At least boot worked on first attempt in 4 out of 4 boot-processes now (I tried different scenarios in case this matters, such as reboot, hard reset, power off + power on). Although I know 4 attempts is _not_ enough for statistics - but hey, that's the longest sequence of successful boot-processes I ever had so far on this rig :-)
FWIW, I'm building a test kernel with the removal of kgdb_breakpoint() call in the code path. It's being built in OBS home:tiwai:bsc1177973 repo. The build should finish after some time (usually an hour or so), and will appear at http://download.opensuse.org/repositories/home:/tiwai:/bsc1177973/standard/ Please give it a try later, too. Last but not least, please give the output of hwinfo.
Created attachment 842877 [details] output from hwinfo
Created attachment 842878 [details] output from hwinfo --short
(In reply to Takashi Iwai from comment #4) > The build should finish after some time (usually an hour or so), and will appear > at > http://download.opensuse.org/repositories/home:/tiwai:/bsc1177973/standard/ Up to now, there is no x86_64 subdirectory present at the URL above. Something gone wrong, or did I just not wait long enough yet? output of hwinfo has been attached to this bug already, hope this is in a somehow usable format...
Please be patient, it just takes long sometimes...
Hrm, something must be wrong in OBS publishing. But if you want to try now, you can fetch the packages via osc command-line. Install osc package, then get the binaries via osc getbinaries home:tiwai:bsc1177973/kernel-default/standard/x86_64
Never mind, the package is available now on the URL. Note that this kernel will still keep showing the WARNING with the stack trace, but it shouldn't go to nirvana but keep running. If it seems working more or less, please upload the dmesg output showing the stack trace.
(In reply to Takashi Iwai from comment #10) > Never mind, the package is available now on the URL. Sorry, I did not manage to download & test earlier... kernel-default-5.9.1-1.1.gc31670b.x86_64 boots without trouble (5 out of 5 attempts). Thank You so much! Hero --> Takashi :-) I'm attaching dmesg outputs of 2 boot attempts: attempt #1 does not show the familiar warning originationg from dal_gpio_open_ex - this attempt is most likely one, that would have succeeded with my previously used kernels from kernel:stable-repo attempt #2 does include the dal_gpio_open_ex - warning. this boot was only able to succeed with your magic kernel. Once again, thanks for all your efforts and this terrific support!!
Created attachment 842927 [details] dmesg boot attempt #1
Created attachment 842928 [details] dmesg boot attempt #2
Good to hear that it works now :) Judging from the warning, it seems that xcmddc program triggering the problem. Maybe it's accessing the sysfs entry concurrently, which caused amdgpu complaining. Could you check which package does xcmddc belong to? % rpm -qf $(which xcmddc) Also, if possible, identify who uses it. This can be udev.
(In reply to Takashi Iwai from comment #14) > Could you check which package does xcmddc belong to? > % rpm -qf $(which xcmddc) xcm-0.5.4-lp152.3.5.x86_64 (I'm back on Leap 15.2 today, using your kernel, but change back to tumbleweed easily if required) > Also, if possible, identify who uses it. This can be udev. How would I do that? I tried % rpm -q --whatrequires xcm no package requires xcm So I did % zypper rm xcm which did not uninstall anything else (as expected). Reboot succeeded, I did not yet notice anything that did not work as before removing xcm. See dmesg_attempt3_5.9.1-1.gc31670b-default.out which I will upload in a minute. Out of curiosity, I tried to reboot with kernel-default-5.9.0-2.1 which was not yet purged (installed from Kernel:Head last Saturday). This one now also boots without trouble (only tried twice, so far, so this might need further investigation) See dmesg_attempt4_5.9.0-2.gb1f22f7-default.out which will follow asap. So eventually xcmddc might be the culprit in this case. No idea why this is installed on my system, I can not recall I ever installed that manually (which does not mean anything, I don't remember a lot of things I eventually did, people say...) I'll do some more reboots with this 5.9.0 kernel and see if I can reproduce the kernel-panic once again in the mean time...
Created attachment 842931 [details] dmesg boot attempt #3, kernel 5.9.1-1.gc31670b-default
Created attachment 842932 [details] dmesg boot attempt #4, kernel 5.9.0-2.gb1f22f7-default
The bug is likely intermittent. It's caused by the concurrent access to the GPIO to get / write some EDID thingy, so sometimes it hits, sometimes not. I was afraid that it caused some deep loop, but apparently it's a one-off thing, and we can fix it properly just by dropping the ASSERT*() call there. I'm going to submit the patch to upstream after a bit more discussion with them. About the invocation of xcmddc: no worries, I checked xcm package, and found that it contains a udev rule to invoke the command. I guess you might be able to trigger the bug by running like xcmddc --i2c /dev/$I2C --identify where $I2C is i2c-0 or such existing device file. Run the above from multiple places at the same time, and you might see the kernel warning again.
(In reply to Takashi Iwai from comment #18) > I guess you might be able to trigger the bug by running like > xcmddc --i2c /dev/$I2C --identify > where $I2C is i2c-0 or such existing device file. Run the above from > multiple places at the same time, and you might see the kernel warning again. For the record: I did several more (more than 10) reboots with 5.9.0 in absence of the xcm package, 100% of them successful. After reinstalling xcm, first boot with 5.9.0 failed with the familiar kernel-panic. 5.9.1 with your fix still works :-) I was not able to reproduce the warnings with 3 concurrent xcmddc --i2c /dev/i2c-0 --identify calls in a while true-loop from bash with your 5.9.1. kernel (at least I did not find any warnings in dmesg or journal), but maybe "same time" is not that easy to achieve (or i just did not manage to find the warnings...)
I submitted the fix patches to upstream, and backported to openSUSE master branch, which will be merge to stable branch for TW update later.
The submitted series: https://lore.kernel.org/amd-gfx/20201023074656.11855-1-tiwai@suse.de/
The fix went in master branch. Let's close the bug.