Bug 1177973 - amdgpu: kernel panic on boot upon modeset on AMD Renoir APU
Summary: amdgpu: kernel panic on boot upon modeset on AMD Renoir APU
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: x86-64 openSUSE Tumbleweed
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-10-21 14:29 UTC by Bernhard Randolf
Modified: 2023-04-26 13:54 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
screenshot of kernel panic upon booting kernel-default-5.8.14 (630.13 KB, image/jpeg)
2020-10-21 14:29 UTC, Bernhard Randolf
Details
screenshot of kernel panic upon booting kernel-default-5.9.1 (883.76 KB, image/jpeg)
2020-10-21 14:31 UTC, Bernhard Randolf
Details
output from hwinfo (1.55 MB, text/plain)
2020-10-21 16:07 UTC, Bernhard Randolf
Details
output from hwinfo --short (4.11 KB, text/plain)
2020-10-21 16:07 UTC, Bernhard Randolf
Details
dmesg boot attempt #1 (83.27 KB, text/plain)
2020-10-22 14:29 UTC, Bernhard Randolf
Details
dmesg boot attempt #2 (91.93 KB, text/plain)
2020-10-22 14:30 UTC, Bernhard Randolf
Details
dmesg boot attempt #3, kernel 5.9.1-1.gc31670b-default (82.68 KB, text/plain)
2020-10-22 15:12 UTC, Bernhard Randolf
Details
dmesg boot attempt #4, kernel 5.9.0-2.gb1f22f7-default (83.06 KB, text/plain)
2020-10-22 15:13 UTC, Bernhard Randolf
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard Randolf 2020-10-21 14:29:17 UTC
Created attachment 842864 [details]
screenshot of kernel panic upon booting kernel-default-5.8.14

Hi All,

I'm currently on kernel-default-5.8.14-1.2.x86_64 and kernel-firmware...20201005

Upon boot, my system (AMD Ryzen 5 PRO 4650G with Radeon Graphics, Gigabyte B550M AORUS PRO (rev. 1.0) - most recent bios F10,  Display DELL U4320Q, 3840x2160 via displayport) frequently hangs with kernel-panic.

Statistically, about one out of 5 boot-attempts are successful, sometimes it works on first attempt, sometimes it takes significantly more than 5 attempts, seems random.

However, once boot was successful, I can run the computer without any stability issues the entire day.
As the machine works nice apart from booting (or with nomodeset to avoid loading amdgpu), I think defective hardware can be ruled out.
The only thing that does not work reliably is booting.


Unfortunately I seem not to be able to get a direct log from the kernel-panic, I only managed to take a photo of the screen (attached).

I also tried kernel:stable as of today: kernel-default-5.9.1-1.1.g8abc535.x86_64 plus the corresponding firmware from the same source - same result: most boot attempts lead to kernel panic.

This is NOT a new problem / regression with a specific kernel or tumbleweed version, I'm experiencing these problems since I bought the Ryzen APU (plus mainboard).
I did test various kernel versions starting from 5.8.x since about 2 months, all more or less the same behavior.

As this is my first bug submission here, please be patient if any required information is missing, I'll try my best to deliver them upon request.

Thanks!
Comment 1 Bernhard Randolf 2020-10-21 14:31:05 UTC
Created attachment 842865 [details]
screenshot of kernel panic upon booting kernel-default-5.9.1
Comment 2 Takashi Iwai 2020-10-21 15:17:22 UTC
Does 5.7.x kernel work?  You can find an old kernel package in my OBS home:tiwai:kernel:5.7 repo.
  http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/

The Oops stack trace indicated that it's basically a kernel warning, but unfortunately the kgdb_breakpoint() calls in ASSERT_CRITICAL() leads to the kernel panic unnecessarily.

Through a quick glance, I didn't find an easy way to ignore this breakpoint.  Ideally the original issue (the unexpected code execution, in this case, it's about the doubly opens) should be fixed, but it's helpful to check whether it works otherwise or not... Let's see.
Comment 3 Bernhard Randolf 2020-10-21 15:43:58 UTC
(In reply to Takashi Iwai from comment #2)
> Does 5.7.x kernel work?  You can find an old kernel package in my OBS
> home:tiwai:kernel:5.7 repo.
>  
> http://download.opensuse.org/repositories/home:/tiwai:/kernel:/5.7/standard/
> 

Yes, it seems like 5.7.12-1.g9c98feb-default from your repo above does work indeed.
At least boot worked on first attempt in 4 out of 4 boot-processes now (I tried different scenarios in case this matters, such as reboot, hard reset, power off + power on).
Although I know 4 attempts is _not_ enough for statistics - but hey, that's the longest sequence of successful boot-processes I ever had so far on this rig :-)
Comment 4 Takashi Iwai 2020-10-21 15:56:11 UTC
FWIW, I'm building a test kernel with the removal of kgdb_breakpoint() call in the code path.  It's being built in OBS home:tiwai:bsc1177973 repo.  The build should finish after some time (usually an hour or so), and will appear at
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1177973/standard/

Please give it a try later, too.

Last but not least, please give the output of hwinfo.
Comment 5 Bernhard Randolf 2020-10-21 16:07:24 UTC
Created attachment 842877 [details]
output from hwinfo
Comment 6 Bernhard Randolf 2020-10-21 16:07:57 UTC
Created attachment 842878 [details]
output from hwinfo --short
Comment 7 Bernhard Randolf 2020-10-21 17:26:30 UTC
(In reply to Takashi Iwai from comment #4)
> The build should finish after some time (usually an hour or so), and will appear
> at
>   http://download.opensuse.org/repositories/home:/tiwai:/bsc1177973/standard/

Up to now, there is no x86_64 subdirectory present at the URL above. Something gone wrong, or did I just not wait long enough yet?

output of hwinfo has been attached to this bug already, hope this is in a somehow usable format...
Comment 8 Takashi Iwai 2020-10-21 17:29:50 UTC
Please be patient, it just takes long sometimes...
Comment 9 Takashi Iwai 2020-10-21 20:35:05 UTC
Hrm, something must be wrong in OBS publishing.

But if you want to try now, you can fetch the packages via osc command-line.
Install osc package, then get the binaries via
  osc getbinaries home:tiwai:bsc1177973/kernel-default/standard/x86_64
Comment 10 Takashi Iwai 2020-10-22 12:27:01 UTC
Never mind, the package is available now on the URL.

Note that this kernel will still keep showing the WARNING with the stack trace, but it shouldn't go to nirvana but keep running.  If it seems working more or less, please upload the dmesg output showing the stack trace.
Comment 11 Bernhard Randolf 2020-10-22 14:28:36 UTC
(In reply to Takashi Iwai from comment #10)
> Never mind, the package is available now on the URL.

Sorry, I did not manage to download & test earlier...


kernel-default-5.9.1-1.1.gc31670b.x86_64 boots without trouble (5 out of 5 attempts). Thank You so much!

Hero --> Takashi :-)

I'm attaching dmesg outputs of 2 boot attempts:
attempt #1 does not show the familiar warning originationg from dal_gpio_open_ex - this attempt is most likely one, that would have succeeded with my previously used kernels from kernel:stable-repo

attempt #2 does include the dal_gpio_open_ex - warning. this boot was only able to succeed with your magic kernel.

Once again, thanks for all your efforts and this terrific support!!
Comment 12 Bernhard Randolf 2020-10-22 14:29:49 UTC
Created attachment 842927 [details]
dmesg boot attempt #1
Comment 13 Bernhard Randolf 2020-10-22 14:30:36 UTC
Created attachment 842928 [details]
dmesg boot attempt #2
Comment 14 Takashi Iwai 2020-10-22 14:44:47 UTC
Good to hear that it works now :)

Judging from the warning, it seems that xcmddc program triggering the problem.  Maybe it's accessing the sysfs entry concurrently, which caused amdgpu complaining.

Could you check which package does xcmddc belong to?
  % rpm -qf $(which xcmddc)

Also, if possible, identify who uses it.  This can be udev.
Comment 15 Bernhard Randolf 2020-10-22 15:11:48 UTC
(In reply to Takashi Iwai from comment #14)
> Could you check which package does xcmddc belong to?
>   % rpm -qf $(which xcmddc)
 
xcm-0.5.4-lp152.3.5.x86_64    (I'm back on Leap 15.2 today, using your kernel, but change back to tumbleweed easily if required)

> Also, if possible, identify who uses it.  This can be udev.

How would I do that?
I tried

    % rpm -q --whatrequires xcm
    no package requires xcm

So I did

    % zypper rm xcm 

which did not uninstall anything else (as expected).

Reboot succeeded, I did not yet notice anything that did not work as before removing xcm.
See dmesg_attempt3_5.9.1-1.gc31670b-default.out which I will upload in a minute.

Out of curiosity, I tried to reboot with kernel-default-5.9.0-2.1 which was not yet purged (installed from Kernel:Head last Saturday).
This one now also boots without trouble (only tried twice, so far, so this might need further investigation)
See dmesg_attempt4_5.9.0-2.gb1f22f7-default.out which will follow asap.

So eventually xcmddc might be the culprit in this case.
No idea why this is installed on my system, I can not recall I ever installed that manually (which does not mean anything, I don't remember a lot of things I eventually did, people say...)

I'll do some more reboots with this 5.9.0 kernel and see if I can reproduce the kernel-panic once again in the mean time...
Comment 16 Bernhard Randolf 2020-10-22 15:12:29 UTC
Created attachment 842931 [details]
dmesg boot attempt #3, kernel 5.9.1-1.gc31670b-default
Comment 17 Bernhard Randolf 2020-10-22 15:13:06 UTC
Created attachment 842932 [details]
dmesg boot attempt #4, kernel 5.9.0-2.gb1f22f7-default
Comment 18 Takashi Iwai 2020-10-22 15:29:15 UTC
The bug is likely intermittent.  It's caused by the concurrent access to the GPIO to get / write some EDID thingy, so sometimes it hits, sometimes not.
I was afraid that it caused some deep loop, but apparently it's a one-off thing, and we can fix it properly just by dropping the ASSERT*() call there.

I'm going to submit the patch to upstream after a bit more discussion with them.

About the invocation of xcmddc: no worries, I checked xcm package, and found that it contains a udev rule to invoke the command.

I guess you might be able to trigger the bug by running like
  xcmddc --i2c /dev/$I2C --identify
where $I2C is i2c-0 or such existing device file.  Run the above from multiple places at the same time, and you might see the kernel warning again.
Comment 19 Bernhard Randolf 2020-10-22 15:51:08 UTC
(In reply to Takashi Iwai from comment #18)
> I guess you might be able to trigger the bug by running like
>   xcmddc --i2c /dev/$I2C --identify
> where $I2C is i2c-0 or such existing device file.  Run the above from
> multiple places at the same time, and you might see the kernel warning again.

For the record: I did several more (more than 10) reboots with 5.9.0 in absence of the xcm package, 100% of them successful. After reinstalling xcm, first boot with 5.9.0 failed with the familiar kernel-panic. 5.9.1 with your fix still works :-)

I was not able to reproduce the warnings with 3 concurrent xcmddc --i2c /dev/i2c-0 --identify calls in a while true-loop from bash with your 5.9.1. kernel (at least I did not find any warnings in dmesg or journal), but maybe "same time" is not that easy to achieve (or i just did not manage to find the warnings...)
Comment 20 Takashi Iwai 2020-10-23 09:36:08 UTC
I submitted the fix patches to upstream, and backported to openSUSE master branch, which will be merge to stable branch for TW update later.
Comment 21 Takashi Iwai 2020-10-23 09:36:59 UTC
The submitted series:
  https://lore.kernel.org/amd-gfx/20201023074656.11855-1-tiwai@suse.de/
Comment 22 Takashi Iwai 2020-10-23 16:14:45 UTC
The fix went in master branch.  Let's close the bug.