Bug 1227301 - Kernel boot crashes on Thinkpad P14s Gen 3 AMD
Summary: Kernel boot crashes on Thinkpad P14s Gen 3 AMD
Status: NEW
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Xen (show other bugs)
Version: Leap 15.6
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Jürgen Groß
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-07-02 15:17 UTC by Takashi Iwai
Modified: 2024-07-09 07:19 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
dmesg with crash of Leap 15.6 kernel (100.61 KB, text/plain)
2024-07-02 15:17 UTC, Takashi Iwai
Details
dmesg from TW kernel (109.05 KB, text/plain)
2024-07-02 15:17 UTC, Takashi Iwai
Details
Debug patch (5.88 KB, patch)
2024-07-03 11:07 UTC, Jürgen Groß
Details | Diff
dmesg from the patched 6.9.7 kernel (103.68 KB, text/plain)
2024-07-03 15:02 UTC, Takashi Iwai
Details
Debug patch V2 (7.21 KB, patch)
2024-07-05 08:10 UTC, Jürgen Groß
Details | Diff
dmesg from the v2 patched 6.9.7 kernel (109.26 KB, text/plain)
2024-07-05 15:45 UTC, Takashi Iwai
Details
logs from xen and normal boots (90.00 KB, application/x-tar)
2024-07-08 09:03 UTC, Takashi Iwai
Details
acpidump output (1.56 MB, text/plain)
2024-07-09 07:19 UTC, Takashi Iwai
Details
hwinfo output (2.44 MB, text/plain)
2024-07-09 07:19 UTC, Takashi Iwai
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Takashi Iwai 2024-07-02 15:17:09 UTC
Created attachment 875832 [details]
dmesg with crash of Leap 15.6 kernel

When I boot a recent kernel (openSUSE Leap 15.6 or TW 6.9.x kernel) with Xen (Dom0) on the Company's standard laptop (Thinkpad P14s Gen 3 AMD), it crashes with kernel oops and couldn't proceed the boot.

After skimming over the net, I found that it's crashing at loading ucsi_acpi driver, and blacklisting it indeed made it booting further.  (As a result, it lacks of the touchpad and some USB stuff, though.)

Below is a dmesg output after manually loading ucsi_acpi module.

I checked with 6.9.7 TW backport kernel, and it hits the same problem.
Comment 1 Takashi Iwai 2024-07-02 15:17:38 UTC
Created attachment 875833 [details]
dmesg from TW kernel
Comment 3 Jürgen Groß 2024-07-03 11:07:09 UTC
Created attachment 875846 [details]
Debug patch

Could you try to boot with the patch applied to your kernel? You'd need to add "xen_mc_debug" to the kernel commandline.

The kernel log should have some more data narrowing down the root cause.
Comment 4 Takashi Iwai 2024-07-03 15:02:34 UTC
Created attachment 875854 [details]
dmesg from the patched 6.9.7 kernel
Comment 5 Takashi Iwai 2024-07-03 15:06:22 UTC
The above is the log from the patched kernel.  At this time, it was called with nomodeset, but it shouldn't matter.  The bug happens right after modprobe of ucsi_acpi module.

As far as I understand, the second Oops ("BUG: unable to handle page fault for address: ffffc90040715100") happened at reading a byte value via ACPI_GET8(logical_addr_ptr) in acpi_ex_system_memory_space_handler().
Comment 6 Jürgen Groß 2024-07-03 15:25:50 UTC
(In reply to Takashi Iwai from comment #5)
> The above is the log from the patched kernel.  At this time, it was called
> with nomodeset, but it shouldn't matter.  The bug happens right after
> modprobe of ucsi_acpi module.
> 
> As far as I understand, the second Oops ("BUG: unable to handle page fault
> for address: ffffc90040715100") happened at reading a byte value via
> ACPI_GET8(logical_addr_ptr) in acpi_ex_system_memory_space_handler().

This is to be expected, as establishing the mapping did fail due to a negative return value from the hypervisor when trying to update a PTE.
Comment 7 Jürgen Groß 2024-07-05 08:10:43 UTC
Created attachment 875905 [details]
Debug patch V2

Second try with more data being printed in the error case.

Can you please replace the first debug patch with this one?
Comment 8 Takashi Iwai 2024-07-05 15:45:51 UTC
Created attachment 875916 [details]
dmesg from the v2 patched 6.9.7 kernel
Comment 9 Jürgen Groß 2024-07-07 08:15:27 UTC
(In reply to Takashi Iwai from comment #8)
> Created attachment 875916 [details]
> dmesg from the v2 patched 6.9.7 kernel

Thanks, this is making things much more clear.

Seems as if the kernel is trying to map part of the MSI space (physical address range 0xfee00000 - 0xfeeff000). When running as dom0 this should not happen, as the hypervisor is owning this region and will deny mapping it.

Seems as if the ucsi driver needs to be made Xen aware.
Comment 10 Jürgen Groß 2024-07-08 08:46:49 UTC
Are you able to tell which I/O-resources are at physical address feec2000-feec2fff?

Probably you should be able to find out when booting without Xen via "cat /proc/iomem" and/or "lspci -v".

I'm pretty sure the region fee01000-feefffff should only be used as MSI space.
Comment 11 Takashi Iwai 2024-07-08 09:03:54 UTC
Created attachment 875936 [details]
logs from xen and normal boots
Comment 12 Jürgen Groß 2024-07-09 07:14:07 UTC
There seems to be no BAR located in the area trying to be mapped.

Could you please provide an acpidump?
Comment 13 Takashi Iwai 2024-07-09 07:19:09 UTC
Created attachment 875954 [details]
acpidump output
Comment 14 Takashi Iwai 2024-07-09 07:19:29 UTC
Created attachment 875955 [details]
hwinfo output