Bug 1218005 - Leap 15.5 does not boot with any of the kernel-default from update repository
Summary: Leap 15.5 does not boot with any of the kernel-default from update repository
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Leap 15.5
Hardware: x86-64 openSUSE Leap 15.5
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-12-13 07:48 UTC by Anders Stedtlund
Modified: 2024-02-15 16:30 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
rdosreport.txt from boot attempt with kernel version 5.14.21-150500.55.36.1 (200.82 KB, text/plain)
2023-12-13 07:48 UTC, Anders Stedtlund
Details
dmesg after successful boot with default kernel (89.23 KB, text/plain)
2023-12-13 07:49 UTC, Anders Stedtlund
Details
rdosreport.txt from boot attempt with kernel version 5.14.21-150500.55.7.1 (200.76 KB, text/plain)
2023-12-13 13:49 UTC, Anders Stedtlund
Details
hwinfo for working kernel 5.14.21-150500.53 (2.74 MB, text/plain)
2023-12-13 13:50 UTC, Anders Stedtlund
Details
dmesg after successful boot with 6.6.9-lp155.2.g61d1d44-default (86.91 KB, text/plain)
2024-01-10 09:36 UTC, Anders Stedtlund
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Anders Stedtlund 2023-12-13 07:48:56 UTC
Created attachment 871315 [details]
rdosreport.txt from boot attempt with kernel version 5.14.21-150500.55.36.1

I have an Acer laptop that does not boot with any of the kernel-default from update repository. The boot process ends up with:
" Starting Dracut Emergency Shell…
Warning: /dev/disk/by-uuid/B2C3-C97D does not exist
Warning: /dev/disk/by-uuid/d733130e-1ad1-4ee3-be20-8f74d7763d88 does not exist"

B2C3-C97D refers to /boot/efi in a working system.
d733130e-1ad1-4ee3-be20-8f74d7763d88 refers to / in a working system.

With the following kernel parameters the boot succeed:
nvme_core.default_ps_max_latency_us=0 pcie_aspm=off

Default kernel 5.14.21-150500.53.2 is working.
E.g. kernel 5.14.21-150500.55.36.1 from update repository is NOT working.
Comment 1 Anders Stedtlund 2023-12-13 07:49:56 UTC
Created attachment 871316 [details]
dmesg after successful boot with default kernel
Comment 2 Takashi Iwai 2023-12-13 09:07:31 UTC
There are a few older update kernels, e.g. 5.14.21-150500.55.31.1, 5.14.21-150500.55.28.1, etc.  You can see it via "zypper se -s kernel-default".

Could you try them and figure out which kernel still worked and which not?
i.e. narrowing down the regression range.

Note that it might be safer to increase the number of installable kernels beforehand, by editing /etc/zypp/zypp.conf.  Add more entries in multiversion.kernels line, e.g.
  multiversion.kernels = latest,latest-1,latest-2,latest-3,running
This will allow 4 kernels to be kept on the system without purging.

Also, please give the hwinfo and dmesg outputs from the working kernel, too.
Comment 3 Anders Stedtlund 2023-12-13 09:17:39 UTC
If I remember correctly it started with the first of the update kernel. I will test again though.
Comment 4 Takashi Iwai 2023-12-13 09:21:13 UTC
There are tons of update kernels between *-150500.53.2 and *-150500.55.36.  Apparently the update took only the latest one, and it was broken.  You can install the older update kernels by installing the kernel with versions, e.g.
  zypper in --oldpackage kernel-default-optional-5.14.21-150500.55.31.1
Comment 5 Anders Stedtlund 2023-12-13 13:49:32 UTC
Created attachment 871329 [details]
rdosreport.txt from boot attempt with kernel version 5.14.21-150500.55.7.1
Comment 6 Anders Stedtlund 2023-12-13 13:50:10 UTC
Created attachment 871330 [details]
hwinfo for working kernel 5.14.21-150500.53
Comment 7 Anders Stedtlund 2023-12-13 13:53:12 UTC
I was a bit unclear before. I had tested with the first kernel update that I could find which seems to be 5.14.21-150500.55.7.1. I have now retested with that kernel and it does not boot.
I have added rdosreport.txt from this attempt.
I also added hwinfo for the working kernel.
The dmesg above is from the working kernel.
Comment 8 Takashi Iwai 2023-12-13 15:32:30 UTC
OK, thanks.  So the breakage appeared already at the very first 15.5 update kernel.

For further narrow-down, let's try to swap the kernel modules.
You can copy the kernel modules from the working kernel to the broken kernel.
e.g. 
  mkdir /lib/modules/5.14.21-150500.55.7-default/updates
  cp /lib/modules/5.14.21-150500.53-default/kernel/drivers/nvme/*/*.ko.zst /lib/modules/5.14.21-150500.55.7-default/updates
  depmod -a 5.14.21-150500.55.7-default
  dracut -f --kver 5.14.21-150500.55.7-default

and retest.  This will replace only nvme-* modules while keeping the rest.
If this works, you can reduce the modules from */updates and try to figure out which module broke.
Comment 9 Anders Stedtlund 2023-12-14 09:11:40 UTC
I followed your instructions but the kernel still don't boot.
Comment 10 Takashi Iwai 2023-12-14 09:39:25 UTC
It means that it's triggered by changes in other parts, e.g. PCI core.

Did you test only with pcie_aspm=off boot option?
Comment 11 Anders Stedtlund 2023-12-14 09:49:14 UTC
I tested now with only pcie_aspm=off and the kernel booted up.
Comment 12 Anders Stedtlund 2023-12-21 12:46:27 UTC
Anything else I can provide to help trouble shoot this?
Comment 13 Takashi Iwai 2024-01-02 12:23:17 UTC
Could you verify whether the problem is present with the recent upstream kernels?  Install kernel-default from OBS Kernel:stable:Backport repo
  http://download.opensuse.org/repositories/Kernel:/stable:/Backport/standard/

If it works, there can be some already workaround in the upstream side we can backport to SLE15-SP5 kernel.
Comment 14 Anders Stedtlund 2024-01-03 07:40:59 UTC
Unfortunately I have another issue when trying to load the kernel from Backport. I can't add it to UEFI as it seem that the password I have saved does not match. This is not good at all but a different problem.
Comment 15 Takashi Iwai 2024-01-03 08:05:38 UTC
Is Secure Boot disabled on BIOS?
Comment 16 Anders Stedtlund 2024-01-03 08:19:01 UTC
I managed to get into UEFI by resetting the password. I have disabled secure boot and the kernel from backport do boot up without any issues as it seems.
Comment 17 Takashi Iwai 2024-01-08 15:19:37 UTC
(In reply to Anders Stedtlund from comment #16)
> I managed to get into UEFI by resetting the password. I have disabled secure
> boot and the kernel from backport do boot up without any issues as it seems.

OK, could you give the dmesg output from the recent upstream kernel, too?
Comment 18 Anders Stedtlund 2024-01-10 09:36:28 UTC
Created attachment 871739 [details]
dmesg after successful boot with 6.6.9-lp155.2.g61d1d44-default
Comment 19 Takashi Iwai 2024-01-15 16:56:12 UTC
Thanks.  I still couldn't identify the cause (nor the possible upstream fix) yet.

Just to be sure, let's check whether the very latest SLE15-SP5 still suffers from the problem.  Please test the kernel in OBS Kernel:SLE15-SP5 repo
  http://download.opensuse.org/repositories/Kernel:/SLE15-SP5/pool/

And, I'm building a test kernel with a few PCIe core patches reverted.  It's being built in OBS home:tiwai:bsc1218005 repo.  Once after the build finishes (takes an hour or so), the package will appear at
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1218005/pool/
If the above kernel from OBS Kernel:SLE15-SP5 repo doesn't work, try mine later.

Note that those kernels are unofficial builds, hence you'd need to Secure Boot if it's turned on.  Also, the kernel revisions may be smaller than the official release kernels.  You'd better to increase the number of installable kernels beforehand by editing /etc/zypp/zypp.conf.  Increase the entries of multiversion.kernels, e.g.
  multiversion.kernels = latest,latest-1,latest-2,latest-3,running
Comment 20 Anders Stedtlund 2024-01-16 11:42:38 UTC
I tested:
kernel-default-5.14.21-150500.225.1.gcc7d8b6.x86_64
from:
http://download.opensuse.org/repositories/Kernel:/SLE15-SP5/pool/
and:
kernel-default-5.14.21-150500.1.1.ge47c72f.x86_64
from:
http://download.opensuse.org/repositories/home:/tiwai:/bsc1218005/pool/

Both failed to boot.
Comment 21 Takashi Iwai 2024-01-16 12:25:30 UTC
Hmm, OK, then let's go back and verify the following:

- 5.14.21-150500.53 kernel works as is without option
- 5.14.21-150500.55.7 kernel boots only with pcie_aspm=off

If both above are true, try to swap the whole modules of the latter kernel with the former in the following way:

  % mv /lib/modules/5.14.21-150500.55.7-default /lib/modules/5.14.21-150500.55.7-default.old
  % cp -a /lib/modules/5.14.21-150500.53-default /lib/modules/5.14.21-150500.55.7-default
  % depmod -a 5.14.21-150500.55.7-default
  % dracut -f --kver 5.14.21-150500.55.7-default

And boot *-55.7 kernel without extra option, verify whether it boots or not.
It'll get warnings about BTF, but those can be ignored.

If this boots up, something in modules are problematic.  If this doesn't boot up, it really means that some changes in the built-in kernel is problematic, instead.
Comment 22 Anders Stedtlund 2024-01-16 15:24:09 UTC
- 5.14.21-150500.53
Boot OK!
- 5.14.21-150500.55.7
Boot OK with pcie_aspm=off

Replace modules in 5.14.21-150500.55.7-default with modules from 5.14.21-150500.53-default.
Boot OK!
Comment 23 Takashi Iwai 2024-01-16 16:34:34 UTC
(In reply to Anders Stedtlund from comment #22)
> - 5.14.21-150500.53
> Boot OK!
> - 5.14.21-150500.55.7
> Boot OK with pcie_aspm=off
> 
> Replace modules in 5.14.21-150500.55.7-default with modules from
> 5.14.21-150500.53-default.
> Boot OK!

That's an interesting result.  Then I scratched a wrong surface.
It might be that the previous test with module replacement didn't work properly.

The next step would be to identify which module actually breaks.  It'll be great help for understanding what's going wrong.

You can copy back the new modules from the saved directory (*-55.7-default.old) to *-55.7-default directory again, piece-by-piece.
e.g. let's begin with the main suspect, nvme driver modules:

  % rm -r /lib/modules/5.14.21-150500.55.7-default/kernel/drivers/nvme
  % cp -a /lib/modules/5.14.21-150500.55.7-default.old/kernel/drivers/nvme lib/modules/5.14.21-150500.55.7-default/kernel/drivers/
  % depmod -a
  % dracut -f --kver 5.14.21-150500.55.7-default

This will replace the all nvme modules back to the *-55.7 again.  Retest with this.

If this breaks the boot, it's one (or more) of nvme modules.  You can try again by copying each *.ko.zst from *-53-default directory and narrow down the culprit.

OTOH, if replacing nvme drivers don't break, it's something else.  Try to replace each directory until you hit.
Comment 24 Takashi Iwai 2024-01-16 16:35:54 UTC
Note that our interest is only about the actually loaded modules.  You can check lsmod output on the working system, and check whether they are included in the directory you try to replace.
Comment 25 Anders Stedtlund 2024-01-17 15:19:03 UTC
This is the module that seems to be the cuplrit:
/lib/modules/5.14.21-150500.55.7-default/kernel/drivers/pci/controller/vmd.ko.zst

If I copy vmd.ko.zst from 5.14.21-150500.53-default and keep all other modules from 5.14.21-150500.55.7-default, the kernel boot.
Comment 26 Takashi Iwai 2024-01-17 15:44:00 UTC
Thanks, that's a great info!  I didn't think of this stuff.
Comment 27 Takashi Iwai 2024-01-17 15:55:12 UTC
I'm build another test kernel with the revert of the problematic change in PCI/vmd.  It's being built in OBS home:tiwai:bsc1218005-2 repo.  Once after the build finishes, the package will appear at
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1218005-2/pool/

Please give it a try later.
Comment 28 Takashi Iwai 2024-01-17 16:08:01 UTC
Also, yet another kernel is being built in OBS home:tiwai:bsc1218005-3 repo.
This one is with more complete backports of PCI/vmd stuff instead of reverting the patch.  Check this one later when you have time, too.
Comment 29 Anders Stedtlund 2024-01-18 11:24:15 UTC
Unfortunately, none of those kernels boot.
Comment 30 Takashi Iwai 2024-01-18 11:35:18 UTC
OK thanks.  Then it must be yet another patch touching PCI/vmd stuff.  It's the only one left between *-53 and *-55.7.

I'm building another kernel in OBS home:tiwai:bsc1218005-4 repo.  Please give it a try later.

BTW, such a problem might be dependent on the hot or cold boot.  Make sure that you do cold boot after updating the kernel.
Comment 31 Anders Stedtlund 2024-01-19 05:08:12 UTC
(In reply to Takashi Iwai from comment #30)
> OK thanks.  Then it must be yet another patch touching PCI/vmd stuff.  It's
> the only one left between *-53 and *-55.7.
> 
> I'm building another kernel in OBS home:tiwai:bsc1218005-4 repo.  Please
> give it a try later.
> 
> BTW, such a problem might be dependent on the hot or cold boot.  Make sure
> that you do cold boot after updating the kernel.

This kernel boot! Both hot and cold boot.

Btw, I had issue adding your latest repos, including this latest. Could not find ./repo/repoinit.xml. Yast did not add them, I had to go with zypper.
Comment 32 Takashi Iwai 2024-01-19 09:36:12 UTC
Thanks, finally nailed down.  The problematic patch was the backport of the commit 0a584655ef89541dae4d48d2c523b1480ae80284
  PCI: vmd: Fix secondary bus reset for Intel bridges

It's still not known which additional fix is missing (as the backport of all PCI/vmd didn't seem to help as in comment 20).  So currently it's just reverted.

The fix will be included likely in the regular update in February.

(In reply to Anders Stedtlund from comment #31)
> Btw, I had issue adding your latest repos, including this latest. Could not
> find ./repo/repoinit.xml. Yast did not add them, I had to go with zypper.

I don't know what's missing, but repomd.xml is present.
In anyway, it's only for testing purpose, and not supposed to be used for long term to be added to your zypper repo list.  Once after installing the kernel (and keep it until for the next update), remove this repo.
Comment 33 Anders Stedtlund 2024-01-19 10:09:17 UTC
Thank you for your support! Let me know if you want me to test anyting before it gets official.

I think I will use kernels from Kernel:/stable:/Backport/standard/ for the time being.
Comment 43 Maintenance Automation 2024-02-14 16:30:07 UTC
SUSE-SU-2024:0469-1: An update that solves 19 vulnerabilities, contains eight features and has 41 security fixes can now be installed.

Category: security (important)
Bug References: 1065729, 1108281, 1141539, 1174649, 1181674, 1193285, 1194869, 1209834, 1210443, 1211515, 1212091, 1214377, 1215275, 1215885, 1216441, 1216559, 1216702, 1217895, 1217987, 1217988, 1217989, 1218005, 1218447, 1218527, 1218659, 1218713, 1218723, 1218730, 1218738, 1218752, 1218757, 1218768, 1218778, 1218779, 1218804, 1218832, 1218836, 1218916, 1218948, 1218958, 1218968, 1218997, 1219006, 1219012, 1219013, 1219014, 1219053, 1219067, 1219120, 1219128, 1219136, 1219285, 1219349, 1219412, 1219429, 1219434, 1219490, 1219512, 1219568, 1219582
CVE References: CVE-2021-33631, CVE-2023-46838, CVE-2023-47233, CVE-2023-4921, CVE-2023-51042, CVE-2023-51043, CVE-2023-51780, CVE-2023-51782, CVE-2023-6040, CVE-2023-6356, CVE-2023-6531, CVE-2023-6535, CVE-2023-6536, CVE-2023-6915, CVE-2024-0565, CVE-2024-0641, CVE-2024-0775, CVE-2024-1085, CVE-2024-1086
Jira References: PED-4729, PED-6694, PED-7322, PED-7615, PED-7616, PED-7620, PED-7622, PED-7623
Sources used:
openSUSE Leap 15.5 (src): kernel-livepatch-SLE15-SP5-RT_Update_10-1-150500.11.5.1, kernel-source-rt-5.14.21-150500.13.35.1, kernel-syms-rt-5.14.21-150500.13.35.1
SUSE Linux Enterprise Live Patching 15-SP5 (src): kernel-livepatch-SLE15-SP5-RT_Update_10-1-150500.11.5.1
SUSE Real Time Module 15-SP5 (src): kernel-source-rt-5.14.21-150500.13.35.1, kernel-syms-rt-5.14.21-150500.13.35.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 44 Maintenance Automation 2024-02-15 16:30:15 UTC
SUSE-SU-2024:0516-1: An update that solves 21 vulnerabilities, contains nine features and has 40 security fixes can now be installed.

Category: security (important)
Bug References: 1065729, 1108281, 1141539, 1174649, 1181674, 1193285, 1194869, 1209834, 1210443, 1211515, 1212091, 1214377, 1215275, 1215885, 1216441, 1216559, 1216702, 1217895, 1217987, 1217988, 1217989, 1218005, 1218447, 1218527, 1218659, 1218689, 1218713, 1218723, 1218730, 1218752, 1218757, 1218768, 1218778, 1218779, 1218804, 1218832, 1218836, 1218916, 1218948, 1218958, 1218968, 1218997, 1219006, 1219012, 1219013, 1219014, 1219053, 1219067, 1219120, 1219128, 1219136, 1219285, 1219349, 1219412, 1219429, 1219434, 1219490, 1219512, 1219568, 1219582, 1219608
CVE References: CVE-2021-33631, CVE-2023-46838, CVE-2023-47233, CVE-2023-4921, CVE-2023-51042, CVE-2023-51043, CVE-2023-51780, CVE-2023-51782, CVE-2023-6040, CVE-2023-6356, CVE-2023-6531, CVE-2023-6535, CVE-2023-6536, CVE-2023-6915, CVE-2024-0340, CVE-2024-0565, CVE-2024-0641, CVE-2024-0775, CVE-2024-1085, CVE-2024-1086, CVE-2024-24860
Jira References: PED-4729, PED-6694, PED-7322, PED-7615, PED-7616, PED-7618, PED-7620, PED-7622, PED-7623
Sources used:
openSUSE Leap 15.5 (src): kernel-livepatch-SLE15-SP5_Update_10-1-150500.11.5.1, kernel-source-5.14.21-150500.55.49.1, kernel-default-base-5.14.21-150500.55.49.1.150500.6.21.2, kernel-obs-build-5.14.21-150500.55.49.1, kernel-syms-5.14.21-150500.55.49.1, kernel-obs-qa-5.14.21-150500.55.49.1
SUSE Linux Enterprise Micro 5.5 (src): kernel-default-base-5.14.21-150500.55.49.1.150500.6.21.2
Basesystem Module 15-SP5 (src): kernel-source-5.14.21-150500.55.49.1, kernel-default-base-5.14.21-150500.55.49.1.150500.6.21.2
Development Tools Module 15-SP5 (src): kernel-obs-build-5.14.21-150500.55.49.1, kernel-source-5.14.21-150500.55.49.1, kernel-syms-5.14.21-150500.55.49.1
SUSE Linux Enterprise Live Patching 15-SP5 (src): kernel-livepatch-SLE15-SP5_Update_10-1-150500.11.5.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 45 Maintenance Automation 2024-02-15 16:30:39 UTC
SUSE-SU-2024:0514-1: An update that solves 21 vulnerabilities, contains nine features and has 41 security fixes can now be installed.

Category: security (important)
Bug References: 1065729, 1108281, 1141539, 1174649, 1181674, 1193285, 1194869, 1209834, 1210443, 1211515, 1212091, 1214377, 1215275, 1215885, 1216441, 1216559, 1216702, 1217895, 1217987, 1217988, 1217989, 1218005, 1218447, 1218527, 1218659, 1218689, 1218713, 1218723, 1218730, 1218738, 1218752, 1218757, 1218768, 1218778, 1218779, 1218804, 1218832, 1218836, 1218916, 1218948, 1218958, 1218968, 1218997, 1219006, 1219012, 1219013, 1219014, 1219053, 1219067, 1219120, 1219128, 1219136, 1219285, 1219349, 1219412, 1219429, 1219434, 1219490, 1219512, 1219568, 1219582, 1219608
CVE References: CVE-2021-33631, CVE-2023-46838, CVE-2023-47233, CVE-2023-4921, CVE-2023-51042, CVE-2023-51043, CVE-2023-51780, CVE-2023-51782, CVE-2023-6040, CVE-2023-6356, CVE-2023-6531, CVE-2023-6535, CVE-2023-6536, CVE-2023-6915, CVE-2024-0340, CVE-2024-0565, CVE-2024-0641, CVE-2024-0775, CVE-2024-1085, CVE-2024-1086, CVE-2024-24860
Jira References: PED-4729, PED-6694, PED-7322, PED-7615, PED-7616, PED-7618, PED-7620, PED-7622, PED-7623
Sources used:
openSUSE Leap 15.5 (src): kernel-source-azure-5.14.21-150500.33.34.1, kernel-syms-azure-5.14.21-150500.33.34.1
Public Cloud Module 15-SP5 (src): kernel-source-azure-5.14.21-150500.33.34.1, kernel-syms-azure-5.14.21-150500.33.34.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.