Bug 1217083 - kernel-firmware (ucode-amd): snapshot 20231108 boot fails
Summary: kernel-firmware (ucode-amd): snapshot 20231108 boot fails
Status: RESOLVED NORESPONSE
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: All Other
: P5 - None : Critical (vote)
Target Milestone: ---
Assignee: dracut maintainers
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-13 13:16 UTC by Roeland Jansen
Modified: 2024-07-08 10:20 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
this is how they all end (52.19 KB, image/jpeg)
2023-11-22 13:10 UTC, Roeland Jansen
Details
dracut debug log of a failed system (5.75 MB, text/x-log)
2023-11-23 13:22 UTC, Roeland Jansen
Details
updated to latest w/o ucode (5.75 MB, text/x-log)
2023-11-23 13:41 UTC, Roeland Jansen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Roeland Jansen 2023-11-13 13:16:09 UTC
Recently I updated 23031107 to 20231108 on two laptops running legacy boot.

Both hang some time at console plymouth stuff. dracut then comes up with the inability to find /dev/system/root.

It ends up in the dracut emergency shell. 

in the shell, lvm pvscan will report no physical volumes.

It also happens on a virtual machine. (also not UEFI).

If I try to recover with an usb stick, no partitions are shown. If I show "all", 
the installed system is being identified als unknown, architecture unknown.
(and yes it's a 64 bit TW stick).

I could not recover them anymore. Also, on one of the laptops, grub does not show windosws anmore, the other does. 

I luckily have a snapshot on my work vm so I can revert back.


The following packages are updated when this happens.

  alsa alsa-devel code kernel-firmware-all kernel-firmware-amdgpu kernel-firmware-ath10k kernel-firmware-ath11k kernel-firmware-atheros kernel-firmware-bluetooth kernel-firmware-bnx2 kernel-firmware-brcm kernel-firmware-chelsio kernel-firmware-dpaa2 kernel-firmware-i915 kernel-firmware-intel kernel-firmware-iwlwifi kernel-firmware-liquidio
  kernel-firmware-marvell kernel-firmware-media kernel-firmware-mediatek kernel-firmware-mellanox kernel-firmware-mwifiex kernel-firmware-network kernel-firmware-nfp kernel-firmware-nvidia kernel-firmware-platform kernel-firmware-prestera kernel-firmware-qcom kernel-firmware-qlogic kernel-firmware-radeon kernel-firmware-realtek kernel-firmware-serial
  kernel-firmware-sound kernel-firmware-ti kernel-firmware-ueagle kernel-firmware-usb-network libasound2 libatopology2 libbrotli-devel libbrotlicommon1 libbrotlicommon1-x86-64-v3 libbrotlidec1 libbrotlidec1-x86-64-v3 libbrotlienc1 libbrotlienc1-x86-64-v3 libbytesize-lang libbytesize1 libgusb2 libnghttp2-14 libsqlite3-0 libsqlite3-0-x86-64-v3
  libsvn_auth_kwallet-1-0 libxxhash0 openSUSE-release openSUSE-release-appliance-custom openSUSE-release-ftp sqlite3-devel sqlite3-tcl subversion subversion-bash-completion subversion-perl sysuser-shadow ucode-amd wmctrl


I tried  by mounting in rescue, created a new grub.cfg where osprober does not show windows anymore. The interesting part is that dracut -f will as last line shows "adding boot menu entry for UEFI firmware settings" 

Basically I will end up in reinstalling windows and linux on both laptops. 
(in legacy boot)

If needed, I can install the non-kernel parts and see what package eventualy triggers this.
Comment 1 Roeland Jansen 2023-11-14 07:18:24 UTC
add info:


started to apply subsets of the updates until I found the package.

in my case, ucode-amd triggered the dracut emergency shell.

Both the failed vm (intel) and laptop (intel) dropped in dracut emerg. shell:

- not finding /dev/mapper/system-root
- lvm pvscan in dracut shell: no disks
   
On the vm I reverted to initial snapshot; taboo'd ucode-adm and then the rest of the patches applied -- boots.

In the laptop I did the recovery dance, mounted all on /mnt, --rbind and chrooted. I removed ucode-adm there, reran dracut and --- it boots again.

hope this info helps.
Comment 2 Takashi Iwai 2023-11-14 14:03:39 UTC
If it's a regression of amd-ucode, you can disable the ucode loading by passing dis_ucode_ldr boot option.

Also, could you check whether the old amd-ucode (found in TW history repo) still works?
  http://download.opensuse.org/history/
Comment 3 Roeland Jansen 2023-11-16 15:12:50 UTC
it's all intel -- so I would think it's not used at all.

I taboo'd the package and a restart was ok there. Directly all back.
Comment 4 Roeland Jansen 2023-11-16 15:13:45 UTC
and obviously rpm --erase'd.
Comment 5 Takashi Iwai 2023-11-16 15:20:27 UTC
Hm, then it smells really strange.  In general, the microcode won't be updated unless it really matches with the CPU (the CPU itself checks).

Since this is the only report -- although it should have hit to far more people -- I'm afraid that we're scratching the wrong surface.

Could you double-check whether it's really ucode-amd package that really breaks?
As mentioned, you can control the ucode loading via a boot option (so you can type in on GRUB menu at boot time).
Comment 6 Roeland Jansen 2023-11-22 13:00:38 UTC
both on the laptop upstairs and on the vm it was definitely the case.

what I did was:

laptop above (in the unbootable state): started rescue, mounted all on /mnt, including rbound /proc /sys and /dev; chrooted, removed the ucode-amd package only and it started.

the vm I use at work, installed all the packages -- broken, snapshotted back and incrementally in junks and w/o ucode-amd it booted. 

In a different setting while updating a client 12.4 --> 15.5 offline, I had the same where even two out of four PVs were missing. The LVs found were / and /usr.
(SLES)

In the mean time (about a week), I believe packages have updated on our RMTs and then it started working. It's what I observed, not 100% conclusive.

And just got word that a collegue of mine had the same issues, several lv's not seen. he's kicking the remote RMT to update and tries to re-update from a previous snapshot.
Comment 7 Roeland Jansen 2023-11-22 13:10:12 UTC
Created attachment 870905 [details]
this is how they all end

in dracut shell, in the lvm section, pvscan, nor pvs find disks. 

note that this not only happened in tumbleweed but also sles15.5 and 15.s for SAP
Comment 8 Takashi Iwai 2023-11-22 13:21:12 UTC
Then it's not an issue of amd-ucode itself, but something screwed up with dracut (or the info fed to dracut).

If it were a problem of CPU ucode, you won't reach at that point at all.

Tossed to dracut maintainers.
Comment 9 Antonio Feijoo 2023-11-22 13:56:15 UTC
(In reply to Roeland Jansen from comment #0)
> Recently I updated 23031107 to 20231108 on two laptops running legacy boot.

The previous dracut update was in snapshot 20231101 (059+suse.511.g0bdb16ac), no changes between 23031107 and 23031107.

> I tried  by mounting in rescue, created a new grub.cfg where osprober does
> not show windows anymore. The interesting part is that dracut -f will as
> last line shows "adding boot menu entry for UEFI firmware settings" 

This output is not from dracut, but from grub2-mkconfig. dracut does not add boot menu entries.

> it's all intel -- so I would think it's not used at all.
> 
> I taboo'd the package and a restart was ok there. Directly all back.

Then I'm wondering how ucode-amd can be included in your initramfs. Are you building non-hostonly initrds?

Can you attach the output of `dracut -f --debug test.img`?
Comment 10 Roeland Jansen 2023-11-22 14:05:23 UTC
if I literally throw away that package, it boots.

And what I just got from my collgue: all packages installed, vmware and model name      : Intel(R) Xeon(R) Gold 6338N CPU @ 2.20GHz as CPU seen on the VM's,

he installed 15.5 (SLES) --> missing lv's and pv's 

he just removed ucode-intel and it boots.... 

regarding the question about the output of `dracut -f --debug test.img`?

I then need to forcefully f* up one instance. Not sure if we can do. At work, these vm's ertainly cannot be f* up.

The images are not unified ones, specifically built on all the respective images.

I even have see a VM where both ucode-intel AND ucode-amd were installed as rpm's.

The plot thickens.
Comment 11 Antonio Feijoo 2023-11-22 14:19:28 UTC
(In reply to Roeland Jansen from comment #1)
> in my case, ucode-amd triggered the dracut emergency shell.
> 
> Both the failed vm (intel) and laptop (intel) dropped in dracut emerg. shell:

+

(In reply to Roeland Jansen from comment #10)
> he installed 15.5 (SLES) --> missing lv's and pv's 
> 
> he just removed ucode-intel and it boots.... 

so the boot fails with any ucode? that's pretty weird...
 
> regarding the question about the output of `dracut -f --debug test.img`?
> 
> I then need to forcefully f* up one instance. Not sure if we can do. At
> work, these vm's ertainly cannot be f* up.

You can install the ucode package that you say it's breaking your boot, run `dracut -f --debug test.img 2>&1 &> dracut.log` (it does not install the initrd in the /boot partition), and uninstall the ucode package.

Otherwise, if we don't have any kind of log it's quite difficult to guess what may be happening, I've never seen a similar bug report.
Comment 12 Roeland Jansen 2023-11-22 15:07:03 UTC
collegue of mine and I will be at the office and will try and see if we can replay it with a new vm and talked to my collegue and we will try to reproduce a new vm with this issue and send the test.img 
 
will also read the story above and act on it.
Comment 13 Antonio Feijoo 2023-11-22 15:19:35 UTC
(In reply to Roeland Jansen from comment #12)
> collegue of mine and I will be at the office and will try and see if we can
> replay it with a new vm and talked to my collegue and we will try to
> reproduce a new vm with this issue and send the test.img 

Just to clarify, we ask to attach the `dracut.log` file, not the `test.img`.

BTW, you can use this `test.img` initramfs without breaking the default, copy it to your /boot partition and edit the grub entry at boot when the grub menu is displayed, changing the value of the `initrd` line to /test.img
Comment 14 Roeland Jansen 2023-11-23 10:06:35 UTC
we're going to redo the test. my collegue mentioned that when the boot started to fail, he had also seen an UEFI entry in grub, like what I mentioned as well in one of the first times I wrote this report.

And his environment also legacy boot.

We'll update you later today I hope
Comment 15 Roeland Jansen 2023-11-23 13:16:30 UTC
the debug log to be attached
Comment 16 Roeland Jansen 2023-11-23 13:22:09 UTC
Created attachment 870934 [details]
dracut debug log of a failed system
Comment 17 Roeland Jansen 2023-11-23 13:41:06 UTC
Created attachment 870936 [details]
updated to latest w/o ucode

this is the log when we uninstall ucode (in this case intel) and do the zypper up. It then will boot just fine.
Comment 18 Roeland Jansen 2023-11-23 13:45:23 UTC
and as an add bonus: 

the boot log w/o ucode above as said boots.
After this we installed ucode-intel ; it fired off dracut and
it fails to boot afterwards. 

It stalls at rechaedtargetbasic system
after that, dracut initqueue complains that timeouts happen while not being able to find all LVs.

Don't think that the ucode itse;f is the issue but something that triggers dracut to create a failed initrd?
Comment 19 Antonio Feijoo 2023-11-24 10:20:02 UTC
> //etc/os-release@4(source): PRETTY_NAME='SUSE Linux Enterprise Server 15 SP5'

The logs you are providing are from SLE, so please provide the Tumbleweed logs.

If this is only about a SLE bug, you should open a L3 incident, so it can be addressed correctly. Thank you.
Comment 25 Roeland Jansen 2023-11-27 08:59:31 UTC
well....


The problem is that it's at home and work a TW issue (3 or 4 systems)
AND it also is a same issue on SLES (*) (at work)

100% same issue, same problem, no PVs found after update.
We do know the work-around" - on all the systems, removing ucode* upfront.

(*) I could just skip it and let it go like "if someone else has this issue, I don't care" but to me that's not helping. 

Two different teams looking at it to me doesn't seem to me effective use of resources?
Comment 26 Antonio Feijoo 2023-11-27 10:24:45 UTC
(In reply to Roeland Jansen from comment #25)
> well....
> 
> 
> The problem is that it's at home and work a TW issue (3 or 4 systems)
> AND it also is a same issue on SLES (*) (at work)
> 
> 100% same issue, same problem, no PVs found after update.
> We do know the work-around" - on all the systems, removing ucode* upfront.
> 
> (*) I could just skip it and let it go like "if someone else has this issue,
> I don't care" but to me that's not helping. 
> 
> Two different teams looking at it to me doesn't seem to me effective use of
> resources?

SLE is a commercial product, therefore all its incidents must be handled through the SUSE Customer Center (https://scc.suse.com/). That does not mean different teams, but different processes. Thank you for your understanding and I hope this does not cause you any inconvenience.

Other than that, I suspect (it's the only thing I can do with the info I have) that you are experiencing at least 2 different issues:
- 1 : Tumbleweed update from 23031107 to 20231108 on two laptops with Intel CPU => problem with ucode-amd (comment #0, comment #3)
- 2 : SLES 15.5, update? laptop or vm? with Intel CPU => problem with ucode-intel (comment #10)

Both Tumbleweed and SLE have different kernel, dracut, firmware versions... Tumbleweed is a rolling release, so it's very unlikely that an issue with a TW update is related to an issue in SLE.

BTW, AFAIK microcode only affects physical CPUs, the vm guests do not have microcode of it's own, so it's even more strange that a ucode package is breaking a vm. And, as Takashi said in comment #8, a problem with a microcode should not allow the system to get this far in the boot process.

So, it'd great if you can provide the Tumbleweed logs requested in comment #11 (`dracut -f --debug test.img 2>&1 &> dracut.log`), and the file /run/initramfs/rdsosreport.txt after trying to boot with the generated initramfs test.img, passing also `rd.debug` to the kernel command line. Thanks!
Comment 27 Roeland Jansen 2023-11-27 15:07:45 UTC
SLE15 is a VM for a customer. 

the TW systems I cannot provide anymore as they are all fixed and the reload of the ucode stuff does not trigger the iissues after the fix which let my collegue and I think of this:

if ucode (intel/amd) is there, it triggers a specific initrd build that b0rks the system. (ref comment #8). If there is no ucode installed --> no another dracut run, basically)

If you then update the system, putting back and doing a dracut -f does not fail anymore. 


What I can do is checking out if I can find back both ISOs (TW) and redo this in a VM.

re 23031107 to 20231108 -- happened on a vm (vmware workstation) in windows (intel) and on two physical systems (laptops, also intel)

the base image 15.5 we used for the customer (SLE) booted fine and failed after the update. we luckily had a snapshot so when we went back, removing ucode there (intel) and udated, all was fine. 

So our idea is that there is somewheren between versions/updates there is a specific condition that breaks the booting process. 

Give me some time to find the specific snapshots and replay this in a workstation vm. 

Re SLE: we can definitely 'replay' this from the base image and update but think the rdsosreport could be a problem. We'll see.
Comment 29 Antonio Feijoo 2024-02-29 16:26:27 UTC
Closing this bug for Tumbleweed after 3 months without response. Please reopen it if you can reproduce it with the current Tumbleweed version (snapshot 20240228) and provide its logs (see comment #26), because we didn't have any other reports similar to this one during this time span.

For the SLE case, an incident must be open through the SUSE Customer Center (https://scc.suse.com/).

Thank you.
Comment 31 Maintenance Automation 2024-04-02 08:30:06 UTC
SUSE-RU-2024:1081-1: An update that has four fixes can now be installed.

Category: recommended (important)
Bug References: 1217083, 1219841, 1220485, 1221675
Maintenance Incident: [SUSE:Maintenance:33012](https://smelt.suse.de/incident/33012/)
Sources used:
Basesystem Module 15-SP5 (src):
 dracut-055+suse.382.g80b55af2-150500.3.18.1
openSUSE Leap 15.5 (src):
 dracut-055+suse.382.g80b55af2-150500.3.18.1
SUSE Linux Enterprise Micro 5.5 (src):
 dracut-055+suse.382.g80b55af2-150500.3.18.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.