Bug 1226497 - No boot from software RAID after upgrade to 15.6 on Supermicro H12DSi-NT6
Summary: No boot from software RAID after upgrade to 15.6 on Supermicro H12DSi-NT6
Status: IN_PROGRESS
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Bootloader (show other bugs)
Version: Leap 15.6
Hardware: x86-64 Other
: P5 - None : Major (vote)
Target Milestone: ---
Assignee: Bootloader Maintainers
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-18 20:19 UTC by Georg Pfuetzenreuter
Modified: 2024-07-16 07:04 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Georg Pfuetzenreuter 2024-06-18 20:19:35 UTC
Hi,

this follows https://progress.opensuse.org/issues/162401.

Background:
We have several hypervisors with a software RAID (MDADM) setup, which also houses /boot and /boot/efi. This always worked very well on Supermicro platforms and avoided the need for additional tricks to keep the EFI partition redundant and in sync.

Situation:
The upgrade to 15.6 on one such machine (with a Supermicro H12SSL-NT board) worked fine.
A second machine (with a Supermicro H12DSi-NT6 board) however freezes at POST during the reboot while printing "Ready to boot" (after which it would usually execute to the OS bootloader) after the 15.6 upgrade.

Entering a chroot from a live environment, downgrading the GRUB packages to the ones used on 15.5

grub2-2.06-150500.29.25.12.x86_64.rpm
grub2-i386-pc-2.06-150500.29.25.12.noarch.rpm
grub2-x86_64-efi-2.06-150500.29.25.12.noarch.rpm

and `bash -x /usr/lib/bootloader/grub2-efi/install` makes the machine boot again.

Advice would be appreciated - we have several more of the H12DSi-NT6 machines which we cannot upgrade to 15.6 because of this.
Comment 1 Georg Pfuetzenreuter 2024-06-18 21:07:05 UTC
Additional, potentially useful, information:

```
falkor21 (Hypervisor):~ # grep -Ev '^#|^$' /etc/sysconfig/bootloader
LOADER_TYPE="grub2-efi"
SECURE_BOOT=no
TRUSTED_BOOT="no"
UPDATE_NVRAM=yes

falkor21 (Hypervisor):~ # efibootmgr
BootCurrent: 0000
Timeout: 1 seconds
BootOrder: 0000,0010,0004,0006,0008,000A,000C,000E,0002,0001,000F
Boot0000* opensuse
Boot0001  Hard Drive
Boot0002* UEFI: Built-in EFI Shell
Boot0004* (B1/D0/F0) UEFI HTTP: IPv4 Supermicro 10GBASE-T Ethernet Controller(MAC:3cecefcb981e)
Boot0006* (B1/D0/F1) UEFI HTTP: IPv4 Supermicro 10GBASE-T Ethernet Controller(MAC:3cecefcb981f)
Boot0008* (B33/D0/F0) UEFI HTTP: IPv4 Intel(R) Ethernet Converged Network Adapter XL710-Q2(MAC:40a6b7a59b68)
Boot000A* (B33/D0/F1) UEFI HTTP: IPv4 Intel(R) Ethernet Converged Network Adapter XL710-Q2(MAC:40a6b7a59b69)
Boot000C* (B161/D0/F0) UEFI HTTP: IPv4 Intel(R) Ethernet Converged Network Adapter XL710-Q2(MAC:40a6b7a59258)
Boot000E* (B161/D0/F1) UEFI HTTP: IPv4 Intel(R) Ethernet Converged Network Adapter XL710-Q2(MAC:40a6b7a59259)
Boot000F  USB CD
Boot0010* UEFI: ATEN Virtual CDROM YS0J
```

I also noted the GRUB2 from 15.6 creates `/boot/efi/EFI/BOOT/BOOTX64.EFI` in addition to `/boot/efi/EFI/opensuse/grubx64.efi`. Not sure if that is relevant.

GRUB2 package version from 15.6 repositories: 2.12-150600.6.12.
Comment 2 Michael Chang 2024-07-03 08:30:12 UTC
Hi Georg,

As far as I can tell, there are no major changes in how grub handles software RAID in version 2.12. It should not fail like that when encountering a problem. Normally, a failed root device access should drop you into a rescue shell, providing some information for troubleshooting.

This makes me suspect that the issue might not be specific to software RAID. Instead, it could be a new regression in version 2.12 caused by the newly introduced bli module, which may also render the screen black indefinitely. [1]

Please check if this is the case by disabling 25_bli in grub.cfg. Follow these steps after updating to grub 2.12:

> 1. chmod -x /etc/grub.d/25_bli
> 2. grub2-mkconfig -o /boot/grub2/grub.cfg

Ensure that the following section is completely **removed** from the resulting grub.cfg:

> ### BEGIN /etc/grub.d/25_bli ###
> if [ "$grub_platform" = "efi" ]; then
>   insmod bli
> fi
> ### END /etc/grub.d/25_bli ###

Finally, reboot and check if it makes any difference?

[1] https://mail.gnu.org/archive/html/grub-devel/2023-07/msg00077.html

Thanks.
Comment 3 Georg Pfuetzenreuter 2024-07-06 11:23:52 UTC
Hi Michael,

thank you for the pointer and instructions!
With them followed, the machine indeed boots fine!

I suspected the issue to be RAID specific since booting a 15.6 based live ISO image (through the "Virtual CD-ROM" feature of the BMC) works fine without the workaround.
Comment 4 Georg Pfuetzenreuter 2024-07-09 21:33:11 UTC
It seems the "25_bli" file gets its executable bit reset after updating/reinstalling the package. But I found it is installed with "noreplace"

```
%config(noreplace) %{_sysconfdir}/grub.d/25_bli
```

meaning a better workaround seems to be inserting `exit 0` at the beginning of the script.

But of course, not a permanent solution. ;-)
Comment 5 Michael Chang 2024-07-10 11:30:31 UTC
(In reply to Georg Pfuetzenreuter from comment #4)
> It seems the "25_bli" file gets its executable bit reset after
> updating/reinstalling the package. But I found it is installed with
> "noreplace"
> 
> ```
> %config(noreplace) %{_sysconfdir}/grub.d/25_bli
> ```
> 
> meaning a better workaround seems to be inserting `exit 0` at the beginning
> of the script.
> 
> But of course, not a permanent solution. ;-)

Somehow I have the impression that the permission should be preserved if the file is packaged as %config(noreplace). Maybe something has changed or my memory has betrayed me. 

Anyway, I have taken a brief look into the bli issue, and despite some related patches being merged upstream to address the GUID alignment problem, the issue still seems not to be completely fixed. I am trying to build a test package in the hope of sorting out the issue and would like to ask if you are okay to help in testing it, given that the issue is specific to firmware and we need your assistance.

Thanks.
Comment 6 Michael Chang 2024-07-11 04:45:07 UTC
Hi Georg,

I’m not entirely sure what went wrong, but here’s the initial patch to start with:

https://build.opensuse.org/projects/home:michael-chang:bsc:1226497/packages/grub2/files/0001-Increase-grub_guid_t-alignment-from-4-to-8.patch?expand=1

It has published repository:

https://download.opensuse.org/repositories/home:/michael-chang:/bsc:/1226497/openSUSE_Tumbleweed/

If you have some time and it won’t disrupt your work, please consider testing it and providing feedback on the results. We'll be able to plan the next steps based on what we learn from this. If you encounter any issues, feel free to let me know.

Thanks.
Comment 7 Georg Pfuetzenreuter 2024-07-11 14:23:27 UTC
Hi Michael,

thanks for looking into it.

I'm happy to test - it's a bit inconvenient to recover from the live system, but doable. ;-)

Could you enable the linked home project to build for 15.6 as well? Or should I pick the Tumbleweed build?
Comment 8 Michael Chang 2024-07-12 07:08:49 UTC
(In reply to Georg Pfuetzenreuter from comment #7)
> Hi Michael,
> 
> thanks for looking into it.
> 
> I'm happy to test - it's a bit inconvenient to recover from the live system,
> but doable. ;-)

Thank you.

> 
> Could you enable the linked home project to build for 15.6 as well? Or
> should I pick the Tumbleweed build?

It's now published:
https://download.opensuse.org/repositories/home:/michael-chang:/bsc:/1226497/15.6/

Thanks..
Comment 9 Georg Pfuetzenreuter 2024-07-12 16:53:19 UTC
Installed grub2-2.12-lp156.30.1.x86_64.rpm, grub2-i386-pc-2.12-lp156.30.1.noarch.rpm  grub2-x86_64-efi-2.12-lp156.30.1.noarch.rpm from your repository, enabled the 25_bli script and generated grub.cfg -> machine does not boot (stuck at "DXE -- Ready to boot.." again).
Comment 10 Michael Chang 2024-07-15 04:49:12 UTC
(In reply to Georg Pfuetzenreuter from comment #9)
> Installed grub2-2.12-lp156.30.1.x86_64.rpm,
> grub2-i386-pc-2.12-lp156.30.1.noarch.rpm 
> grub2-x86_64-efi-2.12-lp156.30.1.noarch.rpm from your repository, enabled
> the 25_bli script and generated grub.cfg -> machine does not boot (stuck at
> "DXE -- Ready to boot.." again).

Thank you for quick turnaround, Looks like the guid alignment isn’t the issue this time.

After digging a bit more,  it seems like get_part_uuid() might be the problem, especially since your /boot/esp is in a RAID setup.

For this second round, I have added the patch:

https://build.opensuse.org/projects/home:michael-chang:bsc:1226497/packages/grub2/files/0001-bli-Fix-crash-in-get_part_uuid.patch?expand=1

And it should be publish in the same location:

https://download.opensuse.org/repositories/home:/michael-chang:/bsc:/1226497/15.6/ 

Could you please test it again ? Your help is greatly appreciated.

Thanks.
Comment 11 Georg Pfuetzenreuter 2024-07-15 15:40:20 UTC
Thanks for the new patch and explanation.
I followed the same steps - lo and behold, it boots!
Comment 12 Michael Chang 2024-07-16 07:04:10 UTC
Thank you very much for your help. Based on the positive results, I have sent the patch upstream for review.

https://lore.kernel.org/grub-devel/20240716065500.17142-1-mchang@suse.com/T/#u

I'll include it once it is accepted.
Thanks.