Bugzilla – Bug 1182776
[Build 154.1] System cannot boot after migration: Synchronous Exception at 0x000000007F5DF438
Last modified: 2024-05-16 07:40:13 UTC
Created attachment 846545 [details] Serial Console * Platform and arch: aarch64, Virtual System * OS Version: SLES+HA 15-SP3 Public Beta (Build 154.1), Full media - Installed via: ISO - Using procedure: manual, by openQA * LOGS: serial console log * Results: - Expected: aarch64 system in a previous OS version is successfully migrated to 15-SP3. - Real: after offline migration finishes, system is not able to boot * Reproducible: yes * openQA: * Link to failed tests: https://openqa.suse.de/tests/5523702#step/grub_test/322 (migration from 15-LTSS) https://openqa.suse.de/tests/5523689#step/grub_test/322 (migration from 15-SP1-LTSS) https://openqa.suse.de/tests/5523697#step/grub_test/322 (migration from 15-SP2) * Link to last successful runs: https://openqa.suse.de/tests/5475230 (Snapshot 10/Build 150.1, migration from 15-LTSS) https://openqa.suse.de/tests/5475217 (Snapshot 10/Build 150.1, migration from 15-SP1-LTSS) https://openqa.suse.de/tests/5475225 (Snapshot 10/Build 150.1, migration from 15-SP2) * Description: attempting to migrate VMs from 15-LTSS, 15-SP1-LTSS or 15-SP2-LTSS to 15-SP3 using textmode migration and Full medium, while the migration process finishes successfully, when the VM is restarted after migration, it does not come back online. It can be seen in the serial console from the openQA test (attached) the following messages right after ISO image grub menu is loaded: Synchronous Exception at 0x000000007F5DF438 AllocatePool: failed to allocate 688 bytes ASSERT [ArmCpuDxe] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdePkg/Library/UefiLib/UefiLibPrint.c(203): Buffer != ((void *) 0) Additionally, the same migration test was run today on the same hypervisor at the same time, one migrating the VM to Snapshot 10, and the other migrating the VM to Public Beta candidate. Migration to Snapshot 10 was possible while migration to Public Beta candidate failed as described: Migration to Snapshot 10: http://mango.qa.suse.de/tests/3556 Migration to Public Beta: http://mango.qa.suse.de/tests/3555
It looks to me that the exception appears before the kernel gets loaded into memory. I.e. it comes from Grub(?). Michael, can you please glance at the serial log and share your thoughts on the matter?
(In reply to Libor Pechacek from comment #1) > It looks to me that the exception appears before the kernel gets loaded into > memory. I.e. it comes from Grub(?). Michael, can you please glance at the > serial log and share your thoughts on the matter? It is assertion failure in AAVMF, so that has to be sorted out from virtual machine firmware first. Synchronous Exception at 0x000000007F5DF438 AllocatePool: failed to allocate 472 bytes ASSERT [HiiDatabase] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdeModulePkg/Universal/HiiDatabaseDxe/Font.c(1686): Cell != ((void *) 0)
(In reply to Alvaro Carvajal from comment #0) > Additionally, the same migration test was run today on the same hypervisor > at the same time, one migrating the VM to Snapshot 10, and the other > migrating the VM to Public Beta candidate. Migration to Snapshot 10 was > possible while migration to Public Beta candidate failed as described: > > Migration to Snapshot 10: http://mango.qa.suse.de/tests/3556 > Migration to Public Beta: http://mango.qa.suse.de/tests/3555 If migration to snapshot 10 and prior can work, then I wonder is it a regression from grub because it's aarch64 port has not changed for nearly a month. * 2021 1月 27 三 mchang@suse.com - Complete Secure Boot support on aarch64 (jsc#SLE-15020) * 0001-Add-support-for-Linux-EFI-stub-loading-on-aarch64.patch * 0002-arm64-make-sure-fdt-has-address-cells-and-size-cells.patch * 0003-Make-grub_error-more-verbose.patch * 0004-arm-arm64-loader-Better-memory-allocation-and-error-.patch * 0005-Make-linux_arm_kernel_header.hdr_offset-be-at-the-ri.patch * 0006-efi-Set-image-base-address-before-jumping-to-the-PE-.patch * 0007-linuxefi-fail-kernel-validation-without-shim-protoco.patch * 0008-squash-Add-support-for-Linux-EFI-stub-loading-on-aar.patch * 0009-squash-Add-support-for-linuxefi.patch
Looks to be shim related. The serial log for snapshot 10 had no shim in place while in the public beta shim was loaded by aavmf before grub in a sequence. Given that the problem emerged in a wake of recent secure boot integration by YaST that made shim really effective in the boot process for aarch64, maybe we can look into this direction first. CC Gary and Joey.
Is secure boot enabled in "firmware" or not ?
(In reply to Michael Chang from comment #5) > Is secure boot enabled in "firmware" or not ? I do not think so. I see these messages in the serial console when the test is starting: Variable SecureBoot is 0 Variable SecureBootEnable is 0 And then I see this: [ 1.357581] ima: secureboot mode disabled In the boot after the migration to 15-SP3. To be sure, I will trigger a test to capture FW settings and get back to you.
Created attachment 846576 [details] Secure Boot Configuration Got the attached screenshot from the following test: http://mango.qa.suse.de/tests/3563 (ongoing as of now) This is a clone of http://mango.qa.suse.de/tests/3559, itself a clone of http://mango.qa.suse.de/tests/3555 which I referenced while opening the bug. As can be seen on the screenshot, Secure Boot State is disabled.
Just FYI our migration test hit this bug too https://openqa.nue.suse.com/tests/5554749#step/grub_test/322 https://openqa.nue.suse.com/tests/5554747#step/grub_test/322
We are still seeing this with the new build 156.3: https://openqa.suse.de/tests/5552630#step/grub_test/322 Also, with bsc#1182663 fixed, our media+SCC migrations scenarios which use Online medium and SCC during migration were able to perform the migration itself, but are also failing to boot after migration in the same manner: https://openqa.suse.de/tests/5552981#step/grub_test/322 However, online migration scenarios with zypper are working: https://openqa.suse.de/tests/5552985 All linked tests are for migrations from 15-SP2, but we're seeing the same issue in migrations from 15-GA and 15-SP1 as well.
Couple of updates to this issue. It's also visible on 64bit with UEFI enabled. I've tried to disable secure boot, but it didn't help. Also, hard drive with the installed system is not listed in the tianocore, but if I try to boot from "Misc devices" entry, I was able to boot with secure boot enabled, and got following on the serial: Creating boot entry "Boot0009" with label "sles-secureboot" for file "\EFI\sles\shim.efi" [0m[30m[47mWelcome to GRUB! [0m[37m[40mPlease press 't' to show the boot menu on this console Create new secret key Failed to generate secret key: EFI_NOT_FOUND Same trick didn't work at all with secure boot being disabled and grub got stuck on loading kernel step.
Adjusting the component to bootloader as recent analysis point towards grub2 and the addition of shim in aarch64 secureboot introduced with the recent yast2-bootloader update.
Another update, on 64bit, seems it's more related to https://bugzilla.suse.com/show_bug.cgi?id=1182749 As manually, if I have secure boot disabled on the VM before the installation, it all works just fine. But in case it's enabled during installation, disabling it afterwards doesn't help. It still might be that issues share same root cause.
The Synchronous Exception is caused by assertion of null pointers returned by uefi memory allocation service in the aavmf. The source snippet relevant to > Synchronous Exception at 0x000000007F5DF438 > AllocatePool: failed to allocate 688 bytes > ASSERT [ArmCpuDxe] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdePkg/Library/UefiLib/UefiLibPrint.c(203): Buffer != ((void *) 0) is > Buffer = (CHAR16 *) AllocatePool(BufferSize); > ASSERT (Buffer != NULL); while > Synchronous Exception at 0x000000007F5DF438 > AllocatePool: failed to allocate 472 bytes > ASSERT [HiiDatabase] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdeModulePkg/Universal/HiiDatabaseDxe/Font.c(1686): Cell != ((void *) 0) is > Cell = (EFI_HII_GLYPH_INFO *) AllocateZeroPool (StrLength * sizeof (EFI_HII_GLYPH_INFO)); > ASSERT (Cell != NULL); Can we try to allocate more memory to the guest and see of that helps ? The shim is loaded in front of grub, and will not be relinquished until kernel calling out ExitBootServce() .. If that didn't help, then probably there's infinte looping somewhere until OOM..
It is about the "Boot from Hard Disk" in the grub_test step. The "working" case appears to be with some workaround applied, the test case seems to expect that it would "(re)boot to uefi menu" but not "Boot from Hard Disk". From there it selected the newly installed boot entry ... Why not boot to the hard disk directly but have to start from the cdrom ? The install should have set the boot order to boot from the hard disk, this looks likely not normal as well. I wonder somehow the "Boot from Hard Disk" is broken with the new shim integration, as now it really finds the fallback.efi and for some reason stuck there. But at the same time the test case seem to be bogus as it relies on a unitended behavior so that it didn't really perform the test -- loading the disk on the target disk with "reboot" (but the cdrom again). Could the openQA help to improve the flow so the test result can be more clearly understood ? I thought the "Boot from Hard Disk" also failed in the "working" case given that it didn't really work. menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class gnu --class os { if search --no-floppy --file /efi/boot/fallback.efi --set ; then for os in opensuse sles caasp ; do if [ -f /efi/$os/grub.efi ] ; then chainloader /efi/$os/grub.efi fi done fi } No shim so no fallback.efi, and the test case expects the search to fail ?
(In reply to Michael Chang from comment #13) > Can we try to allocate more memory to the guest and see of that helps ? The > shim is loaded in front of grub, and will not be relinquished until kernel > calling out ExitBootServce() .. > > If that didn't help, then probably there's infinte looping somewhere until > OOM.. I triggered a test with QEMURAM=2048 which is double what is configured for the tests in openqa.suse.de: http://mango.qa.suse.de/tests/3564#step/grub_test/322 But it failed in the same step as before.
Please have a look to this bug report .. https://bugzilla.suse.com/show_bug.cgi?id=1176967 Basically, we should avoid using "Boot from hard disk" to test boot the disk after installation, as the secure boot signkey may be different for the media and disk. (The shim in the media is used to provide the shim-lock, not the one from the disk).
(In reply to Alvaro Carvajal from comment #15) > (In reply to Michael Chang from comment #13) > http://mango.qa.suse.de/tests/3564#step/grub_test/322 > > But it failed in the same step as before. Thanks a lot for the verification, then it is really something else ...
(In reply to Michael Chang from comment #14) > menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class > gnu --class os { > if search --no-floppy --file /efi/boot/fallback.efi --set ; then > for os in opensuse sles caasp ; do > if [ -f /efi/$os/grub.efi ] ; then > chainloader /efi/$os/grub.efi > fi > done > fi > } Hm. The arm build merely does 'exit' for "Boot from Hard Disk" ... menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class gnu --class os { exit } IIRC, the shim has hook funciton to the 'exit' of the loaded image .. If so it might be related to the problem here, since previously it had worked without shim. Hi Gary, Did you have any idea/thoughts ? Thanks.
(In reply to Michael Chang from comment #18) > (In reply to Michael Chang from comment #14) > > > menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class > > gnu --class os { > > if search --no-floppy --file /efi/boot/fallback.efi --set ; then > > for os in opensuse sles caasp ; do > > if [ -f /efi/$os/grub.efi ] ; then > > chainloader /efi/$os/grub.efi > > fi > > done > > fi > > } > > Hm. The arm build merely does 'exit' for "Boot from Hard Disk" ... > > menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class > gnu --class os { > exit > } > > IIRC, the shim has hook funciton to the 'exit' of the loaded image .. If so > it might be related to the problem here, since previously it had worked > without shim. > > Hi Gary, > > Did you have any idea/thoughts ? > Thanks. The exit hook from shim is quite simple: unhook the system services and call the actual BS->Exit(). Per serial0.txt, the boot option next to cdrom is the firmware menu (UiApp): [Bds]=============Begin Load Options Dumping ...============= Driver Options: SysPrep Options: Boot Options: Boot0001: UEFI QEMU QEMU CD-ROM 0x0001 Boot0000: UiApp 0x0109 Boot0007: EFI Internal Shell 0x0001 Boot0002: UEFI Misc Device 0x0001 PlatformRecovery Options: PlatformRecovery0000: Default PlatformRecovery 0x0001 [Bds]=============End Load Options Dumping============= I wonder if something went wrong in UiApp.
(In reply to Gary Ching-Pang Lin from comment #19) > (In reply to Michael Chang from comment #18) > > (In reply to Michael Chang from comment #14) > [Bds]=============Begin Load Options Dumping ...============= > Driver Options: > SysPrep Options: > Boot Options: > Boot0001: UEFI QEMU QEMU CD-ROM 0x0001 > Boot0000: UiApp 0x0109 > Boot0007: EFI Internal Shell 0x0001 > Boot0002: UEFI Misc Device 0x0001 > PlatformRecovery Options: > PlatformRecovery0000: Default PlatformRecovery 0x0001 > [Bds]=============End Load Options Dumping============= > > I wonder if something went wrong in UiApp. Yes at least the exception looked like trouble in drawing the menu. @ Alvaro, Is it possible to rearrange the order in the openQA test ? Either having "Boot0002: UEFI Misc Device" on top of the list, or at the second place after "Boot0001: UEFI QEMU QEMU CD-ROM 0x0001" that the "Boot from Hard Disk" can fall through to it thus avoid jumping through the hoops of UiApp ... Thanks.
Surprisingly the "Boot from Hard Disk" failed for x85_64 efi as well ... It entered the uefi menu properly. But since it didn't boot the installed disk the result is deemed as failure. The "entering the uefi menu" aligns to the arm behavior here, maybe x86 code path has accidentally changed so it is not a coincident. I wouldn't mind to test the "Boot from Hard disk" functionality, but it'd be better in a separate one that is not on the migration or major path. The reboot after installation should just straight to new installed boot entry, as indicated by the uefi boot order to conform to the what the uefi spec would expect.
(In reply to Michael Chang from comment #20) > @ Alvaro, > Is it possible to rearrange the order in the openQA test ? Either having > "Boot0002: UEFI Misc Device" on top of the list, or at the second place > after "Boot0001: UEFI QEMU QEMU CD-ROM 0x0001" that the "Boot from Hard > Disk" can fall through to it thus avoid jumping through the hoops of UiApp > ... > > Thanks. I figure I can do it before the test starts (i.e., before the migration), but not after, as precisely the issue in the grub_test step is that the test code cannot get into the FW menu. I also tried (with build 154.1) to skip getting into the FW menu after the migration and boot directly, but this didn't work. Not sure if due to the same issue or due to bsc#1022064 (as described in the work around).
Created attachment 846667 [details] Boot Order Configured the boot order as shown in the attached screenshot for the test in http://mango.qa.suse.de/tests/3575, however as can be seen it's failing in the grub_step test module right after migration. It does not look as if the test itself is changing back this boot order, but if necessary I can re-trigger and save a video of the test.
(In reply to Alvaro Carvajal from comment #23) > It does not look as if the test itself is changing back this boot order, but > if necessary I can re-trigger and save a video of the test. It seems to be always trying to load uiapp after grub exit despite how you specify the boot order. My test reveal the same problem. Besides I captured the crash dump, which indicates that uiapp is really at fault. Unloading driver at 0x00078328000 Synchronous Exception at 0x000000007F5C60D8 Synchronous Exception at 0x000000007F5C60D8 PC 0x00007F5C60D8 (0x00007F5A7000+0x0001F0D8) [ 0] DxeCore.dll PC 0x00007F5B6D20 (0x00007F5A7000+0x0000FD20) [ 0] DxeCore.dll PC 0x00007F5B7844 (0x00007F5A7000+0x00010844) [ 0] DxeCore.dll PC 0x00007F5ADF88 (0x00007F5A7000+0x00006F88) [ 0] DxeCore.dll PC 0x00007F5AE8D0 (0x00007F5A7000+0x000078D0) [ 0] DxeCore.dll PC 0x000078404494 (0x0000783FA000+0x0000A494) [ 1] UiApp.dll PC 0x000078408948 (0x0000783FA000+0x0000E948) [ 1] UiApp.dll PC 0x00007BADF8C8 (0x00007BACA000+0x000158C8) [ 2] SetupBrowser.dll PC 0x00007BAD4E94 (0x00007BACA000+0x0000AE94) [ 2] SetupBrowser.dll PC 0x000078401FE8 (0x0000783FA000+0x00007FE8) [ 3] UiApp.dll PC 0x00007F5AE658 (0x00007F5A7000+0x00007658) [ 4] DxeCore.dll PC 0x00007B9DBF54 (0x00007B9D5000+0x00006F54) [ 5] BdsDxe.dll PC 0x00007B9DF1D0 (0x00007B9D5000+0x0000A1D0) [ 5] BdsDxe.dll PC 0x00007F5B1E08 (0x00007F5A7000+0x0000AE08) [ 6] DxeCore.dll [ 0] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll [ 1] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Application/UiApp/UiApp/DEBUG/UiApp.dll [ 2] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Universal/SetupBrowserDxe/SetupBrowserDxe/DEBUG/SetupBrowser.dll [ 3] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Application/UiApp/UiApp/DEBUG/UiApp.dll [ 4] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll [ 5] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Universal/BdsDxe/BdsDxe/DEBUG/BdsDxe.dll [ 6] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll X0 0x0000000078328000 X1 0x00000000000D0FF0 X2 0xAFAFAFAFAFAFAFAF X3 0x0000000078329010 X4 0x00000000783FA000 X5 0x000000007F5D16C8 X6 0x0000000070616D6D X7 0x0000000000000000 X8 0x000000007BFFF508 X9 0x0000000700000000 X10 0x0000000078730000 X11 0x0000000078A81FFF X12 0x0000000000000000 X13 0x0000000000000008 X14 0x0000000000000000 X15 0x0000000000000000 X16 0x000000007BAA8250 X17 0x0000000025239427 X18 0x0000000038CC9B8B X19 0x0000000078328000 X20 0x00000000783F9FFF X21 0x00000000783F9FFF X22 0x0000000000000000 X23 0x00000000000000D2 X24 0x0000000000000001 X25 0x0000000000000007 X26 0x000000007BFFFE40 X27 0x00000000783FA000 X28 0x0000000000000007 FP 0x000000007F5A6070 LR 0x000000007F5B6D20 V0 0xAFAFAFAFAFAFAFAF AFAFAFAFAFAFAFAF V1 0x5F3832315F534541 5F534C543A343833 V2 0x213A4C4C41003635 324148535F4D4347 V3 0x0000000000000000 0000000040000000 V4 0x0010000000000000 0000000000000000 V5 0x4010040140100401 4010040140100401 V6 0x1000000000000040 1000000000000040 V7 0x0000000000000000 0000000000000000 V8 0x0000000000000000 0000000000000000 V9 0x0000000000000000 0000000000000000 V10 0x0000000000000000 0000000000000000 V11 0x0000000000000000 0000000000000000 V12 0x0000000000000000 0000000000000000 V13 0x0000000000000000 0000000000000000 V14 0x0000000000000000 0000000000000000 V15 0x0000000000000000 0000000000000000 V16 0x0000000000000000 0000000000000000 V17 0x0000000000000000 0000000000000000 V18 0x0000000000000000 0000000000000000 V19 0x0000000000000000 0000000000000000 V20 0x0000000000000000 0000000000000000 V21 0x0000000000000000 0000000000000000 V22 0x0000000000000000 0000000000000000 V23 0x0000000000000000 0000000000000000 V24 0x0000000000000000 0000000000000000 V25 0x0000000000000000 0000000000000000 V26 0x0000000000000000 0000000000000000 V27 0x0000000000000000 0000000000000000 V28 0x0000000000000000 0000000000000000 V29 0x0000000000000000 0000000000000000 V30 0x0000000000000000 0000000000000000 V31 0x0000000000000000 0000000000000000 SP 0x000000007F5A6070 ELR 0x000000007F5C60D8 SPSR 0x20000205 FPSR 0x00000000 ESR 0x9600004F FAR 0x0000000078329000 ESR : EC 0x25 IL 0x1 ISS 0x0000004F Data abort: Permission fault, third level Stack dump: 000007F5A5F70: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000007F5A5F90: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000007F5A5FB0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000007F5A5FD0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000007F5A5FF0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000007F5A6010: 0000000000000000 0000000000000000 0000000000000000 0000000000000000 000007F5A6030: 0000000000000000 0000000000000000 000000007BAACA14 0000000040000304 000007F5A6050: 0000000000000000 000000009600004F 0000000078329000 00000000783F9FFF > 000007F5A6070: 000000007F5A6110 000000007F5B7844 00000000000000D2 0000000078328000 000007F5A6090: 0000000000000001 00000000000000D2 000000007F5C71B1 000000007F5C90A3 000007F5A60B0: 000000007F5CF000 0000000070616D6D 0000000044525049 0000000000000000 000007F5A60D0: 000000007F5D0088 0000000178328000 000000007F5CF2D0 0000000000000000 000007F5A60F0: 0000000100000000 0000000000000007 0000000000000150 000000007F5AD698 000007F5A6110: 000000007F5A6160 000000007F5ADF88 0000000078AB3698 000000007F5D0000 000007F5A6130: 000000007F5D0800 000000007B61F2F0 0000000000000828 0000000000000018 000007F5A6150: 0000000078414940 0000000000000000 000000007F5A6200 000000007F5AE8D0 ASSERT [ArmCpuDxe] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/ArmPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c(273): ((BOOLEAN)(0==1))
And it is very odd, as the nvram medium used by openQA looks to be ephemeral/volatile that the entry is not persistent, so every reboot will reset it to default and always have to try from cdrom ... And I didn't experience the problem, in my test it is persistent. I could reboot to new created entry after installation thus don't have to go over the cdrom again ... And I saw sles-secureboot and sles on serial output, they were created by shim-install and grub2-install receptively. They persist after power off and restart the system. If I change boot order, then the reboot always reflect the change I made. SetBootOrderFromQemu: setting BootOrder: success [Bds]OsIndication: 0000000000000000 [Bds]=============Begin Load Options Dumping ...============= Driver Options: SysPrep Options: Boot Options: Boot000A: sles-secureboot 0x0001 Boot0004: sles 0x0001 Boot0002: UEFI Misc Device 2 0x0001 Boot0000: UiApp 0x0109 Boot0001: UEFI Misc Device 0x0001 Boot0003: EFI Internal Shell 0x0001 PlatformRecovery Options: PlatformRecovery0000: Default PlatformRecovery 0x0001 [Bds]=============End Load Options Dumping============= Moreover, both entry worked (secure boot disabled) to boot to the linux system, but only if you type exit in sles-secureboot you'll see crash dump from the uiapp, which coincide to the openQA result here. (provides explaination why it didn't happen for Snapshot 10)
FWIW, I used this script to start qemu ... #!/bin/bash IMG=/root/qemu/disk1.qcow2 CDROM=/root/SLE-15-SP3-Full-aarch64-Build154.1-Media1.iso IMG_FMT=${IMG##*.} EFI="/usr/share/qemu/aavmf-aarch64-code.bin" EFI_NVRAM="$PWD/aavmf-aarch64-vars.bin" qemu-system-aarch64 -enable-kvm -m 1024 -cpu host -machine virt \ -nographic \ -device virtio-scsi-pci,id=scsi0 \ -drive if=pflash,format=raw,unit=0,file=$EFI,readonly=on \ -drive if=pflash,format=raw,unit=1,file=$EFI_NVRAM \ -drive media=cdrom,if=none,id=cd0,format=raw,file=$CDROM \ -device scsi-cd,drive=cd0,bus=scsi0.0 \ -drive if=none,format=${IMG_FMT},file=${IMG},id=hd0 \ -device virtio-blk-device,drive=hd0,bootindex=0 \ -netdev type=user,id=vnet \ -device virtio-net,netdev=vnet,mac=52:54:00:12:34:56 And # rpm -qf /usr/share/qemu/aavmf-aarch64-code.bin qemu-uefi-aarch64-202011-3.2.noarch
(In reply to Michael Chang from comment #26) > FWIW, I used this script to start qemu ... > > #!/bin/bash > > IMG=/root/qemu/disk1.qcow2 > CDROM=/root/SLE-15-SP3-Full-aarch64-Build154.1-Media1.iso > IMG_FMT=${IMG##*.} > EFI="/usr/share/qemu/aavmf-aarch64-code.bin" > EFI_NVRAM="$PWD/aavmf-aarch64-vars.bin" > > qemu-system-aarch64 -enable-kvm -m 1024 -cpu host -machine virt \ > -nographic \ > -device virtio-scsi-pci,id=scsi0 \ > -drive if=pflash,format=raw,unit=0,file=$EFI,readonly=on \ > -drive if=pflash,format=raw,unit=1,file=$EFI_NVRAM \ > -drive media=cdrom,if=none,id=cd0,format=raw,file=$CDROM \ > -device scsi-cd,drive=cd0,bus=scsi0.0 \ > -drive if=none,format=${IMG_FMT},file=${IMG},id=hd0 \ > -device virtio-blk-device,drive=hd0,bootindex=0 \ > -netdev type=user,id=vnet \ > -device virtio-net,netdev=vnet,mac=52:54:00:12:34:56 > > And > # rpm -qf /usr/share/qemu/aavmf-aarch64-code.bin > qemu-uefi-aarch64-202011-3.2.noarch Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I am not sure if any relationship with my previous bug, just fyi https://bugzilla.suse.com/show_bug.cgi?id=1180080
(In reply to Richard Fan from comment #27) > (In reply to Michael Chang from comment #26) > Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I > am not sure if any relationship with my previous bug, just fyi > https://bugzilla.suse.com/show_bug.cgi?id=1180080 Yes that bootindex= is significant, probably can be used to explain the problem here. If I remove bootindex attached to the hd0, the booting failed and eventually landed in the grub shell. Then I have to type 'exit' to the boot menu, and from there selecting the sles or sles-secureboot to boot. It worked. It appeared to me that the bootindex is used to hint the qemu the boot device, if not specified then the device would be skipped thus is not visible to to the firmware/ovmf. This has the benefit of speeding up the device discovery, as only a few (known) device and subsystem has to be initialized. When you enter the ovmf menu, it would triggerd a full device rescan and therefore all devices are iterated and usable. Then you could boot the "missing" device from the boot manager. I'm not sure whether openQA attached bootindex to the target disk ? Thanks.
(In reply to Michael Chang from comment #28) > (In reply to Richard Fan from comment #27) > > (In reply to Michael Chang from comment #26) > > > Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I > > am not sure if any relationship with my previous bug, just fyi > > https://bugzilla.suse.com/show_bug.cgi?id=1180080 > > Yes that bootindex= is significant, probably can be used to explain the > problem here. > > If I remove bootindex attached to the hd0, the booting failed and eventually > landed in the grub shell. Then I have to type 'exit' to the boot menu, and > from there selecting the sles or sles-secureboot to boot. It worked. > > It appeared to me that the bootindex is used to hint the qemu the boot > device, if not specified then the device would be skipped thus is not > visible to to the firmware/ovmf. This has the benefit of speeding up the > device discovery, as only a few (known) device and subsystem has to be > initialized. When you enter the ovmf menu, it would triggerd a full device > rescan and therefore all devices are iterated and usable. Then you could > boot the "missing" device from the boot manager. > > I'm not sure whether openQA attached bootindex to the target disk ? > > Thanks. Hi Michael, I found an easy way to reproduce the issue, and seems that the issue may something to do with the ISO image (rather than the upgraded system, but I am not 100% sure) However, please omit my messages if you have reproduced the issue as well in a simple way. ================================================== I did compare the 156.3 and 150.1 iso images with same "hd" image, only 156.3 can hit the issue. #/usr/bin/qemu-img create -f qcow2 -b SLE-15-SP3-Full-aarch64-Build156.3-Media1.iso cd0-overlay0 9172019200 #/usr/bin/qemu-system-aarch64 \ -m 1024 \ -machine virt,usb=off,gic-version=2,its=off \ -cpu host \ -netdev user,id=qanet0 \ -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 \ -boot menu=on,splash-time=5000 \ -smp 2 \ -enable-kvm \ -vnc :91 \ -monitor stdio \ -device virtio-scsi-pci,id=scsi0 \ -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/libvirt/images/SLES-15-SP3-aarch64-Build156.3@aarch64-gnome.qcow2,cache.no-flush=on \ -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on \ -device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,serial=hd0 \ -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/libvirt/images/cd0-overlay0,cache.no-flush=on \ -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on \ -device scsi-cd,id=cd0-device,drive=cd0-overlay0,bootindex=0,serial=cd0 \ -drive id=pflash-code-overlay0,if=pflash,file=/home/aavmf-aarch64-code.bin,readonly=on \ -drive id=pflash-vars-overlay0,if=pflash,file=/home/aavmf-aarch64-vars.bin,unit=1,format=raw Once you get the boot/install menu entry, then type "c" to grub edit mode, and type "exit", then you can reproduce the issue. the expect result should be entering into the "UEFI BIOS"
(In reply to Richard Fan from comment #29) > (In reply to Michael Chang from comment #28) > > (In reply to Richard Fan from comment #27) > > > (In reply to Michael Chang from comment #26) > > > > > Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I > > > am not sure if any relationship with my previous bug, just fyi > > > https://bugzilla.suse.com/show_bug.cgi?id=1180080 > > > > Yes that bootindex= is significant, probably can be used to explain the > > problem here. > > > > If I remove bootindex attached to the hd0, the booting failed and eventually > > landed in the grub shell. Then I have to type 'exit' to the boot menu, and > > from there selecting the sles or sles-secureboot to boot. It worked. > > > > It appeared to me that the bootindex is used to hint the qemu the boot > > device, if not specified then the device would be skipped thus is not > > visible to to the firmware/ovmf. This has the benefit of speeding up the > > device discovery, as only a few (known) device and subsystem has to be > > initialized. When you enter the ovmf menu, it would triggerd a full device > > rescan and therefore all devices are iterated and usable. Then you could > > boot the "missing" device from the boot manager. > > > > I'm not sure whether openQA attached bootindex to the target disk ? > > > > Thanks. > > Hi Michael, > > I found an easy way to reproduce the issue, and seems that the issue may have > something to do with the ISO image (rather than the upgraded system, but I > am not 100% sure) > > However, please omit my messages if you have reproduced the issue as well in > a simple way. > > ================================================== > > I did compare the 156.3 and 150.1 iso images with same "hd" image, only > 156.3 can hit the issue. > > #/usr/bin/qemu-img create -f qcow2 -b > SLE-15-SP3-Full-aarch64-Build156.3-Media1.iso cd0-overlay0 9172019200 > > #/usr/bin/qemu-system-aarch64 \ > -m 1024 \ > -machine virt,usb=off,gic-version=2,its=off \ > -cpu host \ > -netdev user,id=qanet0 \ > -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 \ > -boot menu=on,splash-time=5000 \ > -smp 2 \ > -enable-kvm \ > -vnc :91 \ > -monitor stdio \ > -device virtio-scsi-pci,id=scsi0 \ > -blockdev > driver=file,node-name=hd0-overlay0-file,filename=/var/lib/libvirt/images/ > SLES-15-SP3-aarch64-Build156.3@aarch64-gnome.qcow2,cache.no-flush=on \ > -blockdev > driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on > \ > -device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,serial=hd0 \ > -blockdev > driver=file,node-name=cd0-overlay0-file,filename=/var/lib/libvirt/images/cd0- > overlay0,cache.no-flush=on \ > -blockdev > driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on > \ > -device scsi-cd,id=cd0-device,drive=cd0-overlay0,bootindex=0,serial=cd0 \ > -drive > id=pflash-code-overlay0,if=pflash,file=/home/aavmf-aarch64-code.bin, > readonly=on \ > -drive > id=pflash-vars-overlay0,if=pflash,file=/home/aavmf-aarch64-vars.bin,unit=1, > format=raw > > Once you get the boot/install menu entry, then type "c" to grub edit mode, > and type "exit", then you can reproduce the issue. the expect result should > be entering into the "UEFI BIOS"
The crash seems caused by shim. Shim modifies the Loaded Image handle for the second stage bootloader. If the second stage bootloader just returns, not Exit(), shim restores the Loaded Image handle. However, shim didn't do the restoration when handling Exit() from the second stage bootloader. OVMF seems alright to live with it. On the other hand, AAVMF probably did some additional check or clean-up, so it would need the original Loaded Image handle. Will dig it further.
(In reply to Gary Ching-Pang Lin from comment #31) > The crash seems caused by shim. Shim modifies the Loaded Image handle for > the second stage bootloader. If the second stage bootloader just returns, > not Exit(), shim restores the Loaded Image handle. However, shim didn't do > the restoration when handling Exit() from the second stage bootloader. OVMF > seems alright to live with it. On the other hand, AAVMF probably did some > additional check or clean-up, so it would need the original Loaded Image > handle. Will dig it further. Hi Gary. Great job. So I thought it is time to reassign as now we are investigating the fix in the shim layer. Feel free to ask if you need anything here. Thanks.
Submitted the patch to upstream for the further review. https://github.com/rhboot/shim/pull/306
I created https://bugzilla.suse.com/show_bug.cgi?id=1183213 to track it for Tumbleweed where the problem is also present on upgrade tests.
This is an autogenerated message for OBS integration: This bug (1182776) was mentioned in https://build.opensuse.org/request/show/877920 Factory / shim
Fixes merged and resolved with RC1 candidate
This is an autogenerated message for openQA integration by the openqa_review script: This bug is still referenced in a failing openQA test: offline_slehpc15sp1_espos_scc_basesys-desk-dev-hpc-python2-srv-wsm_def_full_tm https://openqa.suse.de/tests/5620567 To prevent further reminder comments one of the following options should be followed: 1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted 2. The openQA job group is moved to "Released" 3. The label in the openQA scenario is removed
This is an autogenerated message for openQA integration by the openqa_review script: This bug is still referenced in a failing openQA test: activate_encrypted_volume https://openqa.suse.de/tests/5846206 To prevent further reminder comments one of the following options should be followed: 1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted 2. The openQA job group is moved to "Released" 3. The label in the openQA scenario is removed