Bug 1182776 - [Build 154.1] System cannot boot after migration: Synchronous Exception at 0x000000007F5DF438
Summary: [Build 154.1] System cannot boot after migration: Synchronous Exception at 0x...
Status: RESOLVED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP3
Classification: SUSE Linux Enterprise Server
Component: Bootloader (show other bugs)
Version: Public Beta
Hardware: armv5 Other
: P1 - Urgent : Normal
Target Milestone: unspecified
Assignee: Gary Ching-Pang Lin
QA Contact: Jiri Srain
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 1183213
  Show dependency treegraph
 
Reported: 2021-02-25 17:58 UTC by Alvaro Carvajal
Modified: 2024-05-16 07:40 UTC (History)
13 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
Serial Console (113.46 KB, application/gzip)
2021-02-25 17:58 UTC, Alvaro Carvajal
Details
Secure Boot Configuration (9.63 KB, image/png)
2021-02-26 14:47 UTC, Alvaro Carvajal
Details
Boot Order (13.17 KB, image/png)
2021-03-02 14:43 UTC, Alvaro Carvajal
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Alvaro Carvajal 2021-02-25 17:58:19 UTC
Created attachment 846545 [details]
Serial Console

* Platform and arch: aarch64, Virtual System

* OS Version: SLES+HA 15-SP3 Public Beta (Build 154.1), Full media
  - Installed via: ISO
  - Using procedure: manual, by openQA

* LOGS: serial console log

* Results:
  - Expected: aarch64 system in a previous OS version is successfully migrated to 15-SP3.
  - Real: after offline migration finishes, system is not able to boot

* Reproducible: yes

* openQA:
  * Link to failed tests:
https://openqa.suse.de/tests/5523702#step/grub_test/322 (migration from 15-LTSS)
https://openqa.suse.de/tests/5523689#step/grub_test/322 (migration from 15-SP1-LTSS)
https://openqa.suse.de/tests/5523697#step/grub_test/322 (migration from 15-SP2)
  * Link to last successful runs:
https://openqa.suse.de/tests/5475230 (Snapshot 10/Build 150.1, migration from 15-LTSS)
https://openqa.suse.de/tests/5475217 (Snapshot 10/Build 150.1, migration from 15-SP1-LTSS)
https://openqa.suse.de/tests/5475225 (Snapshot 10/Build 150.1, migration from 15-SP2)

* Description: attempting to migrate VMs from 15-LTSS, 15-SP1-LTSS or 15-SP2-LTSS to 15-SP3 using textmode migration and Full medium, while the migration process finishes successfully, when the VM is restarted after migration, it does not come back online. It can be seen in the serial console from the openQA test (attached) the following messages right after ISO image grub menu is loaded:

Synchronous Exception at 0x000000007F5DF438
AllocatePool: failed to allocate 688 bytes
ASSERT [ArmCpuDxe] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdePkg/Library/UefiLib/UefiLibPrint.c(203): Buffer != ((void *) 0)

Additionally, the same migration test was run today on the same hypervisor at the same time, one migrating the VM to Snapshot 10, and the other migrating the VM to Public Beta candidate. Migration to Snapshot 10 was possible while migration to Public Beta candidate failed as described:

Migration to Snapshot 10: http://mango.qa.suse.de/tests/3556
Migration to Public Beta: http://mango.qa.suse.de/tests/3555
Comment 1 Libor Pechacek 2021-02-26 11:56:56 UTC
It looks to me that the exception appears before the kernel gets loaded into memory. I.e. it comes from Grub(?). Michael, can you please glance at the serial log and share your thoughts on the matter?
Comment 2 Michael Chang 2021-02-26 13:10:06 UTC
(In reply to Libor Pechacek from comment #1)
> It looks to me that the exception appears before the kernel gets loaded into
> memory. I.e. it comes from Grub(?). Michael, can you please glance at the
> serial log and share your thoughts on the matter?

It is assertion failure in AAVMF, so that has to be sorted out from virtual machine firmware first.

Synchronous Exception at 0x000000007F5DF438
AllocatePool: failed to allocate 472 bytes
ASSERT [HiiDatabase] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdeModulePkg/Universal/HiiDatabaseDxe/Font.c(1686): Cell != ((void *) 0)
Comment 3 Michael Chang 2021-02-26 13:23:34 UTC
(In reply to Alvaro Carvajal from comment #0)

> Additionally, the same migration test was run today on the same hypervisor
> at the same time, one migrating the VM to Snapshot 10, and the other
> migrating the VM to Public Beta candidate. Migration to Snapshot 10 was
> possible while migration to Public Beta candidate failed as described:
> 
> Migration to Snapshot 10: http://mango.qa.suse.de/tests/3556
> Migration to Public Beta: http://mango.qa.suse.de/tests/3555

If migration to snapshot 10 and prior can work, then I wonder is it a regression from grub because it's aarch64 port has not changed for nearly a month.

* 2021  1月 27 三 mchang@suse.com
- Complete Secure Boot support on aarch64 (jsc#SLE-15020)
  * 0001-Add-support-for-Linux-EFI-stub-loading-on-aarch64.patch
  * 0002-arm64-make-sure-fdt-has-address-cells-and-size-cells.patch
  * 0003-Make-grub_error-more-verbose.patch
  * 0004-arm-arm64-loader-Better-memory-allocation-and-error-.patch
  * 0005-Make-linux_arm_kernel_header.hdr_offset-be-at-the-ri.patch
  * 0006-efi-Set-image-base-address-before-jumping-to-the-PE-.patch
  * 0007-linuxefi-fail-kernel-validation-without-shim-protoco.patch
  * 0008-squash-Add-support-for-Linux-EFI-stub-loading-on-aar.patch
  * 0009-squash-Add-support-for-linuxefi.patch
Comment 4 Michael Chang 2021-02-26 13:59:41 UTC
Looks to be shim related. The serial log for snapshot 10 had no shim in place while in the public beta shim was loaded by aavmf before grub in a sequence.

Given that the problem emerged in a wake of recent secure boot integration by YaST that made shim really effective in the boot process for aarch64, maybe we can look into this direction first.

CC Gary and Joey.
Comment 5 Michael Chang 2021-02-26 14:01:44 UTC
Is secure boot enabled in "firmware" or not ?
Comment 6 Alvaro Carvajal 2021-02-26 14:29:30 UTC
(In reply to Michael Chang from comment #5)
> Is secure boot enabled in "firmware" or not ?

I do not think so. I see these messages in the serial console when the test is starting:

Variable SecureBoot is 0
Variable SecureBootEnable is 0

And then I see this:

[    1.357581] ima: secureboot mode disabled

In the boot after the migration to 15-SP3.

To be sure, I will trigger a test to capture FW settings and get back to you.
Comment 7 Alvaro Carvajal 2021-02-26 14:47:27 UTC
Created attachment 846576 [details]
Secure Boot Configuration

Got the attached screenshot from the following test:

http://mango.qa.suse.de/tests/3563 (ongoing as of now)

This is a clone of http://mango.qa.suse.de/tests/3559, itself a clone of http://mango.qa.suse.de/tests/3555 which I referenced while opening the bug.

As can be seen on the screenshot, Secure Boot State is disabled.
Comment 9 Alvaro Carvajal 2021-03-01 08:17:04 UTC
We are still seeing this with the new build 156.3:

https://openqa.suse.de/tests/5552630#step/grub_test/322

Also, with bsc#1182663 fixed, our media+SCC migrations scenarios which use Online medium and SCC during migration were able to perform the migration itself, but are also failing to boot after migration in the same manner:

https://openqa.suse.de/tests/5552981#step/grub_test/322

However, online migration scenarios with zypper are working:

https://openqa.suse.de/tests/5552985

All linked tests are for migrations from 15-SP2, but we're seeing the same issue in migrations from 15-GA and 15-SP1 as well.
Comment 10 Rodion Iafarov 2021-03-01 09:21:27 UTC
Couple of updates to this issue.
It's also visible on 64bit with UEFI enabled. I've tried to disable secure boot, but it didn't help.

Also, hard drive with the installed system is not listed in the tianocore, but if I try to boot from "Misc devices" entry, I was able to boot with secure boot enabled, and got following on the serial:

Creating boot entry "Boot0009" with label "sles-secureboot" for file "\EFI\sles\shim.efi"

[0m[30m[47mWelcome to GRUB!


[0m[37m[40mPlease press 't' to show the boot menu on this console

Create new secret key
Failed to generate secret key: EFI_NOT_FOUND

Same trick didn't work at all with secure boot being disabled and grub got stuck on loading kernel step.
Comment 11 Stefan Weiberg 2021-03-01 09:49:37 UTC
Adjusting the component to bootloader as recent analysis point towards grub2 and the addition of shim in aarch64 secureboot introduced with the recent yast2-bootloader update.
Comment 12 Rodion Iafarov 2021-03-01 11:46:34 UTC
Another update, on 64bit, seems it's more related to https://bugzilla.suse.com/show_bug.cgi?id=1182749
As manually, if I have secure boot disabled on the VM before the installation, it all works just fine. But in case it's enabled during installation, disabling it afterwards doesn't help. It still might be that issues share same root cause.
Comment 13 Michael Chang 2021-03-01 14:22:57 UTC
The Synchronous Exception is caused by assertion of null pointers returned by uefi memory allocation service in the aavmf. The source snippet relevant to

> Synchronous Exception at 0x000000007F5DF438
> AllocatePool: failed to allocate 688 bytes
> ASSERT [ArmCpuDxe] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdePkg/Library/UefiLib/UefiLibPrint.c(203): Buffer != ((void *) 0)

is

>   Buffer = (CHAR16 *) AllocatePool(BufferSize);
>   ASSERT (Buffer != NULL);

while

> Synchronous Exception at 0x000000007F5DF438
> AllocatePool: failed to allocate 472 bytes
> ASSERT [HiiDatabase] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable201911/MdeModulePkg/Universal/HiiDatabaseDxe/Font.c(1686): Cell != ((void *) 0)

is

>   Cell = (EFI_HII_GLYPH_INFO *) AllocateZeroPool (StrLength * sizeof (EFI_HII_GLYPH_INFO));
>   ASSERT (Cell != NULL);

Can we try to allocate more memory to the guest and see of that helps ? The shim is loaded in front of grub, and will not be relinquished until kernel calling out ExitBootServce() .. 

If that didn't help, then probably there's infinte looping somewhere until OOM..
Comment 14 Michael Chang 2021-03-01 15:34:42 UTC
It is about the "Boot from Hard Disk" in the grub_test step.

The "working" case appears to be with some workaround applied, the test case seems to expect that it would "(re)boot to uefi menu" but not "Boot from Hard Disk". From there it selected the newly installed boot entry ...

Why not boot to the hard disk directly but have to start from the cdrom ? The install should have set the boot order to boot from the hard disk, this looks likely not normal as well.

I wonder somehow the "Boot from Hard Disk" is broken with the new shim integration, as now it really finds the fallback.efi and for some reason stuck there. But at the same time the test case seem to be bogus as it relies on a unitended behavior so that it didn't really perform the test -- loading the disk on the target disk with "reboot" (but the cdrom again).

Could the openQA help to improve the flow so the test result can be more clearly understood ? I thought the "Boot from Hard Disk" also failed in the "working" case given that it didn't really work. 

menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class gnu --class os {
  if search --no-floppy --file /efi/boot/fallback.efi --set ; then
    for os in opensuse sles caasp ; do
      if [ -f /efi/$os/grub.efi ] ; then
        chainloader /efi/$os/grub.efi
      fi
    done
  fi
}

No shim so no fallback.efi, and the test case expects the search to fail ?
Comment 15 Alvaro Carvajal 2021-03-01 16:16:46 UTC
(In reply to Michael Chang from comment #13)
> Can we try to allocate more memory to the guest and see of that helps ? The
> shim is loaded in front of grub, and will not be relinquished until kernel
> calling out ExitBootServce() .. 
> 
> If that didn't help, then probably there's infinte looping somewhere until
> OOM..

I triggered a test with QEMURAM=2048 which is double what is configured for the tests in openqa.suse.de:

http://mango.qa.suse.de/tests/3564#step/grub_test/322

But it failed in the same step as before.
Comment 16 Michael Chang 2021-03-01 16:28:08 UTC
Please have a look to this bug report ..

https://bugzilla.suse.com/show_bug.cgi?id=1176967

Basically, we should avoid using "Boot from hard disk" to test boot the disk after installation, as the secure boot signkey may be different for the media and disk. (The shim in the media is used to provide the shim-lock, not the one from the disk).
Comment 17 Michael Chang 2021-03-01 16:30:04 UTC
(In reply to Alvaro Carvajal from comment #15)
> (In reply to Michael Chang from comment #13)

> http://mango.qa.suse.de/tests/3564#step/grub_test/322
> 
> But it failed in the same step as before.

Thanks a lot for the verification, then it is really something else ...
Comment 18 Michael Chang 2021-03-01 16:40:31 UTC
(In reply to Michael Chang from comment #14)

> menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class
> gnu --class os {
>   if search --no-floppy --file /efi/boot/fallback.efi --set ; then
>     for os in opensuse sles caasp ; do
>       if [ -f /efi/$os/grub.efi ] ; then
>         chainloader /efi/$os/grub.efi
>       fi
>     done
>   fi
> }

Hm. The arm build merely does 'exit' for "Boot from Hard Disk" ...

menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class gnu --class os {
  exit
}

IIRC, the shim has hook funciton to the 'exit' of the loaded image .. If so it might be related to the problem here, since previously it had worked without shim.

Hi Gary,

Did you have any idea/thoughts ?
Thanks.
Comment 19 Gary Ching-Pang Lin 2021-03-02 03:38:19 UTC
(In reply to Michael Chang from comment #18)
> (In reply to Michael Chang from comment #14)
> 
> > menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class
> > gnu --class os {
> >   if search --no-floppy --file /efi/boot/fallback.efi --set ; then
> >     for os in opensuse sles caasp ; do
> >       if [ -f /efi/$os/grub.efi ] ; then
> >         chainloader /efi/$os/grub.efi
> >       fi
> >     done
> >   fi
> > }
> 
> Hm. The arm build merely does 'exit' for "Boot from Hard Disk" ...
> 
> menuentry "Boot from Hard Disk" --class opensuse --class gnu-linux --class
> gnu --class os {
>   exit
> }
> 
> IIRC, the shim has hook funciton to the 'exit' of the loaded image .. If so
> it might be related to the problem here, since previously it had worked
> without shim.
> 
> Hi Gary,
> 
> Did you have any idea/thoughts ?
> Thanks.

The exit hook from shim is quite simple: unhook the system services and call the actual BS->Exit().

Per serial0.txt, the boot option next to cdrom is the firmware menu (UiApp):

[Bds]=============Begin Load Options Dumping ...=============
  Driver Options:
  SysPrep Options:
  Boot Options:
    Boot0001: UEFI QEMU QEMU CD-ROM              0x0001
    Boot0000: UiApp              0x0109
    Boot0007: EFI Internal Shell                 0x0001
    Boot0002: UEFI Misc Device           0x0001
  PlatformRecovery Options:
    PlatformRecovery0000: Default PlatformRecovery               0x0001
[Bds]=============End Load Options Dumping=============

I wonder if something went wrong in UiApp.
Comment 20 Michael Chang 2021-03-02 06:25:49 UTC
(In reply to Gary Ching-Pang Lin from comment #19)
> (In reply to Michael Chang from comment #18)
> > (In reply to Michael Chang from comment #14)

> [Bds]=============Begin Load Options Dumping ...=============
>   Driver Options:
>   SysPrep Options:
>   Boot Options:
>     Boot0001: UEFI QEMU QEMU CD-ROM              0x0001
>     Boot0000: UiApp              0x0109
>     Boot0007: EFI Internal Shell                 0x0001
>     Boot0002: UEFI Misc Device           0x0001
>   PlatformRecovery Options:
>     PlatformRecovery0000: Default PlatformRecovery               0x0001
> [Bds]=============End Load Options Dumping=============
> 
> I wonder if something went wrong in UiApp.

Yes at least the exception looked like trouble in drawing the menu.

@ Alvaro,
Is it possible to rearrange the order in the openQA test ? Either having "Boot0002: UEFI Misc Device" on top of the list, or at the second place after "Boot0001: UEFI QEMU QEMU CD-ROM  0x0001" that the "Boot from Hard Disk" can fall through to it thus avoid jumping through the hoops of UiApp ...

Thanks.
Comment 21 Michael Chang 2021-03-02 09:26:07 UTC
Surprisingly the "Boot from Hard Disk" failed for x85_64 efi as well ... It entered the uefi menu properly. But since it didn't boot the installed disk the result is deemed as failure.

The "entering the uefi menu" aligns to the arm behavior here, maybe x86 code path has accidentally changed so it is not a coincident. I wouldn't mind to test the "Boot from Hard disk" functionality, but it'd be better in a separate one that is not on the migration or major path. The reboot after installation should just straight to new installed boot entry, as indicated by the uefi boot order to conform to the what the uefi spec would expect.
Comment 22 Alvaro Carvajal 2021-03-02 13:17:30 UTC
(In reply to Michael Chang from comment #20)
> @ Alvaro,
> Is it possible to rearrange the order in the openQA test ? Either having
> "Boot0002: UEFI Misc Device" on top of the list, or at the second place
> after "Boot0001: UEFI QEMU QEMU CD-ROM  0x0001" that the "Boot from Hard
> Disk" can fall through to it thus avoid jumping through the hoops of UiApp
> ...
> 
> Thanks.

I figure I can do it before the test starts (i.e., before the migration), but not after, as precisely the issue in the grub_test step is that the test code cannot get into the FW menu. I also tried (with build 154.1) to skip getting into the FW menu after the migration and boot directly, but this didn't work. Not sure if due to the same issue or due to bsc#1022064 (as described in the work around).
Comment 23 Alvaro Carvajal 2021-03-02 14:43:25 UTC
Created attachment 846667 [details]
Boot Order

Configured the boot order as shown in the attached screenshot for the test in http://mango.qa.suse.de/tests/3575, however as can be seen it's failing in the grub_step test module right after migration.

It does not look as if the test itself is changing back this boot order, but if necessary I can re-trigger and save a video of the test.
Comment 24 Michael Chang 2021-03-03 07:24:29 UTC
(In reply to Alvaro Carvajal from comment #23)

> It does not look as if the test itself is changing back this boot order, but
> if necessary I can re-trigger and save a video of the test.

It seems to be always trying to load uiapp after grub exit despite how you specify the boot order. My test reveal the same problem. 

Besides I captured the crash dump, which indicates that uiapp is really at fault.

Unloading driver at 0x00078328000


Synchronous Exception at 0x000000007F5C60D8


Synchronous Exception at 0x000000007F5C60D8
PC 0x00007F5C60D8 (0x00007F5A7000+0x0001F0D8) [ 0] DxeCore.dll
PC 0x00007F5B6D20 (0x00007F5A7000+0x0000FD20) [ 0] DxeCore.dll
PC 0x00007F5B7844 (0x00007F5A7000+0x00010844) [ 0] DxeCore.dll
PC 0x00007F5ADF88 (0x00007F5A7000+0x00006F88) [ 0] DxeCore.dll
PC 0x00007F5AE8D0 (0x00007F5A7000+0x000078D0) [ 0] DxeCore.dll
PC 0x000078404494 (0x0000783FA000+0x0000A494) [ 1] UiApp.dll
PC 0x000078408948 (0x0000783FA000+0x0000E948) [ 1] UiApp.dll
PC 0x00007BADF8C8 (0x00007BACA000+0x000158C8) [ 2] SetupBrowser.dll
PC 0x00007BAD4E94 (0x00007BACA000+0x0000AE94) [ 2] SetupBrowser.dll
PC 0x000078401FE8 (0x0000783FA000+0x00007FE8) [ 3] UiApp.dll
PC 0x00007F5AE658 (0x00007F5A7000+0x00007658) [ 4] DxeCore.dll
PC 0x00007B9DBF54 (0x00007B9D5000+0x00006F54) [ 5] BdsDxe.dll
PC 0x00007B9DF1D0 (0x00007B9D5000+0x0000A1D0) [ 5] BdsDxe.dll
PC 0x00007F5B1E08 (0x00007F5A7000+0x0000AE08) [ 6] DxeCore.dll

[ 0] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll
[ 1] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Application/UiApp/UiApp/DEBUG/UiApp.dll
[ 2] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Universal/SetupBrowserDxe/SetupBrowserDxe/DEBUG/SetupBrowser.dll
[ 3] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Application/UiApp/UiApp/DEBUG/UiApp.dll
[ 4] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll
[ 5] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Universal/BdsDxe/BdsDxe/DEBUG/BdsDxe.dll
[ 6] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/Build/ArmVirtQemu-AARCH64/DEBUG_GCC5/AARCH64/MdeModulePkg/Core/Dxe/DxeMain/DEBUG/DxeCore.dll

  X0 0x0000000078328000   X1 0x00000000000D0FF0   X2 0xAFAFAFAFAFAFAFAF   X3 0x0000000078329010
  X4 0x00000000783FA000   X5 0x000000007F5D16C8   X6 0x0000000070616D6D   X7 0x0000000000000000
  X8 0x000000007BFFF508   X9 0x0000000700000000  X10 0x0000000078730000  X11 0x0000000078A81FFF
 X12 0x0000000000000000  X13 0x0000000000000008  X14 0x0000000000000000  X15 0x0000000000000000
 X16 0x000000007BAA8250  X17 0x0000000025239427  X18 0x0000000038CC9B8B  X19 0x0000000078328000
 X20 0x00000000783F9FFF  X21 0x00000000783F9FFF  X22 0x0000000000000000  X23 0x00000000000000D2
 X24 0x0000000000000001  X25 0x0000000000000007  X26 0x000000007BFFFE40  X27 0x00000000783FA000
 X28 0x0000000000000007   FP 0x000000007F5A6070   LR 0x000000007F5B6D20  

  V0 0xAFAFAFAFAFAFAFAF AFAFAFAFAFAFAFAF   V1 0x5F3832315F534541 5F534C543A343833
  V2 0x213A4C4C41003635 324148535F4D4347   V3 0x0000000000000000 0000000040000000
  V4 0x0010000000000000 0000000000000000   V5 0x4010040140100401 4010040140100401
  V6 0x1000000000000040 1000000000000040   V7 0x0000000000000000 0000000000000000
  V8 0x0000000000000000 0000000000000000   V9 0x0000000000000000 0000000000000000
 V10 0x0000000000000000 0000000000000000  V11 0x0000000000000000 0000000000000000
 V12 0x0000000000000000 0000000000000000  V13 0x0000000000000000 0000000000000000
 V14 0x0000000000000000 0000000000000000  V15 0x0000000000000000 0000000000000000
 V16 0x0000000000000000 0000000000000000  V17 0x0000000000000000 0000000000000000
 V18 0x0000000000000000 0000000000000000  V19 0x0000000000000000 0000000000000000
 V20 0x0000000000000000 0000000000000000  V21 0x0000000000000000 0000000000000000
 V22 0x0000000000000000 0000000000000000  V23 0x0000000000000000 0000000000000000
 V24 0x0000000000000000 0000000000000000  V25 0x0000000000000000 0000000000000000
 V26 0x0000000000000000 0000000000000000  V27 0x0000000000000000 0000000000000000
 V28 0x0000000000000000 0000000000000000  V29 0x0000000000000000 0000000000000000
 V30 0x0000000000000000 0000000000000000  V31 0x0000000000000000 0000000000000000

  SP 0x000000007F5A6070  ELR 0x000000007F5C60D8  SPSR 0x20000205  FPSR 0x00000000
 ESR 0x9600004F          FAR 0x0000000078329000

 ESR : EC 0x25  IL 0x1  ISS 0x0000004F

Data abort: Permission fault, third level

Stack dump:
  000007F5A5F70: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  000007F5A5F90: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  000007F5A5FB0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  000007F5A5FD0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  000007F5A5FF0: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  000007F5A6010: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
  000007F5A6030: 0000000000000000 0000000000000000 000000007BAACA14 0000000040000304
  000007F5A6050: 0000000000000000 000000009600004F 0000000078329000 00000000783F9FFF
> 000007F5A6070: 000000007F5A6110 000000007F5B7844 00000000000000D2 0000000078328000
  000007F5A6090: 0000000000000001 00000000000000D2 000000007F5C71B1 000000007F5C90A3
  000007F5A60B0: 000000007F5CF000 0000000070616D6D 0000000044525049 0000000000000000
  000007F5A60D0: 000000007F5D0088 0000000178328000 000000007F5CF2D0 0000000000000000
  000007F5A60F0: 0000000100000000 0000000000000007 0000000000000150 000000007F5AD698
  000007F5A6110: 000000007F5A6160 000000007F5ADF88 0000000078AB3698 000000007F5D0000
  000007F5A6130: 000000007F5D0800 000000007B61F2F0 0000000000000828 0000000000000018
  000007F5A6150: 0000000078414940 0000000000000000 000000007F5A6200 000000007F5AE8D0
ASSERT [ArmCpuDxe] /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202011/ArmPkg/Library/DefaultExceptionHandlerLib/AArch64/DefaultExceptionHandler.c(273): ((BOOLEAN)(0==1))
Comment 25 Michael Chang 2021-03-03 07:39:59 UTC
And it is very odd, as the nvram medium used by openQA looks to be ephemeral/volatile that the entry is not persistent, so every reboot will reset it to default and always have to try from cdrom ...

And I didn't experience the problem, in my test it is persistent. I could reboot to new created entry after installation thus don't have to go over the cdrom again ...

And I saw sles-secureboot and sles on serial output, they were created by shim-install and grub2-install receptively. They persist after power off and restart the system. If I change boot order, then the reboot always reflect the change I made.

SetBootOrderFromQemu: setting BootOrder: success
[Bds]OsIndication: 0000000000000000
[Bds]=============Begin Load Options Dumping ...=============
  Driver Options:
  SysPrep Options:
  Boot Options:
    Boot000A: sles-secureboot            0x0001
    Boot0004: sles               0x0001
    Boot0002: UEFI Misc Device 2                 0x0001
    Boot0000: UiApp              0x0109
    Boot0001: UEFI Misc Device           0x0001
    Boot0003: EFI Internal Shell                 0x0001
  PlatformRecovery Options:
    PlatformRecovery0000: Default PlatformRecovery               0x0001
[Bds]=============End Load Options Dumping=============

Moreover, both entry worked (secure boot disabled) to boot to the linux system, but only if you type exit in sles-secureboot you'll see crash dump from the uiapp, which coincide to the openQA result here. (provides explaination why it didn't happen for Snapshot 10)
Comment 26 Michael Chang 2021-03-03 07:46:17 UTC
FWIW, I used this script to start qemu ...

#!/bin/bash

IMG=/root/qemu/disk1.qcow2
CDROM=/root/SLE-15-SP3-Full-aarch64-Build154.1-Media1.iso
IMG_FMT=${IMG##*.}
EFI="/usr/share/qemu/aavmf-aarch64-code.bin"
EFI_NVRAM="$PWD/aavmf-aarch64-vars.bin"

qemu-system-aarch64 -enable-kvm -m 1024 -cpu host -machine virt \
-nographic \
-device virtio-scsi-pci,id=scsi0 \
-drive if=pflash,format=raw,unit=0,file=$EFI,readonly=on \
-drive if=pflash,format=raw,unit=1,file=$EFI_NVRAM \
-drive media=cdrom,if=none,id=cd0,format=raw,file=$CDROM \
-device scsi-cd,drive=cd0,bus=scsi0.0 \
-drive if=none,format=${IMG_FMT},file=${IMG},id=hd0 \
-device virtio-blk-device,drive=hd0,bootindex=0 \
-netdev type=user,id=vnet \
-device virtio-net,netdev=vnet,mac=52:54:00:12:34:56

And
 # rpm -qf /usr/share/qemu/aavmf-aarch64-code.bin
 qemu-uefi-aarch64-202011-3.2.noarch
Comment 27 Richard Fan 2021-03-03 08:16:40 UTC
(In reply to Michael Chang from comment #26)
> FWIW, I used this script to start qemu ...
> 
> #!/bin/bash
> 
> IMG=/root/qemu/disk1.qcow2
> CDROM=/root/SLE-15-SP3-Full-aarch64-Build154.1-Media1.iso
> IMG_FMT=${IMG##*.}
> EFI="/usr/share/qemu/aavmf-aarch64-code.bin"
> EFI_NVRAM="$PWD/aavmf-aarch64-vars.bin"
> 
> qemu-system-aarch64 -enable-kvm -m 1024 -cpu host -machine virt \
> -nographic \
> -device virtio-scsi-pci,id=scsi0 \
> -drive if=pflash,format=raw,unit=0,file=$EFI,readonly=on \
> -drive if=pflash,format=raw,unit=1,file=$EFI_NVRAM \
> -drive media=cdrom,if=none,id=cd0,format=raw,file=$CDROM \
> -device scsi-cd,drive=cd0,bus=scsi0.0 \
> -drive if=none,format=${IMG_FMT},file=${IMG},id=hd0 \
> -device virtio-blk-device,drive=hd0,bootindex=0 \
> -netdev type=user,id=vnet \
> -device virtio-net,netdev=vnet,mac=52:54:00:12:34:56
> 
> And
>  # rpm -qf /usr/share/qemu/aavmf-aarch64-code.bin
>  qemu-uefi-aarch64-202011-3.2.noarch

Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I am not sure if any relationship with my previous bug, just fyi 
https://bugzilla.suse.com/show_bug.cgi?id=1180080
Comment 28 Michael Chang 2021-03-03 09:32:14 UTC
(In reply to Richard Fan from comment #27)
> (In reply to Michael Chang from comment #26)

> Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I
> am not sure if any relationship with my previous bug, just fyi 
> https://bugzilla.suse.com/show_bug.cgi?id=1180080

Yes that bootindex= is significant, probably can be used to explain the problem here.

If I remove bootindex attached to the hd0, the booting failed and eventually landed in the grub shell. Then I have to type 'exit' to the boot menu, and from there selecting the sles or sles-secureboot to boot. It worked.

It appeared to me that the bootindex is used to hint the qemu the boot device, if not specified then the device would be skipped thus is not visible to to the firmware/ovmf. This has the benefit of speeding up the device discovery, as only a few (known) device and subsystem has to be initialized. When you enter the ovmf menu, it would triggerd a full device rescan and therefore all devices are iterated and usable. Then you could boot the "missing" device from the boot manager.

I'm not sure whether openQA attached bootindex to the target disk ?

Thanks.
Comment 29 Richard Fan 2021-03-03 10:34:40 UTC
(In reply to Michael Chang from comment #28)
> (In reply to Richard Fan from comment #27)
> > (In reply to Michael Chang from comment #26)
> 
> > Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I
> > am not sure if any relationship with my previous bug, just fyi 
> > https://bugzilla.suse.com/show_bug.cgi?id=1180080
> 
> Yes that bootindex= is significant, probably can be used to explain the
> problem here.
> 
> If I remove bootindex attached to the hd0, the booting failed and eventually
> landed in the grub shell. Then I have to type 'exit' to the boot menu, and
> from there selecting the sles or sles-secureboot to boot. It worked.
> 
> It appeared to me that the bootindex is used to hint the qemu the boot
> device, if not specified then the device would be skipped thus is not
> visible to to the firmware/ovmf. This has the benefit of speeding up the
> device discovery, as only a few (known) device and subsystem has to be
> initialized. When you enter the ovmf menu, it would triggerd a full device
> rescan and therefore all devices are iterated and usable. Then you could
> boot the "missing" device from the boot manager.
> 
> I'm not sure whether openQA attached bootindex to the target disk ?
> 
> Thanks.

Hi Michael,

I found an easy way to reproduce the issue, and seems that the issue may something to do with the ISO image (rather than the upgraded system, but I am not 100% sure)

However, please omit my messages if you have reproduced the issue as well in a simple way.

==================================================

I did compare the 156.3 and 150.1 iso images with same "hd" image, only 156.3 can hit the issue.

#/usr/bin/qemu-img create -f qcow2 -b SLE-15-SP3-Full-aarch64-Build156.3-Media1.iso cd0-overlay0 9172019200

#/usr/bin/qemu-system-aarch64 \
-m 1024 \
-machine virt,usb=off,gic-version=2,its=off \
-cpu host \
-netdev user,id=qanet0 \
-device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 \
-boot menu=on,splash-time=5000 \
-smp 2 \
-enable-kvm \
-vnc :91 \
-monitor stdio \
-device virtio-scsi-pci,id=scsi0 \
-blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/libvirt/images/SLES-15-SP3-aarch64-Build156.3@aarch64-gnome.qcow2,cache.no-flush=on \
-blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on \
-device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,serial=hd0 \
-blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/libvirt/images/cd0-overlay0,cache.no-flush=on \
-blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on \
-device scsi-cd,id=cd0-device,drive=cd0-overlay0,bootindex=0,serial=cd0 \
-drive id=pflash-code-overlay0,if=pflash,file=/home/aavmf-aarch64-code.bin,readonly=on \
-drive id=pflash-vars-overlay0,if=pflash,file=/home/aavmf-aarch64-vars.bin,unit=1,format=raw

Once you get the boot/install menu entry, then type "c" to grub edit mode, and type "exit", then you can reproduce the issue. the expect result should be entering into the "UEFI BIOS"
Comment 30 Richard Fan 2021-03-03 10:37:05 UTC
(In reply to Richard Fan from comment #29)
> (In reply to Michael Chang from comment #28)
> > (In reply to Richard Fan from comment #27)
> > > (In reply to Michael Chang from comment #26)
> > 
> > > Hello, for openqa, there is "bootindex=0" set for cdrom(or other devices). I
> > > am not sure if any relationship with my previous bug, just fyi 
> > > https://bugzilla.suse.com/show_bug.cgi?id=1180080
> > 
> > Yes that bootindex= is significant, probably can be used to explain the
> > problem here.
> > 
> > If I remove bootindex attached to the hd0, the booting failed and eventually
> > landed in the grub shell. Then I have to type 'exit' to the boot menu, and
> > from there selecting the sles or sles-secureboot to boot. It worked.
> > 
> > It appeared to me that the bootindex is used to hint the qemu the boot
> > device, if not specified then the device would be skipped thus is not
> > visible to to the firmware/ovmf. This has the benefit of speeding up the
> > device discovery, as only a few (known) device and subsystem has to be
> > initialized. When you enter the ovmf menu, it would triggerd a full device
> > rescan and therefore all devices are iterated and usable. Then you could
> > boot the "missing" device from the boot manager.
> > 
> > I'm not sure whether openQA attached bootindex to the target disk ?
> > 
> > Thanks.
> 
> Hi Michael,
> 
> I found an easy way to reproduce the issue, and seems that the issue may have
> something to do with the ISO image (rather than the upgraded system, but I
> am not 100% sure)
> 
> However, please omit my messages if you have reproduced the issue as well in
> a simple way.
> 
> ==================================================
> 
> I did compare the 156.3 and 150.1 iso images with same "hd" image, only
> 156.3 can hit the issue.
> 
> #/usr/bin/qemu-img create -f qcow2 -b
> SLE-15-SP3-Full-aarch64-Build156.3-Media1.iso cd0-overlay0 9172019200
> 
> #/usr/bin/qemu-system-aarch64 \
> -m 1024 \
> -machine virt,usb=off,gic-version=2,its=off \
> -cpu host \
> -netdev user,id=qanet0 \
> -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 \
> -boot menu=on,splash-time=5000 \
> -smp 2 \
> -enable-kvm \
> -vnc :91 \
> -monitor stdio \
> -device virtio-scsi-pci,id=scsi0 \
> -blockdev
> driver=file,node-name=hd0-overlay0-file,filename=/var/lib/libvirt/images/
> SLES-15-SP3-aarch64-Build156.3@aarch64-gnome.qcow2,cache.no-flush=on \
> -blockdev
> driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on
> \
> -device virtio-blk-device,id=hd0-device,drive=hd0-overlay0,serial=hd0 \
> -blockdev
> driver=file,node-name=cd0-overlay0-file,filename=/var/lib/libvirt/images/cd0-
> overlay0,cache.no-flush=on \
> -blockdev
> driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on
> \
> -device scsi-cd,id=cd0-device,drive=cd0-overlay0,bootindex=0,serial=cd0 \
> -drive
> id=pflash-code-overlay0,if=pflash,file=/home/aavmf-aarch64-code.bin,
> readonly=on \
> -drive
> id=pflash-vars-overlay0,if=pflash,file=/home/aavmf-aarch64-vars.bin,unit=1,
> format=raw
> 
> Once you get the boot/install menu entry, then type "c" to grub edit mode,
> and type "exit", then you can reproduce the issue. the expect result should
> be entering into the "UEFI BIOS"
Comment 31 Gary Ching-Pang Lin 2021-03-04 09:36:09 UTC
The crash seems caused by shim. Shim modifies the Loaded Image handle for the second stage bootloader. If the second stage bootloader just returns, not Exit(), shim restores the Loaded Image handle. However, shim didn't do the restoration when handling Exit() from the second stage bootloader. OVMF seems alright to live with it. On the other hand, AAVMF probably did some additional check or clean-up, so it would need the original Loaded Image handle. Will dig it further.
Comment 32 Michael Chang 2021-03-04 11:44:04 UTC
(In reply to Gary Ching-Pang Lin from comment #31)
> The crash seems caused by shim. Shim modifies the Loaded Image handle for
> the second stage bootloader. If the second stage bootloader just returns,
> not Exit(), shim restores the Loaded Image handle. However, shim didn't do
> the restoration when handling Exit() from the second stage bootloader. OVMF
> seems alright to live with it. On the other hand, AAVMF probably did some
> additional check or clean-up, so it would need the original Loaded Image
> handle. Will dig it further.

Hi Gary.

Great job. So I thought it is time to reassign as now we are investigating the fix in the shim layer. Feel free to ask if you need anything here.

Thanks.
Comment 33 Gary Ching-Pang Lin 2021-03-05 07:19:07 UTC
Submitted the patch to upstream for the further review.
https://github.com/rhboot/shim/pull/306
Comment 37 Guillaume GARDET 2021-03-09 08:25:08 UTC
I created https://bugzilla.suse.com/show_bug.cgi?id=1183213 to track it for Tumbleweed where the problem is also present on upgrade tests.
Comment 38 OBSbugzilla Bot 2021-03-09 10:00:07 UTC
This is an autogenerated message for OBS integration:
This bug (1182776) was mentioned in
https://build.opensuse.org/request/show/877920 Factory / shim
Comment 40 Stefan Weiberg 2021-03-15 10:46:38 UTC
Fixes merged and resolved with RC1 candidate
Comment 41 Oliver Kurz 2021-03-30 06:05:12 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: offline_slehpc15sp1_espos_scc_basesys-desk-dev-hpc-python2-srv-wsm_def_full_tm
https://openqa.suse.de/tests/5620567

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed
Comment 42 openQA Review 2021-04-20 05:21:40 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: activate_encrypted_volume
https://openqa.suse.de/tests/5846206

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released"
3. The label in the openQA scenario is removed