Bug 1216472 - VMs with secure boot do not start (assertion in edk2)
Summary: VMs with secure boot do not start (assertion in edk2)
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Virtualization:Other (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Joey Lee
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-23 06:26 UTC by Vasily Ulyanov
Modified: 2023-11-21 03:11 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
VMI startup log (509.64 KB, text/x-log)
2023-10-23 06:26 UTC, Vasily Ulyanov
Details
Domain xml (9.08 KB, text/xml)
2023-11-01 13:27 UTC, Vasily Ulyanov
Details
Domain xml (original) (5.39 KB, text/xml)
2023-11-02 12:24 UTC, Vasily Ulyanov
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vasily Ulyanov 2023-10-23 06:26:55 UTC
Created attachment 870374 [details]
VMI startup log

One of the KubeVirt tests that runs a VM with Secure Boot fails. The debug logs show an assertion in edk2:

ASSERT /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202308/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c(1478): (Index != 0) || (LevelType == 0x01)

Tested with ovmf build from the Virtualization project: qemu-ovmf-x86_64-202308-Virt.1699.256.29.noarch
Comment 1 Joey Lee 2023-11-01 12:58:16 UTC
Hi Vasily,

(In reply to Vasily Ulyanov from comment #0)
> Created attachment 87 [details]  0374 [details]
> VMI startup log
> 
> One of the KubeVirt tests that runs a VM with Secure Boot fails. The debug
> logs show an assertion in edk2:
> 
> ASSERT
> /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202308/UefiCpuPkg/Library/
> BaseXApicX2ApicLib/BaseXApicX2ApicLib.c(1478): (Index != 0) || (LevelType ==
> 0x01)
> 
> Tested with ovmf build from the Virtualization project:
> qemu-ovmf-x86_64-202308-Virt.1699.256.29.noarch

I can not reproduce this issue on my local machine with qemu-ovmf + secure boot. 

My code/vars are:

    <loader readonly='yes' secure='no' type='pflash'>/usr/share/qemu/ovmf-x86_64-code.bin</loader>
    <nvram template='/usr/share/qemu/ovmf-x86_64-vars.bin'>/var/lib/libvirt/qemu/nvram/opensuseTW_VARS.fd</nvram>
Comment 2 Joey Lee 2023-11-01 13:15:19 UTC
(In reply to Vasily Ulyanov from comment #0)
> Created attachment 870374 [details]
> VMI startup log

Base on the above log, looks that the VM used OVMF_CODE.secboot.fd and OVMF_VARS.secboot.fd as EFI firmware:

{"component":"virt-launcher","level":"info","msg":"\t\t<loader readonly=\"yes\" secure=\"yes\" type=\"pflash\">/usr/share/OVMF/OVMF_CODE.secboot.fd</loader>","subcomponent":"libvirt","timestamp":"2023-10-19T17:30:50.321164Z"}
{"component":"virt-launcher","level":"info","msg":"\t\t<nvram template=\"/usr/share/OVMF/OVMF_VARS.secboot.fd\">/tmp/default_testvmi</nvram>","subcomponent":"libvirt","timestamp":"2023-10-19T17:30:50.321177Z"}

But those files are not from qemu-ovmf-x86_64 RPM. 

Hi Vasily,

Do you know where are OVMF_CODE.secboot.fd and OVMF_VARS.secboot.fd from?
Comment 3 Vasily Ulyanov 2023-11-01 13:26:44 UTC
(In reply to Joey Lee from comment #1)
> 
> I can not reproduce this issue on my local machine with qemu-ovmf + secure
> boot. 
> 
> My code/vars are:
> 
>     <loader readonly='yes' secure='no'
> type='pflash'>/usr/share/qemu/ovmf-x86_64-code.bin</loader>
>     <nvram
> template='/usr/share/qemu/ovmf-x86_64-vars.bin'>/var/lib/libvirt/qemu/nvram/
> opensuseTW_VARS.fd</nvram>

I guess with `secure='no'` the VM will run **without** Secure Boot, right? This works fine for me as well. The issue is observed when `secure='yes'` and `ovmf-x86_64-smm-ms-code.bin` is used. I will grab the full domxml from the test run and attach it here.
  
> Do you know where are OVMF_CODE.secboot.fd and OVMF_VARS.secboot.fd from?

Those are just symlinks to `/usr/share/qemu/ovmf-x86_64-smm-ms-code.bin` and `/usr/share/qemu/ovmf-x86_64-smm-ms-vars.bin`:

qemu@testvmi:/> ls -la /usr/share/OVMF/OVMF_CODE.secboot.fd
lrwxrwxrwx 1 root root 35 Nov  1 01:52 /usr/share/OVMF/OVMF_CODE.secboot.fd -> ../qemu/ovmf-x86_64-smm-ms-code.bin
qemu@testvmi:/> 
qemu@testvmi:/> ls -la /usr/share/OVMF/OVMF_VARS.secboot.fd
lrwxrwxrwx 1 root root 35 Nov  1 01:52 /usr/share/OVMF/OVMF_VARS.secboot.fd -> ../qemu/ovmf-x86_64-smm-ms-vars.bin
qemu@testvmi:/> 
qemu@testvmi:/> rpm -qa | grep ovmf
qemu-ovmf-x86_64-202308-Virt.1699.256.44.noarc
Comment 4 Vasily Ulyanov 2023-11-01 13:27:54 UTC
Created attachment 870566 [details]
Domain xml
Comment 5 Joey Lee 2023-11-01 14:15:05 UTC
(In reply to Vasily Ulyanov from comment #4)
> Created attachment 870566 [details]
> Domain xml

May I know your libvirt version? I got a problem when I tested ms-ovmf image. I filed another issue bsc#1216789.
Comment 6 Vasily Ulyanov 2023-11-01 14:22:01 UTC
(In reply to Joey Lee from comment #5)
> (In reply to Vasily Ulyanov from comment #4)
> > Created attachment 870566 [details]
> > Domain xml
> 
> May I know your libvirt version? I got a problem when I tested ms-ovmf
> image. I filed another issue bsc#1216789.

Libvirt is from the Virtualization project in OBS. Should be the latest one:

qemu@testvmi:/> rpm -qa | grep libvirt-
system-group-libvirt-20170617-25.2.noarch
libvirt-libs-9.8.0-Virt.1699.1088.7.x86_64
libvirt-daemon-log-9.8.0-Virt.1699.1088.7.x86_64
libvirt-client-9.8.0-Virt.1699.1088.7.x86_64
libvirt-daemon-common-9.8.0-Virt.1699.1088.7.x86_64
libvirt-daemon-driver-qemu-9.8.0-Virt.1699.1088.7.x86_64
Comment 7 Joey Lee 2023-11-02 06:51:02 UTC
I can NOT reproduce issue on my local machine. Maybe this issue relates to the hardware of host machine:

My environment is:

qemu-8.1.2-1.2.x86_64
libvirt-9.8.0-2.1.x86_64
qemu-ovmf-x86_64-202308-1.2.noarch

The following configuration is success:

  <os firmware='efi'>
    <type arch='x86_64' machine='pc-q35-8.1'>hvm</type>
    <firmware>
      <feature enabled='yes' name='enrolled-keys'/>
      <feature enabled='yes' name='secure-boot'/>
    </firmware>
    <loader readonly='yes' secure='yes' type='pflash'>/usr/share/qemu/ovmf-x86_64-smm-ms-code.bin</loader>
    <nvram template='/usr/share/qemu/ovmf-x86_64-smm-ms-vars.bin'>/var/lib/libvirt/qemu/nvram/opensuseTW_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>

Because ovmf-x86_64-smm-ms-code.bin has default features setting in /usr/share/qemu/firmware/50-ovmf-x86_64-secure-ms.json, new libvirt auto-add 'enrolled-keys' and 'secure-boot' features to firmware section. It works fine.

And, I have tried to create symlinks as comment#3, it still works on my local machines:

  <os>
    <type arch='x86_64' machine='pc-q35-8.1'>hvm</type>
    <loader readonly='yes' secure='yes' type='pflash'>/usr/share/OVMF/OVMF_CODE.secboot.fd</loader>
    <nvram template='/usr/share/OVMF/OVMF_VARS.secboot.fd'>/var/lib/libvirt/qemu/nvram/opensuseTW_VARS.fd</nvram>
    <boot dev='hd'/>
  </os>

Because OVMF_CODE.secboot.fd does NOT have features definition in any .json file. So features will not be auto-added by libvirt. But the above configuration is still works on my local machine with secure boot enabled.
Comment 8 Vasily Ulyanov 2023-11-02 08:38:05 UTC
Hm... Actually, KubeVirt tests are run on a VM. So it is a 'nested' scenario. Can this affect the issue somehow?

Also, just to make sure, KubeVirt domain has:

<features>
  <acpi/>
  <smm state="on"/>
</features>

Do you have those features enabled?
Comment 9 Joey Lee 2023-11-02 09:45:10 UTC
(In reply to Vasily Ulyanov from comment #8)
> Hm... Actually, KubeVirt tests are run on a VM. So it is a 'nested'
> scenario. Can this affect the issue somehow?
> 
> Also, just to make sure, KubeVirt domain has:
> 
> <features>
>   <acpi/>
>   <smm state="on"/>
> </features>
> 
> Do you have those features enabled?

Yes, my xml has the following features:

  <features>
    <acpi/>
    <apic/>
    <smm state='on'/>
    <ioapic driver='qemu'/>
  </features>
Comment 10 Joey Lee 2023-11-02 09:47:10 UTC
Hi Vasily, 

Could you please try ovmf in SUSE:ALP:Source:Standard:1.0 repo on IBS? It's edk2-stable202305. It can help to narrow down the scope of edk2 version.

Thanks a lot!
Joey Lee
Comment 11 Vasily Ulyanov 2023-11-02 10:36:46 UTC
(In reply to Joey Lee from comment #10)
> Hi Vasily, 
> 
> Could you please try ovmf in SUSE:ALP:Source:Standard:1.0 repo on IBS? It's
> edk2-stable202305. It can help to narrow down the scope of edk2 version.
> 
> Thanks a lot!
> Joey Lee

Hi Joey, the version from ALP works fine.
Comment 12 Joey Lee 2023-11-02 11:25:21 UTC
(In reply to Vasily Ulyanov from comment #11)
> (In reply to Joey Lee from comment #10)
> > Hi Vasily, 
> > 
> > Could you please try ovmf in SUSE:ALP:Source:Standard:1.0 repo on IBS? It's
> > edk2-stable202305. It can help to narrow down the scope of edk2 version.
> > 
> > Thanks a lot!
> > Joey Lee
> 
> Hi Joey, the version from ALP works fine.

Thanks! Then the issue relates a change between edk2-stable202305..edk2stable202308.

Could you please share how to build up your environment? or you can share the machine (or virtual machine) to me for debugging?
Comment 13 Joey Lee 2023-11-02 12:06:51 UTC
Hi Vasily

(In reply to Vasily Ulyanov from comment #4)
> Created attachment 870566 [details]
> Domain xml

Per your xml, the cpu mode is:

<cpu mode="custom" match="exact" check="full">
  <model fallback="forbid">SapphireRapids</model>
  <vendor>Intel</vendor>
  <topology sockets="1" dies="1" cores="1" threads="1"/>
  <feature policy="require" name="ss"/>
  <feature policy="require" name="vmx"/>
  <feature policy="require" name="pdcm"/>
  <feature policy="require" name="hypervisor"/>
  <feature policy="require" name="tsc_adjust"/>
  <feature policy="require" name="cldemote"/>
  <feature policy="require" name="movdiri"/>
  <feature policy="require" name="movdir64b"/>
  <feature policy="require" name="md-clear"/>
  <feature policy="require" name="stibp"/>
  <feature policy="require" name="ibpb"/>
  <feature policy="require" name="ibrs"/>
  <feature policy="require" name="amd-stibp"/>
  <feature policy="require" name="amd-ssbd"/>
  <feature policy="require" name="tsx-ctrl"/>
  <feature policy="require" name="sbdr-ssdp-no"/>
  <feature policy="require" name="fbsdp-no"/>
  <feature policy="require" name="psdp-no"/>
  <feature policy="disable" name="amx-bf16"/>
  <feature policy="disable" name="amx-tile"/>
  <feature policy="disable" name="amx-int8"/>
  <feature policy="disable" name="fzrm"/>
  <feature policy="disable" name="fsrs"/>
  <feature policy="disable" name="fsrc"/>
  <feature policy="disable" name="xfd"/>
</cpu>

Could you please also try to change the cpu mode to this?

  <cpu mode='host-model' check='partial'/>

The above model is my setting. It works to me.
Comment 14 Joey Lee 2023-11-02 12:11:12 UTC
(In reply to Vasily Ulyanov from comment #0)
> Created attachment 870374 [details]
> VMI startup log
> 
> One of the KubeVirt tests that runs a VM with Secure Boot fails. The debug
> logs show an assertion in edk2:
> 
> ASSERT
> /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202308/UefiCpuPkg/Library/
> BaseXApicX2ApicLib/BaseXApicX2ApicLib.c(1478): (Index != 0) || (LevelType ==
> 0x01)
> 

The above ASSERT is from GetProcessorLocation2ByApicId() which is about processor. So my direction is the difference of CPU setting.
Comment 15 Vasily Ulyanov 2023-11-02 12:23:17 UTC
(In reply to Joey Lee from comment #13)
> Hi Vasily
> 
> (In reply to Vasily Ulyanov from comment #4)
> > Created attachment 870566 [details]
> > Domain xml
> 
> Per your xml, the cpu mode is:
> 
> <cpu mode="custom" match="exact" check="full">
>   <model fallback="forbid">SapphireRapids</model>
>   <vendor>Intel</vendor>
>   <topology sockets="1" dies="1" cores="1" threads="1"/>
>   <feature policy="require" name="ss"/>
>   <feature policy="require" name="vmx"/>
>   <feature policy="require" name="pdcm"/>
>   <feature policy="require" name="hypervisor"/>
>   <feature policy="require" name="tsc_adjust"/>
>   <feature policy="require" name="cldemote"/>
>   <feature policy="require" name="movdiri"/>
>   <feature policy="require" name="movdir64b"/>
>   <feature policy="require" name="md-clear"/>
>   <feature policy="require" name="stibp"/>
>   <feature policy="require" name="ibpb"/>
>   <feature policy="require" name="ibrs"/>
>   <feature policy="require" name="amd-stibp"/>
>   <feature policy="require" name="amd-ssbd"/>
>   <feature policy="require" name="tsx-ctrl"/>
>   <feature policy="require" name="sbdr-ssdp-no"/>
>   <feature policy="require" name="fbsdp-no"/>
>   <feature policy="require" name="psdp-no"/>
>   <feature policy="disable" name="amx-bf16"/>
>   <feature policy="disable" name="amx-tile"/>
>   <feature policy="disable" name="amx-int8"/>
>   <feature policy="disable" name="fzrm"/>
>   <feature policy="disable" name="fsrs"/>
>   <feature policy="disable" name="fsrc"/>
>   <feature policy="disable" name="xfd"/>
> </cpu>
> 
> Could you please also try to change the cpu mode to this?
> 
>   <cpu mode='host-model' check='partial'/>
> 
> The above model is my setting. It works to me.


The attached dom.xml is the output of `virsh dumpxml`. Meaning that it is 'adjusted' by libvirt. The original dom.xml already specifies the CPU model as 'host-model'. So those CPU settings actually match the host hardware. I will grab and attach the original domain xml here for reference.
Comment 16 Vasily Ulyanov 2023-11-02 12:24:00 UTC
Created attachment 870592 [details]
Domain xml (original)
Comment 17 Joey Lee 2023-11-03 12:32:57 UTC
Thanks for Vasily's help to setup environment for bisecting. After bisecting edk2-stable202305..edk2-stable202308, the winner is:

From 1fadd18d0c0c65ffde9e128a486414ba43b3387c Mon Sep 17 00:00:00 2001      [edk2-stable202308]
From: "Zhang, Hongbin1" <Hongbin1.Zhang@intel.com>
Date: Mon, 29 May 2023 14:39:38 +0800
Subject: [PATCH 177/271] UefiCpuPkg: Get processor extended information for
 SmmCpuServiceProtocol

Some features like RAS need to use processor extended information
under smm, So add code to support it

Signed-off-by: Hongbin1 Zhang <hongbin1.zhang@intel.com>
Cc: Eric Dong <eric.dong@intel.com>
Reviewed-by: Ray Ni <ray.ni@intel.com>
Cc: Rahul Kumar <rahul1.kumar@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Cc: Star Zeng <star.zeng@intel.com>
Reviewed-by: Jiaxin Wu <jiaxin.wu@intel.com>

After reverted this patch, edk2-stable202308 ovmf works fine on issue machine. 

I will check why this change causes problem.
Comment 18 Joey Lee 2023-11-14 14:00:58 UTC
(In reply to Vasily Ulyanov from comment #0)
> Created attachment 870374 [details]
> VMI startup log
> 
> One of the KubeVirt tests that runs a VM with Secure Boot fails. The debug
> logs show an assertion in edk2:
> 
> ASSERT
> /home/abuild/rpmbuild/BUILD/edk2-edk2-stable202308/UefiCpuPkg/Library/
> BaseXApicX2ApicLib/BaseXApicX2ApicLib.c(1478): (Index != 0) || (LevelType ==
> 0x01)
> 

I have checked the ASSERT code of in OVMF:
UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c:GetProcessorLocation2ByApicId
    //
    // first level reported should be SMT.
    //
    ASSERT ((Index != 0) || (LevelType == CPUID_EXTENDED_TOPOLOGY_LEVEL_TYPE_SMT));
    if (LevelType == CPUID_EXTENDED_TOPOLOGY_LEVEL_TYPE_INVALID) {            
      break;
    }

Then I add debug log to print Index and LevelType on issue virtual machine kubevirt-ci-node72, it shows:

GetProcessorLocation2ByApicId, Index: 0, LevelType: 0
ASSERT /home/joeyli/source_code-git/edk2/UefiCpuPkg/Library/BaseXApicX2ApicLib/BaseXApicX2ApicLib.c(1479): (Index != 0) || (LevelType == 0x01)

The LevelType returned by cpuid instruction (CPUID V2 Extended Topology Enumeration Leaf) is CPUID_EXTENDED_TOPOLOGY_LEVEL_TYPE_INVALID. But Index 0 (first level) should with CPUID_EXTENDED_TOPOLOGY_LEVEL_TYPE_SMT. That's why the ASSERT be exposed.

On issue machine, the MaxStandardCpuIdIndex is 31. It is >= CPUID_V2_EXTENDED_TOPOLOGY. So the logic of "CPUID V2 Extended Topology Enumeration Leaf" be used. But the cpuid should NOT return CPUID_EXTENDED_TOPOLOGY_LEVEL_TYPE_INVALID level type for the first level.
Comment 19 Joey Lee 2023-11-14 14:03:27 UTC
Hi James, 

Sorry for bother you! As my comment#18, do you know who returns CPUID_EXTENDED_TOPOLOGY_LEVEL_TYPE_INVALID level type by cpuid instruction? QEMU or host machine's CPU?

Thanks!
Comment 20 Joey Lee 2023-11-15 07:37:41 UTC
I also filed a bug on tianocore bugzilla: 

https://bugzilla.tianocore.org/show_bug.cgi?id=4598
Comment 21 Joey Lee 2023-11-15 12:06:51 UTC
(In reply to Joey Lee from comment #20)
> I also filed a bug on tianocore bugzilla: 
> 
> https://bugzilla.tianocore.org/show_bug.cgi?id=4598

I have followed Gerd Hoffmann's suggestion and confirmed that Gerd's 170d4ce8e9 patch in edk2 mainline works to me on issue machine for fixing problem:

commit 170d4ce8e90abb1eff03852940a69c9d17f8afe5
Author: Gerd Hoffmann <kraxel@redhat.com>
Date:   Tue Oct 17 13:28:07 2023 +0200

    UefiCpuPkg/BaseXApicX2ApicLib: fix CPUID_V2_EXTENDED_TOPOLOGY detection

I have tested 2023-11-15 master branch and also backported the above patch to edk2-stable202308 ovmf. Both of them works.

I will backoport 170d4ce8e9 patch to our edk2-stable202308 ovmf.
Comment 22 Joey Lee 2023-11-21 03:11:22 UTC
Backported 170d4ce8e9 patch be merged to openSUSE:Factory/ovmf. Set this issue to FIXED.