Bug 1165773

Summary: System does not boot with kernel 5.6 (Kernel:HEAD)
Product: [openSUSE] openSUSE Tumbleweed Reporter: Fabian Vogt <fvogt>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: fvogt, glin, gm.venekamp, jlee, jslaby, mchang, tiwai
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: grub.cfg on the working Dell Precision 5510
tpm: check event log version before reading final events
dmesg of boot with 5.4
ACPI tables
Proposed kernel patch

Description Fabian Vogt 2020-03-05 08:12:02 UTC
Using kernel-default-5.6.rc4-1.1.g8a04afc.x86_64, the system does not boot.
GRUB loads kernel and initrd, but after "Loading initial ramdisk ..." nothing happens - the system is stuck there and a hard reset is required.

I tried adding some options for debugging ("debug loglevel=9 nomodeset video=efifb mitigations=off"), but those didn't change anything.

The same issue happens with kernel-default-5.6.rc2-2.1.g84ceb66.x86_64.
Older kernels (5.4) boot just fine.
Comment 1 Jiri Slaby 2020-03-06 08:12:06 UTC
What about debug with earlyprintk=vga, earlyprintk=efi, or earlyprintk=serial (depending on use case)?
Comment 2 Fabian Vogt 2020-03-06 09:12:31 UTC
(In reply to Jiri Slaby from comment #1)
> What about debug with earlyprintk=vga, earlyprintk=efi, or
> earlyprintk=serial (depending on use case)?

It's a Dell Precision 5510 with UEFI, secure boot disabled.

I tried both earlyprintk=vga and earlyprintk=efi, none of them did anything.

I tried earlyprintk=efi with the working 5.4 kernel as well and it didn't do anything either though.
Comment 3 Jiri Slaby 2020-03-06 09:26:34 UTC
Ah, I never know which of those it is. So does earlycon=efifb output anything?
Comment 4 Fabian Vogt 2020-03-06 09:47:37 UTC
(In reply to Jiri Slaby from comment #3)
> Ah, I never know which of those it is. So does earlycon=efifb output
> anything?

Nope, still nothing.

This option did work on 5.4 though.
Comment 5 Fabian Vogt 2020-03-06 10:04:46 UTC
I used QEMU to boot my system and the kernel is working, so the issue is somehow with the hardware or the EFI.

qemu-system-x86_64 -accel kvm -bios /usr/share/qemu/ovmf-x86_64-suse-code.bin -drive file=/dev/sda,if=virtio,format=raw,readonly=on
Comment 6 Michael Pujos 2020-03-10 13:49:28 UTC
Are you absolutely certain Secure Boot is disabled ?
Comment 7 Fabian Vogt 2020-03-10 13:56:56 UTC
(In reply to Michael Pujos from comment #6)
> Are you absolutely certain Secure Boot is disabled ?

Yes.
Comment 8 Takashi Iwai 2020-03-11 11:47:10 UTC
I'm installing a fresh TW now on 5510 for testing.  Let's see whether I can reproduce the problem.
Comment 9 Takashi Iwai 2020-03-11 11:53:41 UTC
Now tested 5.6-rc5 from Kernel:HEAD, and it boots fine.

Tested with Dell Precision 5510 with WD15 dock, UEFI boot mode.

So this seems specific to either setup or some combination.
Comment 10 Takashi Iwai 2020-04-03 16:35:53 UTC
Is this still a problem with the latest kernel?
Comment 11 Fabian Vogt 2020-04-06 13:48:23 UTC
(In reply to Takashi Iwai from comment #10)
> Is this still a problem with the latest kernel?

Just tried with kernel-default-5.6.2-2.1.g969c535.x86_64. Unfortunately still broken.
Comment 12 Takashi Iwai 2020-05-19 15:50:21 UTC
Is this still a problem with the latest TW?

I can retry the fresh installation on the same machine, but I'm afraid it would work again, i.e. some subtle configuration difference may cause the problem...
Comment 13 Fabian Vogt 2020-05-20 07:10:32 UTC
(In reply to Takashi Iwai from comment #12)
> Is this still a problem with the latest TW?
> 
> I can retry the fresh installation on the same machine, but I'm afraid it
> would work again, i.e. some subtle configuration difference may cause the
> problem...

Yep, tried both the TW (kernel-default-5.6.12-1.3.x86_64) and Kernel:HEAD (kernel-default-5.7.rc6-1.1.g3603fcd.x86_64), same issue. I'm still unable to get any sign of life out of the kernel.
Comment 14 Fabian Vogt 2020-05-20 13:00:10 UTC
Progress!

I built some test kernels from git to maybe narrow it down, but those didn't boot at all, until I tried "linux" instead of "linuxefi" for booting. (Didn't work for the 5.6 kernel that broke first though, probably a different issue). And indeed, using "linux"/"initrd" instead of "linuxefi"/"initrdefi" got the 5.7-rc6 kernel to boot here as well now.

Narrowing this down a bit further, I added a boot entry to load grub.efi directly and not go through shim. With that setup, even "linuxefi" works again.

What I do not understand is:

- Why does shim make any difference if secure boot is disabled?
- Why is there just a freeze and no error message?
- Why does linux work, but not linuxefi?
- Why do kernel 5.4 and older boot even with linuxefi and shim?
Comment 15 Gary Ching-Pang Lin 2020-05-21 05:39:17 UTC
(In reply to Fabian Vogt from comment #14)
> Progress!
> 
> I built some test kernels from git to maybe narrow it down, but those didn't
> boot at all, until I tried "linux" instead of "linuxefi" for booting.
> (Didn't work for the 5.6 kernel that broke first though, probably a
> different issue). And indeed, using "linux"/"initrd" instead of
> "linuxefi"/"initrdefi" got the 5.7-rc6 kernel to boot here as well now.
> 
> Narrowing this down a bit further, I added a boot entry to load grub.efi
> directly and not go through shim. With that setup, even "linuxefi" works
> again.
> 
> What I do not understand is:
> 
> - Why does shim make any difference if secure boot is disabled?
To simplify the boot path, we always boot shim and then grub2. When secure boot is disabled, shim still provides the verification protocol for grub2 but skips the verification of signatures.

> - Why is there just a freeze and no error message?
It may crash somewhere between grub2 handing over to kernel and kernel finishing initialization of efi.

> - Why does linux work, but not linuxefi?
AFAIK, linuxefi is designed for secure boot and never upstreamed. Furthermore, we disabled linux in the older grub2. Since grub 2.0.4, a new verification framework is implemented, so that we can use it with/without secure boot.

> - Why do kernel 5.4 and older boot even with linuxefi and shim?
Micheal, any idea?

BTW, could you post the grub2 version and /boot/grub2/grub.cfg? Maybe we can try to figure out with that.
Comment 16 Michael Chang 2020-05-21 07:33:25 UTC
(In reply to Gary Ching-Pang Lin from comment #15)
> (In reply to Fabian Vogt from comment #14)
> > Progress!
> > 
> > I built some test kernels from git to maybe narrow it down, but those didn't
> > boot at all, until I tried "linux" instead of "linuxefi" for booting.
> > (Didn't work for the 5.6 kernel that broke first though, probably a
> > different issue). And indeed, using "linux"/"initrd" instead of
> > "linuxefi"/"initrdefi" got the 5.7-rc6 kernel to boot here as well now.
> > 
> > Narrowing this down a bit further, I added a boot entry to load grub.efi
> > directly and not go through shim. With that setup, even "linuxefi" works
> > again.
> > 
> > What I do not understand is:
> > 
> > - Why does shim make any difference if secure boot is disabled?
> To simplify the boot path, we always boot shim and then grub2. When secure
> boot is disabled, shim still provides the verification protocol for grub2
> but skips the verification of signatures.

I had impression that shim has hooks to some efi boot service ? Is it valid for this case ?
 
> > - Why is there just a freeze and no error message?
> It may crash somewhere between grub2 handing over to kernel and kernel
> finishing initialization of efi.

Yes unfortunately it is quite often that no error message for early efi boot failure, besides error itself can be asynchronous that cannot be captured and logged. 

You may give GRUB_TERMINAL="console" a try, which uses the same efi_console_output protocol for displaying console output as earlyprintk=efi. The gfxterm uses different output protocol (GOP) that may impose screen refresh in the transition.

> > - Why does linux work, but not linuxefi?
> AFAIK, linuxefi is designed for secure boot and never upstreamed.
> Furthermore, we disabled linux in the older grub2. Since grub 2.0.4, a new
> verification framework is implemented, so that we can use it with/without
> secure boot.

As long as linux command totally bypassed the kernel's efistub, it is not influenced/suffered by the problem if it is related to booting efistub.

> 
> > - Why do kernel 5.4 and older boot even with linuxefi and shim?
> Micheal, any idea?

No idea either. Is is possible to "bisect" the problem as this sounded like kernel regression ?
Comment 17 Gary Ching-Pang Lin 2020-05-21 07:49:11 UTC
(In reply to Michael Chang from comment #16)
> (In reply to Gary Ching-Pang Lin from comment #15)
> > (In reply to Fabian Vogt from comment #14)
> > > Progress!
> > > 
> > > I built some test kernels from git to maybe narrow it down, but those didn't
> > > boot at all, until I tried "linux" instead of "linuxefi" for booting.
> > > (Didn't work for the 5.6 kernel that broke first though, probably a
> > > different issue). And indeed, using "linux"/"initrd" instead of
> > > "linuxefi"/"initrdefi" got the 5.7-rc6 kernel to boot here as well now.
> > > 
> > > Narrowing this down a bit further, I added a boot entry to load grub.efi
> > > directly and not go through shim. With that setup, even "linuxefi" works
> > > again.
> > > 
> > > What I do not understand is:
> > > 
> > > - Why does shim make any difference if secure boot is disabled?
> > To simplify the boot path, we always boot shim and then grub2. When secure
> > boot is disabled, shim still provides the verification protocol for grub2
> > but skips the verification of signatures.
> 
> I had impression that shim has hooks to some efi boot service ? Is it valid
> for this case ?
>  
Once secure boot is disabled, shim will not hook those services.

> > > - Why is there just a freeze and no error message?
> > It may crash somewhere between grub2 handing over to kernel and kernel
> > finishing initialization of efi.
> 
> Yes unfortunately it is quite often that no error message for early efi boot
> failure, besides error itself can be asynchronous that cannot be captured
> and logged. 
> 
> You may give GRUB_TERMINAL="console" a try, which uses the same
> efi_console_output protocol for displaying console output as
> earlyprintk=efi. The gfxterm uses different output protocol (GOP) that may
> impose screen refresh in the transition.
> 
> > > - Why does linux work, but not linuxefi?
> > AFAIK, linuxefi is designed for secure boot and never upstreamed.
> > Furthermore, we disabled linux in the older grub2. Since grub 2.0.4, a new
> > verification framework is implemented, so that we can use it with/without
> > secure boot.
> 
> As long as linux command totally bypassed the kernel's efistub, it is not
> influenced/suffered by the problem if it is related to booting efistub.
> 
> > 
> > > - Why do kernel 5.4 and older boot even with linuxefi and shim?
> > Micheal, any idea?
> 
> No idea either. Is is possible to "bisect" the problem as this sounded like
> kernel regression ?

Hmmm maybe we can start with the changes in kernel efistub.
Comment 18 Fabian Vogt 2020-05-22 13:46:11 UTC
(In reply to Gary Ching-Pang Lin from comment #15)
> (In reply to Fabian Vogt from comment #14)
> > What I do not understand is:
> > 
> > - Why does shim make any difference if secure boot is disabled?
> To simplify the boot path, we always boot shim and then grub2. When secure
> boot is disabled, shim still provides the verification protocol for grub2
> but skips the verification of signatures.

Which means there should not be any difference. There is one though.

> > - Why does linux work, but not linuxefi?
> AFAIK, linuxefi is designed for secure boot and never upstreamed.
> Furthermore, we disabled linux in the older grub2. Since grub 2.0.4, a new
> verification framework is implemented, so that we can use it with/without
> secure boot.

Which apparently uses a different codepath?

> > - Why do kernel 5.4 and older boot even with linuxefi and shim?
> Micheal, any idea?
> 
> BTW, could you post the grub2 version and /boot/grub2/grub.cfg? Maybe we can
> try to figure out with that.

grub2-x86_64-efi-2.04-5.1.noarch

I'm booting by manually running "linux(efi) /boot/vmlinuz" in the grub console, so the content of grub.cfg shouldn't have (much) influence.

(In reply to Michael Chang from comment #16)
> (In reply to Gary Ching-Pang Lin from comment #15)
> > (In reply to Fabian Vogt from comment #14)
> > > - Why is there just a freeze and no error message?
> > It may crash somewhere between grub2 handing over to kernel and kernel
> > finishing initialization of efi.
> 
> Yes unfortunately it is quite often that no error message for early efi boot
> failure, besides error itself can be asynchronous that cannot be captured
> and logged.
>
> You may give GRUB_TERMINAL="console" a try, which uses the same
> efi_console_output protocol for displaying console output as
> earlyprintk=efi. The gfxterm uses different output protocol (GOP) that may
> impose screen refresh in the transition.

I tried "terminal_output console" during runtime, which I assume is equivalent.
Even with "set debug=all" before "boot" there is absolutely nothing after it tried to jump into the kernel.

> > > - Why do kernel 5.4 and older boot even with linuxefi and shim?
> > Micheal, any idea?
> 
> No idea either. Is is possible to "bisect" the problem as this sounded like
> kernel regression ?

The strange thing is that none of my self-built kernels booted with "linuxefi", not even a 5.4 one which boots when from Kernel:HEAD.
Otherwise I would've suspected some of the changes to header.S (like the bump of the header version to 2.15) after 5.4.
Comment 19 Gary Ching-Pang Lin 2020-05-25 06:10:24 UTC
(In reply to Fabian Vogt from comment #18)
> (In reply to Gary Ching-Pang Lin from comment #15)
> > (In reply to Fabian Vogt from comment #14)
> > > What I do not understand is:
> > > 
> > > - Why does shim make any difference if secure boot is disabled?
> > To simplify the boot path, we always boot shim and then grub2. When secure
> > boot is disabled, shim still provides the verification protocol for grub2
> > but skips the verification of signatures.
> 
> Which means there should not be any difference. There is one though.
> 
> > > - Why does linux work, but not linuxefi?
> > AFAIK, linuxefi is designed for secure boot and never upstreamed.
> > Furthermore, we disabled linux in the older grub2. Since grub 2.0.4, a new
> > verification framework is implemented, so that we can use it with/without
> > secure boot.
> 
> Which apparently uses a different codepath?
> 
> > > - Why do kernel 5.4 and older boot even with linuxefi and shim?
> > Micheal, any idea?
> > 
> > BTW, could you post the grub2 version and /boot/grub2/grub.cfg? Maybe we can
> > try to figure out with that.
> 
> grub2-x86_64-efi-2.04-5.1.noarch
> 
> I'm booting by manually running "linux(efi) /boot/vmlinuz" in the grub
> console, so the content of grub.cfg shouldn't have (much) influence.
> 
> (In reply to Michael Chang from comment #16)
> > (In reply to Gary Ching-Pang Lin from comment #15)
> > > (In reply to Fabian Vogt from comment #14)
> > > > - Why is there just a freeze and no error message?
> > > It may crash somewhere between grub2 handing over to kernel and kernel
> > > finishing initialization of efi.
> > 
> > Yes unfortunately it is quite often that no error message for early efi boot
> > failure, besides error itself can be asynchronous that cannot be captured
> > and logged.
> >
> > You may give GRUB_TERMINAL="console" a try, which uses the same
> > efi_console_output protocol for displaying console output as
> > earlyprintk=efi. The gfxterm uses different output protocol (GOP) that may
> > impose screen refresh in the transition.
> 
> I tried "terminal_output console" during runtime, which I assume is
> equivalent.
> Even with "set debug=all" before "boot" there is absolutely nothing after it
> tried to jump into the kernel.
> 
> > > > - Why do kernel 5.4 and older boot even with linuxefi and shim?
> > > Micheal, any idea?
> > 
> > No idea either. Is is possible to "bisect" the problem as this sounded like
> > kernel regression ?
> 
> The strange thing is that none of my self-built kernels booted with
> "linuxefi", not even a 5.4 one which boots when from Kernel:HEAD.
> Otherwise I would've suspected some of the changes to header.S (like the
> bump of the header version to 2.15) after 5.4.

Did you enable CONFIG_EFI_STUB in your self-built kernel? It adds the EFI header to the kernel image so that linuxefi can parse the image and use the shim protocol to verify the kernel when secure boot is enabled.
Comment 20 Fabian Vogt 2020-05-25 06:17:46 UTC
(In reply to Gary Ching-Pang Lin from comment #19)
> (In reply to Fabian Vogt from comment #18)
> > The strange thing is that none of my self-built kernels booted with
> > "linuxefi", not even a 5.4 one which boots when from Kernel:HEAD.
> > Otherwise I would've suspected some of the changes to header.S (like the
> > bump of the header version to 2.15) after 5.4.
> 
> Did you enable CONFIG_EFI_STUB in your self-built kernel? It adds the EFI
> header to the kernel image so that linuxefi can parse the image and use the
> shim protocol to verify the kernel when secure boot is enabled.

Yes.
Comment 21 Michael Chang 2020-05-25 07:39:11 UTC
(In reply to Fabian Vogt from comment #18)
> (In reply to Gary Ching-Pang Lin from comment #15)
> > (In reply to Fabian Vogt from comment #14)

> The strange thing is that none of my self-built kernels booted with
> "linuxefi", not even a 5.4 one which boots when from Kernel:HEAD.
> Otherwise I would've suspected some of the changes to header.S (like the
> bump of the header version to 2.15) after 5.4.

Are they all possess the same problem ? Which is with secure boot disabled condition grub/linuxefi works but shim/grub/linuxefi failed ? Is it only reproduced on Dell 5510 in your opinion ? Did you use the exactly the same kernel options with openSUSE's to test 5.4 ?
Comment 22 Fabian Vogt 2020-05-25 08:10:47 UTC
(In reply to Michael Chang from comment #21)
> (In reply to Fabian Vogt from comment #18)
> > (In reply to Gary Ching-Pang Lin from comment #15)
> > > (In reply to Fabian Vogt from comment #14)
> 
> > The strange thing is that none of my self-built kernels booted with
> > "linuxefi", not even a 5.4 one which boots when from Kernel:HEAD.
> > Otherwise I would've suspected some of the changes to header.S (like the
> > bump of the header version to 2.15) after 5.4.
> 
> Are they all possess the same problem ? Which is with secure boot disabled
> condition grub/linuxefi works but shim/grub/linuxefi failed ?

Yes.

> Is it only
> reproduced on Dell 5510 in your opinion ?

No idea. Apparently it worked for Takashi even on that model, but I don't know the details. Maybe it was a setup without shim.

> Did you use the exactly the same
> kernel options with openSUSE's to test 5.4 ?

The options which worked for 5.4 didn't work for 5.6+.
Comment 23 Michael Chang 2020-05-25 10:10:10 UTC
(In reply to Fabian Vogt from comment #22)
> (In reply to Michael Chang from comment #21)
> > (In reply to Fabian Vogt from comment #18)
> > > (In reply to Gary Ching-Pang Lin from comment #15)
> > > > (In reply to Fabian Vogt from comment #14)

> No idea. Apparently it worked for Takashi even on that model, but I don't
> know the details. Maybe it was a setup without shim.

Thanks. Let's check if Takashi could help to provide the information ..

Hi Takashi,

Would you please help to check shim was used in your testing ? If not would it be possible to use 'shim-install' to install shim and grub and disable secure boot to test ?

> > Did you use the exactly the same
> > kernel options with openSUSE's to test 5.4 ?
> 
> The options which worked for 5.4 didn't work for 5.6+.

Sorry I didn't make it clear. I meant your 5.4 which didn't work and openSUSE kernel's 5.4 which worked. I am curious if you built with exactly the same kernel config for testing in case of leaving any option relevant behind ?
Comment 24 Fabian Vogt 2020-05-25 10:14:20 UTC
(In reply to Michael Chang from comment #23)
> (In reply to Fabian Vogt from comment #22)
> > > Did you use the exactly the same
> > > kernel options with openSUSE's to test 5.4 ?
> > 
> > The options which worked for 5.4 didn't work for 5.6+.
> 
> Sorry I didn't make it clear. I meant your 5.4 which didn't work and
> openSUSE kernel's 5.4 which worked. I am curious if you built with exactly
> the same kernel config for testing in case of leaving any option relevant
> behind ?

No, for quicker testing I actually edited the x86 defconfig. I'll build the same config as the working 5.4 and report back.
Comment 25 Takashi Iwai 2020-05-25 10:28:03 UTC
(In reply to Michael Chang from comment #23)
> (In reply to Fabian Vogt from comment #22)
> > (In reply to Michael Chang from comment #21)
> > > (In reply to Fabian Vogt from comment #18)
> > > > (In reply to Gary Ching-Pang Lin from comment #15)
> > > > > (In reply to Fabian Vogt from comment #14)
> 
> > No idea. Apparently it worked for Takashi even on that model, but I don't
> > know the details. Maybe it was a setup without shim.
> 
> Thanks. Let's check if Takashi could help to provide the information ..
> 
> Hi Takashi,
> 
> Would you please help to check shim was used in your testing ? If not would
> it be possible to use 'shim-install' to install shim and grub and disable
> secure boot to test ?

What's the best way to figure out?  It's been linuxefi and shim was installed.
To be sure, I ran shim-install, rebooted and it worked.  Below is grub.cfg file

The system is a bit outdated, based on the fresh TW installation in weeks ago with 5.6.10 KOTD.
Comment 26 Takashi Iwai 2020-05-25 10:29:04 UTC
Created attachment 838104 [details]
grub.cfg on the working Dell Precision 5510
Comment 27 Fabian Vogt 2020-05-26 10:34:10 UTC
(In reply to Fabian Vogt from comment #24)
> (In reply to Michael Chang from comment #23)
> > (In reply to Fabian Vogt from comment #22)
> > > > Did you use the exactly the same
> > > > kernel options with openSUSE's to test 5.4 ?
> > > 
> > > The options which worked for 5.4 didn't work for 5.6+.
> > 
> > Sorry I didn't make it clear. I meant your 5.4 which didn't work and
> > openSUSE kernel's 5.4 which worked. I am curious if you built with exactly
> > the same kernel config for testing in case of leaving any option relevant
> > behind ?
> 
> No, for quicker testing I actually edited the x86 defconfig. I'll build the
> same config as the working 5.4 and report back.

It indeed is the kernel config!

Broken:
# CONFIG_RESET_ATTACK_MITIGATION is not set

Works:
CONFIG_RESET_ATTACK_MITIGATION=y

So for booting to succeed, the option has to be turned on.

None of my self-compiled kernels worked initially because x86_64_defconfig does not have that enabled.
The distro kernel also switched it off between 5.4 and 5.6.
Comment 28 Jiri Slaby 2020-05-26 10:44:02 UTC
(In reply to Fabian Vogt from comment #27)
> The distro kernel also switched it off between 5.4 and 5.6.

(due to bug 1163396)
Comment 29 Takashi Iwai 2020-05-26 10:51:52 UTC
That's interesting...  We can fiddle this via efivars sysfs on the fly?
Comment 30 Fabian Vogt 2020-05-26 15:33:53 UTC
Many kernel builds, soft and hard reboots later, I think I'm quite a bit closer.

By adding some efi_printk in strategic places, I narrowed the place where the kernel gets stuck down to __calc_tpm2_event_size.

In line 229 (https://github.com/torvalds/linux/blob/v5.7-rc6/include/linux/tpm_eventlog.h#L229) the value of efispecid->num_algs is garbage (0xC269276A). Making __calc_tpm2_event_size return 0 early makes the kernel boot reliably again.

With disabled "reset attack mitigation", the loop searches for alg_id == 0x4 first, then 0x2777 (garbage?). With "reset attack mitigation" enabled, it searches for 0x4 first, but then for 0x0, which it encounters quite early in the garbage pile that is efispecid->digest_sizes and so the loop terminates.

So if I'm not mistaken, this is some uninitialized memory read, so it's essentially random whether any particular way of booting works or not. I'm really not familiar with this area of boot code, so I'm not sure whether this is a bootloader or kernel issue, though I assume that it's the latter at this point.
Comment 31 Joey Lee 2020-05-27 02:57:15 UTC
(In reply to Fabian Vogt from comment #30)
> Many kernel builds, soft and hard reboots later, I think I'm quite a bit
> closer.
> 
> By adding some efi_printk in strategic places, I narrowed the place where
> the kernel gets stuck down to __calc_tpm2_event_size.
> 
> In line 229
> (https://github.com/torvalds/linux/blob/v5.7-rc6/include/linux/tpm_eventlog.
> h#L229) the value of efispecid->num_algs is garbage (0xC269276A). Making
> __calc_tpm2_event_size return 0 early makes the kernel boot reliably again.
> 
> With disabled "reset attack mitigation", the loop searches for alg_id == 0x4
> first, then 0x2777 (garbage?). With "reset attack mitigation" enabled, it
> searches for 0x4 first, but then for 0x0, which it encounters quite early in
> the garbage pile that is efispecid->digest_sizes and so the loop terminates.
> 
> So if I'm not mistaken, this is some uninitialized memory read, so it's
> essentially random whether any particular way of booting works or not. I'm
> really not familiar with this area of boot code, so I'm not sure whether
> this is a bootloader or kernel issue, though I assume that it's the latter
> at this point.

Thanks for Fabian's debugging! I think that Ard's patch is against this issue.

https://lkml.org/lkml/2020/5/17/127

Date: Sun, 17 May 2020 14:57:54 +0200
From: Ard Biesheuvel <ardb@kernel.org>
Subject: [PATCH 7/7] tpm: check event log version before reading final events

From: Loic Yhuel <loic.yhuel@gmail.com>

This fixes the boot issues since 5.3 on several Dell models when the TPM
is enabled. Depending on the exact grub binary, booting the kernel would
freeze early, or just report an error parsing the final events log.

We get an event log in the SHA-1 format, which doesn't have a
tcg_efi_specid_event_head in the first event, and there is a final events
table which doesn't match the crypto agile format.
__calc_tpm2_event_size reads bad "count" and "efispecid->num_algs", and
either fails, or loops long enough for the machine to be appear frozen.

So we now only parse the final events table, which is per the spec always
supposed to be in the crypto agile format, when we got a event log in this
format.
Comment 32 Jiri Slaby 2020-05-27 05:55:55 UTC
Created attachment 838222 [details]
tpm: check event log version before reading final events

(In reply to Joey Lee from comment #31)
> https://lkml.org/lkml/2020/5/17/127
> 
> Date: Sun, 17 May 2020 14:57:54 +0200
> From: Ard Biesheuvel <ardb@kernel.org>
> Subject: [PATCH 7/7] tpm: check event log version before reading final events

If you could check it, I would commit it.
Comment 33 Fabian Vogt 2020-05-27 05:58:46 UTC
(In reply to Joey Lee from comment #31)
> Thanks for Fabian's debugging! I think that Ard's patch is against this
> issue.
> 
> https://lkml.org/lkml/2020/5/17/127

Yep, that's a perfect match. I'll still give it a try to be sure though.

If I had debugged that two weeks earlier, it wouldn't have been a waste of (quite a lot of) time...

Can this be backported? By that I mean to the TW kernel as well as SLE/Leap.
Comment 34 Fabian Vogt 2020-05-27 06:16:12 UTC
(In reply to Fabian Vogt from comment #33)
> (In reply to Joey Lee from comment #31)
> > Thanks for Fabian's debugging! I think that Ard's patch is against this
> > issue.
> > 
> > https://lkml.org/lkml/2020/5/17/127
> 
> Yep, that's a perfect match. I'll still give it a try to be sure though.

I did - it made no difference.
Looking at the patch, it's obvious that it can't work actually. The freeze happens before any of the patched code is run.

BTW: I have the TPM disabled here in the Firmware Setup. Maybe this functionality is handled by the firmware even without the TPM though.
Comment 35 Joey Lee 2020-05-27 06:37:57 UTC
(In reply to Fabian Vogt from comment #34)
> (In reply to Fabian Vogt from comment #33)
> > (In reply to Joey Lee from comment #31)
> > > Thanks for Fabian's debugging! I think that Ard's patch is against this
> > > issue.
> > > 
> > > https://lkml.org/lkml/2020/5/17/127
> > 
> > Yep, that's a perfect match. I'll still give it a try to be sure though.
> 
> I did - it made no difference.
> Looking at the patch, it's obvious that it can't work actually. The freeze
> happens before any of the patched code is run.
> 
> BTW: I have the TPM disabled here in the Firmware Setup. Maybe this
> functionality is handled by the firmware even without the TPM though.

hm.... 

The efi_enable_reset_attack_mitigation() is before efi_retrieve_tpm2_eventlog(). But the efi_enable_reset_attack_mitigation() just only set a variable. It should not cause that the behavior of efi_retrieve_tpm2_eventlog() be changed.
Comment 36 Joey Lee 2020-05-27 06:44:26 UTC
(In reply to Fabian Vogt from comment #27)
> (In reply to Fabian Vogt from comment #24)
> > (In reply to Michael Chang from comment #23)
> > > (In reply to Fabian Vogt from comment #22)
> > > > > Did you use the exactly the same
> > > > > kernel options with openSUSE's to test 5.4 ?
> > > > 
> > > > The options which worked for 5.4 didn't work for 5.6+.
> > > 
> > > Sorry I didn't make it clear. I meant your 5.4 which didn't work and
> > > openSUSE kernel's 5.4 which worked. I am curious if you built with exactly
> > > the same kernel config for testing in case of leaving any option relevant
> > > behind ?
> > 
> > No, for quicker testing I actually edited the x86 defconfig. I'll build the
> > same config as the working 5.4 and report back.
> 
> It indeed is the kernel config!
> 
> Broken:
> # CONFIG_RESET_ATTACK_MITIGATION is not set
> 
> Works:
> CONFIG_RESET_ATTACK_MITIGATION=y
> 
> So for booting to succeed, the option has to be turned on.
> 
> None of my self-compiled kernels worked initially because x86_64_defconfig
> does not have that enabled.
> The distro kernel also switched it off between 5.4 and 5.6.

Hi Fabian, 

Could you please attach the dmesg when RESET_ATTACK_MITIGATION=y ?

On the other hand, can this issue be reproduced when TPM be enabled in firmware?
Can system boot with "tpm: check event log version before reading final events" patch when TPM be enabled?

Thanks
Comment 37 Takashi Iwai 2020-05-27 07:08:56 UTC
I've been trying to reproduce the problem on the same model here, but in vain, I never see it.  It really depends on the setup details...
Comment 38 Fabian Vogt 2020-05-27 07:20:58 UTC
Created attachment 838227 [details]
dmesg of boot with 5.4

(In reply to Joey Lee from comment #36)
> Hi Fabian, 
> 
> Could you please attach the dmesg when RESET_ATTACK_MITIGATION=y ?

Done. Though I'm not sure whether this was a boot with or without shim.

> On the other hand, can this issue be reproduced when TPM be enabled in
> firmware?
> Can system boot with "tpm: check event log version before reading final
> events" patch when TPM be enabled?

I have the suspicion that the issue might not be reproducible after enabling the TPM,
so I'd rather not try that until absolutely necessary.

> Thanks
Comment 39 Joey Lee 2020-05-27 07:42:43 UTC
Hi Takashi,

(In reply to Takashi Iwai from comment #37)
> I've been trying to reproduce the problem on the same model here, but in
> vain, I never see it.  It really depends on the setup details...

In dmesg, The firmware version of Fabian Vogt's machine is:

Mai 27 08:09:14 linux-e202 kernel: DMI: Dell Inc. Precision 5510/0N8J4R, BIOS 1.2.18 12/01/2016

What's your firmware version?

Thanks
Comment 40 Takashi Iwai 2020-05-27 08:06:00 UTC
(In reply to Joey Lee from comment #39)
> Hi Takashi,
> 
> (In reply to Takashi Iwai from comment #37)
> > I've been trying to reproduce the problem on the same model here, but in
> > vain, I never see it.  It really depends on the setup details...
> 
> In dmesg, The firmware version of Fabian Vogt's machine is:
> 
> Mai 27 08:09:14 linux-e202 kernel: DMI: Dell Inc. Precision 5510/0N8J4R,
> BIOS 1.2.18 12/01/2016

Mine is:
  Dell Inc. Precision 5510/0N8J4R, BIOS 1.5.1 10/17/2017

So it's a bit newer firmware.
Comment 41 Joey Lee 2020-05-27 09:06:16 UTC
(In reply to Takashi Iwai from comment #40)
> (In reply to Joey Lee from comment #39)
> > Hi Takashi,
> > 
> > (In reply to Takashi Iwai from comment #37)
> > > I've been trying to reproduce the problem on the same model here, but in
> > > vain, I never see it.  It really depends on the setup details...
> > 
> > In dmesg, The firmware version of Fabian Vogt's machine is:
> > 
> > Mai 27 08:09:14 linux-e202 kernel: DMI: Dell Inc. Precision 5510/0N8J4R,
> > BIOS 1.2.18 12/01/2016
> 
> Mine is:
>   Dell Inc. Precision 5510/0N8J4R, BIOS 1.5.1 10/17/2017
> 
> So it's a bit newer firmware.

There have the following firmware update for Precision 5510 on Dell's support website: 

1.13.1   06 Jan 2020
1.12.0	21 October 2019	
1.11.2	28 August 2019	
1.9.0	23 August 2019	
1.10.0	20 March 2019	
1.8.0	05 October 2018	
1.7.0	04 May 2018	
1.6.1	06 February 2018	
1.5.1	14 November 2017	
1.4.0	15 September 2017	

Looks that 1.5.1 has some changes about PCR and variable:
https://www.dell.com/support/home/zh-tw/drivers/driversdetails?driverid=ncyp7&oscode=wt64a&productcode=precision-m5510-workstation

Hi Fabian, 

Do you care to update your machine's firmware?
Comment 42 Takashi Iwai 2020-05-27 09:09:35 UTC
IMO, it's even better to keep the firmware as is.  Otherwise we may loose the test environment that can reproduce the problem.
Comment 43 Fabian Vogt 2020-05-27 09:14:42 UTC
(In reply to Takashi Iwai from comment #42)
> IMO, it's even better to keep the firmware as is.  Otherwise we may loose
> the test environment that can reproduce the problem.

Agreed. In addition, the dock is currently perfectly stable here, and any firmware change might cause the same issues that others are facing...
Comment 44 Fabian Vogt 2020-05-27 10:06:26 UTC
Apparently corrupted log entries are a bug in shim:

https://github.com/rhboot/shim/issues/184

That's been fixed quite some time ago, can I you give me a shim.efi with that fix included for testing?
Comment 45 Gary Ching-Pang Lin 2020-05-28 03:22:28 UTC
(In reply to Fabian Vogt from comment #44)
> Apparently corrupted log entries are a bug in shim:
> 
> https://github.com/rhboot/shim/issues/184
> 
> That's been fixed quite some time ago, can I you give me a shim.efi with
> that fix included for testing?

Here is the shim with the tpm fix in my branch:
https://build.opensuse.org/package/binaries/home:gary_lin:branches:devel:openSUSE:Factory/shim/standard
Comment 46 Fabian Vogt 2020-05-28 06:23:09 UTC
The patched shim didn't help either, unfortunately. The garbage alg_id was different (though it might be different on each boot), but still with the same broken num_algs value.
Comment 47 Joey Lee 2020-05-29 05:22:36 UTC
(In reply to Fabian Vogt from comment #34)
> (In reply to Fabian Vogt from comment #33)
> > (In reply to Joey Lee from comment #31)
> > > Thanks for Fabian's debugging! I think that Ard's patch is against this
> > > issue.
> > > 
> > > https://lkml.org/lkml/2020/5/17/127
> > 
> > Yep, that's a perfect match. I'll still give it a try to be sure though.
> 
> I did - it made no difference.
> Looking at the patch, it's obvious that it can't work actually. The freeze
> happens before any of the patched code is run.
> 
> BTW: I have the TPM disabled here in the Firmware Setup. Maybe this
> functionality is handled by the firmware even without the TPM though.

Hi Fabian, 

Could you please attach your ACPI dump please?
# acpidump > acpidump.raw

I am checking the TCG's TPM2 spec, there have four different implementations of the TPM visibility to OS. I want to find the behavior of get_event_log when TPM be disabled in spec.

Thanks
Comment 48 Fabian Vogt 2020-05-29 06:11:06 UTC
Created attachment 838333 [details]
ACPI tables

(In reply to Joey Lee from comment #47)
> (In reply to Fabian Vogt from comment #34)
> > (In reply to Fabian Vogt from comment #33)
> > > (In reply to Joey Lee from comment #31)
> > > > Thanks for Fabian's debugging! I think that Ard's patch is against this
> > > > issue.
> > > > 
> > > > https://lkml.org/lkml/2020/5/17/127
> > > 
> > > Yep, that's a perfect match. I'll still give it a try to be sure though.
> > 
> > I did - it made no difference.
> > Looking at the patch, it's obvious that it can't work actually. The freeze
> > happens before any of the patched code is run.
> > 
> > BTW: I have the TPM disabled here in the Firmware Setup. Maybe this
> > functionality is handled by the firmware even without the TPM though.
> 
> Hi Fabian, 
> 
> Could you please attach your ACPI dump please?
> # acpidump > acpidump.raw

I used acpidump -b in the hope that the binary format is easier to analyze
programatically. I made it after booting 5.4 with linuxefi, but without shim.

> I am checking the TCG's TPM2 spec, there have four different implementations
> of the TPM visibility to OS. I want to find the behavior of get_event_log
> when TPM be disabled in spec.

Some more details: In the firmware setup, on the Security -> TPM 2.0 screen it's completely disabled, so that the other options (clear, configuration) aren't even shown.

> Thanks
Comment 49 Joey Lee 2020-05-29 17:46:28 UTC
(In reply to Fabian Vogt from comment #48)
> Created attachment 838333 [details]
> ACPI tables

I believe that the behavior of Dell's "Precision 5510/0N8J4R, BIOS 1.2.18 12/01/2016" firmware violated the TCG TPM2 spec. And I didn't see the possibility to workaround by kernel.

There have two firmware issues:

- The first and also the main bug is that the return value of EFI_TCG2_GET_EVENT_LOG is wrong when TPM be disabled. The TCG EFI Protocol Specification v00.13 indicated:

6.5 EFI_TCG2_PROTOCOL.GetEventLog
...
4. If no TPM is present, the function SHALL set the following values and return
   EFI_SUCCESS:
        EventLogLocation = NULL                                                         
        EventLogLastEntry = NULL
        EventLogTruncated = FALSE

So, base on the above rule 4, get_event_log returns EFI_SUCCESS when _NO_ TPM. But Dell's firmware didn't return EventLogLocation=NULL and EventLogLastEntry=NULL. It causes that kernel pass the return value checking logic in efi_retrieve_tpm2_eventlog() and runs into the __calc_tpm2_event_size(). Then kernel stuck in __calc_tpm2_event_size()

I don't think that this firmware issue can be workaround by kernel because the returned EventLogLocation and EventLogLastEntry are not NULL or ZERO. Kernel have no other way to detect TPM existence in EFI stub stage.


- The second issue is that the behavior of TPM object in ACPI tables is strange from Dell:

In TCG PC Client Platform Firmware Profile Specification v1.04, it indicates there have four implementation type for TPM Visibility to OS. But Dell's implementation of ACPI tables didn't follow any of them:

6.1 TPM Visibility to the OS

  1. When the TPM is visible to the OS and enabled:

  2. When the TPM is visible to the OS but disabled:
    a. The TPM2 ACPI Table SHALL be listed in the RSDT and XSDT tables.
    b. The TPM device object SHALL be present in the ACPI namespace and STA() SHALL return 0xF.

  3. When the TPM is discoverable but hidden from the OS:
    a. The TPM2 ACPI Table SHALL be listed in RSDT and XDST tables.
    b. The TPM device object SHALL be present in the ACPI namespace and STA() SHALL return 0x0.

  4. When the TPM is not discoverable and is hidden from the OS:
    a. The TPM2 ACPI Table SHALL NOT be listed in the RSDT or XSDT tables.
    b. The TPM device object SHALL NOT be present in the ACPI namespace.

Because the TPM 2.0 option be disabled in firmware UI, So case 1. is not our target. And, the TPM device object be present in DSDT from Dell firmware. So case 4. is also not our target. 

Then checking the behavior of _STA in TPM device object, it returns Zero because TTDP and TPMF be set to Zero when TPM 2.0 be disabled. So, looks that Dell's firmware closes to the case 3 implementation: "When the TPM is discoverable but hidden from the OS"

But, Dell's firmware didn't expose TPM2 ACPI table. So it violated the rule a. of case 3. This is not a key problem of this bug, but it still a firmware issue.
Comment 50 Joey Lee 2020-06-01 11:02:56 UTC
(In reply to Joey Lee from comment #49)
> (In reply to Fabian Vogt from comment #48)
> > Created attachment 838333 [details]
> > ACPI tables
> 
> I believe that the behavior of Dell's "Precision 5510/0N8J4R, BIOS 1.2.18
> 12/01/2016" firmware violated the TCG TPM2 spec. And I didn't see the
> possibility to workaround by kernel.
> 
> There have two firmware issues:
> 
> - The first and also the main bug is that the return value of
> EFI_TCG2_GET_EVENT_LOG is wrong when TPM be disabled. The TCG EFI Protocol
> Specification v00.13 indicated:
> 
> 6.5 EFI_TCG2_PROTOCOL.GetEventLog
> ...
> 4. If no TPM is present, the function SHALL set the following values and
> return
>    EFI_SUCCESS:
>         EventLogLocation = NULL                                             
> 
>         EventLogLastEntry = NULL
>         EventLogTruncated = FALSE
> 
> So, base on the above rule 4, get_event_log returns EFI_SUCCESS when _NO_
> TPM. But Dell's firmware didn't return EventLogLocation=NULL and
> EventLogLastEntry=NULL. It causes that kernel pass the return value checking
> logic in efi_retrieve_tpm2_eventlog() and runs into the
> __calc_tpm2_event_size(). Then kernel stuck in __calc_tpm2_event_size()
> 
> I don't think that this firmware issue can be workaround by kernel because
> the returned EventLogLocation and EventLogLastEntry are not NULL or ZERO.
> Kernel have no other way to detect TPM existence in EFI stub stage.
> 
> 
> - The second issue is that the behavior of TPM object in ACPI tables is
> strange from Dell:
> 
> In TCG PC Client Platform Firmware Profile Specification v1.04, it indicates
> there have four implementation type for TPM Visibility to OS. But Dell's
> implementation of ACPI tables didn't follow any of them:
> 
> 6.1 TPM Visibility to the OS
> 
>   1. When the TPM is visible to the OS and enabled:
> 
>   2. When the TPM is visible to the OS but disabled:
>     a. The TPM2 ACPI Table SHALL be listed in the RSDT and XSDT tables.
>     b. The TPM device object SHALL be present in the ACPI namespace and
> STA() SHALL return 0xF.
> 
>   3. When the TPM is discoverable but hidden from the OS:
>     a. The TPM2 ACPI Table SHALL be listed in RSDT and XDST tables.
>     b. The TPM device object SHALL be present in the ACPI namespace and
> STA() SHALL return 0x0.
> 
>   4. When the TPM is not discoverable and is hidden from the OS:
>     a. The TPM2 ACPI Table SHALL NOT be listed in the RSDT or XSDT tables.
>     b. The TPM device object SHALL NOT be present in the ACPI namespace.
> 
> Because the TPM 2.0 option be disabled in firmware UI, So case 1. is not our
> target. And, the TPM device object be present in DSDT from Dell firmware. So
> case 4. is also not our target. 
> 
> Then checking the behavior of _STA in TPM device object, it returns Zero
> because TTDP and TPMF be set to Zero when TPM 2.0 be disabled. So, looks
> that Dell's firmware closes to the case 3 implementation: "When the TPM is
> discoverable but hidden from the OS"
> 
> But, Dell's firmware didn't expose TPM2 ACPI table. So it violated the rule
> a. of case 3. This is not a key problem of this bug, but it still a firmware
> issue.

Base on TCG EFI Protocol Specification, it defined a signature field in TCG_EfiSpecIDEventStruct. Maybe we can check the value of signature filed before paring TCG_EfiSpecIDEventStruct.
Comment 51 Joey Lee 2020-06-02 07:57:00 UTC
(In reply to Joey Lee from comment #50)
> (In reply to Joey Lee from comment #49)
[...snip]
>
> Base on TCG EFI Protocol Specification, it defined a signature field in
> TCG_EfiSpecIDEventStruct. Maybe we can check the value of signature filed
> before paring TCG_EfiSpecIDEventStruct.

hm... Unfortunately the signature should be "Spec ID Event03" or "ZERO". It still has possibility that the 16 bytes are zero from a garbage data chunk. So we can not use signature field to filter out garbage event log.
Comment 52 Fabian Vogt 2020-06-02 09:39:56 UTC
I checked the wording of the TPM config in the FW setup.
In its current state, there's just a single checkbox with "TPM On", which is unchecked.

The help text says that this means the TPM remains in its current state (not sure what that is, can't see it) but is invisible to the OS.
Comment 53 Fabian Vogt 2020-06-02 14:24:37 UTC
I did one more check, with patched kernel (comment 31) and without shim.
There is no log entry created (as expected), so efi_call_proto(tcg2_protocol, get_event_log, ...) sets log_last_entry to 0.

(In reply to Joey Lee from comment #51)
> (In reply to Joey Lee from comment #50)
> > (In reply to Joey Lee from comment #49)
> [...snip]
> >
> > Base on TCG EFI Protocol Specification, it defined a signature field in
> > TCG_EfiSpecIDEventStruct. Maybe we can check the value of signature filed
> > before paring TCG_EfiSpecIDEventStruct.
> 
> hm... Unfortunately the signature should be "Spec ID Event03" or "ZERO". It
> still has possibility that the 16 bytes are zero from a garbage data chunk.
> So we can not use signature field to filter out garbage event log.

I found the relevant part of the specification and I interpret it differently:

5.3 Event Log Header
> The fields of the event log header are defined to be PCRIndex of 0, EventType
> of EV_NO_ACTION, Digest of 20 bytes of 0, and Event content defined as
> TCG_EfiSpecIDEventStruct. This first event is the event log header.

The kernel does not check for any of those zeros currently.

Table 3 then defines that this first event always has it's signature set to
"Spec ID Event03", so zeros aren't allowed. The kernel doesn't check that either.

I'll check what the content of those fields is, maybe that gives a hint.
Comment 54 Fabian Vogt 2020-06-02 14:52:14 UTC
> (In reply to Joey Lee from comment #51)
> > (In reply to Joey Lee from comment #50)
> > > (In reply to Joey Lee from comment #49)
> > [...snip]
> > >
> > > Base on TCG EFI Protocol Specification, it defined a signature field in
> > > TCG_EfiSpecIDEventStruct. Maybe we can check the value of signature filed
> > > before paring TCG_EfiSpecIDEventStruct.
> > 
> > hm... Unfortunately the signature should be "Spec ID Event03" or "ZERO". It
> > still has possibility that the 16 bytes are zero from a garbage data chunk.
> > So we can not use signature field to filter out garbage event log.
> 
> I found the relevant part of the specification and I interpret it
> differently:
> 
> 5.3 Event Log Header
> > The fields of the event log header are defined to be PCRIndex of 0, EventType
> > of EV_NO_ACTION, Digest of 20 bytes of 0, and Event content defined as
> > TCG_EfiSpecIDEventStruct. This first event is the event log header.
> 
> The kernel does not check for any of those zeros currently.
> 
> Table 3 then defines that this first event always has it's signature set to
> "Spec ID Event03", so zeros aren't allowed. The kernel doesn't check that
> either.
> 
> I'll check what the content of those fields is, maybe that gives a hint.

It indeed does!

In __calc_tpm2_event_size, printing event_header and efispecid shows this:

event_header->pcr_idx = 0x4;
event_header->event_type = 0x80000003;
event_header->digest = 0x4;

efispecid->signature = some unprintable stuff
efispecid->spec_version_major = 0x3;

So the first event is actually not an event log header at all, but rather
the EV_EFI_BOOT_SERVICES_APPLICATION event logged by shim.

So this could be detected by checking for the event header and even log
signatures properly, right?

What is responsible for generating the event log header and making sure that it's the first event in the log?
Comment 55 Joey Lee 2020-06-02 16:36:27 UTC
(In reply to Fabian Vogt from comment #54)
> > (In reply to Joey Lee from comment #51)
> > > (In reply to Joey Lee from comment #50)
> > > > (In reply to Joey Lee from comment #49)
> > > [...snip]
> > > >
> > > > Base on TCG EFI Protocol Specification, it defined a signature field in
> > > > TCG_EfiSpecIDEventStruct. Maybe we can check the value of signature filed
> > > > before paring TCG_EfiSpecIDEventStruct.
> > > 
> > > hm... Unfortunately the signature should be "Spec ID Event03" or "ZERO". It
> > > still has possibility that the 16 bytes are zero from a garbage data chunk.
> > > So we can not use signature field to filter out garbage event log.
> > 
> > I found the relevant part of the specification and I interpret it
> > differently:
> > 
> > 5.3 Event Log Header
> > > The fields of the event log header are defined to be PCRIndex of 0, EventType
> > > of EV_NO_ACTION, Digest of 20 bytes of 0, and Event content defined as
> > > TCG_EfiSpecIDEventStruct. This first event is the event log header.
> > 
> > The kernel does not check for any of those zeros currently.
> > 
> > Table 3 then defines that this first event always has it's signature set to
> > "Spec ID Event03", so zeros aren't allowed. The kernel doesn't check that
> > either.
> > 
> > I'll check what the content of those fields is, maybe that gives a hint.
> 
> It indeed does!
> 
> In __calc_tpm2_event_size, printing event_header and efispecid shows this:
> 
> event_header->pcr_idx = 0x4;
> event_header->event_type = 0x80000003;
> event_header->digest = 0x4;
> 
> efispecid->signature = some unprintable stuff
> efispecid->spec_version_major = 0x3;
> 
> So the first event is actually not an event log header at all, but rather
> the EV_EFI_BOOT_SERVICES_APPLICATION event logged by shim.
> 
> So this could be detected by checking for the event header and even log
> signatures properly, right?
> 

Not 100% works. Because garbage chunk still has high probability to show a ZERO binary in the signature location.

Base on spec, the right behavior of firmware is to return NULL EventLogLocation and NULL EventLogLastEntry when GetEventLog be call.

Another potential problem is that if any old machine does not follow spec to set the signature field, then the TCG_EfiSpecIDEventStruct event log will not work with patched kernel. I think that part must be confirmed by kernel upstream experts.

> What is responsible for generating the event log header and making sure that
> it's the first event in the log?

Base on spec, firmware should generates the first TCG_EfiSpecIDEventStruct event log header. Then other components add log to event log array. I think that the behavior of shim must be checked when TPM be disabled. At least shim should checks the TCG_EfiSpecIDEventStruct log.

Hi Gary, 

Could you please help to check the behavior of EV_EFI_BOOT_SERVICES_APPLICATION in shim when TPM be disabled?

Thanks
Comment 56 Joey Lee 2020-06-02 16:40:42 UTC
(In reply to Fabian Vogt from comment #53)
> I did one more check, with patched kernel (comment 31) and without shim.
> There is no log entry created (as expected), so
> efi_call_proto(tcg2_protocol, get_event_log, ...) sets log_last_entry to 0.
> 
> (In reply to Joey Lee from comment #51)
> > (In reply to Joey Lee from comment #50)
> > > (In reply to Joey Lee from comment #49)
> > [...snip]
> > >
> > > Base on TCG EFI Protocol Specification, it defined a signature field in
> > > TCG_EfiSpecIDEventStruct. Maybe we can check the value of signature filed
> > > before paring TCG_EfiSpecIDEventStruct.
> > 
> > hm... Unfortunately the signature should be "Spec ID Event03" or "ZERO". It
> > still has possibility that the 16 bytes are zero from a garbage data chunk.
> > So we can not use signature field to filter out garbage event log.
> 
> I found the relevant part of the specification and I interpret it
> differently:
> 
> 5.3 Event Log Header
> > The fields of the event log header are defined to be PCRIndex of 0, EventType
> > of EV_NO_ACTION, Digest of 20 bytes of 0, and Event content defined as
> > TCG_EfiSpecIDEventStruct. This first event is the event log header.
> 
> The kernel does not check for any of those zeros currently.
> 
> Table 3 then defines that this first event always has it's signature set to
> "Spec ID Event03", so zeros aren't allowed. The kernel doesn't check that
> either.

The ZERO signature be used in TCG_PCR_EVENT

https://docs.microsoft.com/en-us/windows/win32/api/tbs/nf-tbs-tbsi_get_tcg_log_ex

If the Signature member is not set to a null-terminated ASCII string of "Spec ID Event03", then the events in the log are of type TCG_PCR_EVENT, and the TCG_EfiSpecIdEventStruct structure does not contain the NumberOfAlgorithms and DigestSizes members.
Comment 57 Fabian Vogt 2020-06-02 17:15:51 UTC
(In reply to Joey Lee from comment #55)
> (In reply to Fabian Vogt from comment #54)
> > So the first event is actually not an event log header at all, but rather
> > the EV_EFI_BOOT_SERVICES_APPLICATION event logged by shim.
> > 
> > So this could be detected by checking for the event header and even log
> > signatures properly, right?
> 
> Not 100% works. Because garbage chunk still has high probability to show a
> ZERO binary in the signature location.

Where does it say in the specification that all zeros is an allowed value
for the signature?

> Base on spec, the right behavior of firmware is to return NULL
> EventLogLocation and NULL EventLogLastEntry when GetEventLog be call.
> 
> Another potential problem is that if any old machine does not follow spec to
> set the signature field, then the TCG_EfiSpecIDEventStruct event log will
> not work with patched kernel. I think that part must be confirmed by kernel
> upstream experts.

Apparently other OSs already check this value, so I would hope that this is ok.
If not, there should be some explicit code to make sure it's not reading garbage
in any case. At least it could check that
(void*)&efispecid->digest_sizes[efispecid->num_algs] < (void*)log_last_entry

(In reply to Joey Lee from comment #56)
> (In reply to Fabian Vogt from comment #53)
> > I found the relevant part of the specification and I interpret it
> > differently:
> > 
> > 5.3 Event Log Header
> > > The fields of the event log header are defined to be PCRIndex of 0, EventType
> > > of EV_NO_ACTION, Digest of 20 bytes of 0, and Event content defined as
> > > TCG_EfiSpecIDEventStruct. This first event is the event log header.
> > 
> > The kernel does not check for any of those zeros currently.
> > 
> > Table 3 then defines that this first event always has it's signature set to
> > "Spec ID Event03", so zeros aren't allowed. The kernel doesn't check that
> > either.
> 
> The ZERO signature be used in TCG_PCR_EVENT
> 
> https://docs.microsoft.com/en-us/windows/win32/api/tbs/nf-tbs-
> tbsi_get_tcg_log_ex
>
> If the Signature member is not set to a null-terminated ASCII string of
> "Spec ID Event03", then the events in the log are of type TCG_PCR_EVENT, and
> the TCG_EfiSpecIdEventStruct structure does not contain the
> NumberOfAlgorithms and DigestSizes members.

The kernel does not check this and uses NumberOfAlgorithms/DigestSizes
anyway. So for a zero signature the behaviour is already broken.

In this error case here where the log header is missing, the first 16 bytes can
never be zero because it contains a TCG_EVENTTYPE which starts at 0x80000000.
So even if a zeroed signature is allowed, it would not let the error case pass.
Comment 58 Gary Ching-Pang Lin 2020-06-04 03:23:36 UTC
(In reply to Joey Lee from comment #55)
> (In reply to Fabian Vogt from comment #54)
> > > (In reply to Joey Lee from comment #51)
> > > > (In reply to Joey Lee from comment #50)
> > > > > (In reply to Joey Lee from comment #49)
> > > > [...snip]
> > > > >
> > > > > Base on TCG EFI Protocol Specification, it defined a signature field in
> > > > > TCG_EfiSpecIDEventStruct. Maybe we can check the value of signature filed
> > > > > before paring TCG_EfiSpecIDEventStruct.
> > > > 
> > > > hm... Unfortunately the signature should be "Spec ID Event03" or "ZERO". It
> > > > still has possibility that the 16 bytes are zero from a garbage data chunk.
> > > > So we can not use signature field to filter out garbage event log.
> > > 
> > > I found the relevant part of the specification and I interpret it
> > > differently:
> > > 
> > > 5.3 Event Log Header
> > > > The fields of the event log header are defined to be PCRIndex of 0, EventType
> > > > of EV_NO_ACTION, Digest of 20 bytes of 0, and Event content defined as
> > > > TCG_EfiSpecIDEventStruct. This first event is the event log header.
> > > 
> > > The kernel does not check for any of those zeros currently.
> > > 
> > > Table 3 then defines that this first event always has it's signature set to
> > > "Spec ID Event03", so zeros aren't allowed. The kernel doesn't check that
> > > either.
> > > 
> > > I'll check what the content of those fields is, maybe that gives a hint.
> > 
> > It indeed does!
> > 
> > In __calc_tpm2_event_size, printing event_header and efispecid shows this:
> > 
> > event_header->pcr_idx = 0x4;
> > event_header->event_type = 0x80000003;
> > event_header->digest = 0x4;
> > 
> > efispecid->signature = some unprintable stuff
> > efispecid->spec_version_major = 0x3;
> > 
> > So the first event is actually not an event log header at all, but rather
> > the EV_EFI_BOOT_SERVICES_APPLICATION event logged by shim.
> > 
> > So this could be detected by checking for the event header and even log
> > signatures properly, right?
> > 
> 
> Not 100% works. Because garbage chunk still has high probability to show a
> ZERO binary in the signature location.
> 
> Base on spec, the right behavior of firmware is to return NULL
> EventLogLocation and NULL EventLogLastEntry when GetEventLog be call.
> 
> Another potential problem is that if any old machine does not follow spec to
> set the signature field, then the TCG_EfiSpecIDEventStruct event log will
> not work with patched kernel. I think that part must be confirmed by kernel
> upstream experts.
> 
> > What is responsible for generating the event log header and making sure that
> > it's the first event in the log?
> 
> Base on spec, firmware should generates the first TCG_EfiSpecIDEventStruct
> event log header. Then other components add log to event log array. I think
> that the behavior of shim must be checked when TPM be disabled. At least
> shim should checks the TCG_EfiSpecIDEventStruct log.
> 
> Hi Gary, 
> 
> Could you please help to check the behavior of
> EV_EFI_BOOT_SERVICES_APPLICATION in shim when TPM be disabled?
> 
> Thanks

Shim probes EFI_TCG_PROTOCOL/EFI_TCG2_PROTOCOL to check the existence of TPM and then measures the UEFI image it loads and the variables it touches. The EV_EFI_BOOT_SERVICES_APPLICATION event is generated when shim loads grub2. If Secure boot is enabled, shim will also measure the images from grub2 when verifying them.

Per comment#52, the firmware likely provides the TCG protocol during the Boot Service phase, so shim and efi stub both detect the existence of TPM and log the events.
Comment 59 Joey Lee 2020-06-04 10:21:38 UTC
(In reply to Gary Ching-Pang Lin from comment #58)
> (In reply to Joey Lee from comment #55)
[...snip]
> > Hi Gary, 
> > 
> > Could you please help to check the behavior of
> > EV_EFI_BOOT_SERVICES_APPLICATION in shim when TPM be disabled?
> > 
> > Thanks
> 
> Shim probes EFI_TCG_PROTOCOL/EFI_TCG2_PROTOCOL to check the existence of TPM
> and then measures the UEFI image it loads and the variables it touches. The
> EV_EFI_BOOT_SERVICES_APPLICATION event is generated when shim loads grub2.
> If Secure boot is enabled, shim will also measure the images from grub2 when
> verifying them.
> 
> Per comment#52, the firmware likely provides the TCG protocol during the
> Boot Service phase, so shim and efi stub both detect the existence of TPM
> and log the events.

Base on the TCG PC Client Platform Firmware Profile spec, The EFI_TCG2_EFI_PROTOCOL SHALL NOT be present only on type 4 "When the TPM is not discoverable and is hidden from the OS".
Comment 60 Joey Lee 2020-06-04 10:31:39 UTC
I have built a new tpm2.0-tools on my home:joeyli:branches:security branch on OBS (only openSUSE TW build success). The tpm2_eventlog tool can be used to parsing tpm event log on userland.

If everything works when booting, kernel should generated a sysfs file that it be used to expose tpm2 event log. Just direct cat it then using tpm2_eventlog for parsing. e.g.

# cat /sys/kernel/security/tpm0/binary_bios_measurements > eventlog.bin
# tpm2_eventlog -V eventlog.bin | less

---
- Event[0]:
  pcrIndex: 0
  eventType: EV_NO_ACTION
  digest: 0000000000000000000000000000000000000000
  eventDataSize: 37
  SpecID:
    - Signature: Spec ID Event03
...

I have checked the tpm2_eventlog tool. It also does not check the signature value of TCG_EfiSpecIDEventStruct event. But it runs many size checking logic in specid_event() function to avoid a garbage chunk.
Comment 61 Fabian Vogt 2020-06-04 11:26:58 UTC
(In reply to Joey Lee from comment #60)
> I have built a new tpm2.0-tools on my home:joeyli:branches:security branch
> on OBS (only openSUSE TW build success). The tpm2_eventlog tool can be used
> to parsing tpm event log on userland.
> 
> If everything works when booting, kernel should generated a sysfs file that
> it be used to expose tpm2 event log. Just direct cat it then using
> tpm2_eventlog for parsing. e.g.
> 
> # cat /sys/kernel/security/tpm0/binary_bios_measurements > eventlog.bin

AFAIU, the efistub has to copy the event log to a configuration table manually,
which is the code we've been discussing here. As it is unable to get the
correct size for events, it can't perform the copying. So AFAICT this can't
work when booting with the missing event log header.

Currently I've got kernel 5.7.0-rc6-1.g3603fcd-default booted with linuxefi
but without shim and the /sys/kernel/security/tpm0/ directory does not exist.
Comment 62 Fabian Vogt 2020-06-05 10:23:29 UTC
Created attachment 838528 [details]
Proposed kernel patch

I made a patch which hardens the tpm driver against the firmware issue.
With this it boots fine here and in a VM with TPM 2.0 the event log is still read correctly.

If there aren't any objections I'll submit it later today or on Monday.
Comment 63 Joey Lee 2020-06-06 02:58:21 UTC
(In reply to Fabian Vogt from comment #62)
> Created attachment 838528 [details]
> Proposed kernel patch
> 
> I made a patch which hardens the tpm driver against the firmware issue.
> With this it boots fine here and in a VM with TPM 2.0 the event log is still
> read correctly.
> 
> If there aren't any objections I'll submit it later today or on Monday.

Fabian's patch looks good to me.

Base on TCG PC Client Platform Firmware Profile spec, I agree with you that kernel can check the "Spec ID Event03" because other EV_NO_ACTION event format should not be handled by EFI stub. 

But... Practically I am not sure that it may causes any problem on old TPM2 machines, because I didn't see other code to restrict Signature field, even tpm2_evenlog.

Let's see what the upstream expert's comment.

Thanks!
Comment 64 Jiri Slaby 2020-06-08 10:11:03 UTC
(In reply to Fabian Vogt from comment #62)
> Created attachment 838528 [details]
> Proposed kernel patch

Could you make zero_digest static? So that it's not created on stack. You don't have to initialize it then too.
Comment 65 Fabian Vogt 2020-06-08 10:28:21 UTC
(In reply to Jiri Slaby from comment #64)
> (In reply to Fabian Vogt from comment #62)
> > Created attachment 838528 [details]
> > Proposed kernel patch
> 
> Could you make zero_digest static? So that it's not created on stack. You
> don't have to initialize it then too.

Possible, but unless the compiler/linker is smart enough to coalesce it with other zeros in .rodata it would allocate space there AFAICT. Those few bytes on the stack should (hopefully) not hurt.

FTR: https://lore.kernel.org/linux-efi/1894249.ujS34B1uSo@linux-e202.suse.de/T/#u
Comment 66 Fabian Vogt 2020-06-15 10:07:38 UTC
(In reply to Fabian Vogt from comment #65)
> (In reply to Jiri Slaby from comment #64)
> > (In reply to Fabian Vogt from comment #62)
> > > Created attachment 838528 [details]
> > > Proposed kernel patch
> > 
> > Could you make zero_digest static? So that it's not created on stack. You
> > don't have to initialize it then too.
> 
> Possible, but unless the compiler/linker is smart enough to coalesce it with
> other zeros in .rodata it would allocate space there AFAICT. Those few bytes
> on the stack should (hopefully) not hurt.
> 
> FTR:
> https://lore.kernel.org/linux-efi/1894249.ujS34B1uSo@linux-e202.suse.de/T/#u

Applied to efi/urgent now: https://lore.kernel.org/linux-efi/CAMj1kXHs6rUUjqJcq+1iSmRfkowSpLJnscdfSrRaf5iTCj3_sQ@mail.gmail.com/

Depending on how quickly it makes it into kernel/stable, can this be added as patch? If you don't think this is worthwhile, feel free to close as FIXED.

The 15 SP2/15.2 kernels appear to have CONFIG_RESET_ATTACK_MITIGATION=y, so it should work for those "by accident".
Comment 67 G.M. Venekamp 2020-06-15 13:00:25 UTC
It seems I ran into this issue today. I upgraded (zypper dup) from 20200610 to 20200612 and than I can no-longer boot the system. At first I though it had to do with the new kernel (5.7.1), however after a rollback (snapper) and locking the kernel (zypper addlock) followed by zypper dup, I can no-longer boot kernel 5.6.14. For what it is worth: I am using a Dell Latitude 7480.
Comment 68 Takashi Iwai 2020-06-15 13:22:27 UTC
(In reply to G.M. Venekamp from comment #67)
> It seems I ran into this issue today. I upgraded (zypper dup) from 20200610
> to 20200612 and than I can no-longer boot the system. At first I though it
> had to do with the new kernel (5.7.1), however after a rollback (snapper)
> and locking the kernel (zypper addlock) followed by zypper dup, I can
> no-longer boot kernel 5.6.14. For what it is worth: I am using a Dell
> Latitude 7480.

This might be rather a bug in intel CPU ucode.  Try to boot with dis_ucode_ldr boot option.
Comment 69 G.M. Venekamp 2020-06-15 13:34:55 UTC
Yes, indeed that works for me. Thanks.
Comment 70 Takashi Iwai 2020-06-15 13:46:02 UTC
(In reply to G.M. Venekamp from comment #69)
> Yes, indeed that works for me. Thanks.

OK, then it's bug 1172856.  The update of ucode-intel with the revert of the buggy firmware is on its way.

Interestingly, I hit the bug above only with Leap 15.1 kernel, and 5.6/5.7 kernel seemed working on my machine.  Maybe it depends on timing or subtle difference...
Comment 71 Fabian Vogt 2020-07-16 08:10:32 UTC
Fix arrived in stable some time ago, closing.