Bug 1214537 - SRSO mitigations break nested virtualization
Summary: SRSO mitigations break nested virtualization
Status: RESOLVED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP5
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: Other Other
: P5 - None : Normal
Target Milestone: ---
Assignee: Kernel Bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-08-23 14:54 UTC by Dominique Leuenberger
Modified: 2023-11-27 11:41 UTC (History)
6 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Dominique Leuenberger 2023-08-23 14:54:22 UTC
Identified as part of openQA tests:

https://openqa.opensuse.org/tests/3525465#live

This was identified to be happening on a few workers only: OW19,OW21,OW22,OW24

We could trace this so far to be a regression due to a kernel update:

kernel-default-5.14.21-150500.55.12.1.x86_64 <--- WORKING
kernel-default-5.14.21-150500.55.19.1.x86_64 <--- BROKEN

The experiment done was downgrading openqaworker24 to use kernel 5.14.21-150500.55.12, restartng the test on OW24 and the test passed.

The tests started failing on August 14 on those workers, which, at that time, also coincided with the kernel update
Comment 1 Fabian Vogt 2023-08-23 17:00:22 UTC
Looking at the diff between those kernels, the most obvious cause is the introduction of SRSO mitigations. I rebooted an openQA worker with "mitigations=off" passed on the kernel cmdline and nested guests boot properly again.
Comment 2 Nikolay Borisov 2023-08-23 19:15:13 UTC
Can you provide anymore information about this ? WHat exactly breaks, any logs ? Is clang involved in any of the compiled kernels?
Comment 3 Fabian Vogt 2023-08-23 19:51:15 UTC
(In reply to Borisov from comment #2)
> Can you provide anymore information about this ? WHat exactly breaks,

"qemu-system-x86_64 -nographic -enable-kvm" produces no output.
Using the monitor it's visible that IP is still at FFF0, so KVM appears to make no progress.

> any
> logs ?

With -d cpu_reset it shows only two entries, once with all zeros and once with the initial CPU state, identical to a working system.

Anything else that could be of help?

> Is clang involved in any of the compiled kernels?

No idea, they're just the SLE kernel binaries.
Comment 4 Fabian Vogt 2023-08-24 15:01:42 UTC
With kernel 6.5-rc7 on openqaworker19 it works, with 6.5-rc6 it fails, so it's likely one of the SRSO fixes in between which fix it, probably "x86/retpoline: Don't clobber RFLAGS during srso_safe_ret()".
Comment 5 Nikolay Borisov 2023-08-25 15:10:48 UTC
Respective fix (alongside some others) has been pushed to sle15-sp4/for-next and sle12-sp5/for-next respectively.
Comment 6 Nikolay Borisov 2023-09-11 07:41:39 UTC
Can this be considered fixed?
Comment 7 Oliver Kurz 2023-09-22 13:18:27 UTC
(In reply to Nikolay Borisov from comment #6)
> Can this be considered fixed?

Can you please reference a submit request that we can follow including the fix?
Comment 8 Nikolay Borisov 2023-09-25 12:15:53 UTC
(In reply to Oliver Kurz from comment #7)
> (In reply to Nikolay Borisov from comment #6)
> > Can this be considered fixed?
> 
> Can you please reference a submit request that we can follow including the
> fix?

The earliest kernel where this commit is released is: 

rpm-5.14.21-150500.13.14
Comment 9 Oliver Kurz 2023-09-25 12:37:18 UTC
Well, if you consider a bug fixed then I suggest you set this bug to "RESOLVED FIXED" accordingly. BUT

(In reply to Nikolay Borisov from comment #8)
> (In reply to Oliver Kurz from comment #7)
> > (In reply to Nikolay Borisov from comment #6)
> > > Can this be considered fixed?
> > 
> > Can you please reference a submit request that we can follow including the
> > fix?
> 
> The earliest kernel where this commit is released is: 
> 
> rpm-5.14.21-150500.13.14

and in the original bug description

kernel-default-5.14.21-150500.55.12.1.x86_64 <--- WORKING
kernel-default-5.14.21-150500.55.19.1.x86_64 <--- BROKEN

so I don't see how that version 5.14.21-150500.13.14 would fix the problem. Maybe you mean 5.14.21-150500.55.13.14 but that would still be part of the broken one.
Comment 10 Marcus Meissner 2023-09-25 13:11:54 UTC
5.14.21-150500.13.14 is a RT kernel version, the September update.

This is part of the retracted kernel update set.

they are currently being retested in QA.

This bug does not seem to be in the References though.
Comment 11 Nikolay Borisov 2023-09-25 17:32:48 UTC
(In reply to Marcus Meissner from comment #10)
> 5.14.21-150500.13.14 is a RT kernel version, the September update.
> 
> This is part of the retracted kernel update set.
> 
> they are currently being retested in QA.
> 
> This bug does not seem to be in the References though.

Yes, because those patches were backported proactively and were considered part of the usual git-fixes flow. Simultaneously this issue was reported and we discovered that this particular fix also fixes the issue. That's why the bug is not referenced.
Comment 12 Dominik Heidler 2023-11-21 12:57:28 UTC
Any news here?
Comment 13 Nikolay Borisov 2023-11-23 12:31:39 UTC
This was fixed at the time it was reported.
Comment 14 Fabian Vogt 2023-11-27 11:41:41 UTC
I removed the zypper locks and will reopen if it breaks.