|
Bugzilla – Full Text Bug Listing |
| Summary: | kernel panic in systemd-testsuite VM after switch_root | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Thomas Blume <thomas.blume> |
| Component: | KVM | Assignee: | systemd maintainers <systemd-maintainers> |
| Status: | RESOLVED INVALID | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | fabiano.rosas, fbui, thomas.blume |
| Version: | Current | Flags: | fabiano.rosas:
needinfo?
|
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE Tumbleweed | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
qemu run log with panic -cpu IvyBridge
qemu run log without panic -cpu Haswell |
||
|
Description
Thomas Blume
2024-06-13 10:26:21 UTC
Created attachment 875461 [details]
qemu run log with panic -cpu IvyBridge
Created attachment 875462 [details]
qemu run log without panic -cpu Haswell
The panic also happens when the -cpu parameter is omitted I'm having difficulty reproducing this. Can't get past this point:
[ 14.652047][ T326] sd 0:0:0:0: [sda] Preferred minimum I/O size 512 bytes
[ 14.684284][ T10] scsi 1:0:0:0: Attached scsi generic sg1 type 5
[ 14.725152][ T326] sda: sda1 sda2 sda3 sda4
Starting dracut initqueue hook...
[ OK ] Stopped Virtual Console Setup.
[ 14.995020][ T312] sr 1:0:0:0: [sr0] scsi3-mmc drive: 4x/4x cd/rw xa/form2 tray
[ 14.997164][ T312] cdrom: Uniform CD-ROM driver Revision: 3.20
Stopping Virtual Console Setup...
Starting Virtual Console Setup...
[ OK ] Finished Virtual Console Setup.
[ OK ] Reached target System Initialization.
[ OK ] Reached target Basic System.
Dracut timeout starts after a while. Any suggestions? This happens with KVM as well, so it's not just emulation slowness.
Also, that panic looks like systemd just died. It doesn't seem to be something kernel-level or virtualization related. So perhaps we could try to isolate what exactly is killing systemd and then see how that relates to the emulated cpu. Would rd.debug be of any help? Or maybe drop to the dracut shell and switch root manually?
I think I've narrowed this issue down.
Depending on the model passed to '-cpu' option, the path of the shared library "libcrypto.so" that the linker loads when systemd is executed varies:
When it fails:
# ldd /sysroot/usr/lib/systemd/systemd
libcrypto.so.3 => /lib64/libcrypto.so.3
When it succeeds:
# ldd /sysroot/usr/lib/systemd/systemd
libcrypto.so.3 => /lib64/glibc-hwcaps/x86-64-v3/libcrypto.so.3.1.4
Apparently the dynamic loader chooses the version of the library to loads based on some of the features supported by the cpu.
However when the test image is built, the library is installed in the image based on the output of ldd running on systemd binary installed on the *host* (from where the VM running the testsuite is spawned). Hence the path of libcrypto is based on the features supported by the host cpu (and in my case it's /lib64/glibc-hwcaps/x86-64-v3/).
But when the VM is spawned, the features supported by the cpu emulated by the VM may differ from the host ones, which explains why in some situations the dynamic linker looks for libcrypto at the wrong place.
Fabiano, do you know whether one possible fix would be to pass "-cpu host" to qemu so the difference between the cpus on the host and on the VM would be minimal ? (In reply to Franck Bui from comment #6) > Fabiano, do you know whether one possible fix would be to pass "-cpu host" > to qemu so the difference between the cpus on the host and on the VM would > be minimal ? It would work, however that option is restricted to use along with KVM acceleration, which may not be available when you run the tests. The examples from comment #0, for instance are running entirely emulated. If the problem is simply that the host cpu has _more_ features than the guest cpu in the test, then the -cpu max option might be useful. It should enable all implemented features. Another workaround from the virtualization tools perspective would be to pin-point which feature(s) is causing the issue and try to craft a compatible -cpu option by disabling some features. However that might become not be portable across host machines. (In reply to Fabiano Rosas from comment #7) > (In reply to Franck Bui from comment #6) > > Fabiano, do you know whether one possible fix would be to pass "-cpu host" > > to qemu so the difference between the cpus on the host and on the VM would > > be minimal ? > > It would work, however that option is restricted to use along with KVM > acceleration, which may not be available when you run the tests. The > examples from comment #0, for instance are running entirely emulated. > > If the problem is simply that the host cpu has _more_ features than the > guest cpu in the test, then the -cpu max option might be useful. It should > enable all implemented features. > > Another workaround from the virtualization tools perspective would be to > pin-point which feature(s) is causing the issue and try to craft a > compatible -cpu option by disabling some features. However that might become > not be portable across host machines. Not sure if this is important but just to mention it, the "host" cpu in the openQA testcase is emulated too, e.g. the openQA Machine is also a vm. So qemu runs nested here. (In reply to Thomas Blume from comment #8) > Not sure if this is important but just to mention it, the "host" cpu in the > openQA testcase is emulated too, e.g. the openQA Machine is also a vm. > So qemu runs nested here. Yes, that's relevant. It gives us more options. Looking at the systemd tests I see a TEST_NESTED_KVM option. Has that been tried? It seems it would run the test guest as a proper nested KVM guest (whereas currently we're merely running an emulated guest inside of a VM). A nested guest would have a cpu with the same feature set as the level 1 guest. (In reply to Fabiano Rosas from comment #9) > (In reply to Thomas Blume from comment #8) > > Not sure if this is important but just to mention it, the "host" cpu in the > > openQA testcase is emulated too, e.g. the openQA Machine is also a vm. > > So qemu runs nested here. > > Yes, that's relevant. It gives us more options. > > Looking at the systemd tests I see a TEST_NESTED_KVM option. Has that been > tried? It seems it would run the test guest as a proper nested KVM guest > (whereas currently we're merely running an emulated guest inside of a VM). A > nested guest would have a cpu with the same feature set as the level 1 guest. Indeed, that works as it sets QEMU_KVM which in turn enables '-cpu host' --> # Let's use KVM if possible if [[ -c /dev/kvm ]] && get_bool $QEMU_KVM; then qemu_options+=(-machine "accel=kvm" -enable-kvm -cpu host) fi --< Still, wouldn't that mean qemu cannot run a systemd based VM on any machine with a CPU older than Haswell? (In reply to Thomas Blume from comment #10) > Still, wouldn't that mean qemu cannot run a systemd based VM on any machine > with a CPU older than Haswell? Per Franck Bui's analysis I understand this issue exists due to the particular way systemd builds the test image, so only that specific scenario should be affected. I'd be glad to continue investigating this, but at the moment I don't see indication that we have an issue anywhere in the virtualization stack. (In reply to Fabiano Rosas from comment #11) > (In reply to Thomas Blume from comment #10) > > Still, wouldn't that mean qemu cannot run a systemd based VM on any machine > > with a CPU older than Haswell? > > Per Franck Bui's analysis I understand this issue exists due to the > particular way systemd builds the test image, so only that specific scenario > should be affected. > > I'd be glad to continue investigating this, but at the moment I don't see > indication that we have an issue anywhere in the virtualization stack. Indeed, I misinterpreted Francks comment, sorry. This is a testsuite problem, hence closing the bug. |