Bugzilla – Bug 1218547
Leap 15.5 - Xorg fails to start on kernel 5.14.21-150500.55.39.1
Last modified: 2024-01-20 18:15:58 UTC
Created attachment 871666 [details] Xorg.0.log file showing the segfault resulting from this kernel update NOTE: Reporting as SLES 15SP5 as there is no Leap 15.5 product that I am able to see, and IIRC, Leap 15.5 is based on SLE 15 SP5. Please change if this is not the correct product to report under. Running VMware Workstation 17.5 on Tumbleweed (host OS shouldn't matter, but is provided for completeness) with nVidia 3090ti video card in host. Leap 15.5 virtual machine with 8 GB of RAM and 4 CPUs, accelerated graphics is enabled in the VM. After updating the kernel from 5.14.21-150500.55.36.1 to 5.14.21-150500.55.39.1, the system locks right as Xorg is starting (the last messages on the text console indicate Locale service is started). If I start with nomodeset, I can get a text console - if I don't, I can't switch VTs, but I can ssh into the system. The Xorg log (attached) reports a segfault. I tested by applying all updates except all kernel updates (anything starting with kernel-default was locked during the 'zypper up'); system booted properly. Unlocking kernel-default and running an update resulted in the boot failure. Please let me know if any other information is needed. Happy to provide the VM image if someone wants to test with the free VMware Player (it should fail there as well).
Additional note: I have also reproduced this using the net install CD (presumably the full DVD ISO will do this as well) allowing live updates during the installation - so it should be fairly easy to reproduce from scratch as well.
The vmware X driver seems failing to initialize, ended with a segfault: [ 6.843] (II) vmware(0): Initialized VMWARE_CTRL extension version 0.2 [ 6.843] (II) vmware(0): Initialized VMware Xinerama extension. [ 6.843] (II) vmware(0): vgaHWGetIOBase: hwp->IOBase is 0x03d0 [ 6.877] (EE) vmware(0): Unable to map frame buffer BAR. Invalid argument (22) [ 6.877] (EE) [ 6.877] (EE) Backtrace: [ 6.877] (EE) 0: /usr/bin/X (xorg_backtrace+0x65) [0x557fa89a9915] [ 6.877] (EE) 1: /usr/bin/X (0x557fa87e5000+0x1c85e9) [0x557fa89ad5e9] [ 6.877] (EE) 2: /lib64/libpthread.so.0 (0x7f43d2846000+0x16910) [0x7f43d285c910] [ 6.877] (EE) 3: /lib64/libc.so.6 (0x7f43d1c09000+0x1914ea) [0x7f43d1d9a4ea] [ 6.877] (EE) 4: /usr/lib64/xorg/modules/drivers/vmware_drv.so (0x7f43d181c000+0xae5b) [0x7f43d1826e5b] [ 6.877] (EE) 5: /usr/bin/X (AddScreen+0xd7) [0x557fa88451e7] [ 6.877] (EE) 6: /usr/bin/X (InitOutput+0x27d) [0x557fa8886c1d] [ 6.877] (EE) 7: /usr/bin/X (0x557fa87e5000+0x63d45) [0x557fa8848d45] [ 6.877] (EE) 8: /lib64/libc.so.6 (__libc_start_main+0xef) [0x7f43d1c3e24d] [ 6.877] (EE) 9: /usr/bin/X (_start+0x2c) [0x557fa88327da] [ 6.877] (EE) [ 6.877] (EE) Segmentation fault at address 0x0 The suspected recent change in the kernel side is the security fix backports. Adding Thomas to Cc.
.... or it might be the early error to open the drm: [ 6.568] (EE) vmware(0): Failed to open drm. [ 6.568] (WW) vmware(0): Disabling 3D support. [ 6.568] (WW) vmware(0): Disabling Render Acceleration. [ 6.568] (WW) vmware(0): Disabling RandR12+ support. [ 6.568] (--) vmware(0): VMware SVGA regs at (0x1070, 0x1071) .... Jim, could you give the Xorg.log from the working case (with the previous kernel)?
Created attachment 871676 [details] Xorg.0.log from before the kernel update when using nomodeset (shows that it fails with nomodeset in either case) I went to look for this, and now I'm puzzled. The working system is using Wayland and there's no Xorg log file. If I start with nomodeset with the previous kernel, it starts with Xorg, but the startup also fails. (Xorg.0.log attached, but it looks to be the same segfault). So it seems that adding nomodeset to the startup is causing it to try to start Xorg rather than Xwayland and changing the problem in a way I was not anticipating. If I start without 'nomodeset' on the working kernel, the system starts up as expected using Xwayland, and the new kernel does not - so there is still an issue, but it looks to actually not be Xorg (though Xorg failing with nomodeset is clearly an issue as well).
Doing some additional digging, I see that wayland writes to the system log - the only messages in the output of 'journalctl -b' related to wayland are the following: --- snip --- Jan 05 12:59:07 localhost.localdomain systemd[1239]: Reached target GNOME Wayland Session. Jan 05 12:59:07 localhost.localdomain systemd[1239]: Starting GNOME Shell on Wayland... Jan 05 12:59:07 localhost.localdomain gnome-shell[1699]: Running GNOME Shell (using mutter 41.9) as a Wayland display server Jan 05 13:00:37 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: start operation timed out. Terminating. Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: State 'stop-sigterm' timed out. Killing. Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1699 (gnome-shell) with signal SIGKILL. Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1777 (gmain) with signal SIGKILL. Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1793 (gdbus) with signal SIGKILL. Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1801 (dconf worker) with signal SIGKILL. Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Main process exited, code=killed, status=9/KILL Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Failed with result 'timeout'. Jan 05 13:00:42 localhost.localdomain systemd[1239]: Failed to start GNOME Shell on Wayland. Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Triggering OnFailure= dependencies. Jan 05 13:00:42 localhost.localdomain systemd[1239]: Stopped target GNOME Wayland Session. --- snip --- When it has reached this point, the system becomes completely nonresponsive - unable to even connect via ssh. Prior to the system reporting the attempt terminate org.gnome.Shell@wayland.service failed (timestamp Jan 05 3:00:42), I was able to connect (and thus had the log on-screen to copy/paste here). It looks like this issue needs to be reclassified as a wayland/kernel issue rather than xorg. Apologies for missing that nomodeset was changing the issue.
With the information that this looked to be related to the VMware X driver failing to initialize, I tried a couple of additional tests: 1. Disabling accelerated graphics 2. With accelerated graphics enabled, increasing system ram in the VM from 8 GB to 16 GB (recommended in the display configuration for the VM with only 8 GB of system ram allocated to the system and 8 GB of vram allocated). Test 1 resulted in a system that starts up. Test 2 hung the guest to the point that I had difficulty shutting the VM down (I had to kill it with the 'kill' command on the host). I can run the VM with accelerated graphics disabled, but am happy to continue to provide info to help resolve the issue so that acceleration can be used.
(In reply to Jim Henderson from comment #6) > With the information that this looked to be related to the VMware X driver > failing to initialize, I tried a couple of additional tests: > > 1. Disabling accelerated graphics So this is likely because the DRM initialization fails by some reason. Even on X11, it failed to open DRM, as shown in comment 3. Something for Thomas, I suppose.
There is a fix backported very recently to SLE15-SP5 branch. I'm building a test kernel in OBS home:tiwai:bsc1218738 repo. Once after the build finishes (takes an hour or so), the package will appear at http://download.opensuse.org/repositories/home:/tiwai:/bsc1218738/pool/ Could you give it a try later?
I should have some time to test it tomorrow, most likely, if the package has arrived in the repo by then. Today's just slammed, but I'll make it a priority tomorrow once the package is available.
I'd say it's the same bug as bsc#1218229. Adding a dependency for now.
Another test in bsc#1218738 failed, so the package in OBS home:tiwai:bsc1218738 is likely still broken. But it's still worth to try. Meanwhile, another fix test kernel is being built in OBS home:tiwai:bsc1218738-2 repo. Please check this one later, too. It'll appear later at http://download.opensuse.org/repositories/home:/tiwai:/bsc1218738-2/pool/
The kernel update from bsc1218738 seems to be working. I've double checked, and 3D acceleration is enabled in the VM, and Wayland is being used. The kernel update from bsc1218738-2 also seems to be working; as before 3D acceleration is enabled in the VM, and Wayland is in use. I did also test forcing Xorg to run with nomodeset (only with the second update, but I can test the first if needed), but Xorg does not start. I'll attach the Xorg.0.log file from that failed startup.
Created attachment 871840 [details] Xorg.0.log while running the kernel pulled from bsc1218738-2 and using nomodeset See previous comment for details
Thanks for quick testing! So it seems OK in your case, interestingly. Could you check the kernel dmesg output with the *-2 kernel, just to be sure? We're rebuilding yet another one in OBS home:tiwai:bsc128738-3 repo, since *-2 still had a minor issue. If you have time, please test it later (once after the build finishes), too.
Absolutely! What specific information is useful out of dmesg, and in which test case - the one with nomodeset, or without? Just want to make sure I pull the right info.
Just run the vm guest as usual with the acceleration enabled like before, and get the dmesg from the guest. Now yet more update, OBS home:tiwai:bsc1218738-4 is being built. Please check this one later rather. Thanks!
Created attachment 871858 [details] dmesg output from the bsc1218738-4 kernel build Updated to the bsc1218738-4 kernel, rebooted (leaving out the nomodeset parameter, with accel enabled), and pulled this dmesg output from the system per request. Desktop still is functional when started in this way, and uses wayland. With nomodeset, it switches to Xorg and still fails in the same way (just providing that data point; I assume that the focus is on the Wayland server's functionality and that's why there's no change on Xorg).
Just wanted to check back and make sure that I sent everything that was needed with the last log. Had someone else report this issue through the FB group as well, just FYI.
Now the fix landed in SLE15-SP5 branch. It will be likely in the regular update in February. Thanks for your report and tests!
Excellent, thanks! For users who need the fix before then in Leap 15.5, can they apply it from somewhere before it hits the regular channel?
KOTD (kernel of the day - which is built from the latest git branch) can be taken from OBS Kernel:SLE15-SP5 repo. But it's an unofficial build, hence Secure Boot has to be turned off.
Cool, thanks for that info. If anyone asks, I'll let them know what they need to do. Probably easier for them to roll back and lock the kernel update before updating, but nice to have the option.