Bug 1218547 - Leap 15.5 - Xorg fails to start on kernel 5.14.21-150500.55.39.1
Summary: Leap 15.5 - Xorg fails to start on kernel 5.14.21-150500.55.39.1
Status: RESOLVED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP5
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: Other Other
: P5 - None : Normal
Target Milestone: ---
Assignee: Thomas Zimmermann
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on: 1218229
Blocks:
  Show dependency treegraph
 
Reported: 2024-01-04 18:45 UTC by Jim Henderson
Modified: 2024-01-20 18:15 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Xorg.0.log file showing the segfault resulting from this kernel update (29.25 KB, text/x-log)
2024-01-04 18:45 UTC, Jim Henderson
Details
Xorg.0.log from before the kernel update when using nomodeset (shows that it fails with nomodeset in either case) (29.25 KB, text/plain)
2024-01-05 20:54 UTC, Jim Henderson
Details
Xorg.0.log while running the kernel pulled from bsc1218738-2 and using nomodeset (29.25 KB, text/plain)
2024-01-12 19:27 UTC, Jim Henderson
Details
dmesg output from the bsc1218738-4 kernel build (110.04 KB, text/x-log)
2024-01-13 19:09 UTC, Jim Henderson
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jim Henderson 2024-01-04 18:45:17 UTC
Created attachment 871666 [details]
Xorg.0.log file showing the segfault resulting from this kernel update

NOTE:  Reporting as SLES 15SP5 as there is no Leap 15.5 product that I am able to see, and IIRC, Leap 15.5 is based on SLE 15 SP5.  Please change if this is not the correct product to report under.

Running VMware Workstation 17.5 on Tumbleweed (host OS shouldn't matter, but is provided for completeness) with nVidia 3090ti video card in host.  Leap 15.5 virtual machine with 8 GB of RAM and 4 CPUs, accelerated graphics is enabled in the VM.

After updating the kernel from 5.14.21-150500.55.36.1 to 5.14.21-150500.55.39.1, the system locks right as Xorg is starting (the last messages on the text console indicate Locale service is started).  If I start with nomodeset, I can get a text console - if I don't, I can't switch VTs, but I can ssh into the system.

The Xorg log (attached) reports a segfault.

I tested by applying all updates except all kernel updates (anything starting with kernel-default was locked during the 'zypper up'); system booted properly.  Unlocking kernel-default and running an update resulted in the boot failure.

Please let me know if any other information is needed.  Happy to provide the VM image if someone wants to test with the free VMware Player (it should fail there as well).
Comment 1 Jim Henderson 2024-01-04 18:46:21 UTC
Additional note:  I have also reproduced this using the net install CD (presumably the full DVD ISO will do this as well) allowing live updates during the installation - so it should be fairly easy to reproduce from scratch as well.
Comment 2 Takashi Iwai 2024-01-05 16:13:41 UTC
The vmware X driver seems failing to initialize, ended with a segfault:
[     6.843] (II) vmware(0): Initialized VMWARE_CTRL extension version 0.2
[     6.843] (II) vmware(0): Initialized VMware Xinerama extension.
[     6.843] (II) vmware(0): vgaHWGetIOBase: hwp->IOBase is 0x03d0
[     6.877] (EE) vmware(0): Unable to map frame buffer BAR. Invalid argument (22)
[     6.877] (EE) 
[     6.877] (EE) Backtrace:
[     6.877] (EE) 0: /usr/bin/X (xorg_backtrace+0x65) [0x557fa89a9915]
[     6.877] (EE) 1: /usr/bin/X (0x557fa87e5000+0x1c85e9) [0x557fa89ad5e9]
[     6.877] (EE) 2: /lib64/libpthread.so.0 (0x7f43d2846000+0x16910) [0x7f43d285c910]
[     6.877] (EE) 3: /lib64/libc.so.6 (0x7f43d1c09000+0x1914ea) [0x7f43d1d9a4ea]
[     6.877] (EE) 4: /usr/lib64/xorg/modules/drivers/vmware_drv.so (0x7f43d181c000+0xae5b) [0x7f43d1826e5b]
[     6.877] (EE) 5: /usr/bin/X (AddScreen+0xd7) [0x557fa88451e7]
[     6.877] (EE) 6: /usr/bin/X (InitOutput+0x27d) [0x557fa8886c1d]
[     6.877] (EE) 7: /usr/bin/X (0x557fa87e5000+0x63d45) [0x557fa8848d45]
[     6.877] (EE) 8: /lib64/libc.so.6 (__libc_start_main+0xef) [0x7f43d1c3e24d]
[     6.877] (EE) 9: /usr/bin/X (_start+0x2c) [0x557fa88327da]
[     6.877] (EE) 
[     6.877] (EE) Segmentation fault at address 0x0

The suspected recent change in the kernel side is the security fix backports.
Adding Thomas to Cc.
Comment 3 Takashi Iwai 2024-01-05 16:18:13 UTC
.... or it might be the early error to open the drm:
[     6.568] (EE) vmware(0): Failed to open drm.
[     6.568] (WW) vmware(0): Disabling 3D support.
[     6.568] (WW) vmware(0): Disabling Render Acceleration.
[     6.568] (WW) vmware(0): Disabling RandR12+ support.
[     6.568] (--) vmware(0): VMware SVGA regs at (0x1070, 0x1071)
....

Jim, could you give the Xorg.log from the working case (with the previous kernel)?
Comment 4 Jim Henderson 2024-01-05 20:54:41 UTC
Created attachment 871676 [details]
Xorg.0.log from before the kernel update when using nomodeset (shows that it fails with nomodeset in either case)

I went to look for this, and now I'm puzzled.  The working system is using Wayland and there's no Xorg log file.

If I start with nomodeset with the previous kernel, it starts with Xorg, but the startup also fails.  (Xorg.0.log attached, but it looks to be the same segfault).

So it seems that adding nomodeset to the startup is causing it to try to start Xorg rather than Xwayland and changing the problem in a way I was not anticipating.

If I start without 'nomodeset' on the working kernel, the system starts up as expected using Xwayland, and the new kernel does not - so there is still an issue, but it looks to actually not be Xorg (though Xorg failing with nomodeset is clearly an issue as well).
Comment 5 Jim Henderson 2024-01-05 21:07:24 UTC
Doing some additional digging, I see that wayland writes to the system log - the only messages in the output of 'journalctl -b' related to wayland are the following:

--- snip ---

Jan 05 12:59:07 localhost.localdomain systemd[1239]: Reached target GNOME Wayland Session.
Jan 05 12:59:07 localhost.localdomain systemd[1239]: Starting GNOME Shell on Wayland...
Jan 05 12:59:07 localhost.localdomain gnome-shell[1699]: Running GNOME Shell (using mutter 41.9) as a Wayland display server
Jan 05 13:00:37 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: start operation timed out. Terminating.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: State 'stop-sigterm' timed out. Killing.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1699 (gnome-shell) with signal SIGKILL.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1777 (gmain) with signal SIGKILL.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1793 (gdbus) with signal SIGKILL.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Killing process 1801 (dconf worker) with signal SIGKILL.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Main process exited, code=killed, status=9/KILL
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Failed with result 'timeout'.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: Failed to start GNOME Shell on Wayland.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: org.gnome.Shell@wayland.service: Triggering OnFailure= dependencies.
Jan 05 13:00:42 localhost.localdomain systemd[1239]: Stopped target GNOME Wayland Session.

--- snip ---

When it has reached this point, the system becomes completely nonresponsive - unable to even connect via ssh.  Prior to the system reporting the attempt terminate org.gnome.Shell@wayland.service failed (timestamp Jan 05 3:00:42), I was able to connect (and thus had the log on-screen to copy/paste here).

It looks like this issue needs to be reclassified as a wayland/kernel issue rather than xorg.  Apologies for missing that nomodeset was changing the issue.
Comment 6 Jim Henderson 2024-01-05 21:27:12 UTC
With the information that this looked to be related to the VMware X driver failing to initialize, I tried a couple of additional tests:

1.  Disabling accelerated graphics
2.  With accelerated graphics enabled, increasing system ram in the VM from 8 GB to 16 GB (recommended in the display configuration for the VM with only 8 GB of system ram allocated to the system and 8 GB of vram allocated).

Test 1 resulted in a system that starts up.

Test 2 hung the guest to the point that I had difficulty shutting the VM down (I had to kill it with the 'kill' command on the host).

I can run the VM with accelerated graphics disabled, but am happy to continue to provide info to help resolve the issue so that acceleration can be used.
Comment 7 Takashi Iwai 2024-01-08 12:07:53 UTC
(In reply to Jim Henderson from comment #6)
> With the information that this looked to be related to the VMware X driver
> failing to initialize, I tried a couple of additional tests:
> 
> 1.  Disabling accelerated graphics

So this is likely because the DRM initialization fails by some reason.
Even on X11, it failed to open DRM, as shown in comment 3.

Something for Thomas, I suppose.
Comment 8 Takashi Iwai 2024-01-11 18:49:03 UTC
There is a fix backported very recently to SLE15-SP5 branch.
I'm building a test kernel in OBS home:tiwai:bsc1218738 repo.
Once after the build finishes (takes an hour or so), the package will appear at
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1218738/pool/

Could you give it a try later?
Comment 9 Jim Henderson 2024-01-11 18:59:33 UTC
I should have some time to test it tomorrow, most likely, if the package has arrived in the repo by then.  Today's just slammed, but I'll make it a priority tomorrow once the package is available.
Comment 10 Thomas Zimmermann 2024-01-12 08:15:47 UTC
I'd say it's the same bug as bsc#1218229. Adding a dependency for now.
Comment 12 Takashi Iwai 2024-01-12 16:17:48 UTC
Another test in bsc#1218738 failed, so the package in OBS home:tiwai:bsc1218738 is likely still broken.  But it's still worth to try.

Meanwhile, another fix test kernel is being built in OBS home:tiwai:bsc1218738-2 repo.  Please check this one later, too.  It'll appear later at
  http://download.opensuse.org/repositories/home:/tiwai:/bsc1218738-2/pool/
Comment 13 Jim Henderson 2024-01-12 19:24:53 UTC
The kernel update from bsc1218738 seems to be working.

I've double checked, and 3D acceleration is enabled in the VM, and Wayland is being used.

The kernel update from bsc1218738-2 also seems to be working; as before 3D acceleration is enabled in the VM, and Wayland is in use.

I did also test forcing Xorg to run with nomodeset (only with the second update, but I can test the first if needed), but Xorg does not start.  I'll attach the Xorg.0.log file from that failed startup.
Comment 14 Jim Henderson 2024-01-12 19:27:13 UTC
Created attachment 871840 [details]
Xorg.0.log while running the kernel pulled from bsc1218738-2 and using nomodeset

See previous comment for details
Comment 15 Takashi Iwai 2024-01-12 20:31:47 UTC
Thanks for quick testing!

So it seems OK in your case, interestingly.  Could you check the kernel dmesg output with the *-2 kernel, just to be sure?

We're rebuilding yet another one in OBS home:tiwai:bsc128738-3 repo, since *-2 still had a minor issue.  If you have time, please test it later (once after the build finishes), too.
Comment 16 Jim Henderson 2024-01-12 21:39:31 UTC
Absolutely!  What specific information is useful out of dmesg, and in which test case - the one with nomodeset, or without?

Just want to make sure I pull the right info.
Comment 17 Takashi Iwai 2024-01-13 07:28:30 UTC
Just run the vm guest as usual with the acceleration enabled like before, and get the dmesg from the guest.

Now yet more update, OBS home:tiwai:bsc1218738-4 is being built.  Please check this one later rather.  Thanks!
Comment 18 Jim Henderson 2024-01-13 19:09:51 UTC
Created attachment 871858 [details]
dmesg output from the bsc1218738-4 kernel build

Updated to the bsc1218738-4 kernel, rebooted (leaving out the nomodeset parameter, with accel enabled), and pulled this dmesg output from the system per request.

Desktop still is functional when started in this way, and uses wayland.

With nomodeset, it switches to Xorg and still fails in the same way (just providing that data point; I assume that the focus is on the Wayland server's functionality and that's why there's no change on Xorg).
Comment 19 Jim Henderson 2024-01-18 18:33:21 UTC
Just wanted to check back and make sure that I sent everything that was needed with the last log.  Had someone else report this issue through the FB group as well, just FYI.
Comment 20 Takashi Iwai 2024-01-19 09:29:50 UTC
Now the fix landed in SLE15-SP5 branch.  It will be likely in the regular update in February.  Thanks for your report and tests!
Comment 21 Jim Henderson 2024-01-19 18:53:44 UTC
Excellent, thanks!  For users who need the fix before then in Leap 15.5, can they apply it from somewhere before it hits the regular channel?
Comment 22 Takashi Iwai 2024-01-20 08:36:04 UTC
KOTD (kernel of the day - which is built from the latest git branch) can be taken from OBS Kernel:SLE15-SP5 repo.  But it's an unofficial build, hence Secure Boot has to be turned off.
Comment 23 Jim Henderson 2024-01-20 18:15:58 UTC
Cool, thanks for that info.  If anyone asks, I'll let them know what they need to do.  Probably easier for them to roll back and lock the kernel update before updating, but nice to have the option.