Bug 1140081

Summary: Unable to change VTs after wayland compositor crash
Product: [openSUSE] openSUSE Tumbleweed Reporter: Fabian Vogt <fvogt>
Component: BasesystemAssignee: systemd maintainers <systemd-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: alynx.zhou, fbui, fvogt, jslaby, tiwai, tzimmermann, vliaskovitis
Version: CurrentFlags: fbui: needinfo? (jslaby)
Target Milestone: ---   
Hardware: Other   
OS: Other   
URL: https://github.com/systemd/systemd/issues/13241
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Bug Depends on: 1101591    
Bug Blocks:    
Attachments: Patch that tries to restore VT twice

Description Fabian Vogt 2019-07-02 12:45:44 UTC
Unfortunately the issue seems to be back, but this time I'm unable to reproduce it in a VM the same way as in the original bug report.

Issue first seen with kernel-default-5.2.rc5-3.1.ga562f5a.x86_64 and systemd 242.

Jun 30 22:09:00 linux-e202.suse.de systemd-logind[1263]: Failed to restore VT, ignoring: Input/output error
Jul 01 17:23:43 linux-e202.suse.de systemd-logind[1263]: Failed to restore VT, ignoring: Input/output error
Jul 01 19:07:57 linux-e202.suse.de systemd-logind[1254]: Failed to restore VT, ignoring: Input/output error

I set SYSTEMD_LOG_LEVEL=debug for logind now, once it happens again (if it does at all with that option...) I'll add the debug messages here.

+++ This bug was initially created as a clone of Bug #1101591 +++

When the wayland compositor crashes, the system becomes unusable:

- Ctrl-Alt-FX has no effect
- "chvt X" over SSH hangs

chvt shows the following trace:

[<0>] __vt_event_wait.isra.4.part.5+0x2b/0x60
[<0>] vt_waitactive+0x60/0xa0
[<0>] vt_ioctl+0x6a5/0x1100
[<0>] tty_ioctl+0x21c/0x890
[<0>] do_vfs_ioctl+0x90/0x5e0
[<0>] ksys_ioctl+0x60/0x90
[<0>] __x64_sys_ioctl+0x16/0x20
[<0>] do_syscall_64+0x60/0x110
[<0>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[<0>] 0xffffffffffffffff

Backtraces of all tasks are attached, but chvt seems to be the only one inside VT/tty/graphics code.

Linux linux-e202.suse.de 4.18.0-rc4-1.g36e98dd-default #1 SMP PREEMPT Mon Jul 9 07:48:38 UTC 2018 (36e98dd) x86_64 x86_64 x86_64 GNU/Linux

The system has hybrid graphics, with i915 and nouveau loaded.

This is 100% reproducible.
Comment 1 Stefan Dirsch 2019-07-02 13:02:55 UTC
I suggest to disable nouveau driver support first via

nouveau.modeset=0

(already needs to be active in initrd of course).
Comment 3 Franck Bui 2019-07-16 08:59:59 UTC
(In reply to Fabian Vogt from comment #0)
> I set SYSTEMD_LOG_LEVEL=debug for logind now, once it happens again (if it
> does at all with that option...) I'll add the debug messages here.

So I guess you haven't managed to reproduce with the debug logs enabled, have you ?
Comment 4 Fabian Vogt 2019-07-16 09:19:49 UTC
(In reply to Franck Bui from comment #3)
> (In reply to Fabian Vogt from comment #0)
> > I set SYSTEMD_LOG_LEVEL=debug for logind now, once it happens again (if it
> > does at all with that option...) I'll add the debug messages here.
> 
> So I guess you haven't managed to reproduce with the debug logs enabled,
> have you ?

I didn't immediately reboot after configuring the loglevel and disabling nouveau, but waited until it happened again and was forced to reset. That was only two days ago, so now I'm waiting for it to happen again.
Comment 5 Fabian Vogt 2019-07-24 11:40:58 UTC
It happened again, with nouveau.modeset=0:

Jul 24 13:05:31 linux-e202.suse.de org_kde_powerdevil[1912]: The Wayland connection broke. Did the Wayland compositor die?
Jul 24 13:05:31 linux-e202.suse.de systemd-logind[1241]: Inhibitor PowerDevil (KDE handles power events) pid=1912 uid=1000 mode=block stopped.
Jul 24 13:05:31 linux-e202.suse.de systemd-logind[1241]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/login1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=7424 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
Jul 24 13:05:31 linux-e202.suse.de kernel: kwin_wayland[1787]: segfault at 7fb14eccb1e0 ip 00007fb154608190 sp 00007fffdd6e1e80 error 4 in libEGL_mesa.so.0.0.0[7fb1545f8000+23000]
Jul 24 13:05:31 linux-e202.suse.de kernel: Code: 00 00 4c 89 e2 48 89 ee 48 89 c7 e8 8a fe ff ff eb ce 0f 1f 84 00 00 00 00 00 55 48 8b 46 68 48 89 d5 48 8b 7a 20 48 8b 40 68 <ff> 50 20 48 89 ef e8 65 0d ff ff b8 01 00 00 00 5d c3 66 66 2e 0f
Jul 24 13:05:31 linux-e202.suse.de systemd-logind[1241]: Inhibitor Bildschirmsperre (Stellt sicher, dass der Bildschirm vor dem Herunterfahren gesperrt wird) pid=1787 uid=1000 mode=delay stopped.
Jul 24 13:05:31 linux-e202.suse.de systemd-logind[1241]: Sent message type=signal sender=n/a destination=n/a path=/org/freedesktop/login1 interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=7425 reply_cookie=0 signature=sa{sv}as error-name=n/a error-message=n/a
Jul 24 13:05:32 linux-e202.suse.de dbus-daemon[1197]: [system] Activating via systemd: service name='org.freedesktop.resolve1' unit='dbus-org.freedesktop.resolve1.service' requested by ':1.2' (uid=0 pid=1205 comm="/usr/sbin/NetworkManager --no-daemon ")
Jul 24 13:05:32 linux-e202.suse.de dbus-daemon[1197]: [system] Activation via systemd failed for unit 'dbus-org.freedesktop.resolve1.service': Unit dbus-org.freedesktop.resolve1.service not found.
Jul 24 13:05:32 linux-e202.suse.de systemd-logind[1241]: Got message type=signal sender=:1.0 destination=n/a path=/org/freedesktop/systemd1 interface=org.freedesktop.systemd1.Manager member=UnitRemoved cookie=15623 reply_cookie=0 signature=so error-name=n/a error-message=n/a
Jul 24 13:05:34 linux-e202.suse.de systemd-logind[1241]: Got message type=signal sender=org.freedesktop.DBus destination=n/a path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=NameOwnerChanged cookie=875 reply_cookie=0 signature=sss error-name=n/a error-message=n/a
Jul 24 13:05:34 linux-e202.suse.de systemd-logind[1241]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=RemoveMatch cookie=7426 reply_cookie=0 signature=s error-name=n/a error-message=n/a
Jul 24 13:05:35 linux-e202.suse.de sddm-helper[1723]: [PAM] Closing session
Jul 24 13:05:35 linux-e202.suse.de kernel: traps: ThreadPoolSingl[2756] trap int3 ip:5650f3898797 sp:7f14f9115f60 error:0 in vivaldi-bin[5650f144a000+6c6f000]
Jul 24 13:05:35 linux-e202.suse.de systemd[23050]: run-user-1000-doc.mount: Succeeded.
Jul 24 13:05:35 linux-e202.suse.de systemd[23050]: run-user-1000-gvfs.mount: Succeeded.
Jul 24 13:05:35 linux-e202.suse.de systemd[1726]: run-user-1000-doc.mount: Succeeded.
Jul 24 13:05:35 linux-e202.suse.de systemd[1726]: run-user-1000-gvfs.mount: Succeeded.
Jul 24 13:05:35 linux-e202.suse.de systemd-logind[1241]: Failed to reset keyboard mode, ignoring: Input/output error
Jul 24 13:05:35 linux-e202.suse.de systemd-logind[1241]: Failed to set VT_AUTO mode, ignoring: Input/output error
Jul 24 13:05:35 linux-e202.suse.de systemd-logind[1241]: Failed to restore VT, ignoring: Input/output error
Jul 24 13:05:35 linux-e202.suse.de systemd-logind[1241]: Got message type=method_call sender=:1.1406 destination=org.freedesktop.login1 path=/org/freedesktop/login1 interface=org.freedesktop.login1.Manager member=ReleaseSession cookie=2 reply_cookie=0 signature=s error-name=n/a error-message=n/a
Jul 24 13:05:35 linux-e202.suse.de systemd-logind[1241]: Sent message type=method_call sender=n/a destination=org.freedesktop.DBus path=/org/freedesktop/DBus interface=org.freedesktop.DBus member=GetConnectionUnixUser cookie=7427 reply_cookie=0 signature=s error-name=n/a error-message=n/a
Jul 24 13:05:35 linux-e202.suse.de systemd-logind[1241]: Got message type=method_return sender=org.freedesktop.DBus destination=:1.6 path=n/a interface=n/a member=n/a cookie=876 reply_cookie=7427 signature=u error-name=n/a error-message=n/a
Jul 24 13:05:35 linux-e202.suse.de systemd-logind[1241]: Sent message type=method_return sender=n/a destination=:1.1406 path=n/a interface=n/a member=n/a cookie=7428 reply_cookie=2 signature=n/a error-name=n/a error-message=n/a
Jul 24 13:05:35 linux-e202.suse.de sddm-helper[1723]: pam_unix(sddm:session): session closed for user fabian
Jul 24 13:05:35 linux-e202.suse.de sddm-helper[1723]: [PAM] Ended.
Comment 8 Franck Bui 2019-10-18 15:49:48 UTC
@Fabian, as discussed privately I'll attach a patch that you could try out after making sure that you still reproduce this issue.

Thanks !
Comment 9 Franck Bui 2019-10-18 15:50:24 UTC
Created attachment 821945 [details]
Patch that tries to restore VT twice
Comment 10 Fabian Vogt 2019-10-24 13:10:17 UTC
(In reply to Franck Bui from comment #9)
> Created attachment 821945 [details]
> Patch that tries to restore VT twice

Added on top of v243 and after triggering some crashes I get "systemd-logind[1587]: FOO: First attempt to restore VT failed, retrying a 2nd time with a fresh handle" and am able to switch VTs.

I can't say for sure that it's a 100% fix, but so far the issue hasn't returned.
Comment 11 Franck Bui 2019-10-29 11:00:12 UTC
Thanks Fabian for the testing. I'll submit the patch to upstream.
Comment 12 Franck Bui 2019-10-29 11:13:26 UTC
However I can't remember why my first attempt [1] to fix this race was about to make logind spawn a process to become the controlling terminal process. The new fix seems much simpler but maybe I'm missing something...

Takashi, could you confirm that there's no need for a process (such as logind) to become the controlling terminal process before restoring VT_AUTO ?

Also, one thing I'm still wondering: if the process that has set VT_PROCESS dies, why doesn't the kernel automatically restore VT_AUTO for the VT ?

Leaving the VT unusable until another "foreign" process restore VT_AUTO doesn't sound the best option.

[1] https://github.com/systemd/systemd/pull/9961
Comment 13 Takashi Iwai 2019-10-29 15:38:14 UTC
Erm, that's a way old issue I worked on it, so my memory is really vague.  Now reading back the bsc#1101591, too...

IIRC, the switch to the fallback ops happens at disassociate_ctty() function, that is, when the control process exits.  And the race that was seen in the past was where another process (dbus stuff) was a session leader and it exits in a racy way while logind tries to restore it.  So the fix was that logind itself takes the control.

But yeah, just for restoring the VT mode itself, the process doesn't need to be the control process.

The restore to VC_AUTO is handled in reset_vc() which is called only at change_console() and vc_init().  This might be improved by invoking reset_vc() somehow at vt's close op when the control process exits.

But I believe Jiri is the right person to ask this; he is the maintainer of tty layer, after all ;)  Cc'ed.
Comment 14 Franck Bui 2019-10-29 16:35:10 UTC
(In reply to Takashi Iwai from comment #13)
> Erm, that's a way old issue I worked on it, so my memory is really vague. 
> Now reading back the bsc#1101591, too...

yeah sorry, but my question is only about the kernel behavior here that is why it doesn't restore VT_AUTO when the control process (i.e. the one that set VT_PROCESS) dies.
Comment 15 Takashi Iwai 2019-10-29 18:18:34 UTC
(In reply to Franck Bui from comment #14)
> (In reply to Takashi Iwai from comment #13)
> > Erm, that's a way old issue I worked on it, so my memory is really vague. 
> > Now reading back the bsc#1101591, too...
> 
> yeah sorry, but my question is only about the kernel behavior here that is
> why it doesn't restore VT_AUTO when the control process (i.e. the one that
> set VT_PROCESS) dies.

As far as I read the code, the kernel seems restoring to VT_AUTO not at hangup or close but only at changing the VT.  As mentioned, change_console() has a couple of places checking the presence of the control session and calls reset_vc() accordingly.  But I don't know why it's written so and whether it's the intentional behavior, too.
Comment 16 Franck Bui 2019-11-13 15:29:20 UTC
I finally queued the fix for Factory/TW. It should be available in the next update so I'm closing this bug.

I prefer not backporting the fix to SLE as I'm not enterily sure that it won't cause any regression and since it addresses a corner case, let's wait until someone face it again on the older distros.

Jiri, it would be nice if you could answer the question in comment #12 though.