Bug 1223178 - [Build 80.1] openQA test fails in sshd: Segfault or fatal journal entry detected in journal
Summary: [Build 80.1] openQA test fails in sshd: Segfault or fatal journal entry detec...
Status: VERIFIED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: Basesystem (show other bugs)
Version: unspecified
Hardware: Other Other
: P2 - High : Normal
Target Milestone: ---
Assignee: Antonio Larrosa
QA Contact:
URL: https://openqa.suse.de/tests/14103898...
Whiteboard: Seams not severe at all as a journali...
Keywords:
Depends on:
Blocks:
 
Reported: 2024-04-22 04:06 UTC by Chenzi Cao
Modified: 2024-05-09 06:06 UTC (History)
7 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
sshd journal log (3.47 KB, text/plain)
2024-04-22 04:06 UTC, Chenzi Cao
Details
Whole journal log (158.82 KB, text/x-log)
2024-04-22 04:36 UTC, Chenzi Cao
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Chenzi Cao 2024-04-22 04:06:56 UTC
Created attachment 874400 [details]
sshd journal log

Environment: ppc64le

Steps:
- Fresh install SLES15SP6 build80.1 by libyui, with basesystem module, server applications module and python3 module
- Minimal system role

Result:
After installation, segfault or fatal journal entry detected.

Please find the attached sshd journal log for this issue, and I'll provide the whole journal log later.

## Observation

openQA test in scenario sle-15-SP6-Online-ppc64le-minimal+role_minimal@ppc64le-spvm fails in
[sshd](https://openqa.suse.de/tests/14103898/modules/sshd/steps/427)

## Test suite description
Like default but explicitly select the system role "minimal". The resulting system should roughly correspond to an unregistered system but with access to modules for optional installation.


## Reproducible

Fails since (at least) Build [62.1](https://openqa.suse.de/tests/13711573)


## Expected result

Last good: (unknown) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le-spvm&test=minimal%2Brole_minimal&version=15-SP6)
Comment 1 Chenzi Cao 2024-04-22 04:36:14 UTC
Created attachment 874401 [details]
Whole journal log

This is the whole journal log.

@Antonio Larrosa, I run a case with "script_run("update-crypto-policies --set LEGACY");" in sshd.pm, the issue still exists:

https://openqa.suse.de/tests/14112210#step/sshd/437
Comment 2 Radoslav Tzvetkov 2024-04-22 07:44:45 UTC
As I understand, this is a journaling bug that does not impede the functionality?
Comment 3 Antonio Larrosa 2024-04-23 09:05:35 UTC
Radoslav, the problem is that sshd shows this in the journal:

Apr 22 00:11:52 grenache-7 sshd[2299]: fatal: Timeout before authentication for 2a07:de40:b203:12:4048:e1ff:fed7:b5d0 port 33548

That is reported as a fatal error so the openqa test fails because it's not ignored as other fatal errors are ignored in sshd.pm since they're known and expected (see the `if` clause at https://github.com/chcao/os-autoinst-distri-opensuse/blob/33997b81bf65c14001e07a3f38203de8ec102d20/tests/console/sshd.pm#L111 )

What I'm not sure is if we should also ignore this error or there's a real problem (which I'm leaning to think). Specially because this morning I noticed someone reported a similar issue upstream with ssh connections hanging out after upgrading their clients and server to 9.6 (although in this case, the report says it's the client that gives a timeout, not the server, so it might be a different issue, but I'll be tracking it: https://lists.mindrot.org/pipermail/openssh-unix-dev/2024-April/041311.html ).

I see that there's already openssh 9.7 that has some work done related to timeouts (https://www.openssh.com/releasenotes.html) so I'll talk with Chenzi Cao to test that in openqa to see if it's fixed with it.
Comment 4 Radoslav Tzvetkov 2024-04-24 12:51:56 UTC
At this stage of the project, I strongly discourage upgrading.

Please backport to fix the issue. However, our SR deadline is 15.05.
Comment 5 Antonio Larrosa 2024-04-25 15:26:57 UTC
Yes, that was my idea, to backport the fix from 9.7 if 9.7 fixed the issue (specially since I'm not confident about the rebase of the gssapi patch needed for 9.7, but that's a different issue).

Chenzi Cao and me have been debugging this in the last days, so I'll add an update just to keep everyone on the loop.

In the last test run at https://openqa.suse.de/tests/14146074, one can see this in the sshd journal log:

```
# journalctl -b -u sshd.service; echo aQWQh-$?-
Apr 25 06:48:30 grenache-5 systemd[1]: Starting OpenSSH Daemon...
Apr 25 06:48:30 grenache-5 sshd-gen-keys-start[2116]: Checking for missing server keys in /etc/ssh
Apr 25 06:48:30 grenache-5 sshd-gen-keys-start[2130]: ssh-keygen: generating new host keys: RSA ECDSA ED25519
Apr 25 06:48:31 grenache-5 sshd[2185]: Server listening on 0.0.0.0 port 22.
Apr 25 06:48:31 grenache-5 sshd[2185]: Server listening on :: port 22.
Apr 25 06:48:31 grenache-5 systemd[1]: Started OpenSSH Daemon.
Apr 25 06:48:41 grenache-5 sshd[2499]: Connection closed by 10.145.10.212 port 56150
Apr 25 06:48:43 grenache-5 sshd[2503]: Connection closed by 10.145.10.212 port 56160
Apr 25 06:48:43 grenache-5 sshd[2504]: Accepted keyboard-interactive/pam for root from 2a07:de40:b203:12:4048:e1ff:fed7:b5d0 port 43220 ssh2
Apr 25 06:48:44 grenache-5 sshd[2504]: pam_unix(sshd:session): session opened for user root by (uid=0)
Apr 25 06:48:50 grenache-5 sshd[2529]: Accepted keyboard-interactive/pam for root from 2a07:de40:b203:12:4048:e1ff:fed7:b5d0 port 43226 ssh2
Apr 25 06:48:50 grenache-5 sshd[2529]: pam_unix(sshd:session): session opened for user root by (uid=0)
Apr 25 06:50:41 grenache-5 sshd[2500]: fatal: Timeout before authentication for 2a07:de40:b203:12:4048:e1ff:fed7:b5d0 port 43212
```

Considering that the sshd service is enabled and journalctl -b is used, the start of sshd happening at 06:48:30 is the start of the service during system boot. The fatal timeout happens at 6:50:41. So I checked in the code and the login authentication is using a 120 seconds timeout, which means the connection was initiated at 6:48:41 which is just 10 seconds after sshd is started and much before the sshd test script that fails is started.

Looking at autoinst-log.txt we can see that around 6:48:41 openQA was doing this:

```
[2024-04-25T12:48:40.349311+02:00] [debug] [pid:2786] ||| finished first_boot installation (runtime: 68 s)
[2024-04-25T12:48:40.351370+02:00] [debug] [pid:2786] ||| starting system_prepare tests/console/system_prepare.pm
[2024-04-25T12:48:40.356563+02:00] [debug] [pid:2786] tests/console/system_prepare.pm:31 called testapi::select_console
[2024-04-25T12:48:40.356703+02:00] [debug] [pid:2786] <<< testapi::select_console(testapi_console="root-console")
[2024-04-25T12:48:40.385429+02:00] [debug] [pid:2787] Establishing VNC connection to localhost:55591
[2024-04-25T12:48:41.630354+02:00] [debug] [pid:2787] Connected to Xvnc - PID 8821
icewm PID is 8825
[2024-04-25T12:48:42.653326+02:00] [debug] [pid:2787] Wait for SSH on host grenache-5.oqa.prg2.suse.org (timeout: 120)
[2024-04-25T12:48:42.659505+02:00] [debug] [pid:2787] Xterm PID: 8826
[2024-04-25T12:48:42.713797+02:00] [debug] [pid:2787] led state 0 0 0 -261
[2024-04-25T12:48:42.720191+02:00] [debug] [pid:2786] tests/console/system_prepare.pm:31 called testapi::select_console -> lib/susedistribution.pm:820 called Utils::Backends::use_ssh_serial_console -> lib/Utils/Backends.pm:98 called testapi::select_console
[2024-04-25T12:48:42.720466+02:00] [debug] [pid:2786] <<< testapi::select_console(testapi_console="root-ssh")
[2024-04-25T12:48:42.728796+02:00] [debug] [pid:2787] Establishing VNC connection to localhost:60773
XIO:  fatal IO error 11 (Resource temporarily unavailable) on X server ":34011"
      after 28809 requests (28652 known processed) with 0 events remaining.
xterm: fatal IO error 11 (Resource temporarily unavailable) or KillClient on X server ":34011"
[2024-04-25T12:48:43.753414+02:00] [info] [pid:2523] ::: backend::driver::_collect_orphan: Driver backend collected unknown process with pid 7185 and exit status: 1
[2024-04-25T12:48:44.004029+02:00] [debug] [pid:2787] Connected to Xvnc - PID 8827
```

Note that the host is running in UTC+2 and the vm is running in UTC-4 which gives a 6 hour difference between them, so we can just look at minutes and seconds ignoring the hours.

Having that in mind, it seems that openQA was running `select_console 'root-console';` from tests/console/system_prepare.pm:31 at that time, which indeed internally uses ssh to run Xvnc, but that didn't give any error by itself. We will keep investigating.
Comment 6 Chenzi Cao 2024-04-30 09:41:52 UTC
To update status here:
Antonio and I are debugging it, this is the latest VR:
https://openqa.suse.de/tests/14165193

If need to file progress ticket to update the testing, please feel free to contact us (qe-yam squad), I cc our PO to this bug report, thanks.
Comment 7 Hans Petter Jansson 2024-05-02 17:06:17 UTC
Is this happening on other archs too, or is it just ppc64le? Are the same tests succeeding in other environments?
Comment 9 Antonio Larrosa 2024-05-03 06:48:37 UTC
This definitely look like a journalling issue, The error message is written to the log inside the select_console 'root-console'; call in system_prepare.pm before the sshd.pm tests are even started. I'd like to debug it a bit more to understand why it's happening but I don't have much time, so in the meantime, I'd suggest to ignore this error in sshd.pm just like the code that handles the diffie-hellman-group1-sha1 log message (bsc#1185584).

Btw, note that we probably only want to handle this for ppc64le since there are chances that ignoring this error might also ignore other relevant errors (as is the current case if there are two "fatal" log messages and one of them is the diffie-hellman-group1-sha1 case, in which case, the other error is silently ignored in sshd.pm line 109). So IMHO I'd suggest using something like ` | grep -v 'fatal: Timeout before authentication for [0-9a-fA-F:]\+ port'` at the end of both grep commands in lines https://github.com/os-autoinst/os-autoinst-distri-opensuse/blob/master/tests/console/sshd.pm#L107 and 108 (but as I said, only for ppc64le). This will silently ignore this error, but at least it won't silently ignore others that might appear in the future.

In any case, I'll let the decision on what to do to the openQA experts.

Just for the record, we tested openssh 9.7 (just to see if it fixed anything) and it didn't change the behaviour.
Comment 10 Radoslav Tzvetkov 2024-05-03 12:50:49 UTC
Chenzi can you please verify?
Comment 11 Radoslav Tzvetkov 2024-05-03 12:52:13 UTC
I'm putting it as resolved to confirm Antonio's conclusion.
Comment 13 Lemon Li 2024-05-09 06:06:26 UTC
We have filed a ticket https://progress.opensuse.org/issues/159891 following Antonio Larrosa's suggestion, and the related PR https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/19248 already merged. 

VR: https://openqa.suse.de/tests/14234820#step/sshd/426

Verified.