Bug 1221906 - [15SP6][HA] openQA test fails in pacemaker_cts_cluster_exerciser: cts-lab hangs due to issue of `journalctl --until`
Summary: [15SP6][HA] openQA test fails in pacemaker_cts_cluster_exerciser: cts-lab han...
Status: VERIFIED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: systemd (show other bugs)
Version: unspecified
Hardware: Other Other
: P3 - Medium : Normal
Target Milestone: ---
Assignee: systemd maintainers
QA Contact:
URL: https://openqa.suse.de/tests/13845070...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-25 02:09 UTC by lili zhao
Modified: 2024-07-12 09:50 UTC (History)
6 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
screen shot for 'cts-lab' (195.99 KB, image/png)
2024-03-25 02:33 UTC, lili zhao
Details
pacemaker_cts_cluster_exerciser test results (passed) (218.21 KB, image/png)
2024-03-28 08:21 UTC, lili zhao
Details
output of rpm-qi-changelog (14.14 KB, image/png)
2024-04-23 03:25 UTC, lili zhao
Details

Note You need to log in before you can comment on or make changes to this bug.
Description lili zhao 2024-03-25 02:09:18 UTC
- Environment Details:
    SLE15 SP6 recent builds, all arches

- Summary:
    The openQA test cases "ha_pacemaker_cts_client" test module "pacemaker_cts_cluster_exerciser" timed out on command "cts-lab", it hangs due to systemd about journalctl handling logs.

- Reproduce steps:
  1. Set up the test env.
  Test cases are in openQA I can help to setup the envs as https://openqa.suse.de/tests/13825589#next_previous did
  2. Login to the "ha_pacemaker_cts_client". 
  3. Run cts-lab as following:
	'/usr/share/pacemaker/tests/cts-lab 
	--nodes 'pacemaker-node01 pacemaker-node02'
	--stonith-type external/sbd 
	--stonith-args pcmk_delay_max=30,pcmk_off_action=reboot,action=reboot 
	--test-ip-base 10.0.2.20 
	--no-loop-tests 
	--no-unsafe-tests 
	--at-boot 1 
	--outputfile /tmp/cts_cluster_exerciser.log 
	--once'
  4. "ctl-lab" hangs, see attached pic

  5. Some feedback from Yan Gao FYI:
  We are definitely suffering from a fresh issue with systemd about journalctl handling logs here, which i believe is about this: https://github.com/systemd/systemd/issues/31776
  Looks like it has been just fixed by this https://github.com/systemd/systemd/pull/31861
Comment 1 lili zhao 2024-03-25 02:33:32 UTC
Created attachment 873762 [details]
screen shot for 'cts-lab'
Comment 2 Yan Gao 2024-03-25 07:03:43 UTC
(In reply to lili zhao from comment #0)
>   5. Some feedback from Yan Gao FYI:
>   We are definitely suffering from a fresh issue with systemd about
> journalctl handling logs here, which i believe is about this:
> https://github.com/systemd/systemd/issues/31776
>   Looks like it has been just fixed by this
> https://github.com/systemd/systemd/pull/31861

To clarify, it's not journalctl itself that's hanging logs. It's cts-lab that is waiting for completion of collecting logs that are older than a specified timestamp, but the collecting never completes.

Apparently `journalctl --until` no long works correctly together with the options `--after-cursor` and `--lines`. With these options supplied all together, journalctl keeps giving the logs that are newer than the specified `--until` timestamp.
Comment 3 Franck Bui 2024-03-25 10:08:25 UTC
Hi,

(In reply to lili zhao from comment #0)
>   We are definitely suffering from a fresh issue with systemd about
> journalctl handling logs here, which i believe is about this:
> https://github.com/systemd/systemd/issues/31776
>   Looks like it has been just fixed by this
> https://github.com/systemd/systemd/pull/31861

Would it possible to try a test package which would include the backport of this PR ?
Comment 4 lili zhao 2024-03-26 03:07:12 UTC
Hi Franck Bui,
I can help to create the openQA test case envs for debugging, but could you please tell me how to make the fix take effect on VM (for example, how to do make/patch/...)?
Comment 5 Franck Bui 2024-03-26 15:56:53 UTC
Hi Lili,

(In reply to lili zhao from comment #4)
> I can help to create the openQA test case envs for debugging, but could you
> please tell me how to make the fix take effect on VM (for example, how to do
> make/patch/...)?

The idea is that I'll provide you a TEST package on IBS including the backport of the suggested fix that you'll use in your openQA test case to check if it's effectively fix the reported issue.
Comment 6 lili zhao 2024-03-27 00:53:55 UTC
(In reply to Franck Bui from comment #5)
> Hi Lili,
> 
> (In reply to lili zhao from comment #4)
> > I can help to create the openQA test case envs for debugging, but could you
> > please tell me how to make the fix take effect on VM (for example, how to do
> > make/patch/...)?
> 
> The idea is that I'll provide you a TEST package on IBS including the
> backport of the suggested fix that you'll use in your openQA test case to
> check if it's effectively fix the reported issue.

Got it, thanks!
Comment 7 Franck Bui 2024-03-27 10:13:58 UTC
Lili, you can find the test package by following this url: https://build.suse.de/package/show/home:fbui:systemd:SLE-15-SP6-bsc1221906/systemd

Can you please give it a try ?
Comment 8 lili zhao 2024-03-28 08:19:59 UTC
(In reply to Franck Bui from comment #7)
> Lili, you can find the test package by following this url:
> https://build.suse.de/package/show/home:fbui:systemd:SLE-15-SP6-bsc1221906/
> systemd
> 
> Can you please give it a try ?

Thanks for the pkg, test cases passed with the new repo: https://openqa.suse.de/tests/13897424#step/pacemaker_cts_cluster_exerciser/33 (test results, or see attachment)
Comment 9 lili zhao 2024-03-28 08:21:40 UTC
Created attachment 873884 [details]
pacemaker_cts_cluster_exerciser test results (passed)
Comment 10 Franck Bui 2024-03-28 09:06:44 UTC
Thanks for testing Lili, I'll queue the patch for the next SP6 snapshot.
Comment 11 lili zhao 2024-04-22 02:10:35 UTC
Updates:
This issue still exists on latest build Public RC candidate 82.1:
https://openqa.suse.de/tests/14107506#step/pacemaker_cts_cluster_exerciser/23
Comment 12 Franck Bui 2024-04-22 07:46:50 UTC
Did you check that the version of systemd contains the fix ?
Comment 13 lili zhao 2024-04-23 03:24:39 UTC
I did not find the code change related descriptions in the output of "# rpm -qi --changelog systemd | grep journalctl" and "# rpm -qi --changelog systemd | grep bsc | grep 1221906", see attached screenshot for details.
Comment 14 lili zhao 2024-04-23 03:25:23 UTC
Created attachment 874433 [details]
output of rpm-qi-changelog
Comment 15 Thao Huynh 2024-04-30 08:27:02 UTC
Do you have any update on this bug, @Frank?
Comment 16 Franck Bui 2024-04-30 14:03:37 UTC
Sorry I thought that the fix was already submitted but that appeared to be wrong.

I now submitted it via sr#328293.

Please give it a bit of time before it becomes available.
Comment 22 OBSbugzilla Bot 2024-05-09 20:15:06 UTC
This is an autogenerated message for OBS integration:
This bug (1221906) was mentioned in
https://build.opensuse.org/request/show/1172983 Factory / systemd
Comment 23 Thao Huynh 2024-05-10 12:39:21 UTC
Thanks Franck for the fix, so we can move the status forward now.
Comment 24 Franck Bui 2024-05-15 13:40:23 UTC
Thanks for the positive status, I'll close the bug then.
Comment 26 lili zhao 2024-05-22 01:48:29 UTC
Verified fixed.