Bug 1192336

Summary: [Build 58.2] openQA test fails in ctdb
Product: [openSUSE] PUBLIC SUSE Linux Enterprise High Availability Extension 15 SP4 Reporter: Lumir Palovsky <lpalovsky>
Component: OtherAssignee: Noel Power <nopower>
Status: RESOLVED FIXED QA Contact:
Severity: Normal    
Priority: P2 - High CC: llzhao, mloviska, nopower, rtsvetkov, samba-maintainers, suse-beta, zzhou
Version: unspecified   
Target Milestone: PublicBeta-202202   
Hardware: Other   
OS: SLES 15   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: HB report
y2logs
problem detection logs

Description Lumir Palovsky 2021-11-04 10:03:13 UTC
Created attachment 853528 [details]
HB report

## Observation

Hello, 
there is a consistent fail in ctdb module on SLE15SP4 Build 58.2.
Issue affects all architectures (aarch64, ppc64, x86_64). 

Problem seems to be in sbd rather in ctdb, but I was not able to find the issue any further.

In the journalctl I see following error message:

Nov 01 18:11:04.617651 ctdb-node01 sbd[6719]: /dev/disk/by-path/ip-10.0.2.1:3260-iscsi-iqn.2016-02.de.openqa:132-lun-0:    error: servant_md: mbox read failed in servant.
Nov 01 18:16:43.250898 ctdb-node01 nmbd[11893]: [2021/11/01 18:16:43.250837,  0] ../../source3/nmbd/nmbd.c:902(main)
Nov 01 18:16:43.251791 ctdb-node01 nmbd[11893]:   nmbd version 4.15.0-git.177.73057cd57b61.3-SUSE-oS15.0-x86_64 started.
Nov 01 18:16:43.251924 ctdb-node01 nmbd[11893]:   Copyright Andrew Tridgell and the Samba Team 1992-2021
Nov 01 18:16:46.318278 ctdb-node01 smbd[11933]: [2021/11/01 18:16:46.318219,  0] ../../source3/smbd/server.c:1738(main)
Nov 01 18:16:46.318877 ctdb-node01 smbd[11933]:   smbd version 4.15.0-git.177.73057cd57b61.3-SUSE-oS15.0-x86_64 started.
Nov 01 18:16:46.319015 ctdb-node01 smbd[11933]:   Copyright Andrew Tridgell and the Samba Team 1992-2021
Nov 01 18:16:46.371723 ctdb-node01 smbd[11933]: [2021/11/01 18:16:46.371671,  0] ../../lib/util/become_daemon.c:120(exit_daemon)
Nov 01 18:16:46.371744 ctdb-node01 smbd[11933]:   exit_daemon: daemon failed to start: Samba failed to init printing subsystem, error code 13
Nov 01 18:16:50.890848 ctdb-node01 nmbd[11893]: [2021/11/01 18:16:50.890760,  0] ../../source3/nmbd/nmbd.c:60(terminate)
Nov 01 18:16:50.890869 ctdb-node01 nmbd[11893]:   Got SIGTERM: going down...

I have found an explanation that this could be sbd device timeout issue, but it appeared only with newest build

I have also cloned the job on my personal instance and have the same result.
Below in the "##Reproducible" section you can find all links to the tests runs. 

## Test suite description
The base test suite is used for job templates defined in YAML documents. It has no settings of its own.


## Reproducible

Fails since (at least) Build [43.1](https://openqa.suse.de/tests/7262075)

aarch64:
https://openqa.suse.de/tests/7583912

ppc64le:
https://openqa.suse.de/tests/7584337

x86_64:
https://openqa.suse.de/tests/7604278

x86_64 reproduced on separate instance outside of OSD:
https://mordor.suse.cz/tests/1530


## Expected result

Last good: [39.1](https://openqa.suse.de/tests/7257330) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=ppc64le&distri=sle&flavor=Online&machine=ppc64le&test=ha_ctdb_node01&version=15-SP4)
Comment 1 Lumir Palovsky 2021-11-04 10:03:56 UTC
Created attachment 853529 [details]
y2logs
Comment 2 Lumir Palovsky 2021-11-04 10:04:25 UTC
Created attachment 853530 [details]
problem detection logs
Comment 3 openQA Review 2021-11-19 00:26:39 UTC
This is an autogenerated message for openQA integration by the openqa_review script:

This bug is still referenced in a failing openQA test: ha_ctdb_node01
https://openqa.suse.de/tests/7667308

To prevent further reminder comments one of the following options should be followed:
1. The test scenario is fixed by applying the bug fix to the tested product or the test is adjusted
2. The openQA job group is moved to "Released" or "EOL" (End-of-Life)
3. The bugref in the openQA scenario is removed or replaced, e.g. `label:wontfix:boo1234`
Comment 4 Roger Zhou 2021-12-02 07:18:11 UTC
Hi @Lumir,

Try to less use Bugzilla "Other" component. It is ambiguous to catch the attention of the individual domain expert to respond faster than we expect. 

When you think sbd is suspicious, that's a good starting point indeed, but try to prove that and eventually use "sbd" component in Bugzilla.

However, in this case, you might clarify why the sbd device is not accessible before approach sbd developers. Is iscsi target still around, for example? 

My gut feeling, probably this bug is some openQA configuration problem.
Comment 5 Lumir Palovsky 2021-12-06 10:16:52 UTC
(In reply to Roger Zhou from comment #4)
> Hi @Lumir,
> 
> Try to less use Bugzilla "Other" component. It is ambiguous to catch the
> attention of the individual domain expert to respond faster than we expect. 
> 
> When you think sbd is suspicious, that's a good starting point indeed, but
> try to prove that and eventually use "sbd" component in Bugzilla.
> 
> However, in this case, you might clarify why the sbd device is not
> accessible before approach sbd developers. Is iscsi target still around, for
> example? 
> 
> My gut feeling, probably this bug is some openQA configuration problem.


Hello Roger,

Sorry didn't know that this component is not checked that often. 
The problem is there are issue indicators on different places and I am not sure atm which is the actual cause of failure. 

I am less inclined toward test itself being at fault as the failure is consistent across architectures including outside of OSD. There is a failing nmb service too, so maybe it is the samba related instead of ha.  
Currently I am collecting more data and once I have something more clear I will reassign it to the correct group.
Comment 6 Lumir Palovsky 2021-12-06 11:23:33 UTC
Hello colleagues, 

There is a consistent OpenQA test fail in HA Build validation ctdb based tests since Build 58.2.

The issue seems to be in printing subsystem as you can see in the message below:

Nov 01 18:16:46.371744 ctdb-node01 smbd[11933]:   exit_daemon: daemon failed to start: Samba failed to init printing subsystem, error code 13

This results in nmb/smb service not starting up. 

Once I disabled spool in smb.conf with "disable spoolss = yes" services were up and running again including whole cluster. 

If you need additional logs, please let me know. 

regards

Lumir.
Comment 7 Noel Power 2021-12-06 11:35:27 UTC
Note: There is an apparmor update in flight https://build.suse.de/request/show/259417 (also note: sle15-sp3 is where sle15-sp4 apparmor inherits from)

*** This bug has been marked as a duplicate of bug 1191532 ***
Comment 8 Lumir Palovsky 2021-12-08 10:57:15 UTC
Just FYI: The bug is still present in the most recent build 70.1.
Comment 9 Lumir Palovsky 2021-12-17 06:52:34 UTC
Hello, 

In Beta2 candidate 74.1 the issue is still present:
https://openqa.suse.de/tests/7863893#step/ctdb/52
Comment 10 Lumir Palovsky 2021-12-17 11:09:09 UTC
Reopening, since the issue was still present in the last Beta2 candidate.
Comment 11 Noel Power 2021-12-17 14:33:55 UTC
(In reply to Lumir Palovsky from comment #10)
> Reopening, since the issue was still present in the last Beta2 candidate.

indeed it still is happening (differently of course)
apparmor in sle15-sp4 was updated to the factory version (which included the original fix for this issue)

Even though tw is running with the same version of samba and apparmor it appears there is needed now an extra rule for SLE15-SP4 to solve new DENIED entries I now see in audit.log e.g.
    
   apparmor="DENIED" operation="file_mmap" profile="samba-bgqd" name="/usr/lib64/samba/samba-bgqd" pid=2876 comm="samba-bgqd" requested_mask="m" denied_mask="m" fsuid=0 ouid=0

like I said weirdly this doesn't seem to affect tw/factory
Comment 13 Noel Power 2021-12-20 14:21:40 UTC
Assigning to the public product. It would be good if in future these typeof bugs could be opened against the public project by default.
Comment 14 Noel Power 2021-12-20 15:06:19 UTC
(In reply to Noel Power from comment #13)
> Assigning to the public product. It would be good if in future these typeof
> bugs could be opened against the public project by default.

https://gitlab.com/apparmor/apparmor/-/merge_requests/819
Comment 15 OBSbugzilla Bot 2021-12-20 21:01:26 UTC
This is an autogenerated message for OBS integration:
This bug (1192336) was mentioned in
https://build.opensuse.org/request/show/941697 Factory / apparmor
Comment 16 Noel Power 2021-12-23 08:35:10 UTC
*** Bug 1192141 has been marked as a duplicate of this bug. ***
Comment 17 Stefan Weiberg 2022-01-04 10:57:51 UTC
This should be resolved with build 78.1
Comment 18 Lumir Palovsky 2022-01-18 10:16:55 UTC
Can confirm, issue does not appear in 81.1 anymore.
Comment 19 Martin Loviska 2022-02-03 08:34:46 UTC
Hello Noel,

we can see the same problem in sle15sp3 - QR
- Tested image:
https://dist.suse.de/ibs/SUSE:/SLE-15-SP3:/Update:/QR/images/SLES15-SP3-JeOS.x86_64-15.3-kvm-and-xen-Build150300.4.7.9.qcow2

```
Feb 03 08:02:15 localhost systemd[1]: Starting Samba SMB Daemon...
Feb 03 08:02:16 localhost update-apparmor-samba-profile[30361]: Reloading updated AppArmor profile for Samba...
Feb 03 08:02:16 localhost smbd[30376]: [2022/02/03 08:02:16.578040,  0] ../../source3/smbd/server.c:1734(main)
Feb 03 08:02:16 localhost smbd[30376]:   smbd version 4.15.4-git.324.8332acf1a63150300.3.25.3-SUSE-oS15.0-x86_64 started.
Feb 03 08:02:16 localhost smbd[30376]:   Copyright Andrew Tridgell and the Samba Team 1992-2021
Feb 03 08:02:16 localhost systemd[1]: Started Samba SMB Daemon.
Feb 03 08:02:16 localhost smbd[30376]: [2022/02/03 08:02:16.665501,  0] ../../lib/util/become_daemon.c:120(exit_daemon)
Feb 03 08:02:16 localhost smbd[30376]:   exit_daemon: daemon failed to start: Samba failed to init printing subsystem, error code 13
Feb 03 08:02:16 localhost systemd[1]: smb.service: Main process exited, code=exited, status=1/FAILURE
Feb 03 08:02:16 localhost systemd[1]: smb.service: Failed with result 'exit-code'.
```

As per https://bugzilla.suse.com/show_bug.cgi?id=1192141#c8 I have dropped a comment here, please let me know whether a new report is needed or not.

Thanks
Comment 20 Noel Power 2022-02-03 09:28:14 UTC
(In reply to Martin Loviska from comment #19)
> Hello Noel,
> 
> we can see the same problem in sle15sp3 - QR
> - Tested image:
> https://dist.suse.de/ibs/SUSE:/SLE-15-SP3:/Update:/QR/images/SLES15-SP3-JeOS.
> x86_64-15.3-kvm-and-xen-Build150300.4.7.9.qcow2
> 
> ```
> Feb 03 08:02:15 localhost systemd[1]: Starting Samba SMB Daemon...
> Feb 03 08:02:16 localhost update-apparmor-samba-profile[30361]: Reloading
> updated AppArmor profile for Samba...
> Feb 03 08:02:16 localhost smbd[30376]: [2022/02/03 08:02:16.578040,  0]
> ../../source3/smbd/server.c:1734(main)
> Feb 03 08:02:16 localhost smbd[30376]:   smbd version
> 4.15.4-git.324.8332acf1a63150300.3.25.3-SUSE-oS15.0-x86_64 started.
> Feb 03 08:02:16 localhost smbd[30376]:   Copyright Andrew Tridgell and the
> Samba Team 1992-2021
> Feb 03 08:02:16 localhost systemd[1]: Started Samba SMB Daemon.
> Feb 03 08:02:16 localhost smbd[30376]: [2022/02/03 08:02:16.665501,  0]
> ../../lib/util/become_daemon.c:120(exit_daemon)
> Feb 03 08:02:16 localhost smbd[30376]:   exit_daemon: daemon failed to
> start: Samba failed to init printing subsystem, error code 13
> Feb 03 08:02:16 localhost systemd[1]: smb.service: Main process exited,
> code=exited, status=1/FAILURE
> Feb 03 08:02:16 localhost systemd[1]: smb.service: Failed with result
> 'exit-code'.
> ```
> 
> As per https://bugzilla.suse.com/show_bug.cgi?id=1192141#c8 I have dropped a
> comment here, please let me know whether a new report is needed or not.
> 
> Thanks

please ensure that the installed apparmor the very latest, see https://build.suse.de/package/rdiff/SUSE:SLE-15-SP3:Update/apparmor?linkrev=base&rev=3 and the changelog entries you should see on the installed apparmor (note: this is only a day old so it is possible it might not have reached the repos when your test image was created)
Comment 21 Martin Loviska 2022-02-03 11:45:33 UTC
(In reply to Noel Power from comment #20)
> 
> please ensure that the installed apparmor the very latest, see
> https://build.suse.de/package/rdiff/SUSE:SLE-15-SP3:Update/
> apparmor?linkrev=base&rev=3 and the changelog entries you should see on the
> installed apparmor (note: this is only a day old so it is possible it might
> not have reached the repos when your test image was created)

Yup, you are right! 
The image was built with these pre-installed apparmor packages, that were published according to your link on 02-Dec-2021 18:55:06.

# 2022-02-02 20:37:00 libapparmor1.rpm installed ok
2022-02-02 20:37:00|install|libapparmor1|2.13.6-3.8.1|x86_64
# 2022-02-02 20:37:27 apparmor-parser.rpm installed ok
2022-02-02 20:37:27|install|apparmor-parser|2.13.6-3.8.1|x86_64
# 2022-02-02 20:37:28 apparmor-abstractions.rpm installed ok
2022-02-02 20:37:28|install|apparmor-abstractions|2.13.6-3.8.1|noarch
# 2022-02-02 20:37:28 apparmor-profiles.rpm installed ok
2022-02-02 20:37:28|install|apparmor-profiles|2.13.6-3.8.1|noarch
# 2022-02-02 20:37:35 patterns-base-apparmor.rpm installed ok
2022-02-02 20:37:35|install|patterns-base-apparmor|20200124-10.5.1|x86_64

After update, I have got the newest packages built from 27-Jan-2022 15:16:15.

2022-02-03 11:19:39|install|libapparmor1|2.13.6-150300.3.11.1|x86_64
2022-02-03 11:19:40|install|apparmor-parser|2.13.6-150300.3.11.2|x86_64
2022-02-03 11:19:45|install|apparmor-abstractions|2.13.6-150300.3.11.2|noarch
2022-02-03 11:19:45|install|apparmor-profiles|2.13.6-150300.3.11.2|noarch
2022-02-03 11:32:53|install|apparmor-abstractions|2.13.6-150300.3.11.2|noarch
2022-02-03 11:32:53|install|apparmor-profiles|2.13.6-150300.3.11.2|noarch
2022-02-03 11:32:53|install|patterns-base-apparmor|20200124-10.5.1|x86_64

Sorry for the noise!
Comment 22 Noel Power 2022-02-04 09:08:07 UTC
based on comment #21 assuming this is working and closing again, please reopen if necessary