Bug 1225150 - [Build 6.71] MinimalVM combustion fails to unmount /sysroot/dev/shm in AMD workers
Summary: [Build 6.71] MinimalVM combustion fails to unmount /sysroot/dev/shm in AMD wo...
Status: RESOLVED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: Documentation (show other bugs)
Version: unspecified
Hardware: Other Other
: P5 - None : Normal
Target Milestone: ---
Assignee: Jana Halackova
QA Contact: Frank Sundermeyer
URL: https://openqa.suse.de/tests/14337468...
Whiteboard: https://jira.suse.com/browse/DOCTEAM-...
Keywords:
Depends on:
Blocks:
 
Reported: 2024-05-23 14:08 UTC by Pablo Herranz Ramírez
Modified: 2024-06-12 07:07 UTC (History)
3 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
Intel worker ok (110.60 KB, text/x-log)
2024-05-23 14:08 UTC, Pablo Herranz Ramírez
Details
AMD worker NOK (65.03 KB, text/x-log)
2024-05-23 14:09 UTC, Pablo Herranz Ramírez
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pablo Herranz Ramírez 2024-05-23 14:08:59 UTC
Created attachment 875056 [details]
Intel worker ok

## Observation

openQA test in scenario sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-jeos-main-combustion@64bit-virtio-vga fails in
[image_checks](https://openqa.suse.de/tests/14337468/modules/image_checks/steps/2)

## Reproducible

Fails since (at least) Build [6.37](https://openqa.suse.de/tests/14104622)


## Expected result

Last good: [6.39](https://openqa.suse.de/tests/14094642) (or more recent)


## Further details

I've seen that the tests runs properly in Intel machines, but fails in AMD ones. After analyzing both logs, I've get to the conclusion that there's something preventing the filesystem /sysroot/dev/shm to unmount, which causes combustion script to fail and leads into emergency mode.

Attached you can find both `journalctl --no-pager` from a NOK AMD worker and an OK Intel one.
Comment 1 Pablo Herranz Ramírez 2024-05-23 14:09:34 UTC
Created attachment 875057 [details]
AMD worker NOK
Comment 2 Pablo Herranz Ramírez 2024-05-23 14:10:37 UTC
Might be similar to https://bugzilla.suse.com/show_bug.cgi?id=1222411
Comment 3 Fabian Vogt 2024-05-24 08:22:26 UTC
Please try with a "wait" command at the end of the combustion script.
Comment 4 Pablo Herranz Ramírez 2024-05-27 10:29:55 UTC
I've created a new combustion image with `sleep 5` at the end and the tests have started to pass. Which would be the proper way to proceed?

https://openqa.suse.de/tests/14442171#
Comment 5 Fabian Vogt 2024-05-27 10:46:29 UTC
(In reply to Pablo Herranz Ramírez from comment #4)
> I've created a new combustion image with `sleep 5` at the end and the tests
> have started to pass. Which would be the proper way to proceed?
> 
> https://openqa.suse.de/tests/14442171#

Have you tried with "wait"?
Comment 6 Pablo Herranz Ramírez 2024-05-27 12:20:23 UTC
I've tried `wait` alone but the test fails. Do I need to specify the PID of the job?

https://openqa.suse.de/tests/14454009#
Comment 7 Fabian Vogt 2024-05-27 12:56:16 UTC
(In reply to Pablo Herranz Ramírez from comment #6)
> I've tried `wait` alone but the test fails. Do I need to specify the PID of
> the job?
> 
> https://openqa.suse.de/tests/14454009#

Does it fail with the same error? "wait" without arguments should wait for all jobs.

I just realized that this can't really work by design, as tee only quits once the script has finished, but the script waits for tee to quit... That should result in a deadlock though, not failure.

What happens with "exec 1>&- 2>&-; wait"?
Comment 8 Pablo Herranz Ramírez 2024-05-27 13:04:10 UTC
Yes, that works fine :)

https://openqa.suse.de/tests/14454510#
Comment 9 Fabian Vogt 2024-05-27 13:09:08 UTC
(In reply to Pablo Herranz Ramírez from comment #8)
> Yes, that works fine :)
> 
> https://openqa.suse.de/tests/14454510#

Great, can you do that 3x to make sure it's not a random success?

If this works, I'll add it to the combustion README and we should probably mention it in the product documentation as well.
Comment 10 Pablo Herranz Ramírez 2024-05-27 14:15:23 UTC
There's 5/10 jobs failing, but this seems like a different failure. I'll go on investigating tomorrow:

```
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454945
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454946
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454947
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454948
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454949
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454950
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454951
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454952
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454953
1 job has been created:
 - sle-15-SP6-JeOS-for-kvm-and-xen-x86_64-Build6.73-jeos-main-combustion@64bit-virtio-vga -> https://openqa.suse.de/tests/14454954
```
Comment 12 Fabian Vogt 2024-05-28 10:44:17 UTC
(In reply to Pablo Herranz Ramírez from comment #11)
> I've restarted them 10 and now the tests are all green. Seems like some
> sporadic issue (ssh-ing to a s390x machine?!?!) was going on yesterday.

Maybe a conflict with VNC ports?

> The fix suggested by @fvogt works like a charm:
> 
> https://openqa.suse.de/tests/14461909
> https://openqa.suse.de/tests/14461910
> https://openqa.suse.de/tests/14461911
> https://openqa.suse.de/tests/14461912
> https://openqa.suse.de/tests/14461913
> https://openqa.suse.de/tests/14461914
> https://openqa.suse.de/tests/14461915
> https://openqa.suse.de/tests/14461916
> https://openqa.suse.de/tests/14461917
> https://openqa.suse.de/tests/14461918

Perfect!

Reassigning to documentation.

Can you please mention in the combustion sections that the script should ensure that all processes complete before it ends, like this:

# Close outputs and wait for tee to finish.
exec 1>&- 2>&-; wait;

IMO it's mostly a workaround until tukit handles this better but it's good practice anyway so it can be recommended in general.
Comment 13 Tomáš Bažant 2024-06-07 10:58:00 UTC
Thank you for reporting this bug!
It is being tracked and processed as part of our queue.
Comment 14 Jana Halackova 2024-06-12 07:07:12 UTC
Fixed by: https://github.com/SUSE/doc-modular/pull/343