Bug 1216381 - [Build ] openQA test fails in install_update - system hit emergency shell - RAID0 configuration with 4 disks (20GB each)
Summary: [Build ] openQA test fails in install_update - system hit emergency shell - ...
Status: IN_PROGRESS
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP5
Classification: openSUSE
Component: Other (show other bugs)
Version: unspecified
Hardware: x86-64 Other
: P5 - None : Major
Target Milestone: ---
Assignee: dracut maintainers
QA Contact:
URL: https://openqa.suse.de/tests/12556928...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-18 13:54 UTC by Santiago Zarate
Modified: 2024-06-17 13:46 UTC (History)
6 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---
thomas.blume: needinfo? (santiago.zarate)


Attachments
serial log from openQA (455.18 KB, text/plain)
2023-10-18 13:54 UTC, Santiago Zarate
Details
logs (205.59 KB, text/plain)
2023-11-03 09:42 UTC, Zaoliang Luo
Details
rd.udev.log_level=debug (1.91 MB, text/x-log)
2023-11-03 11:07 UTC, Santiago Zarate
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Santiago Zarate 2023-10-18 13:54:38 UTC
Created attachment 870303 [details]
serial log from openQA

## Observation

looking at the logs: 
uuid-079a5aeb:ec59f6b0:57804be4:3506c9b7" ]"
[  138.280124] localhost dracut-initqueue[446]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-id\x2fmd-uuid-8b417c40:baf040df:c5a0c4db:e7533591.sh: "[ -e "/dev/disk/by-id/md-uuid-8b417c40:baf040df:c5a0c4db:e7533591" ]"
[  138.280816] localhost dracut-initqueue[446]: Warning: /lib/dracut/hooks/initqueue/finished/devexists-\x2fdev\x2fdisk\x2fby-uuid\x2f1c885c02-2dd4-4e8c-b030-f8820de87767.sh: "if ! grep -q After=remote-fs-pre.target /run/systemd/generator/systemd-cryptsetup@*.service 2>/dev/null; then
[  138.280816] localhost dracut-initqueue[446]:     [ -e "/dev/disk/by-uuid/1c885c02-2dd4-4e8c-b030-f8820de87767" ]
[  138.280816] localhost dracut-initqueue[446]: fi"
[  138.281681] localhost dracut-initqueue[446]: Warning: dracut-initqueue: starting timeout scripts
[  138.325502] localhost dracut-initqueue[2138]: mdadm: failed to start array /dev/md1: Invalid argument
[  138.323357] localhost kernel: md/raid0:md1: too few disks (3 of 4) - aborting!
[  138.323382] localhost kernel: md: pers->run() failed ...


openQA test in scenario sle-15-SP5-Server-DVD-Incidents-Minimal-x86_64-qam-minimal-RAID1:investigate:retry@64bit-smp fails in
[install_update](https://openqa.suse.de/tests/12556928/modules/install_update/steps/38)

## Test suite description
Testsuite maintained at https://gitlab.suse.de/qa-maintenance/qam-openqa-yml. Installation of RAID1 using expert partitioner
minimal = base pattern, minimal (enhanced base) pattern are additional convenience paclkages


## Reproducible

Fails since (at least) Build [](https://openqa.suse.de/tests/12556928) (current job)


## Expected result

Last good: [](https://openqa.suse.de/tests/12467201) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Server-DVD-Incidents-Minimal&machine=64bit-smp&test=qam-minimal-RAID1%3Ainvestigate%3Aretry&version=15-SP5)
Comment 1 Santiago Zarate 2023-10-18 13:58:14 UTC
Changing severity because this seems RAID related,
Comment 2 Antonio Feijoo 2023-10-20 12:50:51 UTC
While I'm struggling to clone this job to our local openQA server due to some DNS issues (sigh), I see that the first (https://openqa.suse.de/tests/12467201) and the third/last (https://openqa.suse.de/tests/12563612) time this job was run it didn't fail, and apparently the RAID failure could be related to hardware problems (i.e. some error when the virtual disks were created).

Could you rerun this test to check if it consistently fails again? Thanks!
Comment 3 Zaoliang Luo 2023-10-25 14:34:16 UTC
(In reply to Antonio Feijoo from comment #2)
> While I'm struggling to clone this job to our local openQA server due to
> some DNS issues (sigh), I see that the first
> (https://openqa.suse.de/tests/12467201) and the third/last
> (https://openqa.suse.de/tests/12563612) time this job was run it didn't
> fail, and apparently the RAID failure could be related to hardware problems
> (i.e. some error when the virtual disks were created).
> 
> Could you rerun this test to check if it consistently fails again? Thanks!

I checked this issue quite long time:

http://10.168.192.143/tests/208#step/update_minimal/86
hit emergency shell at different place.

the only difference than other successful test run: there is a warning about /dev/dist/by-id/md-uuid-bad-** does not exist.

We had this issue already before but it is sporadic:

https://openqa.suse.de/tests/12466822#step/update_minimal/85
Comment 4 Santiago Zarate 2023-10-26 07:29:30 UTC
(In reply to Antonio Feijoo from comment #2)
> While I'm struggling to clone this job to our local openQA server due to
> some DNS issues (sigh), I see that the first
> (https://openqa.suse.de/tests/12467201) and the third/last
> (https://openqa.suse.de/tests/12563612) time this job was run it didn't
> fail, and apparently the RAID failure could be related to hardware problems
> (i.e. some error when the virtual disks were created).
> 
> Could you rerun this test to check if it consistently fails again? Thanks!

Antonio, at the moment, if this is failing, what logs could we provide to you so that this report becomes more helpful for you next time it shows up, so our reports have most of the information from the get-go? given the sporadic nature of this?
Comment 5 Thomas Blume 2023-10-26 08:31:48 UTC
Looking at: https://openqa.suse.de/tests/12556928/logfile?filename=serial0.txt, I can see this:

-->
Oct 17 18:46:47.130528 localhost (udev-worker)[440]: vdd3: Failed to add device '/dev/vdd3' to watch: Operation not permitted
Oct 17 18:46:47.130894 localhost (udev-worker)[445]: vdb3: Process '/sbin/mdadm -I /dev/vdb3' failed with exit code 1.
[...]
Oct 17 18:49:03.145536 localhost kernel: md/raid0:md1: too few disks (3 of 4) - aborting!
Oct 17 18:49:03.145568 localhost kernel: md: pers->run() failed ...
--<


So, it seems that some devices are not available when mdadm wants to assemble them.
Can you see that in the other failing tests too?
Comment 6 Antonio Feijoo 2023-10-26 08:47:55 UTC
(In reply to Thomas Blume from comment #5)
> Looking at:
> https://openqa.suse.de/tests/12556928/logfile?filename=serial0.txt, I can
> see this:
> 
> -->
> Oct 17 18:46:47.130528 localhost (udev-worker)[440]: vdd3: Failed to add
> device '/dev/vdd3' to watch: Operation not permitted
> Oct 17 18:46:47.130894 localhost (udev-worker)[445]: vdb3: Process
> '/sbin/mdadm -I /dev/vdb3' failed with exit code 1.
> [...]
> Oct 17 18:49:03.145536 localhost kernel: md/raid0:md1: too few disks (3 of
> 4) - aborting!
> Oct 17 18:49:03.145568 localhost kernel: md: pers->run() failed ...
> --<
> 
> 
> So, it seems that some devices are not available when mdadm wants to
> assemble them.

Thanks Thomas. Apparently it doesn't seem that dracut is the culprit, but since this is a RAID related issue, I would also ask for Neil's expert opinion.
Comment 7 Hannes Reinecke 2023-10-26 09:12:03 UTC
This is curious:

+ cat /proc/mdstat
Personalities : [raid1] [raid0] 
md1 : inactive vdd3[3] vdc3[2] vda3[0]
      307008 blocks super 1.0
       
md0 : active raid1 vdd2[3] vda2[0] vdc2[2] vdb2[1]
      8191936 blocks super 1.0 [4/4] [UUUU]
      bitmap: 0/1 pages [0KB], 65536KB chunk

unused devices: <none>

It'll be interesting to figure out why 'md1' isn't activated properly; what happened to 'vdb3' ? It sure should be part of the 'md1', right?
Comment 8 Hannes Reinecke 2023-10-26 09:16:21 UTC
Oct 17 18:46:47.130528 localhost (udev-worker)[440]: vdd3: Failed to add device '/dev/vdd3' to watch: Operation not permitted
Oct 17 18:46:47.130894 localhost (udev-worker)[445]: vdb3: Process '/sbin/mdadm -I /dev/vdb3' failed with exit code 1.

These two operations are happening basically at the same time, so might it be possible that we're hitting an internal race condition in the 'md' driver?
IE that these two operations stomp on each other?
Comment 9 Hannes Reinecke 2023-10-26 09:27:17 UTC
So it would be good to know whether you can execute 'mdadm -I /dev/vdb3' manually once you are in this situation.
Comment 10 Thomas Blume 2023-10-26 09:35:51 UTC
(In reply to Hannes Reinecke from comment #7)
> This is curious:
> 
> + cat /proc/mdstat
> Personalities : [raid1] [raid0] 
> md1 : inactive vdd3[3] vdc3[2] vda3[0]
>       307008 blocks super 1.0
>        
> md0 : active raid1 vdd2[3] vda2[0] vdc2[2] vdb2[1]
>       8191936 blocks super 1.0 [4/4] [UUUU]
>       bitmap: 0/1 pages [0KB], 65536KB chunk
> 
> unused devices: <none>
> 
> It'll be interesting to figure out why 'md1' isn't activated properly; what
> happened to 'vdb3' ? It sure should be part of the 'md1', right?

Indeed and it is pretty confusing that an udev worker displays an error for vdd3, but vdb3 fails to assemble.
Santiago, can you please add:

debug rd.udev.log_level=debug

to the kernel command line and provide /run/initramfs/rdsosdebug.txt from the test vm?
Comment 11 Zaoliang Luo 2023-10-30 16:34:32 UTC
(In reply to Thomas Blume from comment #10)
> (In reply to Hannes Reinecke from comment #7)
> > This is curious:
> > 
> > + cat /proc/mdstat
> > Personalities : [raid1] [raid0] 
> > md1 : inactive vdd3[3] vdc3[2] vda3[0]
> >       307008 blocks super 1.0
> >        
> > md0 : active raid1 vdd2[3] vda2[0] vdc2[2] vdb2[1]
> >       8191936 blocks super 1.0 [4/4] [UUUU]
> >       bitmap: 0/1 pages [0KB], 65536KB chunk
> > 
> > unused devices: <none>
> > 
> > It'll be interesting to figure out why 'md1' isn't activated properly; what
> > happened to 'vdb3' ? It sure should be part of the 'md1', right?
> 
> Indeed and it is pretty confusing that an udev worker displays an error for
> vdd3, but vdb3 fails to assemble.
> Santiago, can you please add:
> 
> debug rd.udev.log_level=debug
> 
> to the kernel command line and provide /run/initramfs/rdsosdebug.txt from
> the test vm?

I'll try to get logs for you if I can reproduce the issue again.
Comment 12 Zaoliang Luo 2023-11-03 09:42:28 UTC
https://openqa.suse.de/tests/12725295#step/install_update/36

is this what you need?
Comment 13 Zaoliang Luo 2023-11-03 09:42:42 UTC
Created attachment 870620 [details]
logs
Comment 14 Santiago Zarate 2023-11-03 11:07:42 UTC
Created attachment 870627 [details]
rd.udev.log_level=debug

Zaoliang, I'm attaching the log from your openQA instance: http://10.168.192.143/tests/371#step/install_update/139 that booted with the right parameters

Thomas, can you give it a look?
Comment 15 Santiago Zarate 2023-11-03 11:09:34 UTC
(In reply to Zaoliang Luo from comment #12)
> https://openqa.suse.de/tests/12725295#step/install_update/36
> 
> is this what you need?

for some reason that job didn't have the extrabootparams, but https://openqa.suse.de/tests/12736507#step/install_update/139 does :) thanks again!
Comment 16 Zaoliang Luo 2023-11-03 11:10:43 UTC
(In reply to Santiago Zarate from comment #15)
> (In reply to Zaoliang Luo from comment #12)
> > https://openqa.suse.de/tests/12725295#step/install_update/36
> > 
> > is this what you need?
> 
> for some reason that job didn't have the extrabootparams, but
> https://openqa.suse.de/tests/12736507#step/install_update/139 does :) thanks
> again!

okay, after os many re-tries ;) yes, thanks!
Comment 17 Hannes Reinecke 2023-11-03 13:17:40 UTC
Oct 30 17:58:25.347918 localhost kernel: block device autoloading is deprecated and will be removed.
Oct 30 17:58:25.358508 localhost (udev-worker)[433]: vdc2: Failed to add device '/dev/vdc2' to watch: Operation not permitted
Oct 30 17:58:25.358792 localhost (udev-worker)[438]: vda2: Process '/sbin/mdadm -I /dev/vda2' failed with exit code 2.
Oct 30 17:58:25.358823 localhost (udev-worker)[438]: vda2: Failed to add device '/dev/vda2' to watch: Operation not permitted
Oct 30 17:58:25.359226 localhost (udev-worker)[434]: vda3: Process '/sbin/mdadm -I /dev/vda3' failed with exit code 1.
Oct 30 17:58:25.359846 localhost kernel: md: could not open device unknown-block(253,2).
Oct 30 17:58:25.359864 localhost kernel: md: md_import_device returned -1
Comment 23 Thomas Blume 2024-06-17 12:20:52 UTC
It seems the issue is now also appear for customers, see: bug#1225064
Comment 24 Thomas Blume 2024-06-17 12:59:01 UTC
(In reply to Hannes Reinecke from comment #17)
> Oct 30 17:58:25.358823 localhost (udev-worker)[438]: vda2: Failed to add
> device '/dev/vda2' to watch: Operation not permitted

Maybe there is an upstream fix:

https://github.com/systemd/systemd/issues/24668

Checking and working in a testpackage if appropriate.
Comment 25 Thomas Blume 2024-06-17 13:46:52 UTC
(In reply to Thomas Blume from comment #24)
> (In reply to Hannes Reinecke from comment #17)
> > Oct 30 17:58:25.358823 localhost (udev-worker)[438]: vda2: Failed to add
> > device '/dev/vda2' to watch: Operation not permitted
> 
> Maybe there is an upstream fix:
> 
> https://github.com/systemd/systemd/issues/24668
> 
> Checking and working in a testpackage if appropriate.

Santiago, the testpackage is here:

https://download.suse.de/ibs/home:/tsaupe:/branches:/SUSE:/SLE-15-SP5:/Update:/systemd-bsc1216381/standard/

could you give it a try on the openQA Machines?