|
Bugzilla – Full Text Bug Listing |
| Summary: | [Build ] openQA test fails in install_update - system hit emergency shell - RAID0 configuration with 4 disks (20GB each) | ||
|---|---|---|---|
| Product: | [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP5 | Reporter: | Santiago Zarate <santiago.zarate> |
| Component: | Other | Assignee: | dracut maintainers <dracut-maintainers> |
| Status: | IN_PROGRESS --- | QA Contact: | |
| Severity: | Major | ||
| Priority: | P5 - None | CC: | antonio.feijoo, hare, nfbrown, santiago.zarate, thomas.blume, zluo |
| Version: | unspecified | Flags: | thomas.blume:
needinfo?
(santiago.zarate) |
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | Other | ||
| URL: | https://openqa.suse.de/tests/12556928/modules/install_update/steps/38 | ||
| See Also: |
https://bugzilla.suse.com/show_bug.cgi?id=1210443 https://bugzilla.suse.com/show_bug.cgi?id=1219073 https://bugzilla.suse.com/show_bug.cgi?id=1225064 |
||
| Whiteboard: | |||
| Found By: | openQA | Services Priority: | |
| Business Priority: | Blocker: | Yes | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
serial log from openQA
logs rd.udev.log_level=debug |
||
|
Description
Santiago Zarate
2023-10-18 13:54:38 UTC
Changing severity because this seems RAID related, While I'm struggling to clone this job to our local openQA server due to some DNS issues (sigh), I see that the first (https://openqa.suse.de/tests/12467201) and the third/last (https://openqa.suse.de/tests/12563612) time this job was run it didn't fail, and apparently the RAID failure could be related to hardware problems (i.e. some error when the virtual disks were created). Could you rerun this test to check if it consistently fails again? Thanks! (In reply to Antonio Feijoo from comment #2) > While I'm struggling to clone this job to our local openQA server due to > some DNS issues (sigh), I see that the first > (https://openqa.suse.de/tests/12467201) and the third/last > (https://openqa.suse.de/tests/12563612) time this job was run it didn't > fail, and apparently the RAID failure could be related to hardware problems > (i.e. some error when the virtual disks were created). > > Could you rerun this test to check if it consistently fails again? Thanks! I checked this issue quite long time: http://10.168.192.143/tests/208#step/update_minimal/86 hit emergency shell at different place. the only difference than other successful test run: there is a warning about /dev/dist/by-id/md-uuid-bad-** does not exist. We had this issue already before but it is sporadic: https://openqa.suse.de/tests/12466822#step/update_minimal/85 (In reply to Antonio Feijoo from comment #2) > While I'm struggling to clone this job to our local openQA server due to > some DNS issues (sigh), I see that the first > (https://openqa.suse.de/tests/12467201) and the third/last > (https://openqa.suse.de/tests/12563612) time this job was run it didn't > fail, and apparently the RAID failure could be related to hardware problems > (i.e. some error when the virtual disks were created). > > Could you rerun this test to check if it consistently fails again? Thanks! Antonio, at the moment, if this is failing, what logs could we provide to you so that this report becomes more helpful for you next time it shows up, so our reports have most of the information from the get-go? given the sporadic nature of this? Looking at: https://openqa.suse.de/tests/12556928/logfile?filename=serial0.txt, I can see this: --> Oct 17 18:46:47.130528 localhost (udev-worker)[440]: vdd3: Failed to add device '/dev/vdd3' to watch: Operation not permitted Oct 17 18:46:47.130894 localhost (udev-worker)[445]: vdb3: Process '/sbin/mdadm -I /dev/vdb3' failed with exit code 1. [...] Oct 17 18:49:03.145536 localhost kernel: md/raid0:md1: too few disks (3 of 4) - aborting! Oct 17 18:49:03.145568 localhost kernel: md: pers->run() failed ... --< So, it seems that some devices are not available when mdadm wants to assemble them. Can you see that in the other failing tests too? (In reply to Thomas Blume from comment #5) > Looking at: > https://openqa.suse.de/tests/12556928/logfile?filename=serial0.txt, I can > see this: > > --> > Oct 17 18:46:47.130528 localhost (udev-worker)[440]: vdd3: Failed to add > device '/dev/vdd3' to watch: Operation not permitted > Oct 17 18:46:47.130894 localhost (udev-worker)[445]: vdb3: Process > '/sbin/mdadm -I /dev/vdb3' failed with exit code 1. > [...] > Oct 17 18:49:03.145536 localhost kernel: md/raid0:md1: too few disks (3 of > 4) - aborting! > Oct 17 18:49:03.145568 localhost kernel: md: pers->run() failed ... > --< > > > So, it seems that some devices are not available when mdadm wants to > assemble them. Thanks Thomas. Apparently it doesn't seem that dracut is the culprit, but since this is a RAID related issue, I would also ask for Neil's expert opinion. This is curious:
+ cat /proc/mdstat
Personalities : [raid1] [raid0]
md1 : inactive vdd3[3] vdc3[2] vda3[0]
307008 blocks super 1.0
md0 : active raid1 vdd2[3] vda2[0] vdc2[2] vdb2[1]
8191936 blocks super 1.0 [4/4] [UUUU]
bitmap: 0/1 pages [0KB], 65536KB chunk
unused devices: <none>
It'll be interesting to figure out why 'md1' isn't activated properly; what happened to 'vdb3' ? It sure should be part of the 'md1', right?
Oct 17 18:46:47.130528 localhost (udev-worker)[440]: vdd3: Failed to add device '/dev/vdd3' to watch: Operation not permitted Oct 17 18:46:47.130894 localhost (udev-worker)[445]: vdb3: Process '/sbin/mdadm -I /dev/vdb3' failed with exit code 1. These two operations are happening basically at the same time, so might it be possible that we're hitting an internal race condition in the 'md' driver? IE that these two operations stomp on each other? So it would be good to know whether you can execute 'mdadm -I /dev/vdb3' manually once you are in this situation. (In reply to Hannes Reinecke from comment #7) > This is curious: > > + cat /proc/mdstat > Personalities : [raid1] [raid0] > md1 : inactive vdd3[3] vdc3[2] vda3[0] > 307008 blocks super 1.0 > > md0 : active raid1 vdd2[3] vda2[0] vdc2[2] vdb2[1] > 8191936 blocks super 1.0 [4/4] [UUUU] > bitmap: 0/1 pages [0KB], 65536KB chunk > > unused devices: <none> > > It'll be interesting to figure out why 'md1' isn't activated properly; what > happened to 'vdb3' ? It sure should be part of the 'md1', right? Indeed and it is pretty confusing that an udev worker displays an error for vdd3, but vdb3 fails to assemble. Santiago, can you please add: debug rd.udev.log_level=debug to the kernel command line and provide /run/initramfs/rdsosdebug.txt from the test vm? (In reply to Thomas Blume from comment #10) > (In reply to Hannes Reinecke from comment #7) > > This is curious: > > > > + cat /proc/mdstat > > Personalities : [raid1] [raid0] > > md1 : inactive vdd3[3] vdc3[2] vda3[0] > > 307008 blocks super 1.0 > > > > md0 : active raid1 vdd2[3] vda2[0] vdc2[2] vdb2[1] > > 8191936 blocks super 1.0 [4/4] [UUUU] > > bitmap: 0/1 pages [0KB], 65536KB chunk > > > > unused devices: <none> > > > > It'll be interesting to figure out why 'md1' isn't activated properly; what > > happened to 'vdb3' ? It sure should be part of the 'md1', right? > > Indeed and it is pretty confusing that an udev worker displays an error for > vdd3, but vdb3 fails to assemble. > Santiago, can you please add: > > debug rd.udev.log_level=debug > > to the kernel command line and provide /run/initramfs/rdsosdebug.txt from > the test vm? I'll try to get logs for you if I can reproduce the issue again. https://openqa.suse.de/tests/12725295#step/install_update/36 is this what you need? Created attachment 870620 [details]
logs
Created attachment 870627 [details] rd.udev.log_level=debug Zaoliang, I'm attaching the log from your openQA instance: http://10.168.192.143/tests/371#step/install_update/139 that booted with the right parameters Thomas, can you give it a look? (In reply to Zaoliang Luo from comment #12) > https://openqa.suse.de/tests/12725295#step/install_update/36 > > is this what you need? for some reason that job didn't have the extrabootparams, but https://openqa.suse.de/tests/12736507#step/install_update/139 does :) thanks again! (In reply to Santiago Zarate from comment #15) > (In reply to Zaoliang Luo from comment #12) > > https://openqa.suse.de/tests/12725295#step/install_update/36 > > > > is this what you need? > > for some reason that job didn't have the extrabootparams, but > https://openqa.suse.de/tests/12736507#step/install_update/139 does :) thanks > again! okay, after os many re-tries ;) yes, thanks! Oct 30 17:58:25.347918 localhost kernel: block device autoloading is deprecated and will be removed. Oct 30 17:58:25.358508 localhost (udev-worker)[433]: vdc2: Failed to add device '/dev/vdc2' to watch: Operation not permitted Oct 30 17:58:25.358792 localhost (udev-worker)[438]: vda2: Process '/sbin/mdadm -I /dev/vda2' failed with exit code 2. Oct 30 17:58:25.358823 localhost (udev-worker)[438]: vda2: Failed to add device '/dev/vda2' to watch: Operation not permitted Oct 30 17:58:25.359226 localhost (udev-worker)[434]: vda3: Process '/sbin/mdadm -I /dev/vda3' failed with exit code 1. Oct 30 17:58:25.359846 localhost kernel: md: could not open device unknown-block(253,2). Oct 30 17:58:25.359864 localhost kernel: md: md_import_device returned -1 It seems the issue is now also appear for customers, see: bug#1225064 (In reply to Hannes Reinecke from comment #17) > Oct 30 17:58:25.358823 localhost (udev-worker)[438]: vda2: Failed to add > device '/dev/vda2' to watch: Operation not permitted Maybe there is an upstream fix: https://github.com/systemd/systemd/issues/24668 Checking and working in a testpackage if appropriate. (In reply to Thomas Blume from comment #24) > (In reply to Hannes Reinecke from comment #17) > > Oct 30 17:58:25.358823 localhost (udev-worker)[438]: vda2: Failed to add > > device '/dev/vda2' to watch: Operation not permitted > > Maybe there is an upstream fix: > > https://github.com/systemd/systemd/issues/24668 > > Checking and working in a testpackage if appropriate. Santiago, the testpackage is here: https://download.suse.de/ibs/home:/tsaupe:/branches:/SUSE:/SLE-15-SP5:/Update:/systemd-bsc1216381/standard/ could you give it a try on the openQA Machines? |