|
Bugzilla – Full Text Bug Listing |
| Summary: | [Build 20231109] openQA test fails in networkd_dhcp | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Dominique Leuenberger <dimstar> |
| Component: | Basesystem | Assignee: | systemd maintainers <systemd-maintainers> |
| Status: | NEW --- | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | dheidler, dimstar, fvogt, petr.vorel, santiago.zarate, thomas.blume |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| URL: | https://openqa.opensuse.org/tests/3714912/modules/networkd_dhcp/steps/11 | ||
| See Also: |
https://bugzilla.suse.com/show_bug.cgi?id=1220119 https://bugzilla.opensuse.org/show_bug.cgi?id=1225930 |
||
| Whiteboard: | |||
| Found By: | openQA | Services Priority: | |
| Business Priority: | Blocker: | Yes | |
| Marketing QA Status: | --- | IT Deployment: | --- |
|
Description
Dominique Leuenberger
2023-11-10 13:12:49 UTC
Current findings. The command: "networkctl status" to nspawn container node1 hangs: https://openqa.opensuse.org/tests/3738427#step/networkd_dhcp/15 Tried on my private openQA setup, but couldn't reproduce the issue there (will retry on another machine). The ps log on the container host shows: --> 1147 ? Ss 0:00 login -- root 1161 tty6 Ss 0:00 \_ -bash 3171 tty6 S+ 0:00 \_ systemd-run -tM node1 /bin/bash -c networkctl status --< inside the container, the status command returns correctly: --> Nov 18 01:45:51.308612 node1 systemd[1]: Started /bin/bash -c networkctl status. Nov 18 01:45:51.458301 node1 systemd[1]: run-u16.service: Deactivated successfully. --< So, apparently the feedback from node1 to the container host is hanging. (In reply to Thomas Blume from comment #1) > So, apparently the feedback from node1 to the container host is hanging. Looks like there is a time keeping issue. dmesg on the host shows: --> [ 303.093179] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 303.094160] clocksource: 'kvm-clock' wd_nsec: 480016638 wd_now: 47a9fb099a wd_last: 478d5e909c mask: ffffffffffffffff [ 303.095819] clocksource: 'tsc' cs_nsec: 41654552965 cs_now: b3c0bc3362 cs_last: 9e6a91c09e mask: ffffffffffffffff [ 303.097036] clocksource: Clocksource 'tsc' skewed 41174536327 ns (41174 ms) over watchdog 'kvm-clock' interval of 480016638 ns (480 ms) [ 303.098189] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 303.099101] tsc: Marking TSC unstable due to clocksource watchdog --< I can't see that on my private machine, where the test works. Also on my second machine, the test runs successfull. Dominique, is this openQA instance running in a VM? If so, that might be the difference, because mine are in containers on bare metal. If in a VM, could you try using a different timekeeping for the VM? (In reply to Thomas Blume from comment #3) > Also on my second machine, the test runs successfull. > Dominique, is this openQA instance running in a VM? > If so, that might be the difference, because mine are in containers on bare > metal. > If in a VM, could you try using a different timekeeping for the VM? basically, all openQA tests run in a VM (qemu/kvm with VNC output) Let's see if Santiago (as a representative of openQA) has an idea about the clock source (In reply to Dominique Leuenberger from comment #4) > (In reply to Thomas Blume from comment #3) > > Also on my second machine, the test runs successfull. > > Dominique, is this openQA instance running in a VM? > > If so, that might be the difference, because mine are in containers on bare > > metal. > > If in a VM, could you try using a different timekeeping for the VM? > > basically, all openQA tests run in a VM (qemu/kvm with VNC output) > > Let's see if Santiago (as a representative of openQA) has an idea about the > clock source I think I have an idea, IIRC... let me check in the meantime couple of things and try to set the clock/timekeeping and see if I can reproduce it by playing with the RTC parameter in qemu it seems that the test might be hanging somewhere... lets see how it goes: https://openqa.opensuse.org/tests/3751459#live Dimstar, can you show me or do you know the SR that introduced the latest change? Unfortunately, I cannot reproduce the issue, even in a VM. The test just runs fine. I don't see the error mentioned in comment#2 either: --> localhost:~ # dmesg | grep clocksource [ 0.000003] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns [ 0.025394] clocksource: refined-jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645519600211568 ns [ 0.090525] clocksource: tsc-early: mask: 0xffffffffffffffff max_cycles: 0x1e33052aadd, max_idle_ns: 440795310221 ns [ 0.119165] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns [ 0.770899] clocksource: Switched to clocksource kvm-clock [ 0.788281] clocksource: acpi_pm: mask: 0xffffff max_cycles: 0xffffff, max_idle_ns: 2085701024 ns [ 0.890714] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x1e33052aadd, max_idle_ns: 440795310221 ns --< (In reply to Santiago Zarate from comment #9) > it seems that the test might be hanging somewhere... lets see how it goes: > https://openqa.opensuse.org/tests/3751459#live > > Dimstar, can you show me or do you know the SR that introduced the latest > change? Santiago, it seems the test still hangs. On the other hand I cannot reproduce the issue on my test setup. Hence, I assume a time keeping issue in the test VM (see comment#2 and comment#10). Reassigning to you for now, could you try to rerun the test on a VM where the time keeping issue doesn't appear? If the issue still happens then, please assign back. FYI I also noticed some "Marking clocksource 'tsc'" issues triggered on x86_64 on my private openQA worker by ltp_net_tcp_cmds, triggered for 2 tests: arp and host: OpenQA::run_ltp.pm: Starting ipneigh01_arp [ 410.818828][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 410.820365][ C0] clocksource: 'kvm-clock' wd_nsec: 506724586 wd_now: 610448443a wd_last: 60e6144350 mask: ffffffffffffffff [ 410.821824][ C0] clocksource: 'tsc' cs_nsec: 49302930542 cs_now: 17c1364ba0a cs_last: 153fdc32526 mask: ffffffffffffffff [ 410.823293][ C0] clocksource: Clocksource 'tsc' skewed 48796205956 ns (48796 ms) over watchdog 'kvm-clock' interval of 506724586 ns (506 ms) [ 410.824843][ C0] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 410.825707][ C0] tsc: Marking TSC unstable due to clocksource watchdog [ 411.109710][T28135] Initializing XFRM netlink socket OpenQA::run_ltp.pm: Starting ipneigh01_ip OpenQA::run_ltp.pm: Starting arping01 OpenQA::run_ltp.pm: Starting clockdiff01 OpenQA::run_ltp.pm: Starting ftp OpenQA::run_ltp.pm: Starting host [ 410.818845][ C0] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large: [ 410.820687][ C0] clocksource: 'kvm-clock' wd_nsec: 506741379 wd_now: 61044885d3 wd_last: 60e6144350 mask: ffffffffffffffff [ 410.821794][ C0] clocksource: 'tsc' cs_nsec: 804834172 cs_now: 154a546a082 cs_last: 153fdc32526 mask: ffffffffffffffff [ 410.822919][ C0] clocksource: Clocksource 'tsc' skewed 298092793 ns (298 ms) over watchdog 'kvm-clock' interval of 506741379 ns (506 ms) [ 410.824088][ C0] clocksource: 'kvm-clock' (not 'tsc') is current clocksource. [ 410.824768][ C0] tsc: Marking TSC unstable due to clocksource watchdog [ 411.126759][T28138] Initializing XFRM netlink socket I wondered, whether this is somehow related to the way openQA run qemu: /usr/bin/qemu-system-x86_64 -device VGA,edid=on,xres=1024,yres=768 -only-migratable -chardev ringbuf,id=serial0,logfile=serial0,logappend=on -serial chardev:serial0 -audiodev none,id=snd0 -device intel-hda -device hda-output,audiodev=snd0 -global isa-fdc.fdtypeA=none -m 1536 -cpu host -netdev user,id=qanet0 -device virtio-net,netdev=qanet0,mac=52:54:00:12:34:56 -object rng-random,filename=/dev/urandom,id=rng0 -device virtio-rng-pci,rng=rng0 -boot order=c -device qemu-xhci -device usb-tablet -smp 1 -enable-kvm -no-shutdown -vnc :96,share=force-shared -device virtio-serial -chardev pipe,id=virtio_console,path=virtio_console,logfile=virtio_console.log,logappend=on -device virtconsole,chardev=virtio_console,name=org.openqa.console.virtio_console -chardev pipe,id=virtio_console_user,path=virtio_console_user,logfile=virtio_console_user.log,logappend=on -device virtconsole,chardev=virtio_console_user,name=org.openqa.console.virtio_console_user -chardev socket,path=qmp_socket,server=on,wait=off,id=qmp_socket,logfile=qmp_socket.log,logappend=on -qmp chardev:qmp_socket -S -device virtio-scsi-pci,id=scsi0 -blockdev driver=file,node-name=hd0-overlay0-file,filename=/var/lib/openqa/pool/6/raid/hd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=hd0-overlay0,file=hd0-overlay0-file,cache.no-flush=on,discard=unmap -device virtio-blk,id=hd0-device,drive=hd0-overlay0,bootindex=0,serial=hd0 -blockdev driver=file,node-name=cd0-overlay0-file,filename=/var/lib/openqa/pool/6/raid/cd0-overlay0,cache.no-flush=on -blockdev driver=qcow2,node-name=cd0-overlay0,file=cd0-overlay0-file,cache.no-flush=on,discard=unmap -device scsi-cd,id=cd0-device,drive=cd0-overlay0,serial=cd0 Also, originally I noted this on ltp_net_tcp_cmds running on ppc64le on o3, where host test keeps failing (but arp test is working), I don't see it on x86_64 on o3. Apologies, this completely skipped my radar; let me read a bit and come back to this bug (In reply to Thomas Blume from comment #11) > (In reply to Santiago Zarate from comment #9) > > it seems that the test might be hanging somewhere... lets see how it goes: > > https://openqa.opensuse.org/tests/3751459#live > > > > Dimstar, can you show me or do you know the SR that introduced the latest > > change? > > Santiago, it seems the test still hangs. > On the other hand I cannot reproduce the issue on my test setup. > > Hence, I assume a time keeping issue in the test VM (see comment#2 and > comment#10). > Reassigning to you for now, could you try to rerun the test on a VM where > the time keeping issue doesn't appear? > If the issue still happens then, please assign back. The timekeeping warnings happen when the VM is paused and unpaused, which happens when creating and restoring snapshots. I had a look as well. Meanwhile it's the networkd_init module that fails: https://openqa.opensuse.org/tests/4194179#step/networkd_init/48 systemd-run -tM node1 /bin/bash -c "zypper lr -u" does not return. While it appears like zypper got stuck, this is not the case. The zypper process finishes successfully (visible in logs) and inside the container it also reports that the run-uNR.service unit finished. So it appears to be the communication issue indicated by comment 1. Might be https://github.com/systemd/systemd/issues/27189 upstream, no activity. (In reply to Fabian Vogt from comment #16) > I had a look as well. Meanwhile it's the networkd_init module that fails: > https://openqa.opensuse.org/tests/4194179#step/networkd_init/48 I'm trying to reproduce the issue, but the assets seem to have been deleted. Even tried the latest test there: https://openqa.opensuse.org/tests/4209853# But openSUSE-Tumbleweed-DVD-x86_64-Snapshot20240520-Media.iso isn't available. openQA colleagues, could you please reschedule the test so that I can get the assets for the reproducer? (In reply to Thomas Blume from comment #18) > (In reply to Fabian Vogt from comment #16) > > I had a look as well. Meanwhile it's the networkd_init module that fails: > > https://openqa.opensuse.org/tests/4194179#step/networkd_init/48 > > I'm trying to reproduce the issue, but the assets seem to have been deleted. > Even tried the latest test there: > > https://openqa.opensuse.org/tests/4209853# > > But openSUSE-Tumbleweed-DVD-x86_64-Snapshot20240520-Media.iso isn't > available. > > openQA colleagues, could you please reschedule the test so that I can get > the assets for the reproducer? you can try out with https://openqa.opensuse.org/tests/4246330#downloads (In reply to Fabian Vogt from comment #16) > I had a look as well. Meanwhile it's the networkd_init module that fails: > https://openqa.opensuse.org/tests/4194179#step/networkd_init/48 > > systemd-run -tM node1 /bin/bash -c "zypper lr -u" does not return. > > While it appears like zypper got stuck, this is not the case. The zypper > process finishes successfully (visible in logs) and inside the container it > also reports that the run-uNR.service unit finished. So it appears to be the > communication issue indicated by comment 1. With the hint in comment#19 I could reproduce the issue on my local openQA setup. But unlike in the original test, where there is a timeout: [2024-05-16T08:17:58.087198Z] [info] [pid:80064] ::: basetest::runtest: # Test died: command 'systemd-run -tM node1 /bin/bash -c "zypper lr -u"' timed out at /usr/lib/os-autoinst/testapi.pm line 926. testapi::assert_script_run("systemd-run -tM node1 /bin/bash -c \"zypper lr -u\"") called at opensuse/lib/networkdbase.pm line 28 The log shows that the systemd-run command returns with failure: --> [2024-06-04T11:11:21.773323Z] [info] [pid:1987] ::: basetest::runtest: # Test died: command 'systemd-run -tM node1 /bin/bash -c "zypper lr -u"' failed at /usr/lib/os-autoinst/testapi.pm line 926. testapi::assert_script_run("systemd-run -tM node1 /bin/bash -c \"zypper lr -u\"") called at opensuse/lib/networkdbase.pm line 28 --< Not sure what makes this difference. Investigating further. (In reply to Thomas Blume from comment #20) > (In reply to Fabian Vogt from comment #16) > > I had a look as well. Meanwhile it's the networkd_init module that fails: > > https://openqa.opensuse.org/tests/4194179#step/networkd_init/48 > > > > systemd-run -tM node1 /bin/bash -c "zypper lr -u" does not return. > > > > While it appears like zypper got stuck, this is not the case. The zypper > > process finishes successfully (visible in logs) and inside the container it > > also reports that the run-uNR.service unit finished. So it appears to be the > > communication issue indicated by comment 1. > > With the hint in comment#19 I could reproduce the issue on my local openQA > setup. > But unlike in the original test, where there is a timeout: > > [2024-05-16T08:17:58.087198Z] [info] [pid:80064] ::: basetest::runtest: # > Test died: command 'systemd-run -tM node1 /bin/bash -c "zypper lr -u"' timed > out at /usr/lib/os-autoinst/testapi.pm line 926. > testapi::assert_script_run("systemd-run -tM node1 /bin/bash -c \"zypper > lr -u\"") called at opensuse/lib/networkdbase.pm line 28 > > > The log shows that the systemd-run command returns with failure: > > --> > [2024-06-04T11:11:21.773323Z] [info] [pid:1987] ::: basetest::runtest: # > Test died: command 'systemd-run -tM node1 /bin/bash -c "zypper lr -u"' > failed at /usr/lib/os-autoinst/testapi.pm line 926. > testapi::assert_script_run("systemd-run -tM node1 /bin/bash -c \"zypper > lr -u\"") called at opensuse/lib/networkdbase.pm line 28 > --< > > Not sure what makes this difference. > Investigating further. systemd-run -tM node1 /bin/bash -c "zypper lr -u" fails on my test setup due to a pty allocation error. strace shows: --> [...] openat(AT_FDCWD, "/dev/pts/1", O_RDWR|O_NOCTTY|O_CLOEXEC) = -1 ENOENT (No such file or directory) close(5) = 0 close(7) = 0 exit_group(1) = ? +++ exited with 1 +++ --< However, the similar command: machinectl shell node1 succeeds. According to the documentation at https://www.freedesktop.org/software/systemd/man/latest/systemd-run.html# 'systemd-run -tM' is for starting an interactive shell. This is not required for the test and the manpage states: --> [...] Note that machinectl(1)'s shell command is usually a better alternative for requesting a new, interactive login session on the local host or a local container. [...] --< For the non-interactive test run, the command: systemd-run -PM node1 /bin/bash -c "zypper lr -u" can be used. My test setup succeeds with the following patch: --> diff --git a/lib/networkdbase.pm b/lib/networkdbase.pm index f763732fb..b90d61239 100644 --- a/lib/networkdbase.pm +++ b/lib/networkdbase.pm @@ -25,7 +25,7 @@ Run C<$script> in the systemd-nspawn container called C<$machine>. sub assert_script_run_container { my ($self, $machine, $script) = @_; - assert_script_run("systemd-run -tM $machine /bin/bash -c \"$script\""); + assert_script_run("systemd-run -PM $machine /bin/bash -c \"$script\""); } =head2 script_run_container --< Could you please try this? (In reply to Thomas Blume from comment #21) > (In reply to Thomas Blume from comment #20) > > (In reply to Fabian Vogt from comment #16) > > > I had a look as well. Meanwhile it's the networkd_init module that fails: > > > https://openqa.opensuse.org/tests/4194179#step/networkd_init/48 > > > > > > systemd-run -tM node1 /bin/bash -c "zypper lr -u" does not return. > > > > > > While it appears like zypper got stuck, this is not the case. The zypper > > > process finishes successfully (visible in logs) and inside the container it > > > also reports that the run-uNR.service unit finished. So it appears to be the > > > communication issue indicated by comment 1. > > > > With the hint in comment#19 I could reproduce the issue on my local openQA > > setup. > > But unlike in the original test, where there is a timeout: > > > > [2024-05-16T08:17:58.087198Z] [info] [pid:80064] ::: basetest::runtest: # > > Test died: command 'systemd-run -tM node1 /bin/bash -c "zypper lr -u"' timed > > out at /usr/lib/os-autoinst/testapi.pm line 926. > > testapi::assert_script_run("systemd-run -tM node1 /bin/bash -c \"zypper > > lr -u\"") called at opensuse/lib/networkdbase.pm line 28 > > > > > > The log shows that the systemd-run command returns with failure: > > > > --> > > [2024-06-04T11:11:21.773323Z] [info] [pid:1987] ::: basetest::runtest: # > > Test died: command 'systemd-run -tM node1 /bin/bash -c "zypper lr -u"' > > failed at /usr/lib/os-autoinst/testapi.pm line 926. > > testapi::assert_script_run("systemd-run -tM node1 /bin/bash -c \"zypper > > lr -u\"") called at opensuse/lib/networkdbase.pm line 28 > > --< > > > > Not sure what makes this difference. > > Investigating further. > > systemd-run -tM node1 /bin/bash -c "zypper lr -u" > > fails on my test setup due to a pty allocation error. It's possible you're testing a snapshot with broken systemd: bsc#1225943 Maybe try an older snapshot where the original issue occurs. (In reply to Fabian Vogt from comment #22) > > systemd-run -tM node1 /bin/bash -c "zypper lr -u" > > > > fails on my test setup due to a pty allocation error. > > It's possible you're testing a snapshot with broken systemd: bsc#1225943 > > Maybe try an older snapshot where the original issue occurs. I'd need the older asset for the test, but that isn't available anymore, see comment#18 The -P def makes more sense than -t... created also: https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/19457 with link to the VR (In reply to Thomas Blume from comment #21) > > My test setup succeeds with the following patch: > > --> > diff --git a/lib/networkdbase.pm b/lib/networkdbase.pm > index f763732fb..b90d61239 100644 > --- a/lib/networkdbase.pm > +++ b/lib/networkdbase.pm > @@ -25,7 +25,7 @@ Run C<$script> in the systemd-nspawn container called > C<$machine>. > > sub assert_script_run_container { > my ($self, $machine, $script) = @_; > - assert_script_run("systemd-run -tM $machine /bin/bash -c \"$script\""); > + assert_script_run("systemd-run -PM $machine /bin/bash -c \"$script\""); > } > > =head2 script_run_container > --< > > Could you please try this? (In reply to Thomas Blume from comment #23) > (In reply to Fabian Vogt from comment #22) > > > systemd-run -tM node1 /bin/bash -c "zypper lr -u" > > > > > > fails on my test setup due to a pty allocation error. > > > > It's possible you're testing a snapshot with broken systemd: bsc#1225943 > > > > Maybe try an older snapshot where the original issue occurs. > > I'd need the older asset for the test, but that isn't available anymore, see > comment#18 https://openqa.opensuse.org/tests/4241474#downloads looks fine? (In reply to Fabian Vogt from comment #25) > (In reply to Thomas Blume from comment #23) > > (In reply to Fabian Vogt from comment #22) > > > > systemd-run -tM node1 /bin/bash -c "zypper lr -u" > > > > > > > > fails on my test setup due to a pty allocation error. > > > > > > It's possible you're testing a snapshot with broken systemd: bsc#1225943 > > > > > > Maybe try an older snapshot where the original issue occurs. > > > > I'd need the older asset for the test, but that isn't available anymore, see > > comment#18 > > https://openqa.opensuse.org/tests/4241474#downloads looks fine? Yepp, that works, thanks! trying (In reply to Thomas Blume from comment #26) > > https://openqa.opensuse.org/tests/4241474#downloads looks fine? > > Yepp, that works, thanks! > > trying the test works with the patch from comment#24: http://10.168.189.195/tests/150#step/networkd_init/48 Trying now without the patch. (In reply to Thomas Blume from comment #27) > (In reply to Thomas Blume from comment #26) > > > https://openqa.opensuse.org/tests/4241474#downloads looks fine? > > > > Yepp, that works, thanks! > > > > trying > > the test works with the patch from comment#24: > > http://10.168.189.195/tests/150#step/networkd_init/48 > > Trying now without the patch. Unfortunately, it also works. :( http://10.168.189.195/tests/152 (In reply to Thomas Blume from comment #28) > (In reply to Thomas Blume from comment #27) > > (In reply to Thomas Blume from comment #26) > > > > https://openqa.opensuse.org/tests/4241474#downloads looks fine? > > > > > > Yepp, that works, thanks! > > > > > > trying > > > > the test works with the patch from comment#24: > > > > http://10.168.189.195/tests/150#step/networkd_init/48 > > > > Trying now without the patch. > > Unfortunately, it also works. :( > > http://10.168.189.195/tests/152 openqa-clone-job --repeat 100 maybe? (In reply to Santiago Zarate from comment #29) > (In reply to Thomas Blume from comment #28) > > (In reply to Thomas Blume from comment #27) > > > (In reply to Thomas Blume from comment #26) > > > > > https://openqa.opensuse.org/tests/4241474#downloads looks fine? > > > > > > > > Yepp, that works, thanks! > > > > > > > > trying > > > > > > the test works with the patch from comment#24: > > > > > > http://10.168.189.195/tests/150#step/networkd_init/48 > > > > > > Trying now without the patch. > > > > Unfortunately, it also works. :( > > > > http://10.168.189.195/tests/152 > > openqa-clone-job --repeat 100 maybe? I ran 20 repetitions, which took about 13 hours, but couldn't reproduce the issue. :( |