|
Bugzilla – Full Text Bug Listing |
| Summary: | intermittently VM startup indefinitely stucks at "start job is running for wicked managed network interfaces" | ||
|---|---|---|---|
| Product: | [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP6 | Reporter: | fei wang <fei2.wang> |
| Component: | Basesystem | Assignee: | wicked maintainers <wicked-maintainers> |
| Status: | NEW --- | QA Contact: | |
| Severity: | Major | ||
| Priority: | P2 - High | CC: | aginies, cfamullaconrad, claudio.fontana, dfaggioli, fei2.wang, lma, marc.ruehrschneck, mt, poswald, pragyansri.pathi, rtsvetkov, wicked-maintainers |
| Version: | unspecified | Flags: | mt:
needinfo?
(fei2.wang) |
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | SLES 15 | ||
| Whiteboard: | |||
| Found By: | Beta-Customer | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | screenshot for the failure symptom | ||
|
Description
fei wang
2024-02-21 09:43:30 UTC
Created attachment 872891 [details]
screenshot for the failure symptom
What version of the qemu packag do you have there ? Can you try with '-cpu host,host-phys-bits=on' ? (In reply to fei wang from comment #0) > ...... for the latest > instance of the failure I am not able to get it boot up any longer even I > simplified the command line to “qemu-system-x86_64 -name vm0 -enable-kvm > -daemonize -cpu host -smp 4 -m 10240 -vnc :10 -drive > file=/home/images/vm0.img”. * If the image /home/images/vm0.img is a sparse based file, Please make sure there is enough free disk space for path /home/images/ on the host. * I suggest explicitly specifying image format instead of auto probing by qemu. e.g.: get the image format information through "qemu-img info /home/images/vm0.img", then add the format information to qemu cli: "-drive file=/home/images/vm0.img,format={raw,qcow2}" * Please make sure there is enough free virtual disk space in image vm0.img for various mount points. * In general, The message "start job is running for wicked managed network interfaces" is normal, If this message stays longer, usually it means: according to the configuration, wicked is waiting for an IP for the nic. I have no idea yet why the guest os stucks there for overnight, But it seems there is at least one nic configuration in wicked and wicked is waiting for an IP according to that configuration. Even though the simplified qemu cli you used doesn't contain a virtual nic, But qemu offers a default nic(e1000) for you because you doesn't explicitly specify the '-nodefaults' option, You can see it by perform 'info network' command via your qemu monitor interface. So I suggest: 1. Add '-nodefaults' option to your simplified qemu cli to start the vm. 2. If the guest os successfully start up, then remove the exist wicked nic configuration. 3. shutoff the vm. 4. Start the vm using your regular qemu cli. 5. If the guest os successfully start up, then re-config the nic(s) in wicked. 6. Observe. BTW, you used to use vfio to passthrough device to the vm, In this case, host fully allocates all of memory(10G) for qemu instance instead of COW, So please make sure there is enough free memory on the host to avoid OOM. (In reply to Dario Faggioli from comment #2) > What version of the qemu packag do you have there ? > > Can you try with '-cpu host,host-phys-bits=on' ? Hey Dario, i tried adding '-cpu host,host-phys-bits=on' but neither saw improvement nor the increased log verbosity. May i know what is the function of this parameter? (In reply to Lin Ma from comment #3) > (In reply to fei wang from comment #0) > > ...... for the latest > > instance of the failure I am not able to get it boot up any longer even I > > simplified the command line to “qemu-system-x86_64 -name vm0 -enable-kvm > > -daemonize -cpu host -smp 4 -m 10240 -vnc :10 -drive > > file=/home/images/vm0.img”. > > * If the image /home/images/vm0.img is a sparse based file, Please make sure > there is enough free disk space for path /home/images/ on the host. > > * I suggest explicitly specifying image format instead of auto probing by > qemu. e.g.: > get the image format information through "qemu-img info > /home/images/vm0.img", then add the format information to qemu cli: "-drive > file=/home/images/vm0.img,format={raw,qcow2}" > > * Please make sure there is enough free virtual disk space in image vm0.img > for various mount points. > > * In general, The message "start job is running for wicked managed network > interfaces" is normal, If this message stays longer, usually it means: > according to the configuration, wicked is waiting for an IP for the nic. > > I have no idea yet why the guest os stucks there for overnight, But it seems > there is at least one nic configuration in wicked and wicked is waiting for > an IP according to that configuration. > Even though the simplified qemu cli you used doesn't contain a virtual nic, > But qemu offers a default nic(e1000) for you because you doesn't explicitly > specify the '-nodefaults' option, You can see it by perform 'info network' > command via your qemu monitor interface. > So I suggest: > 1. Add '-nodefaults' option to your simplified qemu cli to start the vm. > 2. If the guest os successfully start up, then remove the exist wicked nic > configuration. > 3. shutoff the vm. > 4. Start the vm using your regular qemu cli. > 5. If the guest os successfully start up, then re-config the nic(s) in > wicked. > 6. Observe. > > BTW, you used to use vfio to passthrough device to the vm, In this case, > host fully allocates all of memory(10G) for qemu instance instead of COW, So > please make sure there is enough free memory on the host to avoid OOM. WF: -The VM's indeed using qcow2 sparse file system, but i am pretty sure from df-h perspective there are still plenty of available storage space out there on both of our two systems on which we are observing the same failure symptom. -Miraculously after booting the VM for multiple times i got the luck to boot it back. i will try -nodefaults option if i hit consistent consecutive failure. -The troublesome thing is we are using DPDK Test Suite which is an automation tool to do this test, so the command line parameter is inherited from there, we don't have much flexibility to customize the qemu cli parameters, i understand there is indeed some way there, but would need a certain amount efforts, what's worse is that it means we have to implement SLES-specific CLI. -both of our systems have 256GB memory, i suppose 10GB memory would not be a problem for them. The issue about stucking at "start job is running for wicked managed network interfaces", It seems to ask wicked team for further help.
@Claudio, Any thoughts?
Below is a workaround to avoid this issue, you might have a try:
Launch your SLES 15 SP6 vm image, take a look at two files, e.g.:
(I assume that you're using wicked to manage network in the guest os and there is only one nic configured by the wicked)
guest:~ # cat /etc/sysconfig/network/ifcfg-eth0
BOOTPROTO='dhcp4'
STARTMODE='auto'
guest:~ # cat /etc/udev/rules.d/70-persistent-net.rules
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="virtio-pci", ATTR{dev_id}=="0x0", KERNELS=="0000:00:03.0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth0"
You can see that A nic whose bus is 0 and addr is 0x3 will be assigned an interface name "eth0", and there is a wicked network configuration for interface eth0(ifcfg-eth0).
So according to above example, what you need to do is to ensure there is a virtual nic's bus is pci.0 and addr is 0x3 when you run DPDK Test Suite. E.g:
......
-device e1000,netdev=nttsip1,bus=pci.0,addr=0x3
......
If it's hard to customize the qemu cli parameter for you, You can choose one of the three:
A. Dig into DPDK Test Suite to see if it allows users to specify bus number and addr for pci devices.
B. Re-generate/re-customize your vm image to cleanup network configuration
C. Use sles15sp6 jeos image.
currently i worked around this issue by switching from wicked to NetworkManager service, so far so good, not sure if there is obvious drawback/caveat/pitfall for going with NetworkManager. Lin, is switching to networkmanager a valid option? Is this fully supported? (In reply to Marc Ruehrschneck from comment #8) > Lin, is switching to networkmanager a valid option? Is this fully supported? Yes, it is a valid option. I think the networkmanager is fully official supported, although I'm not 100% sure. Need pm or networkmanager team helping to confirm it. As per question in comment #6, assigned to wicked maintainers to answer the general question about the wicked symptom and eventually comment #8. Hi we are at the final stage of SLE15 SP6. When shall we expect this? Based on comment this is not a Virtu bug, its more a wicked change behavior/ bug It could be the the same issue as mentioned in bsc#1222105, which is fixed in the latest version. Is a re validation possible? Thanks in advance. (In reply to Clemens Famulla-Conrad from comment #13) > It could be the the same issue as mentioned in bsc#1222105, which is fixed > in the latest version. Is a re validation possible? It'd be relevant if there would be also a bridge (with eth interface as port) and enabled STP and the nic port is unable to find carrier (which the bridge is inheriting). As you're using Intel E810 VFs, it's could be a variant of this Intel E810 nic reset & ethtool reading issue: https://bugzilla.suse.com/show_bug.cgi?id=1215269 Please make sure, you kernels and DPDK drivers include this bug fix. tried accessing https://bugzilla.suse.com/show_bug.cgi?id=1222105 and also https://bugzilla.suse.com/show_bug.cgi?id=1215269, unfortunately i got "You are not authorized to access bug #" error, do i need to apply for additional permission? Thanks. Fei, I added you to https://bugzilla.suse.com/show_bug.cgi?id=1215269 i went through https://bugzilla.suse.com/show_bug.cgi?id=1215269, it was raised by my colleague. Actually we haven't got any fix plan for our driver team for that issue yet. Also i tend to believe these are two issues, though i am not sure. Would it be possible to add me in the CC list for https://bugzilla.suse.com/show_bug.cgi?id=1222105? i'd like to check what is the symptom and solution mentioned there, and will give it a try if possible.Thx. Fei, you write, "this issue is intermittent with a medium probability" and "elapsed time 28s doesn’t change and bootup process cannot proceed". You're not reinstalling the VM, just starting + stopping it, right? As the "elapsed time 28s doesn’t change" is comming from systemd, it sounds like that the complete VM / kernel freezes. Is kdump enabled and perhaps there is a kernel dump in /var/crash/…? It needs quite a while until it gets written. Could you attach a supportconfig from the same, but working case? Please also the `journalctl -o short-precise -b > journal.0.log` and when it happened in the previous boot, please try also: `journalctl -o short-precise -b 1 > journal.0.log` This would give us some hints about the config+environment in the VM. When possible, please enable debug log (once, before reboot), that is WICKED_DEBUG=all WICKED_LOG_LEVEL=debug2 as described at https://en.opensuse.org/openSUSE:Bugreport_wicked: ... # enable debugging, applied to wickedd*.service as well as to wicked.service aka network.service # (when requested in a bug report to enable debug level 2, use '{$1=debug2}' bellow) perl -i -lpe 's{^(WICKED_DEBUG)=.*}{$1=all};s{^(WICKED_LOG_LEVEL)=.*}{$1=debug}' /etc/sysconfig/network/config ... let me try to reproduce the failure and then collect more logs and get back to you, thanks. |