|
Bugzilla – Full Text Bug Listing |
| Summary: | LTC16580 Bonding slaves are not mandatory devices by default (was: Ethernet bonding does not come up properly) | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE Linux 10.1 | Reporter: | LTC BugProxy <bugproxy> |
| Component: | Network | Assignee: | Christian Zoz <zoz> |
| Status: | VERIFIED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | ||
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| See Also: | https://bugzilla.linux.ibm.com/show_bug.cgi?id=16580 | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
bonding.tar
bonding-debug.patch sysconfig-0.31.0-15.35.i586.rpm netwk_debug_minimum.patch "gai.debug" |
||
|
Description
LTC BugProxy
2005-07-08 22:35:40 UTC
Created attachment 41511 [details]
bonding.tar
IBM attachment id 11038
Created attachment 41512 [details]
bonding-debug.patch
Created attachment 41514 [details]
sysconfig-0.31.0-15.35.i586.rpm
Created attachment 41515 [details]
netwk_debug_minimum.patch
---- Additional Comments From skodati@in.ibm.com 2005-07-15 16:02 EDT ------- This bug is taking a while more than expected to conclude, partly because of the low reproduction rate with debug messages. ifenslave is returning 1 when there is a failure to attach the device, when invoked with -v option it should have printed the reason for the failure. There is only place in ifenslave code where there is no debug message printed when the abi version is not valid. I am rerunning the testcases again after adding an error message when there is a failure. Thanks. ---- Additional Comments From skodati@in.ibm.com 2005-07-19 14:54 EDT ------- I am struggling hard to reproduce the problem with the required debug information , so far 228 passes without a single failure.. Thanks. ---- Additional Comments From skodati@in.ibm.com 2005-07-20 12:04 EDT ------- Finally I am able to reproduce the problem once after 296 iterations. I found some surprising results though.. ifenslave fails at ioctl(skfd, SIOCGIFFLAGS, &ifr2) at the following code: else if (abi_ver < 1) { /* The driver is using an old ABI, so we'll set the interface * down to avoid any conflicts due to same IP/MAC */ strncpy(ifr2.ifr_name, slave_ifname, IFNAMSIZ); if (ioctl(skfd, SIOCGIFFLAGS, &ifr2) < 0) { <-- HERE int saved_errno = errno; fprintf(stderr, "SIOCGIFFLAGS on %s failed: %s ", slave_ifname, strerror(saved_errno)); } Strangely, this check is done when abi_ver < 1, but all the logs suggest the abi_ver it received is 2. I am debugging the problem further. Thanks. ---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-07-21 15:23 EDT ------- If ifenslave itself is failing, isn't that a different failure than the original problem (in which ifenslave would never be called)? The failure you cite seems very strange; I think you'd have to have an old bonding driver installed to follow that path. ---- Additional Comments From skodati@in.ibm.com 2005-07-21 15:32 EDT ------- (In reply to comment #32) > If ifenslave itself is failing, isn't that a different failure than the original > problem (in which ifenslave would never be called)? True, In the case where there was a failure to add eth1, ioctl(skfd, SIOCGIFFLAGS, &ifr2) fails with the following error SIOCGIFFLAGS on eth1 failed: No such device > The failure you cite seems very strange; I think you'd have to have an old > bonding driver installed to follow that path. Sorry for the confusion, I think overlooked at another instance of the code where the similar check exists even for abi_ver 2. ---- Additional Comments From skodati@in.ibm.com 2005-07-26 17:33 EDT ------- I think the prime reason for the problem to appear is a very slight time delay between eth1 being up and attaching the device to bonding device. When eth1 is attached to bond0 through ifenslave in /sbin/ifup ( /sbin/ifenslave -v $BONDING_OPTIONS $INTERFACE $BSIFACE ), it returns 1 with the failure ( SIOCGIFFLAGS on eth1 failed: No such device ). But from the logs I could see that eth1 was just up with a very slight delay. To verify this I tested with a small patch which checks for the return status of ifenslave ( /sbin/ifenslave -v $BONDING_OPTIONS $INTERFACE $BSIFACE ) in /sbin/ifup and retry attaching eth1, it succeeded always. I can see 894 passes so far without any failure. A possible workaround to reflect the above testing is to check the status of bonding towards the end of the init scripts and restarting bond0. I had a ST chat with Sanjay today and we will discuss this work around tomorrow ( 27th July ). Thanks. changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |NEEDINFO
------- Additional Comments From skodati@in.ibm.com 2005-08-01 10:25 EDT -------
Please update the report with your comments. Keeping the report in NEEDINFO. Thanks
---- Additional Comments From skodati@in.ibm.com 2005-08-16 16:13 EDT ------- Novell, Any comments/Suggestions on this bug report. Thanks. Created attachment 46400 [details]
"gai.debug"
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-18 01:59 EDT ------- getcfg trace file I did some tests today on one of the problem machines. Right now, my best guess is that whatever is loading the modules is loading e1000 and bonding in parallel, causing the probe of eth0 and eth1 by e1000 to overlap with the loop in the ifup of bond0 trying to look for them. It is unclear to me what agent actually performs the modprobe for e1000; it doesn't appear to happen in the main loop of /etc/init.d/network. Any init script gurus want to chime in here? Since I'm coming in over the network, it's hard to test the theory that the getcfg- itself might trigger a hotplug event or something to load the driver, although trying that scenario on a system I have locally doesn't cause the driver to load for a getcfg- query. I did tinker with an install line in /etc/modprobe.conf.local for e1000, as follows: install e1000 /sbin/modprobe --ignore-install e1000 && { logger -s -p kern.warning e1000 sleep 5 ; sleep 5 ; logger -s -p kern.warning sleep done ; } At the time I thought this might give the driver time to finish probing, but it didn't make any difference. It did produce the "sleep done" message in /var/log/messages as follows, however: Aug 17 15:53:01 fvt10-mds6 /etc/hotplug/pci.agent[2280]: logger: sleep done Aug 17 15:53:01 fvt10-mds6 logger: sleep done The message coming from the hotplug pci.agent suggests that hotplug is loading e1000, but I still don't know what the mechanism is. The interleaved messages from bonding and e100 appear in /var/log/messages as follows: 17:30:50 x kernel: Intel(R) PRO/1000 Network Driver - version 5.2.39 17:30:50 x kernel: Copyright (c) 1999-2004 Intel Corporation. 17:30:50 x kernel: ACPI: PCI interrupt 0000:06:08.0[A] -> GSI 29 (level, low) -> IRQ 29 17:30:50 x kernel: Ethernet Channel Bonding Driver: v2.6.0 (January 14, 2004) 17:30:50 x kernel: bonding: MII link monitoring set to 100 ms 17:30:50 x kernel: e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection 17:30:50 x kernel: ACPI: PCI interrupt 0000:06:08.1[B] -> GSI 30 (level, low) -> IRQ 30 17:30:50 x kernel: e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection 17:30:50 x kernel: bonding: bond0: enslaving eth0 as a backup interface with a down link. 17:30:50 x kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex 17:30:50 x kernel: bonding: bond0: link status definitely up for interface eth0. 17:30:50 x kernel: bonding: bond0: making interface eth0 the new active one. Also, I do not believe that the get_all_interfaces function called by getcfg-interface is the root of the problem. I think I discussed this possibility with somebody, but I don't recall who. I instrumented /lib/libgetconfig and /sbin/getcfg and had it print all sorts of deep, meaningful stuff. Among the tidbits is that only the failing getcfg-interface call enters get_all_interfaces; the successful calls don't get that far. In the attached trace, the two getcfg-interface calls for the slave devices are pids 2572 and 2573, grepped here for your convenience: getcfg 2572 /sbin/getcfg-interface -- bus-pci-0000:06:08.1 2572 case g_i: rv 0 from split_hwdesc 2572 case g_i: rv 0 from complete_hwdesc_sysfs 2572 get_all_interfaces net bus-pci-0000:06:08.1 2572 iface: 'bond0' cfg: 'bond0' mcfg: 'bus-pci-0000:06:08.1' 2572 iface: 'eth0' cfg: 'eth-id-00:09:6b:f1:ac:06' mcfg: 'bus-pci-0000:06:08.1' 2572 iface: 'lo' cfg: 'lo' mcfg: 'bus-pci-0000:06:08.1' 2572 ret: cl net cfgname bus-pci-0000:06:08.1 ifs r 0 2572 get_all_if ret !=1 iflist exit 11 getcfg 2573 /sbin/getcfg-interface -- bus-pci-0000:06:08.0 2573 case g_i: rv 0 from split_hwdesc 2573 case g_i: rv 0 from complete_hwdesc_sysfs 2573 case g_i: match_type: h->iface eth0 h->devtype eth iftype net Without the source handy that might not make much sense, but 2572 is the "exit 11" failure case for eth1; 2573 succeeds for eth0. Note that there is a lot of interleaving in the trace file; I'm not sure how much of that is real and how much is an artifact of buffering in fprintf (I set it to unbuffered, but who knows). I have a fair level of faith in the interleaving of the bonding/e1000 kernel messages, since they line up that way in dmesg right from the kernel printk. Lastly, note that in /etc/init.d/network, it doesn't appear that the e1000 driver is loaded until bonding is initialized, so the WAIT_FOR_INTERFACES loop won't make any difference (as I read it; I might be mistaken here, but it doesn't look like the e1000 devices will be put into the MANDATORY list, because their STARTMODE is "off"). I don't understand your problem, because this bug report is really messed up. Attachments don't fit their description, a lot of useless lines which make it hard to find the relevant part and so on.Please excuse me, but could you describe in a few lines what the problem(s) is (are)? You write something about a getcfg-interface problem and also about an ifenslave failure. So pleaseone after another: 1) What do you want to set up? 2) What is your configuration for that? 3) What _exact_ is the failiure you see at first and what is the state of all involved network interfaces? 4) Does it happen only at boot time, or is it reproducible when you set STARTMODE=manual in all ifcfg-* files and call 'rcnetwork start -o boot manual' later? 5) Do you see the problem with SP2 as well? I looked again over this report and it might be just a timing problem (as far as i understood the report). The automatic determination of mandatory devices may fail. Go use the MANDATORY_DEVICES variable in the config file. ------- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-18 13:06 EDT ------- (In reply to comment #39) [...] > You write something about a getcfg-interface problem and also about an ifenslave failure. So pleaseone after another: > 1) What do you want to set up? The system is trying to boot up and start bonding at boot time, with two e1000 devices in an active-backup configuration. > 2) What is your configuration for that? Somebody else needs to provide whatever details; all I know is that it's a SMP x86 system of some sort. > 3) What _exact_ is the failiure you see at first and what is the state of all > involved network interfaces? At boot time, when /etc/init.d/network gets to the ifenslave part, it first runs a loop that does getcfg-interface on all of the devices listed as BONDING_SLAVEs in the ifcfg-bond0. Very often, one of these getcfg-interface calls will fail, and exit code 11. The initial suspicion was that there was something wrong with getcfg itself, but after yesterday's session, I believe the problem is that the e1000 module is being loaded simultaneously with the ifup bond0 / getcfg-interface loop, causing one of the interfaces to not be probed at the time getcfg-interface tries to look it up. > 4) Does it happen only at boot time, or is it reproducible when you set > STARTMODE=manual in all ifcfg-* files and call > 'rcnetwork start -o boot manual' later? Boot time for sure. I have not personally tried the other two. > 5) Do you see the problem with SP2 as well? Yes. > I looked again over this report and it might be just a timing problem (as far as > i understood the report). The automatic determination of mandatory devices may > fail. Go use the MANDATORY_DEVICES variable in the config file. Adding the slaves to MANDATORY_DEVICES does bring things up (at least after a couple of tries; I'm not sure if the submitters or assingee tried it more and still saw failures). The slaves don't go in MANDATORY_DEVICES automatically because they're configured as "off." That doesn't explain the problem, though; I've never previously seen a case that required the bonding slaves to be added to MANDATORY_DEVICES by hand. It looks like something is running the modprobe of e1000 in the background. Of course is modprobe e1000 running in background. This is triggered via hotplug. And that is the reason why the network script waits for mandatory devices to be set up properly. The problem is to determine which of the available network devices are mandatory for the system. Either you set the STARTMODE of the bonded interfces to auto or add their devices to MANDATORY_DEVICES. So i either had to update the ifup manpage to make this understandable or i have to check configuration files of bonding or vlan interfaces for the devices they depend on and add these devices to the mandatory device list. Another question: What error message do you see at boot time if bonding failed? Can you please attach the relevant part of /var/log/boot.msg? (not the complete file please.) ------- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-19 14:32 EDT ------- (In reply to comment #41) > ---- Additional Comments From zoz@suse.de 2005-08-19 01:13 MST ------- > Of course is modprobe e1000 running in background. This is triggered via > hotplug. And that is the reason why the network script waits for mandatory > devices to be set up properly. I'm having some trouble seeing why this ever works correctly (except by luck), unless something has changed very recently, since the modprobe of the driver would presumably always race with the getcfg-interface loop in /etc/init.d/network. > The problem is to determine which of the available network devices are mandatory > for the system. Either you set the STARTMODE of the bonded interfces to auto or > add their devices to MANDATORY_DEVICES. Doing so (STARTMODE auto or explictly add to MANDATORY_DEVICES) has never been necessary in my past experience, and apparently not in SuSE's, either, since the documentation found at http://portal.suse.com/sdb/en/2004/09/tami_sles9_bonding_setup.html says to remove the slave device ifcfg-eth-* files, which, if I'm reading the code correctly, would exclude them from consideration as detected MANDATORY_DEVICES in /etc/init.d/network. That document does mention adding to MODULES_LOADED_ON_BOOT (which I have not tried) and WAIT_FOR_INTERFACES (which doesn't help unless the devices are MANDATORY). The documentation distributed for bonding differs; it says to keep the slave ifcfg-eth-* files, but set them as STARTMODE=off. That text was based on a mailing list posting (which I can't find at the moment), but I've never seen (or had a previous report of) this particular problem following the bonding.txt instructions. > So i either had to update the ifup manpage to make this understandable or i have > to check configuration files of bonding or vlan interfaces for the devices they > depend on and add these devices to the mandatory device list. I just checked what appears to be the current sysconfig on ftp.suse.com, version 0.32.0, and it does have a new(?) ifcfg-bonding.5 man page that has some good stuff in it, but it doesn't describe any special steps related to setting up the slave configurations. FWIW, the most recent bonding.txt is always kept at http://sourceforge.net/projects/bonding it is likely to be more up to date than what's in the kernel source; I don't know if you want to add that to the manual page or not (as external links may come and go over the long term). I think it would be most intuitive for end users for the init script itself to wait for the slave devices to become ready (treat them as MANDATORY, or possibly add a "wait for ready" type loop into the bonding device check, but that might be too much code duplication). > Another question: What error message do you see at boot time if bonding failed? > Can you please attach the relevant part of /var/log/boot.msg? It's short, I'll just paste it in here: Setting up network interfaces: lo lo IP address: 127.0.0.1/8 done bond0 bond0 Could not get an interface for slave device 'bus-pci-0000:06:08.1' bond0 IP address: 192.168.10.124/16 as bonding master enslaving eth0 eth0 is already a slave Using an "eth" type name in the BONDING_SLAVE variable doesn't make any difference; it still fails (although the message is a bit different), presumably because the device hasn't been probed by e1000 at that point. ---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-23 18:26 EDT ------- Any updates? Submitters, do you have a viable workaround at this point (using MANDATORY_DEVICES in /etc/sysconfig/network/config, or something else)? SuSE, and update on a long term fix? > I'm having some trouble seeing why this ever works correctly (except by luck),
> unless something has changed very recently, since the modprobe of the driver
> would presumably always race with the getcfg-interface loop in
> /etc/init.d/network.
That's why we have the loop. We have to wait sometimes.
Further the article in SUSE portalis not completely correct. I will speak to
Tami to correct this.
And yes, i will make BONDING_SLAVES mandatory automaically. But it will take
some time, since i'm very busy with SL 10.0 currently
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-10-12 12:18 EDT ------- SuSE, any updates on when the fix should appear? WIP. I'm just testing the code. Will probably go to SP3 beta4. Fixed for SLES9 SP3. Patches still need to go to SVN for next release. Added patches to svn. Maybe worth a YOUpdate New function get_slaves() did not work well in all cases. Added improved version of this function to SP3 as well. changed:
What |Removed |Added
----------------------------------------------------------------------------
Status|ACCEPTED |CLOSED
------- Additional Comments From thinh@us.ibm.com(prefers email via th2tran@austin.ibm.com) 2006-02-08 15:37 EDT -------
no response from bug submitter for months.
Fix is in SLES9 SP3. Closing.
Please re-open if you can recreate this on SLES9 SP3.
Thanks.
|