Bugzilla – Bug 95834
LTC16580 Bonding slaves are not mandatory devices by default (was: Ethernet bonding does not come up properly)
Last modified: 2016-02-13 06:23:21 UTC
LTC Owner is: skodati@in.ibm.com LTC Originator is: gsanjay@us.ibm.com Ethernet bonding does not come up properly due to getcfg-interface return “ error 11” during boot. This problem is intermittent problem and it happened on both eth0 and eth1 or one of this interface. We have seen this problem on multiple x346 systems, Provide output from "uname -a", if possible: 2.6.5-7.139-bigsmp Hardware Environment Machine type : x345 Cpu type IA-32 Describe any special hardware you think might be relevant to this problem: Intel Corp,82546EB Gigabit Ethternet Controller (Copper) Rev-01 Is this reproducible? Intermittent problem. Describe the steps: Setup bodning using procedure http://support.novell.com/techcenter/tips/10046.html. Reboot system in loop, you may see this problem intermittently. Additional information: sametime chat information: adcosta@us.ibm... Hi adcosta@us.ib... Should we raise an LTC defect regarding getcfg failure during boot time bonding ? Jay Vosburgh Probably, yes. I'm not sure who would look into it right offhand. Jay Vosburgh Is the e1000 in question on a PCI card? If so, I'd try swapping it out just to make sure it's not a hardware problem with a particular card adcosta@us.ib... The problem happens intermittently. Does that still be a possible hardware problem? Also we are seeing the same thing on many LSES9 machines adcosta@us.ib... The upcoming sanfs release relies heavily on bonding for HA. Jay Vosburgh Well, if it's on several machines, it's probably not a failed card. Might still be a firmware type of problem, but that's less likely. Jay Vosburgh From what I can tell, it looks like a problem with the one interface not being initialized properly. Jay Vosburgh Do you see the same problem with the interfaces not coming up even if bonding is not configured? adcosta@us.ib... The getcfg works after the boot even if it had failed during ifup adcosta@us.ib... No, the interfaces come up without bonding Jay Vosburgh Interesting. adcosta@us.ib... Also, we captured the retcode from getcfg in ifup & it is 11 Jay Vosburgh Yah, I haven't looked at the getcfg source to see what that 11 might mean adcosta@us.ib... So fr setup documentation, should users be pointed to the sourceforge.net doc for SLES9 bonding setup? Jay Vosburgh Sure adcosta@us.ib... I noted that the Novell documentation uses the bus-pci names directly in the BONDING_SLAVE variables whereas at sourceforge it is ethN values,.. Jay Vosburgh Yah, either one works; I have a note to update the bonding.txt (as the bus id names give module load order independence) adcosta@us.ib... Ok, I will ask Sanjay to raise a LTC defect. Thanks so much for your help. Jay Vosburgh Sure thing web site for bonding and where doc is http://sourceforge.net/projects/bonding click on documentation Sanjay, the system is running an older kernel version. Please install sles 9 SP2 RC2 (2.6.5-7.183) to see if the problem recreates. The isos are available on ftp3.liux.ibm.com in '/suse/beta_cds/sles-9-sp2/i386/RC': -rw-r----- 1 root suse 521572352 Jun 8 10:56 SLES-9-SP-2-i386-RC2-CD1.iso -rw-r----- 1 root suse 660611072 Jun 8 10:57 SLES-9-SP-2-i386-RC2-CD2.iso -rw-r----- 1 root suse 657164288 Jun 8 10:58 SLES-9-SP-2-i386-RC2-CD3.iso Please attach complete /var/log/messages collected from the failure system and messages printed on console (if any). Thanks. Created an attachment (id=11038) config info and system logs This is the config info and boot logs that I looked through. Note that "aftr_boot" is after a failure, and "restart" is after a network restart that comes up correctly. Error 11 is returned from getcfg-interface when get_all_interfaces() called from getcfg ( case get_interface ) returns any value other than 1. from tools/get_config.c ... 555 if (1 == get_all_interfaces(interfacetype, hwdesc->hwdesc, 556 interfacelist)) { 557 if (verbosity == 0) 558 printf("%s\n", interfacelist); 559 else 560 printf("%s (indirekt)\n", interfacelist); 561 return 0; 562 } 563 return 11; .... get_all_interfaces() ( from get_config_lib.c) will return a value other than 1 under many circumstances. In most of the cases an error is looged, except in the case where it can return "0" when it fails to complete 1 iteration within dlist_for_each_data(){}. From the boot messages it is more likely that it failed to get the details for eth1. I am planning to prepare a debug patch to capture all error codes, and hopefully we can get more information with it. Thanks. Created an attachment (id=11046) debug patch to identify the problem Please let me know if you have any problems with applying the patch and rebuilding the rpm's. I will be glad to assist you in building binary rpm's with the patch. Thanks. Created an attachment (id=11052) binary rpm with the patch Sanjay, Please update the report with details of debug messages as requested by Jay, by setting DEBUG to "yes" in /etc/sysconfig/network/config. Thanks. Created an attachment (id=11259) debug patch to add timestamps I had a telephonic chat with Sanjay and he explained practical problems in considering the workaround, since it will add a time delay to the bootup process. I think one opiton to resolve the problem is to add eth1 & eth0 as MANDATORY_DEVICES in /etc/sysconfig/network/config. It will ensure that the given interfaces are up and running. I am attaching a minimal debug patch to timestamp the interface bring-up times. I decided to keep it minimal after the concerns about too much debug information might cause the problem to disappear. Sanjay, Use the following patch to /etc/init.d/network and attach the file /etc/ltdebug to the report. Please provide the details of success/failure of bonding for each iteration. Thanks. Sanjay requested me to carry the tests on lab machines, I am waiting for the machine details and access to lab. Moving the report to NEEDINFO. Thanks.
Created attachment 41511 [details] bonding.tar IBM attachment id 11038
Created attachment 41512 [details] bonding-debug.patch
Created attachment 41514 [details] sysconfig-0.31.0-15.35.i586.rpm
Created attachment 41515 [details] netwk_debug_minimum.patch
---- Additional Comments From skodati@in.ibm.com 2005-07-15 16:02 EDT ------- This bug is taking a while more than expected to conclude, partly because of the low reproduction rate with debug messages. ifenslave is returning 1 when there is a failure to attach the device, when invoked with -v option it should have printed the reason for the failure. There is only place in ifenslave code where there is no debug message printed when the abi version is not valid. I am rerunning the testcases again after adding an error message when there is a failure. Thanks.
---- Additional Comments From skodati@in.ibm.com 2005-07-19 14:54 EDT ------- I am struggling hard to reproduce the problem with the required debug information , so far 228 passes without a single failure.. Thanks.
---- Additional Comments From skodati@in.ibm.com 2005-07-20 12:04 EDT ------- Finally I am able to reproduce the problem once after 296 iterations. I found some surprising results though.. ifenslave fails at ioctl(skfd, SIOCGIFFLAGS, &ifr2) at the following code: else if (abi_ver < 1) { /* The driver is using an old ABI, so we'll set the interface * down to avoid any conflicts due to same IP/MAC */ strncpy(ifr2.ifr_name, slave_ifname, IFNAMSIZ); if (ioctl(skfd, SIOCGIFFLAGS, &ifr2) < 0) { <-- HERE int saved_errno = errno; fprintf(stderr, "SIOCGIFFLAGS on %s failed: %s ", slave_ifname, strerror(saved_errno)); } Strangely, this check is done when abi_ver < 1, but all the logs suggest the abi_ver it received is 2. I am debugging the problem further. Thanks.
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-07-21 15:23 EDT ------- If ifenslave itself is failing, isn't that a different failure than the original problem (in which ifenslave would never be called)? The failure you cite seems very strange; I think you'd have to have an old bonding driver installed to follow that path.
---- Additional Comments From skodati@in.ibm.com 2005-07-21 15:32 EDT ------- (In reply to comment #32) > If ifenslave itself is failing, isn't that a different failure than the original > problem (in which ifenslave would never be called)? True, In the case where there was a failure to add eth1, ioctl(skfd, SIOCGIFFLAGS, &ifr2) fails with the following error SIOCGIFFLAGS on eth1 failed: No such device > The failure you cite seems very strange; I think you'd have to have an old > bonding driver installed to follow that path. Sorry for the confusion, I think overlooked at another instance of the code where the similar check exists even for abi_ver 2.
---- Additional Comments From skodati@in.ibm.com 2005-07-26 17:33 EDT ------- I think the prime reason for the problem to appear is a very slight time delay between eth1 being up and attaching the device to bonding device. When eth1 is attached to bond0 through ifenslave in /sbin/ifup ( /sbin/ifenslave -v $BONDING_OPTIONS $INTERFACE $BSIFACE ), it returns 1 with the failure ( SIOCGIFFLAGS on eth1 failed: No such device ). But from the logs I could see that eth1 was just up with a very slight delay. To verify this I tested with a small patch which checks for the return status of ifenslave ( /sbin/ifenslave -v $BONDING_OPTIONS $INTERFACE $BSIFACE ) in /sbin/ifup and retry attaching eth1, it succeeded always. I can see 894 passes so far without any failure. A possible workaround to reflect the above testing is to check the status of bonding towards the end of the init scripts and restarting bond0. I had a ST chat with Sanjay today and we will discuss this work around tomorrow ( 27th July ). Thanks.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO ------- Additional Comments From skodati@in.ibm.com 2005-08-01 10:25 EDT ------- Please update the report with your comments. Keeping the report in NEEDINFO. Thanks
---- Additional Comments From skodati@in.ibm.com 2005-08-16 16:13 EDT ------- Novell, Any comments/Suggestions on this bug report. Thanks.
Created attachment 46400 [details] "gai.debug"
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-18 01:59 EDT ------- getcfg trace file I did some tests today on one of the problem machines. Right now, my best guess is that whatever is loading the modules is loading e1000 and bonding in parallel, causing the probe of eth0 and eth1 by e1000 to overlap with the loop in the ifup of bond0 trying to look for them. It is unclear to me what agent actually performs the modprobe for e1000; it doesn't appear to happen in the main loop of /etc/init.d/network. Any init script gurus want to chime in here? Since I'm coming in over the network, it's hard to test the theory that the getcfg- itself might trigger a hotplug event or something to load the driver, although trying that scenario on a system I have locally doesn't cause the driver to load for a getcfg- query. I did tinker with an install line in /etc/modprobe.conf.local for e1000, as follows: install e1000 /sbin/modprobe --ignore-install e1000 && { logger -s -p kern.warning e1000 sleep 5 ; sleep 5 ; logger -s -p kern.warning sleep done ; } At the time I thought this might give the driver time to finish probing, but it didn't make any difference. It did produce the "sleep done" message in /var/log/messages as follows, however: Aug 17 15:53:01 fvt10-mds6 /etc/hotplug/pci.agent[2280]: logger: sleep done Aug 17 15:53:01 fvt10-mds6 logger: sleep done The message coming from the hotplug pci.agent suggests that hotplug is loading e1000, but I still don't know what the mechanism is. The interleaved messages from bonding and e100 appear in /var/log/messages as follows: 17:30:50 x kernel: Intel(R) PRO/1000 Network Driver - version 5.2.39 17:30:50 x kernel: Copyright (c) 1999-2004 Intel Corporation. 17:30:50 x kernel: ACPI: PCI interrupt 0000:06:08.0[A] -> GSI 29 (level, low) -> IRQ 29 17:30:50 x kernel: Ethernet Channel Bonding Driver: v2.6.0 (January 14, 2004) 17:30:50 x kernel: bonding: MII link monitoring set to 100 ms 17:30:50 x kernel: e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection 17:30:50 x kernel: ACPI: PCI interrupt 0000:06:08.1[B] -> GSI 30 (level, low) -> IRQ 30 17:30:50 x kernel: e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection 17:30:50 x kernel: bonding: bond0: enslaving eth0 as a backup interface with a down link. 17:30:50 x kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full Duplex 17:30:50 x kernel: bonding: bond0: link status definitely up for interface eth0. 17:30:50 x kernel: bonding: bond0: making interface eth0 the new active one. Also, I do not believe that the get_all_interfaces function called by getcfg-interface is the root of the problem. I think I discussed this possibility with somebody, but I don't recall who. I instrumented /lib/libgetconfig and /sbin/getcfg and had it print all sorts of deep, meaningful stuff. Among the tidbits is that only the failing getcfg-interface call enters get_all_interfaces; the successful calls don't get that far. In the attached trace, the two getcfg-interface calls for the slave devices are pids 2572 and 2573, grepped here for your convenience: getcfg 2572 /sbin/getcfg-interface -- bus-pci-0000:06:08.1 2572 case g_i: rv 0 from split_hwdesc 2572 case g_i: rv 0 from complete_hwdesc_sysfs 2572 get_all_interfaces net bus-pci-0000:06:08.1 2572 iface: 'bond0' cfg: 'bond0' mcfg: 'bus-pci-0000:06:08.1' 2572 iface: 'eth0' cfg: 'eth-id-00:09:6b:f1:ac:06' mcfg: 'bus-pci-0000:06:08.1' 2572 iface: 'lo' cfg: 'lo' mcfg: 'bus-pci-0000:06:08.1' 2572 ret: cl net cfgname bus-pci-0000:06:08.1 ifs r 0 2572 get_all_if ret !=1 iflist exit 11 getcfg 2573 /sbin/getcfg-interface -- bus-pci-0000:06:08.0 2573 case g_i: rv 0 from split_hwdesc 2573 case g_i: rv 0 from complete_hwdesc_sysfs 2573 case g_i: match_type: h->iface eth0 h->devtype eth iftype net Without the source handy that might not make much sense, but 2572 is the "exit 11" failure case for eth1; 2573 succeeds for eth0. Note that there is a lot of interleaving in the trace file; I'm not sure how much of that is real and how much is an artifact of buffering in fprintf (I set it to unbuffered, but who knows). I have a fair level of faith in the interleaving of the bonding/e1000 kernel messages, since they line up that way in dmesg right from the kernel printk. Lastly, note that in /etc/init.d/network, it doesn't appear that the e1000 driver is loaded until bonding is initialized, so the WAIT_FOR_INTERFACES loop won't make any difference (as I read it; I might be mistaken here, but it doesn't look like the e1000 devices will be put into the MANDATORY list, because their STARTMODE is "off").
I don't understand your problem, because this bug report is really messed up. Attachments don't fit their description, a lot of useless lines which make it hard to find the relevant part and so on.Please excuse me, but could you describe in a few lines what the problem(s) is (are)? You write something about a getcfg-interface problem and also about an ifenslave failure. So pleaseone after another: 1) What do you want to set up? 2) What is your configuration for that? 3) What _exact_ is the failiure you see at first and what is the state of all involved network interfaces? 4) Does it happen only at boot time, or is it reproducible when you set STARTMODE=manual in all ifcfg-* files and call 'rcnetwork start -o boot manual' later? 5) Do you see the problem with SP2 as well? I looked again over this report and it might be just a timing problem (as far as i understood the report). The automatic determination of mandatory devices may fail. Go use the MANDATORY_DEVICES variable in the config file.
------- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-18 13:06 EDT ------- (In reply to comment #39) [...] > You write something about a getcfg-interface problem and also about an ifenslave failure. So pleaseone after another: > 1) What do you want to set up? The system is trying to boot up and start bonding at boot time, with two e1000 devices in an active-backup configuration. > 2) What is your configuration for that? Somebody else needs to provide whatever details; all I know is that it's a SMP x86 system of some sort. > 3) What _exact_ is the failiure you see at first and what is the state of all > involved network interfaces? At boot time, when /etc/init.d/network gets to the ifenslave part, it first runs a loop that does getcfg-interface on all of the devices listed as BONDING_SLAVEs in the ifcfg-bond0. Very often, one of these getcfg-interface calls will fail, and exit code 11. The initial suspicion was that there was something wrong with getcfg itself, but after yesterday's session, I believe the problem is that the e1000 module is being loaded simultaneously with the ifup bond0 / getcfg-interface loop, causing one of the interfaces to not be probed at the time getcfg-interface tries to look it up. > 4) Does it happen only at boot time, or is it reproducible when you set > STARTMODE=manual in all ifcfg-* files and call > 'rcnetwork start -o boot manual' later? Boot time for sure. I have not personally tried the other two. > 5) Do you see the problem with SP2 as well? Yes. > I looked again over this report and it might be just a timing problem (as far as > i understood the report). The automatic determination of mandatory devices may > fail. Go use the MANDATORY_DEVICES variable in the config file. Adding the slaves to MANDATORY_DEVICES does bring things up (at least after a couple of tries; I'm not sure if the submitters or assingee tried it more and still saw failures). The slaves don't go in MANDATORY_DEVICES automatically because they're configured as "off." That doesn't explain the problem, though; I've never previously seen a case that required the bonding slaves to be added to MANDATORY_DEVICES by hand. It looks like something is running the modprobe of e1000 in the background.
Of course is modprobe e1000 running in background. This is triggered via hotplug. And that is the reason why the network script waits for mandatory devices to be set up properly. The problem is to determine which of the available network devices are mandatory for the system. Either you set the STARTMODE of the bonded interfces to auto or add their devices to MANDATORY_DEVICES. So i either had to update the ifup manpage to make this understandable or i have to check configuration files of bonding or vlan interfaces for the devices they depend on and add these devices to the mandatory device list. Another question: What error message do you see at boot time if bonding failed? Can you please attach the relevant part of /var/log/boot.msg? (not the complete file please.)
------- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-19 14:32 EDT ------- (In reply to comment #41) > ---- Additional Comments From zoz@suse.de 2005-08-19 01:13 MST ------- > Of course is modprobe e1000 running in background. This is triggered via > hotplug. And that is the reason why the network script waits for mandatory > devices to be set up properly. I'm having some trouble seeing why this ever works correctly (except by luck), unless something has changed very recently, since the modprobe of the driver would presumably always race with the getcfg-interface loop in /etc/init.d/network. > The problem is to determine which of the available network devices are mandatory > for the system. Either you set the STARTMODE of the bonded interfces to auto or > add their devices to MANDATORY_DEVICES. Doing so (STARTMODE auto or explictly add to MANDATORY_DEVICES) has never been necessary in my past experience, and apparently not in SuSE's, either, since the documentation found at http://portal.suse.com/sdb/en/2004/09/tami_sles9_bonding_setup.html says to remove the slave device ifcfg-eth-* files, which, if I'm reading the code correctly, would exclude them from consideration as detected MANDATORY_DEVICES in /etc/init.d/network. That document does mention adding to MODULES_LOADED_ON_BOOT (which I have not tried) and WAIT_FOR_INTERFACES (which doesn't help unless the devices are MANDATORY). The documentation distributed for bonding differs; it says to keep the slave ifcfg-eth-* files, but set them as STARTMODE=off. That text was based on a mailing list posting (which I can't find at the moment), but I've never seen (or had a previous report of) this particular problem following the bonding.txt instructions. > So i either had to update the ifup manpage to make this understandable or i have > to check configuration files of bonding or vlan interfaces for the devices they > depend on and add these devices to the mandatory device list. I just checked what appears to be the current sysconfig on ftp.suse.com, version 0.32.0, and it does have a new(?) ifcfg-bonding.5 man page that has some good stuff in it, but it doesn't describe any special steps related to setting up the slave configurations. FWIW, the most recent bonding.txt is always kept at http://sourceforge.net/projects/bonding it is likely to be more up to date than what's in the kernel source; I don't know if you want to add that to the manual page or not (as external links may come and go over the long term). I think it would be most intuitive for end users for the init script itself to wait for the slave devices to become ready (treat them as MANDATORY, or possibly add a "wait for ready" type loop into the bonding device check, but that might be too much code duplication). > Another question: What error message do you see at boot time if bonding failed? > Can you please attach the relevant part of /var/log/boot.msg? It's short, I'll just paste it in here: Setting up network interfaces: lo lo IP address: 127.0.0.1/8 done bond0 bond0 Could not get an interface for slave device 'bus-pci-0000:06:08.1' bond0 IP address: 192.168.10.124/16 as bonding master enslaving eth0 eth0 is already a slave Using an "eth" type name in the BONDING_SLAVE variable doesn't make any difference; it still fails (although the message is a bit different), presumably because the device hasn't been probed by e1000 at that point.
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-08-23 18:26 EDT ------- Any updates? Submitters, do you have a viable workaround at this point (using MANDATORY_DEVICES in /etc/sysconfig/network/config, or something else)? SuSE, and update on a long term fix?
> I'm having some trouble seeing why this ever works correctly (except by luck), > unless something has changed very recently, since the modprobe of the driver > would presumably always race with the getcfg-interface loop in > /etc/init.d/network. That's why we have the loop. We have to wait sometimes. Further the article in SUSE portalis not completely correct. I will speak to Tami to correct this. And yes, i will make BONDING_SLAVES mandatory automaically. But it will take some time, since i'm very busy with SL 10.0 currently
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com) 2005-10-12 12:18 EDT ------- SuSE, any updates on when the fix should appear?
WIP. I'm just testing the code. Will probably go to SP3 beta4.
Fixed for SLES9 SP3. Patches still need to go to SVN for next release.
Added patches to svn. Maybe worth a YOUpdate New function get_slaves() did not work well in all cases. Added improved version of this function to SP3 as well.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ACCEPTED |CLOSED ------- Additional Comments From thinh@us.ibm.com(prefers email via th2tran@austin.ibm.com) 2006-02-08 15:37 EDT ------- no response from bug submitter for months. Fix is in SLES9 SP3. Closing. Please re-open if you can recreate this on SLES9 SP3. Thanks.