Bug 95834 - LTC16580 Bonding slaves are not mandatory devices by default (was: Ethernet bonding does not come up properly)
Summary: LTC16580 Bonding slaves are not mandatory devices by default (was: Ethernet b...
Status: VERIFIED FIXED
Alias: None
Product: SUSE Linux 10.1
Classification: openSUSE
Component: Network (show other bugs)
Version: unspecified
Hardware: All Linux
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Christian Zoz
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-07-08 22:35 UTC by LTC BugProxy
Modified: 2016-02-13 06:23 UTC (History)
0 users

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
bonding.tar (130.00 KB, application/x-tar)
2005-07-08 22:36 UTC, LTC BugProxy
Details
bonding-debug.patch (849 bytes, patch)
2005-07-08 22:36 UTC, LTC BugProxy
Details | Diff
sysconfig-0.31.0-15.35.i586.rpm (256.51 KB, audio/x-pn-realaudio-plugin)
2005-07-08 22:37 UTC, LTC BugProxy
Details
netwk_debug_minimum.patch (1.05 KB, patch)
2005-07-08 22:37 UTC, LTC BugProxy
Details | Diff
"gai.debug" (17.87 KB, text/plain)
2005-08-18 06:06 UTC, LTC BugProxy
Details

Note You need to log in before you can comment on or make changes to this bug.
Description LTC BugProxy 2005-07-08 22:35:40 UTC
LTC Owner is: skodati@in.ibm.com
LTC Originator is: gsanjay@us.ibm.com


Ethernet bonding does not come up properly due to getcfg-interface return “ 
error  11” during boot. This problem is intermittent problem and  it happened 
on both eth0 and eth1 or one of this interface. We have seen this problem on 
multiple x346 systems,



Provide output from "uname -a", if possible:
2.6.5-7.139-bigsmp

Hardware Environment
    Machine type : x345
    Cpu type IA-32
    Describe any special hardware you think might be relevant to this problem:
    Intel Corp,82546EB Gigabit Ethternet Controller (Copper) Rev-01

Is this reproducible?
    Intermittent problem.
Describe the steps:
    Setup bodning using procedure 
http://support.novell.com/techcenter/tips/10046.html.
    Reboot system in loop, you may see this problem  intermittently.


Additional information:
sametime chat  information:

adcosta@us.ibm...	Hi
adcosta@us.ib...	Should we raise an LTC defect regarding getcfg failure 
during boot time bonding ?
Jay Vosburgh	Probably, yes.  I'm not sure who would look into it right 
offhand.

Jay Vosburgh	Is the e1000 in question on a PCI card?  If so, I'd try 
swapping it out just to make sure it's not a hardware problem with a 
particular card
adcosta@us.ib...	The problem happens intermittently. Does that still be 
a possible hardware problem? Also we are seeing the same thing on many LSES9 
machines
adcosta@us.ib...	The upcoming sanfs release relies heavily on bonding 
for HA.
Jay Vosburgh	Well, if it's on several machines, it's probably not a failed 
card.  Might still be a firmware type of problem, but that's less likely.
Jay Vosburgh	From what I can tell, it looks like a problem with the one 
interface not being initialized properly.
Jay Vosburgh	Do you see the same problem with the interfaces not coming up 
even if bonding is not configured?
adcosta@us.ib...	The getcfg works after the boot even if it had failed 
during ifup
adcosta@us.ib...	No, the interfaces come up without bonding
Jay Vosburgh	Interesting.
adcosta@us.ib...	Also, we captured the retcode from getcfg in ifup & it 
is 11
Jay Vosburgh	Yah, I haven't looked at the getcfg source to see what that 11 
might mean
adcosta@us.ib...	So fr setup documentation, should users be pointed to 
the sourceforge.net doc for SLES9 bonding setup?
Jay Vosburgh	Sure

adcosta@us.ib...	I noted that the Novell documentation uses the bus-pci 
names directly in the BONDING_SLAVE variables whereas at sourceforge it is 
ethN values,..
Jay Vosburgh	Yah, either one works; I have a note to update the bonding.txt 
(as the bus id names give module load order independence)
adcosta@us.ib...	Ok, I will ask Sanjay to raise a LTC defect. Thanks so 
much for your help.
Jay Vosburgh	Sure thing




web site for bonding and where doc is

http://sourceforge.net/projects/bonding

click on documentation

Sanjay, the system is running an older kernel version.  Please install sles 9
SP2 RC2 (2.6.5-7.183) to see if the problem recreates.

The isos are available on ftp3.liux.ibm.com in '/suse/beta_cds/sles-9-sp2/i386/RC':
-rw-r-----   1 root     suse     521572352 Jun  8 10:56 SLES-9-SP-2-i386-RC2-CD1.iso
-rw-r-----   1 root     suse     660611072 Jun  8 10:57 SLES-9-SP-2-i386-RC2-CD2.iso
-rw-r-----   1 root     suse     657164288 Jun  8 10:58 SLES-9-SP-2-i386-RC2-CD3.iso

Please attach complete /var/log/messages collected from the failure system and
messages printed on console (if any). Thanks. 

Created an attachment (id=11038)
config info and system logs

This is the config info and boot logs that I looked through.  Note that
"aftr_boot" is after a failure, and "restart" is after a network restart that
comes up correctly.

Error 11 is returned from getcfg-interface when get_all_interfaces() called from 
getcfg ( case get_interface ) returns any value other than 1.

from tools/get_config.c

...
    555             if (1 == get_all_interfaces(interfacetype, hwdesc->hwdesc,
    556                                        interfacelist)) {
    557                 if (verbosity == 0)
    558                     printf("%s\n", interfacelist);
    559                 else
    560                     printf("%s (indirekt)\n", interfacelist);
    561                 return 0;
    562             }
    563             return 11;
....

get_all_interfaces() ( from get_config_lib.c)  will return a value other than 1
under many circumstances. In most of the cases an error is looged, except in the
case where it can return "0" when it fails to complete 1 iteration within
dlist_for_each_data(){}. From the boot messages it is more likely that it failed
to get the details for eth1. 

I am planning to prepare a debug patch to capture all error codes, and hopefully
we can get more information with it. 

Thanks. 

Created an attachment (id=11046)
debug patch to identify the problem 

Please let me know if you have any problems with applying the patch and
rebuilding the rpm's. I will be glad to assist you in building binary rpm's
with the patch. 
Thanks. 

Created an attachment (id=11052)
binary rpm with the patch


Sanjay, Please update the report with details of debug messages as requested by
Jay, by setting DEBUG to "yes" in /etc/sysconfig/network/config.
Thanks. 

Created an attachment (id=11259)
debug patch to add timestamps

I had a telephonic chat with Sanjay and he explained practical problems in
considering the workaround, since it will add a time delay to the bootup
process. 

I think one opiton to resolve the problem is to add eth1 & eth0 as
MANDATORY_DEVICES in /etc/sysconfig/network/config. It will ensure that the
given interfaces are up and running.

I am attaching a minimal debug patch to timestamp the interface bring-up times.
I decided to keep it minimal after the concerns about too much debug
information might cause the problem to disappear. 

Sanjay, Use the following patch to /etc/init.d/network and attach the file
/etc/ltdebug to the report. Please provide the details of success/failure of
bonding for each iteration.

Thanks. 

Sanjay requested me to carry the tests on lab machines, I am waiting for the
machine details and access to lab. Moving the report to NEEDINFO. 
Thanks.
Comment 1 LTC BugProxy 2005-07-08 22:36:37 UTC
Created attachment 41511 [details]
bonding.tar

IBM attachment id 11038
Comment 2 LTC BugProxy 2005-07-08 22:36:55 UTC
Created attachment 41512 [details]
bonding-debug.patch
Comment 3 LTC BugProxy 2005-07-08 22:37:28 UTC
Created attachment 41514 [details]
sysconfig-0.31.0-15.35.i586.rpm
Comment 4 LTC BugProxy 2005-07-08 22:37:52 UTC
Created attachment 41515 [details]
netwk_debug_minimum.patch
Comment 5 LTC BugProxy 2005-07-15 20:11:27 UTC
---- Additional Comments From skodati@in.ibm.com  2005-07-15 16:02 EDT -------
This bug is taking a while more than expected to conclude, partly because of the
low reproduction rate with debug messages. 
ifenslave is returning 1 when there is a failure to attach the device, when
invoked with -v option it should have printed the reason for the failure. There
is only place in ifenslave code where there is no debug message printed when the
abi version is not valid. I am rerunning the testcases again after adding an
error message when there is a failure. 
Thanks. 
Comment 6 LTC BugProxy 2005-07-19 19:00:50 UTC
---- Additional Comments From skodati@in.ibm.com  2005-07-19 14:54 EDT -------
I am struggling hard to reproduce the problem with the required debug
information , so far 228 passes without a single failure..
Thanks. 
Comment 7 LTC BugProxy 2005-07-20 16:12:57 UTC
---- Additional Comments From skodati@in.ibm.com  2005-07-20 12:04 EDT -------
Finally I am able to reproduce the problem once after 296 iterations. I found
some surprising results though..

ifenslave fails at ioctl(skfd, SIOCGIFFLAGS, &ifr2) at the following code:

                        else if (abi_ver < 1) {
                                /* The driver is using an old ABI, so we'll set
the interface
                                 * down to avoid any conflicts due to same IP/MAC
                                 */
                                strncpy(ifr2.ifr_name, slave_ifname, IFNAMSIZ);
                                if (ioctl(skfd, SIOCGIFFLAGS, &ifr2) < 0) { <-- HERE
                                        int saved_errno = errno;
                                        fprintf(stderr, "SIOCGIFFLAGS on %s
failed: %s
", slave_ifname,
                                                strerror(saved_errno));
                                        }

Strangely, this check is done when abi_ver < 1, but all the logs suggest the
abi_ver it received is 2. I am debugging the problem further. 
Thanks. 
Comment 8 LTC BugProxy 2005-07-21 19:30:23 UTC
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com)  2005-07-21 15:23 EDT -------
If ifenslave itself is failing, isn't that a different failure than the original
problem (in which ifenslave would never be called)?

The failure you cite seems very strange; I think you'd have to have an old
bonding driver installed to follow that path. 
Comment 9 LTC BugProxy 2005-07-21 19:35:48 UTC
---- Additional Comments From skodati@in.ibm.com  2005-07-21 15:32 EDT -------
(In reply to comment #32)
> If ifenslave itself is failing, isn't that a different failure than the original
> problem (in which ifenslave would never be called)?

True,  In the case where there was a failure to add eth1, ioctl(skfd,
SIOCGIFFLAGS, &ifr2) fails with the following error

SIOCGIFFLAGS on eth1 failed: No such device

> The failure you cite seems very strange; I think you'd have to have an old
> bonding driver installed to follow that path.

Sorry for the confusion, I think overlooked at another instance of the code
where the similar check exists even for abi_ver 2. 
Comment 10 LTC BugProxy 2005-07-26 21:41:53 UTC
---- Additional Comments From skodati@in.ibm.com  2005-07-26 17:33 EDT -------
I think the prime reason for the problem to appear is a very slight time delay
between eth1 being up and attaching the device to bonding device. 

When eth1 is attached to bond0 through ifenslave in /sbin/ifup ( 
/sbin/ifenslave -v $BONDING_OPTIONS $INTERFACE $BSIFACE  ), it returns 1 with
the failure (  SIOCGIFFLAGS on eth1 failed: No such device ).

But from the logs I could see that eth1 was just up with a very slight delay. To
verify this I tested with a small patch which checks for the return status of
ifenslave (  /sbin/ifenslave -v $BONDING_OPTIONS $INTERFACE $BSIFACE  ) in
/sbin/ifup and retry attaching eth1, it succeeded always. I can see 894 passes
so far without any failure. 

A possible workaround to reflect the above testing is to check the status of
bonding towards the end of the init scripts and restarting bond0. 

I had a ST chat with Sanjay today and we will discuss this work around tomorrow
( 27th July ). 

Thanks. 
Comment 11 LTC BugProxy 2005-08-01 14:36:03 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEEDINFO




------- Additional Comments From skodati@in.ibm.com  2005-08-01 10:25 EDT -------
Please update the report with your comments. Keeping the report in NEEDINFO. Thanks 
Comment 12 LTC BugProxy 2005-08-16 20:23:01 UTC
---- Additional Comments From skodati@in.ibm.com  2005-08-16 16:13 EDT -------
Novell, Any comments/Suggestions on this bug report. Thanks. 
Comment 13 LTC BugProxy 2005-08-18 06:06:53 UTC
Created attachment 46400 [details]
"gai.debug"
Comment 14 LTC BugProxy 2005-08-18 06:06:55 UTC
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com)  2005-08-18 01:59 EDT -------
 
getcfg trace file

I did some tests today on one of the problem machines.	Right now, my best
guess is that whatever is loading the modules is loading e1000 and bonding in
parallel, causing the probe of eth0 and eth1 by e1000 to overlap with the loop
in the ifup of bond0 trying to look for them.

It is unclear to me what agent actually performs the modprobe for e1000; it
doesn't appear to happen in the main loop of /etc/init.d/network.  Any init
script gurus want to chime in here?  Since I'm coming in over the network, it's
hard to test the theory that the getcfg- itself might trigger a hotplug event
or something to load the driver, although trying that scenario on a system I
have locally doesn't cause the driver to load for a getcfg- query. 

I did tinker with an install line in /etc/modprobe.conf.local for e1000, as
follows:

install e1000 /sbin/modprobe --ignore-install e1000 &&	{ logger -s -p
kern.warning e1000 sleep 5 ; sleep 5 ; logger -s -p kern.warning sleep done ; }


At the time I thought this might give the driver time to finish probing, but it
didn't make any difference.  It did produce the "sleep done" message in
/var/log/messages as follows, however:

Aug 17 15:53:01 fvt10-mds6 /etc/hotplug/pci.agent[2280]: logger: sleep done
Aug 17 15:53:01 fvt10-mds6 logger: sleep done

The message coming from the hotplug pci.agent suggests that hotplug is loading
e1000, but I still don't know what the mechanism is.

The interleaved messages from bonding and e100 appear in /var/log/messages as
follows:

17:30:50 x kernel: Intel(R) PRO/1000 Network Driver - version 5.2.39
17:30:50 x kernel: Copyright (c) 1999-2004 Intel Corporation.
17:30:50 x kernel: ACPI: PCI interrupt 0000:06:08.0[A] -> GSI 29 (level, low)
-> IRQ 29
17:30:50 x kernel: Ethernet Channel Bonding Driver: v2.6.0 (January 14, 2004)
17:30:50 x kernel: bonding: MII link monitoring set to 100 ms
17:30:50 x kernel: e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network
Connection
17:30:50 x kernel: ACPI: PCI interrupt 0000:06:08.1[B] -> GSI 30 (level, low)
-> IRQ 30
17:30:50 x kernel: e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network
Connection
17:30:50 x kernel: bonding: bond0: enslaving eth0 as a backup interface with a
down link.
17:30:50 x kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 100 Mbps Full
Duplex
17:30:50 x kernel: bonding: bond0: link status definitely up for interface
eth0.
17:30:50 x kernel: bonding: bond0: making interface eth0 the new active one.

Also, I do not believe that the get_all_interfaces function called by
getcfg-interface is the root of the problem.  I think I discussed this
possibility with somebody, but I don't recall who.  I instrumented
/lib/libgetconfig and /sbin/getcfg and had it print all sorts of deep,
meaningful stuff.  Among the tidbits is that only the failing getcfg-interface
call enters get_all_interfaces; the successful calls don't get that far.

In the attached trace, the two getcfg-interface calls for the slave devices are
pids 2572 and 2573, grepped here for your convenience:

getcfg 2572  /sbin/getcfg-interface -- bus-pci-0000:06:08.1
  2572 case g_i: rv 0 from split_hwdesc
  2572 case g_i: rv 0 from complete_hwdesc_sysfs
  2572 get_all_interfaces net bus-pci-0000:06:08.1 
  2572 iface: 'bond0'  cfg: 'bond0' mcfg: 'bus-pci-0000:06:08.1'
  2572 iface: 'eth0'  cfg: 'eth-id-00:09:6b:f1:ac:06' mcfg:
'bus-pci-0000:06:08.1'
  2572 iface: 'lo'  cfg: 'lo' mcfg: 'bus-pci-0000:06:08.1'
  2572 ret: cl net cfgname bus-pci-0000:06:08.1 ifs  r 0
  2572 get_all_if ret !=1 iflist  exit 11

getcfg 2573  /sbin/getcfg-interface -- bus-pci-0000:06:08.0
  2573 case g_i: rv 0 from split_hwdesc
  2573 case g_i: rv 0 from complete_hwdesc_sysfs
  2573 case g_i: match_type: h->iface eth0 h->devtype eth iftype net

Without the source handy that might not make much sense, but 2572 is the "exit
11" failure case for eth1; 2573 succeeds for eth0.  Note that there is a lot of
interleaving in the trace file; I'm not sure how much of that is real and how
much is an artifact of buffering in fprintf (I set it to unbuffered, but who
knows).  I have a fair level of faith in the interleaving of the bonding/e1000
kernel messages, since they line up that way in dmesg right from the kernel
printk.

Lastly, note that in /etc/init.d/network, it doesn't appear that the e1000
driver is loaded until bonding is initialized, so the WAIT_FOR_INTERFACES loop
won't make any difference (as I read it; I might be mistaken here, but it
doesn't look like the e1000 devices will be put into the MANDATORY list,
because their STARTMODE is "off"). 
Comment 15 Christian Zoz 2005-08-18 07:31:03 UTC
I don't understand your problem, because this bug report is really messed up.
Attachments don't fit their description, a lot of useless lines which make it
hard to find the relevant part and so on.Please excuse me, but could you
describe in a few lines what the problem(s) is (are)?

You write something about a getcfg-interface problem and also about an ifenslave
failure. So pleaseone after another:
1) What do you want to set up?
2) What is your configuration for that?
3) What _exact_ is the failiure you see at first and what is the state of all 
   involved network interfaces?
4) Does it happen only at boot time, or is it reproducible when you set 
   STARTMODE=manual in all ifcfg-* files and call 
   'rcnetwork start -o boot manual' later?
5) Do you see the problem with SP2 as well?

I looked again over this report and it might be just a timing problem (as far as
i understood the report). The automatic determination of mandatory devices may
fail. Go use the MANDATORY_DEVICES variable in the config file.
Comment 16 LTC BugProxy 2005-08-18 17:12:31 UTC
------- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com)  2005-08-18 13:06 EDT -------
(In reply to comment #39)
[...]
> You write something about a getcfg-interface problem and also about an
ifenslave failure. So pleaseone after another:
> 1) What do you want to set up?

The system is trying to boot up and start bonding at boot time, with two e1000
devices in an active-backup configuration.

> 2) What is your configuration for that?

Somebody else needs to provide whatever details; all I know is that it's a SMP
x86 system of some sort.

> 3) What _exact_ is the failiure you see at first and what is the state of all 
>    involved network interfaces?

At boot time, when /etc/init.d/network gets to the ifenslave part, it first runs
a loop that does getcfg-interface on all of the devices listed as BONDING_SLAVEs
in the ifcfg-bond0.

Very often, one of these getcfg-interface calls will fail, and exit code 11.

The initial suspicion was that there was something wrong with getcfg itself, but
after yesterday's session, I believe the problem is that the e1000 module is
being loaded simultaneously with the ifup bond0 / getcfg-interface loop, causing
one of the interfaces to not be probed at the time getcfg-interface tries to
look it up.


> 4) Does it happen only at boot time, or is it reproducible when you set 
>    STARTMODE=manual in all ifcfg-* files and call 
>    'rcnetwork start -o boot manual' later?

Boot time for sure.  I have not personally tried the other two.

> 5) Do you see the problem with SP2 as well?

Yes.

> I looked again over this report and it might be just a timing problem (as far as
> i understood the report). The automatic determination of mandatory devices may
> fail. Go use the MANDATORY_DEVICES variable in the config file.

Adding the slaves to MANDATORY_DEVICES does bring things up (at least after a
couple of tries; I'm not sure if the submitters or assingee tried it more and
still saw failures).  The slaves don't go in MANDATORY_DEVICES automatically
because they're configured as "off."

That doesn't explain the problem, though; I've never previously seen a case that
required the bonding slaves to be added to MANDATORY_DEVICES by hand.  It looks
like something is running the modprobe of e1000 in the background. 
Comment 17 Christian Zoz 2005-08-19 07:13:01 UTC
Of course is modprobe e1000 running in background. This is triggered via
hotplug. And that is the reason why the network script waits for mandatory
devices to be set up properly.

The problem is to determine which of the available network devices are mandatory
for the system. Either you set the STARTMODE of the bonded interfces to auto or
add their devices to MANDATORY_DEVICES.

So i either had to update the ifup manpage to make this understandable or i have
to check configuration files of bonding or vlan interfaces for the devices they
depend on and add these devices to the mandatory device list.

Another question: What error message do you see at boot time if bonding failed?
Can you please attach the relevant part of /var/log/boot.msg? (not the complete
file please.)
Comment 18 LTC BugProxy 2005-08-19 18:38:36 UTC
------- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com)  2005-08-19 14:32 EDT -------
(In reply to comment #41)
> ---- Additional Comments From zoz@suse.de  2005-08-19 01:13 MST -------
> Of course is modprobe e1000 running in background. This is triggered via
> hotplug. And that is the reason why the network script waits for mandatory
> devices to be set up properly.

I'm having some trouble seeing why this ever works correctly (except by luck),
unless something has changed very recently, since the modprobe of the driver
would presumably always race with the getcfg-interface loop in
/etc/init.d/network.  

> The problem is to determine which of the available network devices are mandatory
> for the system. Either you set the STARTMODE of the bonded interfces to auto or
> add their devices to MANDATORY_DEVICES.

Doing so (STARTMODE auto or explictly add to MANDATORY_DEVICES) has never been
necessary in my past experience, and apparently not in SuSE's, either, since the
documentation found at

http://portal.suse.com/sdb/en/2004/09/tami_sles9_bonding_setup.html

says to remove the slave device ifcfg-eth-* files, which, if I'm reading the
code correctly, would exclude them from consideration as detected
MANDATORY_DEVICES in /etc/init.d/network.  That document does mention adding to
MODULES_LOADED_ON_BOOT (which I have not tried) and WAIT_FOR_INTERFACES (which
doesn't help unless the devices are MANDATORY).

The documentation distributed for bonding differs; it says to keep the slave
ifcfg-eth-* files, but set them as STARTMODE=off.  That text was based on a
mailing list posting (which I can't find at the moment), but I've never seen (or
had a previous report of) this particular problem following the bonding.txt
instructions.

> So i either had to update the ifup manpage to make this understandable or i have
> to check configuration files of bonding or vlan interfaces for the devices they
> depend on and add these devices to the mandatory device list.

I just checked what appears to be the current sysconfig on ftp.suse.com, version
0.32.0, and it does have a new(?) ifcfg-bonding.5 man page that has some good
stuff in it, but it doesn't describe any special steps related to setting up the
slave configurations.

FWIW, the most recent bonding.txt is always kept at

http://sourceforge.net/projects/bonding

it is likely to be more up to date than what's in the kernel source; I don't
know if you want to add that to the manual page or not (as external links may
come and go over the long term).

I think it would be most intuitive for end users for the init script itself to
wait for the slave devices to become ready (treat them as MANDATORY, or possibly
add a "wait for ready" type loop into the bonding device check, but that might
be too much code duplication).

> Another question: What error message do you see at boot time if bonding failed?
> Can you please attach the relevant part of /var/log/boot.msg? 

It's short, I'll just paste it in here:

Setting up network interfaces:
    lo       
    lo        IP address: 127.0.0.1/8   
done    bond0    
    bond0     Could not get an interface for slave device 'bus-pci-0000:06:08.1'
    bond0     IP address: 192.168.10.124/16   as bonding master 
              enslaving eth0
eth0 is already a slave

Using an "eth" type name in the BONDING_SLAVE variable doesn't make any
difference; it still fails (although the message is a bit different), presumably
because the device hasn't been probed by e1000 at that point. 
Comment 19 LTC BugProxy 2005-08-23 22:35:21 UTC
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com)  2005-08-23 18:26 EDT -------
Any updates?

Submitters, do you have a viable workaround at this point (using
MANDATORY_DEVICES in /etc/sysconfig/network/config, or something else)?

SuSE, and update on a long term fix? 
Comment 20 Christian Zoz 2005-08-24 10:32:55 UTC
> I'm having some trouble seeing why this ever works correctly (except by luck), 
> unless something has changed very recently, since the modprobe of the driver 
> would presumably always race with the getcfg-interface loop in
> /etc/init.d/network.

That's why we have the loop. We have to wait sometimes.

Further the article in SUSE portalis not completely correct. I will speak to
Tami to correct this.

And yes, i will make BONDING_SLAVES mandatory automaically. But it will take
some time, since i'm very busy with SL 10.0 currently
Comment 21 LTC BugProxy 2005-10-12 16:22:29 UTC
---- Additional Comments From vosburgh@us.ibm.com(prefers email via fubar@us.ibm.com)  2005-10-12 12:18 EDT -------
SuSE, any updates on when the fix should appear? 
Comment 22 Christian Zoz 2005-10-13 10:24:59 UTC
WIP. I'm just testing the code. Will probably go to SP3 beta4.
Comment 23 Christian Zoz 2005-10-17 12:37:12 UTC
Fixed for SLES9 SP3.

Patches still need to go to SVN for next release.
Comment 24 Christian Zoz 2005-10-24 11:25:13 UTC
Added patches to svn. Maybe worth a YOUpdate

New function get_slaves() did not work well in all cases. Added improved version of this function to SP3 as well.
Comment 25 LTC BugProxy 2006-02-08 20:40:11 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ACCEPTED                    |CLOSED




------- Additional Comments From thinh@us.ibm.com(prefers email via th2tran@austin.ibm.com)  2006-02-08 15:37 EDT -------
no response from bug submitter for months.
Fix is in SLES9 SP3. Closing.
Please re-open if you can recreate this on SLES9 SP3.

Thanks.