Bug 104379

Summary: nfs connection errors w/ beta1 and network started by ifplugd
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Matthias Hopf <mhopf>
Component: NetworkAssignee: Jaroslav Kysela <perex>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None    
Version: RC 1   
Target Milestone: ---   
Hardware: i686   
OS: All   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Syslog from startup to test end
Ping log
Test script, just for completeness

Description Matthias Hopf 2005-08-12 12:45:13 UTC
On this laptop (Dell D410) with beta1 the automounter often hangs when I plug in
the network cable after booting (using ifplugd, restarting ypbind and autofs).
Trying to perform the same nfs mount by hand hangs as well.

Unfortunately this isn't exactly easily reproducable. Right now it the laptop is
in this state (and will remain in it until I boot it), and I haven't seen that
on other machines, thus I didn't set it on Critical. SL9.3 worked fine, and I
haven't seen it so far with networked plugged in on boot time, though I am not
sure about that.

Excert from /var/log/messages:

Aug 12 06:30:41 linux ifplugd(eth0)[4462]: client: Starting DHCP Client Daemon o
n eth0... . IP/Netmask: 10.10.101.229 / 255.255.0.0 
Aug 12 06:30:42 linux ifplugd(eth0)[4462]: Program executed successfully.
Aug 12 06:31:56 linux kernel: RPC: error 5 connecting to server hilbert
Aug 12 06:32:56 linux kernel: RPC: error 5 connecting to server hilbert
Aug 12 06:32:56 linux automount[9787]: >> mount: hilbert:/work: can't read super
block
Aug 12 06:32:56 linux automount[9787]: mount(nfs): nfs: mount failure hilbert:/w
ork on /mounts/work
Aug 12 06:32:56 linux automount[9787]: failed to mount /mounts/work
Aug 12 06:33:56 linux kernel: RPC: error 5 connecting to server hilbert
Aug 12 06:34:56 linux kernel: RPC: error 5 connecting to server hilbert
Aug 12 06:35:41 linux kernel: smb_add_request: request [ee7e9e40, mid=28040] tim
ed out!
Aug 12 06:35:56 linux kernel: RPC: error 5 connecting to server hilbert

Sometimes a manually invoked mount returns successfully, but an ls on that
directory just returns with an Input/Output error.

I've seen this behavior with both wotan and hilbert.

I'm not exactly sure whether this should be reported for the Network or the
Kernel component.
Comment 1 Forgotten User ZhJd0F0L3x 2005-08-12 16:06:21 UTC
it just does not work this way. If you want NFS / automount, you need a
permanent network connection. Or you need to stop the automounter before
unplugging the cable and restart after plugging the cable.

NFS is no offline filesystem. Ifplugd has nothing to do with this.
Comment 2 Forgotten User ZhJd0F0L3x 2005-08-12 16:07:02 UTC
NFS does not work without network. Design issue.
Comment 3 Matthias Hopf 2005-08-12 16:26:42 UTC
?!? As long as no NFS mount is requested from autofs (which isn't because there
are no automount tables), nfs is not invoked at all. Only after the network
cable is attached, and NIS configured and autofs restarted.

It also always seemed to work with 9.3.

Again, I boot the system w/o network, I plugin in ethernet, it gets configured
by ifplugd, I restart NIS and autofs (root console login), I try to log in (NIS
user), NFS hangs on the mount request. IMHO this shouldn't happen.
Comment 4 Dr. Werner Fink 2005-08-12 16:34:20 UTC
I'm not maintainer of ifplugd and I do not know how to
fix that.  Btw: It has always worked is no an argument,
may it was a bug.
Comment 5 Forgotten User ZhJd0F0L3x 2005-08-12 20:03:27 UTC
anyway, it has nothing to do with ifplugd. ifplugd just does "ifup $INTERFACE"
as soon as a cable is plugged in adn "ifdown $INTERFACE" after cable is unplugged.
If it is a bug, it is clearly not a bug in ifplugd since ifplugd obviously works
fine.

Just leave the bug assigned to me, i am on vacation for the next 2 weeks.
Comment 6 Matthias Hopf 2005-08-31 14:35:52 UTC
After installing Beta4 the issue remains, and it even got worse - but maybe this
is just coincidence:

In the post-installation network startup, dhcp failed and the machine didn't get
an ip address. After the runlevel change finished, I did

/etc/init.d/network restart
/etc/init.d/ypbind restart
/etc/init.d/autofs restart

and I got network and working homes. However, as soon as I cd to
/mounts/work/CDs, the mount sits there for some time, and then I get a

kernel: RPC: error 5 connecting to server hilbert
automount[x]: mount(nfs): nfs: mount failure hilbert:/work on /mounts/work

After some time (minutes) it works.
As this happened very often on this machine, I doubt that each time hilbert was
down.

Maybe this is an issue with the network driver, this machine has a Broadcom
BCM5751...
BTW - This is definitively not minor!
Comment 7 Christian Zoz 2005-09-01 12:04:07 UTC
I had this problems to, but on 9.3 and mostly then when i had no time to debug.
autofs/nfs told me that it could not resolve hostname of the server. Network
setup was ok. Sometimes i could not ping another host for a short period. Then
it worked again.

Maybe it is hardware related. A few days ago my network connetion was down
compeletly until i removed a cable from the hub to another laptop that was
switched off.
Comment 8 Forgotten User ZhJd0F0L3x 2005-09-02 20:06:24 UTC
this bug has not much to do with ifplugd - it is a generic networking bug
Comment 9 Andreas Jaeger 2005-09-05 09:42:00 UTC
Matthias, please check that your clock is correct, this could explain some of
the problems.

Olaf, further ideas?
Comment 10 Olaf Kirch 2005-09-05 10:01:23 UTC
I would also try restarting nscd after the network comes up. 
 
But I think in order to understand what is going on here, we need 
a tcpdump. 
 
Before plugging in the cable, start 
 
tcpdump -s0 -w /tmp/eth0.log -i eth0 
 
then plug in the cable, try to access nfs volume. If it doesn't hang 
unplug the cable, stop tcpdump, go back to square 1 :-) 
Repeat until you observe a hang. 
 
It may also be useful to enable NFS and RPC debugging on the client: 
 
echo 65535 > /proc/sys/sunrpc/rpc_debug 
echo 65535 > /proc/sys/sunrpc/nfs_debug 
 
make sure you turn that off again once you're done, as it creates quite 
a bit of debug output. 
Comment 11 Olaf Kirch 2005-09-05 19:28:41 UTC
Maybe Neil could help with this one? 
Comment 12 Chris L Mason 2005-09-05 20:26:25 UTC
Neil, please take a look.  Feel free to set this bug to NEEDINFO if you want to wait until 
the information Olaf requested has been provided. 
Comment 13 Neil Brown 2005-09-05 23:12:06 UTC
This looks a lot like a networking issue to me.  I would definitely be interested
in the tcpdump output, but it would be good if you could also try other network
operations like 'ping' and 'ssh' and report on how well they work.

Also, the output of "ifconfig -a" and "ip route" (or "netstat -nr") might
help.

I suspect you are using 'soft' mounts.  These a generally frowned upon, but
in the current sitation they mean you get IO errors instead of a hard hang,
which may be a win...
Comment 14 Matthias Hopf 2005-09-06 11:52:51 UTC
I don't know about the mount options. I tested using the standard autofs maps,
and mount by hand without any options, I'm currently not sure what the default is.

I'll try to reproduce and perform the tests. I'll attach the logfiles as soon as
I got something.
Comment 15 Matthias Hopf 2005-09-06 12:16:37 UTC
Ok, was just able to reproduce. Will attach logfiles now.

BTW - the date is completely berserk on this machine (some days off), so if that
could be the source of the problem... I'll configure ntp on this machine now.

Note that the logfiles will be generated with the misconfigured wall time clock.

1st I configured nfs and rpc debugging.
2nd I started tcpdump
3rd I plugged in the network cable
4th ifup eth0
5th rcypbind restart
6th rcautofs restart
7th cd /mounts/work/CDs (failed)
8th cd /suse/mhopf (failed)

The tcpdump logfile will be compressed due to its size (127MB).
Comment 16 Matthias Hopf 2005-09-06 12:17:23 UTC
Created attachment 48912 [details]
Syslog from startup to test end
Comment 17 Matthias Hopf 2005-09-06 12:18:05 UTC
Created attachment 48913 [details]
Ping log

Note that one ping packet was lost during the whole test.
Comment 18 Neil Brown 2005-09-07 00:41:57 UTC
Still waiting for the tcpdump... if it is too big for bugzilla,
just dump it somewhere (e.g. /suse/mhopf) and I'll pick it up.

Also it might help to see the output of 
   /usr/sbin/rpcinfo -p hilbert
   showmount -e hilbert

Thanks,
NeilBrown
Comment 19 Matthias Hopf 2005-09-07 10:02:57 UTC
I didn't notice that the browser gave me an error message due to the size of the
log.

You can now get it from ~mhopf/Export/eth0.log.gz

It's a hefty 127MB in uncompressed form.

ifconfig -a and ip route follow.
Comment 20 Matthias Hopf 2005-09-07 10:09:32 UTC
All captured not during the session I created the logs, but with a cd
/mounts/work/CDs blocking:

iproute -a:

10.10.0.0/16 dev eth0  proto kernel  scope link  src 10.10.101.229 
169.254.0.0/16 dev eth0  scope link 
127.0.0.0/8 dev lo  scope link 
default via 10.10.0.8 dev eth0 


ifconfig -a:


eth0      Link encap:Ethernet  HWaddr 00:12:3F:22:4C:FD  
          inet addr:10.10.101.229  Bcast:10.10.255.255  Mask:255.255.0.0
          inet6 addr: 2001:780:101:a00:212:3fff:fe22:4cfd/64 Scope:Global
          inet6 addr: fe80::212:3fff:fe22:4cfd/64 Scope:Link
          UP BROADCAST NOTRAILERS RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:1869 errors:0 dropped:0 overruns:0 frame:0
          TX packets:74 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:146522 (143.0 Kb)  TX bytes:6462 (6.3 Kb)
          Interrupt:169 

eth1      Link encap:Ethernet  HWaddr 00:12:F0:7E:12:20  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)
          Interrupt:201 Base address:0xc000 Memory:dfcff000-dfcfffff 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:270 errors:0 dropped:0 overruns:0 frame:0
          TX packets:270 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:22066 (21.5 Kb)  TX bytes:22066 (21.5 Kb)

sit0      Link encap:IPv6-in-IPv4  
          NOARP  MTU:1480  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)


rpcinfo -p hilbert:

   program vers proto   port
    100000    2   tcp    111  portmapper
    100000    2   udp    111  portmapper
    100004    2   udp    709  ypserv
    100004    1   udp    709  ypserv
    100004    2   tcp    712  ypserv
    100004    1   tcp    712  ypserv
    390113    1   tcp   7937
    100007    2   udp    791  ypbind
    100007    1   udp    791  ypbind
    100007    2   tcp    794  ypbind
    100007    1   tcp    794  ypbind
    100005    1   udp    649  mountd
    100005    2   udp    649  mountd
    100005    1   tcp    652  mountd
    100005    2   tcp    652  mountd
    100003    2   udp   2049  nfs
    100003    2   tcp   2049  nfs


showmount -e hilbert:

Export list for hilbert:
/patchinfo          10.10.0.0/255.255.0.0
/patchinfo          *.suse.de
/mbuild             *.hwlab.suse.de
/mbuild             10.10.0.0/255.255.0.0
/mbuild             *.suse.de
/work/local         (everyone)
/work/local         (everyone)
/work/users/uli     *.suse.de
/work/users/coolo   *.suse.de
/work/users/mmj     *.suse.de
/work/users/matz    *.suse.de
/work/users/adrian  *.suse.de
/work/users/aj      *.suse.de
/work/users/kukuk   *.suse.de
/work/tmp           <anon clnt>
/work/built         10.10.0.0/255.255.0.0
/work/built         *.hwlab.suse.de
/work/built         *.suse.de
/work/src/Copyright Boole.suse.de
/work/cd/pac        *.hwlab.suse.de
/work/cd/pac        *.suse.de
/work/src/done      *.hwlab.suse.de
/work/src/done      *.suse.de
/work/SRC/ChangeLog machcd4.suse.de
/work/SRC/ChangeLog machcd3.suse.de
/work/SRC/ChangeLog machcd2.suse.de
/work/SRC/ChangeLog machcd.suse.de
/work               10.11.0.0/255.255.0.0
/work               *.hwlab.suse.de
/work               10.10.0.0/255.255.0.0
/work               *.suse.de
/work               hilbert6.suse.de
/work               hilbert5.suse.de
/work               hilbert4.suse.de
/work               hilbert3.suse.de
/work               hilbert2.suse.de
/work               hilbert1.suse.de
/                   localhost

Comment 21 Neil Brown 2005-09-08 04:11:51 UTC
Well, it looks like everything is working perfectly EXCEPT NFS over TCP.

If you use
   tcpdump -r eth0.log -w eth0nfs.log host  10.10.101.229 and port 2049
to just get the NFS traffic to/from your notebook, and look at
eth0nfs.log in ethereal, you will see an NFS NULL request (effectively a 'ping')
over UDP which works fine.
Then your NFS client tries to create a TCP connection to the NFS server by
sending a [SYN] packet.  It expects to get a [SYN, ACK] packet back, but
only sees an [ACK].  It looks very much like the SYN flag is being stripped
out of the reply packet before it gets to your notebook.
This is only happening for NFS connections.  Other tcp connections work fine,
as evidenced by 'showmount' and 'rpcinfo' working perfectly.

My only guess is that your firewall is doing something odd to 
these TCP/NFS packets, how ever I don't know anything about how SUSE-Firewall
works, so I cannot be certain.

Can I suggest you try disabling the firewall and trying again, or would that
be politically incorrect :-)
My only other suggestion to so assign this bug to someone who knows about
SUSEFirewall2, but I don't know who that is.

NeilBrown
Comment 22 Matthias Hopf 2005-09-08 09:51:32 UTC
Even if it was political incorrect, I wouldn't care ;)
I'll try stopping the firewall as well as using NFS over UDP.
Comment 23 Olaf Kirch 2005-09-09 09:41:23 UTC
This sounds very much like a problem with netfilter we had before; 
see bug #61818. This was reported upstream and got fixed there, maybe 
it crept back in. 
 
The problem in that bug was that immediately following a reboot, the NFS 
client will end up opening the same TCP port it used before, so it tries to 
establish a TCP connection from client:1234 -> server:2049. The server still 
has a TCP control block for this, and replies with a single ACK containing 
what it thinks are the right sequence numbers. That ACK is eaten by the 
conntrack module because the connection isn't yet in state ESTABLISHED. 
 
Does that sound plausible? 
Comment 24 Olaf Kirch 2005-09-09 09:45:57 UTC
Check the 9.2 branch for these patches: 
 
patches.fixes/netfilter-tcp-rst-ack-fix 
patches.fixes/netfilter-tcp-rst-ack-not-assured-fix 
 
These patches supposedly made it into mainline, but looking at the current 
code it seems we still have the old behavior. 
Comment 25 Matthias Hopf 2005-09-09 10:00:49 UTC
ATM I cannot reproduce the problem. It works out-of-the-box as anticipated.

I'll update the laptop to RC1 and check whether the issue pops up again.


The explanation seems to make sense, I'll think of a better testing scheme after
the upgrade. BTW - I've never seen this behavior with 9.3 (never tested 9.2 with
this laptop) - does the 9.3 kernel have these patches applied as well?
Comment 26 Karsten Keil 2005-09-09 11:02:27 UTC
For 9.3 kernel this was already in mainline I think. 
Comment 27 Matthias Hopf 2005-09-09 12:46:35 UTC
It seems that Olaf is right.

I built a test script that does tcpdump, mounts & unmounts an nfs file system.

If I start the script, reboot, and start it again, everything works.
If I start the script, unplug the network cable before the umount, reboot, and
start it again, it freezes.
If I start it with firewall disabled, everything works.

I can attach tcpdump logs if needed, but I don't think it's necessary. Can
someone point me to a kernel (or an easy way to build it myself) with the
patches for testing?
Comment 28 Matthias Hopf 2005-09-09 12:47:14 UTC
Created attachment 49377 [details]
Test script, just for completeness
Comment 29 Matthias Hopf 2005-10-12 11:56:13 UTC
What's the status on this one? Have the patches been ported to the current
kernel? Will they be included with a security kernel update, or just with the
next box product?
Comment 30 Olaf Kirch 2005-10-14 07:33:39 UTC
Good question - Ludwig submitted patches to netfilter-devel a while ago,   
and they got merged upstream. But the current mainline kernel looks like this   
was dropped again. 
Comment 31 Matthias Hopf 2005-11-29 14:45:14 UTC
Raising severity. We didn't get this fixed (again) in 10.0, we definitively should in 10.1.
Comment 32 Olaf Kirch 2005-12-02 15:32:55 UTC
I agree. This came up on netdev a couple of days ago, and it turned out
they really dropped the patches from mainline for some reason. They
just resubmitted the patch.

Jaroslav, can you add this to our HEAD branch please, if it's not in
the latest 2.6.15-rc yet?
Patch is here:
http://marc.theaimsgroup.com/?l=linux-netdev&m=113326103416574&w=2

Thanks!
Comment 33 Jaroslav Kysela 2006-01-11 10:20:07 UTC
The mentioned patch is already in mainstream 2.6.15 kernel (thus in HEAD).
Comment 34 Jon Nelson 2006-01-11 15:24:52 UTC
I am experiencing the same problem here with a boxed copy of 10.0

What is the likelihood of a 10.0 kernel fix?