|
Bugzilla – Full Text Bug Listing |
| Summary: | nfs connection errors w/ beta1 and network started by ifplugd | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Matthias Hopf <mhopf> |
| Component: | Network | Assignee: | Jaroslav Kysela <perex> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Major | ||
| Priority: | P5 - None | ||
| Version: | RC 1 | ||
| Target Milestone: | --- | ||
| Hardware: | i686 | ||
| OS: | All | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
Syslog from startup to test end
Ping log Test script, just for completeness |
||
|
Description
Matthias Hopf
2005-08-12 12:45:13 UTC
it just does not work this way. If you want NFS / automount, you need a permanent network connection. Or you need to stop the automounter before unplugging the cable and restart after plugging the cable. NFS is no offline filesystem. Ifplugd has nothing to do with this. NFS does not work without network. Design issue. ?!? As long as no NFS mount is requested from autofs (which isn't because there are no automount tables), nfs is not invoked at all. Only after the network cable is attached, and NIS configured and autofs restarted. It also always seemed to work with 9.3. Again, I boot the system w/o network, I plugin in ethernet, it gets configured by ifplugd, I restart NIS and autofs (root console login), I try to log in (NIS user), NFS hangs on the mount request. IMHO this shouldn't happen. I'm not maintainer of ifplugd and I do not know how to fix that. Btw: It has always worked is no an argument, may it was a bug. anyway, it has nothing to do with ifplugd. ifplugd just does "ifup $INTERFACE" as soon as a cable is plugged in adn "ifdown $INTERFACE" after cable is unplugged. If it is a bug, it is clearly not a bug in ifplugd since ifplugd obviously works fine. Just leave the bug assigned to me, i am on vacation for the next 2 weeks. After installing Beta4 the issue remains, and it even got worse - but maybe this is just coincidence: In the post-installation network startup, dhcp failed and the machine didn't get an ip address. After the runlevel change finished, I did /etc/init.d/network restart /etc/init.d/ypbind restart /etc/init.d/autofs restart and I got network and working homes. However, as soon as I cd to /mounts/work/CDs, the mount sits there for some time, and then I get a kernel: RPC: error 5 connecting to server hilbert automount[x]: mount(nfs): nfs: mount failure hilbert:/work on /mounts/work After some time (minutes) it works. As this happened very often on this machine, I doubt that each time hilbert was down. Maybe this is an issue with the network driver, this machine has a Broadcom BCM5751... BTW - This is definitively not minor! I had this problems to, but on 9.3 and mostly then when i had no time to debug. autofs/nfs told me that it could not resolve hostname of the server. Network setup was ok. Sometimes i could not ping another host for a short period. Then it worked again. Maybe it is hardware related. A few days ago my network connetion was down compeletly until i removed a cable from the hub to another laptop that was switched off. this bug has not much to do with ifplugd - it is a generic networking bug Matthias, please check that your clock is correct, this could explain some of the problems. Olaf, further ideas? I would also try restarting nscd after the network comes up. But I think in order to understand what is going on here, we need a tcpdump. Before plugging in the cable, start tcpdump -s0 -w /tmp/eth0.log -i eth0 then plug in the cable, try to access nfs volume. If it doesn't hang unplug the cable, stop tcpdump, go back to square 1 :-) Repeat until you observe a hang. It may also be useful to enable NFS and RPC debugging on the client: echo 65535 > /proc/sys/sunrpc/rpc_debug echo 65535 > /proc/sys/sunrpc/nfs_debug make sure you turn that off again once you're done, as it creates quite a bit of debug output. Maybe Neil could help with this one? Neil, please take a look. Feel free to set this bug to NEEDINFO if you want to wait until the information Olaf requested has been provided. This looks a lot like a networking issue to me. I would definitely be interested in the tcpdump output, but it would be good if you could also try other network operations like 'ping' and 'ssh' and report on how well they work. Also, the output of "ifconfig -a" and "ip route" (or "netstat -nr") might help. I suspect you are using 'soft' mounts. These a generally frowned upon, but in the current sitation they mean you get IO errors instead of a hard hang, which may be a win... I don't know about the mount options. I tested using the standard autofs maps, and mount by hand without any options, I'm currently not sure what the default is. I'll try to reproduce and perform the tests. I'll attach the logfiles as soon as I got something. Ok, was just able to reproduce. Will attach logfiles now. BTW - the date is completely berserk on this machine (some days off), so if that could be the source of the problem... I'll configure ntp on this machine now. Note that the logfiles will be generated with the misconfigured wall time clock. 1st I configured nfs and rpc debugging. 2nd I started tcpdump 3rd I plugged in the network cable 4th ifup eth0 5th rcypbind restart 6th rcautofs restart 7th cd /mounts/work/CDs (failed) 8th cd /suse/mhopf (failed) The tcpdump logfile will be compressed due to its size (127MB). Created attachment 48912 [details]
Syslog from startup to test end
Created attachment 48913 [details]
Ping log
Note that one ping packet was lost during the whole test.
Still waiting for the tcpdump... if it is too big for bugzilla, just dump it somewhere (e.g. /suse/mhopf) and I'll pick it up. Also it might help to see the output of /usr/sbin/rpcinfo -p hilbert showmount -e hilbert Thanks, NeilBrown I didn't notice that the browser gave me an error message due to the size of the log. You can now get it from ~mhopf/Export/eth0.log.gz It's a hefty 127MB in uncompressed form. ifconfig -a and ip route follow. All captured not during the session I created the logs, but with a cd
/mounts/work/CDs blocking:
iproute -a:
10.10.0.0/16 dev eth0 proto kernel scope link src 10.10.101.229
169.254.0.0/16 dev eth0 scope link
127.0.0.0/8 dev lo scope link
default via 10.10.0.8 dev eth0
ifconfig -a:
eth0 Link encap:Ethernet HWaddr 00:12:3F:22:4C:FD
inet addr:10.10.101.229 Bcast:10.10.255.255 Mask:255.255.0.0
inet6 addr: 2001:780:101:a00:212:3fff:fe22:4cfd/64 Scope:Global
inet6 addr: fe80::212:3fff:fe22:4cfd/64 Scope:Link
UP BROADCAST NOTRAILERS RUNNING MULTICAST MTU:1500 Metric:1
RX packets:1869 errors:0 dropped:0 overruns:0 frame:0
TX packets:74 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:146522 (143.0 Kb) TX bytes:6462 (6.3 Kb)
Interrupt:169
eth1 Link encap:Ethernet HWaddr 00:12:F0:7E:12:20
BROADCAST MULTICAST MTU:1500 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
Interrupt:201 Base address:0xc000 Memory:dfcff000-dfcfffff
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:16436 Metric:1
RX packets:270 errors:0 dropped:0 overruns:0 frame:0
TX packets:270 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:22066 (21.5 Kb) TX bytes:22066 (21.5 Kb)
sit0 Link encap:IPv6-in-IPv4
NOARP MTU:1480 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:0
RX bytes:0 (0.0 b) TX bytes:0 (0.0 b)
rpcinfo -p hilbert:
program vers proto port
100000 2 tcp 111 portmapper
100000 2 udp 111 portmapper
100004 2 udp 709 ypserv
100004 1 udp 709 ypserv
100004 2 tcp 712 ypserv
100004 1 tcp 712 ypserv
390113 1 tcp 7937
100007 2 udp 791 ypbind
100007 1 udp 791 ypbind
100007 2 tcp 794 ypbind
100007 1 tcp 794 ypbind
100005 1 udp 649 mountd
100005 2 udp 649 mountd
100005 1 tcp 652 mountd
100005 2 tcp 652 mountd
100003 2 udp 2049 nfs
100003 2 tcp 2049 nfs
showmount -e hilbert:
Export list for hilbert:
/patchinfo 10.10.0.0/255.255.0.0
/patchinfo *.suse.de
/mbuild *.hwlab.suse.de
/mbuild 10.10.0.0/255.255.0.0
/mbuild *.suse.de
/work/local (everyone)
/work/local (everyone)
/work/users/uli *.suse.de
/work/users/coolo *.suse.de
/work/users/mmj *.suse.de
/work/users/matz *.suse.de
/work/users/adrian *.suse.de
/work/users/aj *.suse.de
/work/users/kukuk *.suse.de
/work/tmp <anon clnt>
/work/built 10.10.0.0/255.255.0.0
/work/built *.hwlab.suse.de
/work/built *.suse.de
/work/src/Copyright Boole.suse.de
/work/cd/pac *.hwlab.suse.de
/work/cd/pac *.suse.de
/work/src/done *.hwlab.suse.de
/work/src/done *.suse.de
/work/SRC/ChangeLog machcd4.suse.de
/work/SRC/ChangeLog machcd3.suse.de
/work/SRC/ChangeLog machcd2.suse.de
/work/SRC/ChangeLog machcd.suse.de
/work 10.11.0.0/255.255.0.0
/work *.hwlab.suse.de
/work 10.10.0.0/255.255.0.0
/work *.suse.de
/work hilbert6.suse.de
/work hilbert5.suse.de
/work hilbert4.suse.de
/work hilbert3.suse.de
/work hilbert2.suse.de
/work hilbert1.suse.de
/ localhost
Well, it looks like everything is working perfectly EXCEPT NFS over TCP. If you use tcpdump -r eth0.log -w eth0nfs.log host 10.10.101.229 and port 2049 to just get the NFS traffic to/from your notebook, and look at eth0nfs.log in ethereal, you will see an NFS NULL request (effectively a 'ping') over UDP which works fine. Then your NFS client tries to create a TCP connection to the NFS server by sending a [SYN] packet. It expects to get a [SYN, ACK] packet back, but only sees an [ACK]. It looks very much like the SYN flag is being stripped out of the reply packet before it gets to your notebook. This is only happening for NFS connections. Other tcp connections work fine, as evidenced by 'showmount' and 'rpcinfo' working perfectly. My only guess is that your firewall is doing something odd to these TCP/NFS packets, how ever I don't know anything about how SUSE-Firewall works, so I cannot be certain. Can I suggest you try disabling the firewall and trying again, or would that be politically incorrect :-) My only other suggestion to so assign this bug to someone who knows about SUSEFirewall2, but I don't know who that is. NeilBrown Even if it was political incorrect, I wouldn't care ;) I'll try stopping the firewall as well as using NFS over UDP. This sounds very much like a problem with netfilter we had before; see bug #61818. This was reported upstream and got fixed there, maybe it crept back in. The problem in that bug was that immediately following a reboot, the NFS client will end up opening the same TCP port it used before, so it tries to establish a TCP connection from client:1234 -> server:2049. The server still has a TCP control block for this, and replies with a single ACK containing what it thinks are the right sequence numbers. That ACK is eaten by the conntrack module because the connection isn't yet in state ESTABLISHED. Does that sound plausible? Check the 9.2 branch for these patches: patches.fixes/netfilter-tcp-rst-ack-fix patches.fixes/netfilter-tcp-rst-ack-not-assured-fix These patches supposedly made it into mainline, but looking at the current code it seems we still have the old behavior. ATM I cannot reproduce the problem. It works out-of-the-box as anticipated. I'll update the laptop to RC1 and check whether the issue pops up again. The explanation seems to make sense, I'll think of a better testing scheme after the upgrade. BTW - I've never seen this behavior with 9.3 (never tested 9.2 with this laptop) - does the 9.3 kernel have these patches applied as well? For 9.3 kernel this was already in mainline I think. It seems that Olaf is right. I built a test script that does tcpdump, mounts & unmounts an nfs file system. If I start the script, reboot, and start it again, everything works. If I start the script, unplug the network cable before the umount, reboot, and start it again, it freezes. If I start it with firewall disabled, everything works. I can attach tcpdump logs if needed, but I don't think it's necessary. Can someone point me to a kernel (or an easy way to build it myself) with the patches for testing? Created attachment 49377 [details]
Test script, just for completeness
What's the status on this one? Have the patches been ported to the current kernel? Will they be included with a security kernel update, or just with the next box product? Good question - Ludwig submitted patches to netfilter-devel a while ago, and they got merged upstream. But the current mainline kernel looks like this was dropped again. Raising severity. We didn't get this fixed (again) in 10.0, we definitively should in 10.1. I agree. This came up on netdev a couple of days ago, and it turned out they really dropped the patches from mainline for some reason. They just resubmitted the patch. Jaroslav, can you add this to our HEAD branch please, if it's not in the latest 2.6.15-rc yet? Patch is here: http://marc.theaimsgroup.com/?l=linux-netdev&m=113326103416574&w=2 Thanks! The mentioned patch is already in mainstream 2.6.15 kernel (thus in HEAD). I am experiencing the same problem here with a boxed copy of 10.0 What is the likelihood of a 10.0 kernel fix? |