Bugzilla – Bug 128784
No file locking possible on /home directory on nfs server
Last modified: 2006-04-09 17:49:28 UTC
I have an x86 desktop on SuSE 10.0, which mounts the /home directory over nfs. When this nfs server is running SuSE 9.1, everything works. However I decided to upgrade the server to SuSE 10.0 as well, but now several programs don't start anymore. The programs being affected that I identified sofar are: - Eclipse - OpenOffice - Skype For OO I found a workaround (see comment 3 & 4) in: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=152269 For eclipse it is possible to disable file locking as well according to the information in a log file: !MESSAGE Error reading configuration: An error occurred while locking file "/home/aart/.eclipse/org.eclipse.platform_3.1.1/configuration/org.eclipse.osgi/.manager/.fileTableLock": "Stale NFS file handle". A common reason is that the file system or Runtime Environment does not support file locking for that location. Please choose a different location, or disable file locking passing "-Dosgi.locking=none" as a VM argument. For skype I haven't found any workarounds. There is probably more software not working that I haven't tested yet. For the time being I will have to boot my server in 9.1. Since it is dual booting SuSE 10.0, I'm more than willing to test any patches. My setup: On server /etc/fstab entry for /home directory /dev/hda8 /home reiserfs acl,user_xattr 1 2 This directory is exported with the YaST nfs server utility. The results in /etc/exports containing: /home *(rw,no_root_squash,sync) On the desktop-client the nfs directory is mounted based on the following entry in /etc/fstab: 192.168.1.2:/home /home nfs acl,rw 1 1
Why this was a blocker? I guess this is kernel nfs.
One more for Neil, although it looks as though the lock server just isn't running.
I see a lockd process on the nfs-server when I do a ps ax, just after the nfs processes when running 10.0 on the server. This looks very similar to the process list on 9.1. Here is that part of the ps ax dump when running 10.0 on the server: 4253 ? S< 0:00 [nfsd4] 4258 ? S 0:00 [nfsd] 4259 ? S 0:00 [nfsd] 4260 ? S 0:00 [nfsd] 4261 ? S 0:00 [nfsd] 4303 ? Ssl 0:00 /usr/sbin/nscd 4292 ? S 0:00 [lockd] One more piece of info: The problem also exists when the firewall on the server is switched off. Why a blocker?: To me it is when 3 out of my five mostly used apps run crippled or not on a 10.0 server, I won't be using it. From your standpoint of view with all the issues being reported, I can assume you will have a different opinion, but since I didn't see a description on the definitions of these priorities when I had to select one, I selected based on my perception.
I think I have the same or similar bug on an x86_64 client - again mounting /home over nfs from an i586 server. Symptons for me include OpenOffice reporting "A General Input/Output error while accessing ...". Workaround as suggested above worked for me. Rather more difficult to fix is MSOffice under CrossOver Office doing the same thing. Was all working with 9.3 client and server. Possibly relevant: I've got multiple things like this in /var/log/messages: Oct 18 23:35:30 nova kernel: lockd: unexpected unlock status: 7 Possibly also relevant - I thought I saw something about the way locking was handled in nfs got changed in 2.6.12 - my 9.3 machines were runing 2.6.11, the 10.0 machines run 2.6.13...
Yep, there is a bug in the kernel-based 'statd' which makes lockd not work. You could disable the kernel-based statd with the kernel parameter lockd.nsm_use_kstatd=0 but then you would need the user-space statd, which doesn't come with SuSE-10. The simplest short-term fix would be to add the above kernel parameter to the appropriate line in /boot/grub/menu.lst and get a source distrib of nfs-utils and compile statd from that. A better solution would be to fix the kernel bug, which ofcourse requires recompiling the kernel. I'll try to put togther a patch in the next day or so.
Created attachment 54719 [details] Patch to fix statd/lockd problem in SuSE10 This patch (which applies to the SUSE-10 kernel, not to mainline) should fix the locking problem. If you are able to test it, please let me know the result. Thanks, NeilBrown
I have applied the patch and the module nfsd is loading without errors. But when i try to mount a directoy, the nfsd module crashes. Here is the output of dmesg: Oct 19 17:53:32 server kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000 Oct 19 17:53:32 server kernel: printing eip: Oct 19 17:53:32 server kernel: 00000000 Oct 19 17:53:32 server kernel: *pde = 00000000 Oct 19 17:53:32 server kernel: Oops: 0000 [#1] Oct 19 17:53:32 server kernel: Modules linked in: w83627hf w83781d hfsplus vfat fat subfs speedstep_lib freq_table nfsd exportfs ipt_MASQUERADE ipt_pkttype ipt_TCPMSS ipt_LOG ipt_limit af_packet ppp_synctty button battery eeprom i2c_sensor ac i2c_isa ppp_generic slhc edd ip6t_REJECT ipt_REJECT ipt_state iptable_mangle iptable_nat iptable_filter ip6table_mangle ip_conntrack ip_tables ip6table_filter ip6_tables ipv6 fcdslsl quota_v2 capi i2c_prosavage i2c_algo_bit savagefb capifs via_agp agpgart ehci_hcd shpchp pci_hotplug uhci_hcd usbcore via_rhine mii kernelcapi i2c_viapro i2c_core via_ircc irda crc_ccitt generic parport_pc lp parport dm_mod reiserfs fan ide_cd cdrom thermal processor via82cxxx ide_disk ide_core Oct 19 17:53:32 server kernel: CPU: 0 Oct 19 17:53:32 server kernel: EIP: 0060:[<00000000>] Tainted: P U VLI Oct 19 17:53:32 server kernel: EFLAGS: 00010246 (2.6.13-15-default) Oct 19 17:53:32 server kernel: EIP is at _stext+0x3feffdc0/0x20 Oct 19 17:53:32 server kernel: eax: df30ac00 ebx: e0aa38a0 ecx: df30ac64 edx: 00000005 Oct 19 17:53:32 server kernel: esi: df30ac00 edi: 00000003 ebp: df30ac64 esp: da4aff84 Oct 19 17:53:32 server kernel: ds: 007b es: 007b ss: 0068 Oct 19 17:53:32 server kernel: Process nfsd (pid: 6034, threadinfo=da4ae000 task=deacaaa0) Oct 19 17:53:32 server kernel: Stack: c02ec638 da4e890c da4e88e0 00000002 df30ac40 00000000 00000000 db08700c Oct 19 17:53:32 server kernel: 01000000 00000000 da4e88e0 ffffc81e df30ac00 e0a7735f da4b1fbc e0aa3880 Oct 19 17:53:32 server kernel: deacaaa0 fffffeff ffffffff fffffef8 ffffffff e0a771f0 00000000 00000000 Oct 19 17:53:32 server kernel: Call Trace: Oct 19 17:53:32 server kernel: [<c02ec638>] svc_process+0x578/0x6b0 Oct 19 17:53:32 server kernel: [<e0a7735f>] nfsd+0x16f/0x2e0 [nfsd] Oct 19 17:53:32 server kernel: [<e0a771f0>] nfsd+0x0/0x2e0 [nfsd] Oct 19 17:53:32 server kernel: [<c01012f1>] kernel_thread_helper+0x5/0x14 Oct 19 17:53:32 server kernel: Code: Bad EIP value.
I also managed to aplly and test the patch. I get the same NULL pointer as Rico. PS. I compiled and tested this on a AMD Duron system.
Created attachment 54854 [details] Second attempt to fix bug Okay... fixing that bug exposed another one. This one only affects you if you have ACL support enabled, and the client I was using for testing didn't. Please try this new patch. Thanks, NeilBrown
This time you worked your magic. Works great. All three apps that didn't work before (Skype, Eclipse and OpenOffice) now all work again without the workarounds. Thanks Neil, I appreciate the good work. PS. I don't know whether I should change the status to fixed myself, or whether you would like some more people testing (like Rico). So I'll leave you the honnors to declare this thing fixed.
Best not to mark it 'fixed' until the patch is in the CVS, and preferrable upstream too. I'll see what I can do. Thanks for the testing and quick feedback. NeilBrown
Ok, this is now in the CVS, and will be heading upstream shortly, so I'm marking it fixed. Thanks again,
*** Bug 129744 has been marked as a duplicate of this bug. ***
*** Bug 132096 has been marked as a duplicate of this bug. ***
*** Bug 133619 has been marked as a duplicate of this bug. ***