Bug 128784

Summary: No file locking possible on /home directory on nfs server
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Aart de Vries <aart.de.vries>
Component: KernelAssignee: Neil Brown <nfbrown>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: clain, forgotten_uCb0QVSAVR
Version: Final   
Target Milestone: ---   
Hardware: x86   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Customer Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Patch to fix statd/lockd problem in SuSE10
Second attempt to fix bug

Description Aart de Vries 2005-10-17 17:29:22 UTC
I have an x86 desktop on SuSE 10.0, which mounts the /home directory over nfs.
When this nfs server is running SuSE 9.1, everything works.
However I decided to upgrade the server to SuSE 10.0 as well, but now several
programs don't start anymore. 

The programs being affected that I identified sofar are:
- Eclipse
- OpenOffice
- Skype

For OO I found a workaround (see comment 3 & 4) in:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=152269

For eclipse it is possible to disable file locking as well according to the
information in a log file:
!MESSAGE Error reading configuration: An error occurred while locking file
"/home/aart/.eclipse/org.eclipse.platform_3.1.1/configuration/org.eclipse.osgi/.manager/.fileTableLock":
 "Stale NFS file handle". A common reason is that the file system or Runtime
Environment does not support
 file locking for that location. Please choose a different location, or disable
file locking passing "-Dosgi.locking=none" as a VM argument.

For skype I haven't found any workarounds.

There is probably more software not working that I haven't tested yet. For the
time being I will have to boot my server in 9.1. Since it is dual booting SuSE
10.0, I'm more than willing to test any patches. My setup:

On server /etc/fstab entry for /home directory 
/dev/hda8       /home   reiserfs        acl,user_xattr 1 2

This directory is exported with the YaST nfs server utility. The results 
in /etc/exports containing:
/home   *(rw,no_root_squash,sync)

On the desktop-client the nfs directory is mounted based on the following entry 
in /etc/fstab:
192.168.1.2:/home       /home   nfs     acl,rw 1 1
Comment 1 Dr. Werner Fink 2005-10-18 09:22:12 UTC
Why this was a blocker? I guess this is kernel nfs.
Comment 2 Chris L Mason 2005-10-18 13:48:14 UTC
One more for Neil, although it looks as though the lock server just isn't running.
Comment 3 Aart de Vries 2005-10-18 15:48:20 UTC
I see a lockd process on the nfs-server when I do a ps ax, just after the nfs processes when running 10.0 on the server. This looks very similar to the process list on 9.1. Here is that part of the ps ax dump when running 10.0 on the server:
 4253 ?        S<     0:00 [nfsd4]
 4258 ?        S      0:00 [nfsd]
 4259 ?        S      0:00 [nfsd]
 4260 ?        S      0:00 [nfsd]
 4261 ?        S      0:00 [nfsd]
 4303 ?        Ssl    0:00 /usr/sbin/nscd
 4292 ?        S      0:00 [lockd]

One more piece of info: The problem also exists when the firewall on the server is switched off.

Why a blocker?: To me it is when 3 out of my five mostly used apps run crippled or not on a 10.0 server, I won't be using it. From your standpoint of view with all the issues being reported, I can assume you will have a different opinion, but since I didn't see a description on the definitions of these priorities when I had to select one, I selected based on my perception.
Comment 4 Gavin Burnell 2005-10-18 22:43:41 UTC
I think I have the same or similar bug on an x86_64 client - again mounting /home over nfs from an i586 server. Symptons for me include OpenOffice reporting "A General Input/Output error while accessing ...". Workaround as suggested above worked for me. Rather more difficult to fix is MSOffice under CrossOver Office doing the same thing. Was all working with 9.3 client and server.

Possibly relevant: I've got multiple things like this in /var/log/messages:
Oct 18 23:35:30 nova kernel: lockd: unexpected unlock status: 7

Possibly also relevant - I thought I saw something about the way locking was handled in nfs got changed in 2.6.12 - my 9.3 machines were runing 2.6.11, the 10.0 machines run 2.6.13...
Comment 5 Neil Brown 2005-10-19 04:35:04 UTC
Yep, there is a bug in the kernel-based 'statd' which makes lockd not work.

You could disable the kernel-based statd with the kernel parameter
   lockd.nsm_use_kstatd=0
but then you would need the user-space statd, which doesn't come with SuSE-10.

The simplest short-term fix would be to add the above kernel parameter to
the appropriate line in
   /boot/grub/menu.lst
and get a source distrib of nfs-utils and compile statd from that.

A better solution would be to fix the kernel bug, which ofcourse requires
recompiling the kernel.  I'll try to put togther a patch in the next day or so.
Comment 6 Neil Brown 2005-10-19 09:50:34 UTC
Created attachment 54719 [details]
Patch to fix statd/lockd problem in SuSE10

This patch (which applies to the SUSE-10 kernel, not to
mainline) should fix the locking problem.

If you are able to test it, please let me know the result.

Thanks,
NeilBrown
Comment 7 Rico Rommel 2005-10-19 16:52:52 UTC
I have applied the patch and the module nfsd is loading without errors.
But when i try to mount a directoy, the nfsd module crashes.


Here is the output of dmesg:

Oct 19 17:53:32 server kernel: Unable to handle kernel NULL pointer dereference at virtual address 00000000
Oct 19 17:53:32 server kernel:  printing eip:
Oct 19 17:53:32 server kernel: 00000000
Oct 19 17:53:32 server kernel: *pde = 00000000
Oct 19 17:53:32 server kernel: Oops: 0000 [#1]
Oct 19 17:53:32 server kernel: Modules linked in: w83627hf w83781d hfsplus vfat fat subfs speedstep_lib freq_table nfsd exportfs ipt_MASQUERADE ipt_pkttype ipt_TCPMSS ipt_LOG ipt_limit af_packet ppp_synctty button battery eeprom i2c_sensor ac i2c_isa ppp_generic slhc edd ip6t_REJECT ipt_REJECT ipt_state iptable_mangle iptable_nat iptable_filter ip6table_mangle ip_conntrack ip_tables ip6table_filter ip6_tables ipv6 fcdslsl quota_v2 capi i2c_prosavage i2c_algo_bit savagefb capifs via_agp agpgart ehci_hcd shpchp pci_hotplug uhci_hcd usbcore via_rhine mii kernelcapi i2c_viapro i2c_core via_ircc irda crc_ccitt generic parport_pc lp parport dm_mod reiserfs fan ide_cd cdrom thermal processor via82cxxx ide_disk ide_core
Oct 19 17:53:32 server kernel: CPU:    0
Oct 19 17:53:32 server kernel: EIP:    0060:[<00000000>]    Tainted: P     U VLI
Oct 19 17:53:32 server kernel: EFLAGS: 00010246   (2.6.13-15-default) 
Oct 19 17:53:32 server kernel: EIP is at _stext+0x3feffdc0/0x20
Oct 19 17:53:32 server kernel: eax: df30ac00   ebx: e0aa38a0   ecx: df30ac64   edx: 00000005
Oct 19 17:53:32 server kernel: esi: df30ac00   edi: 00000003   ebp: df30ac64   esp: da4aff84
Oct 19 17:53:32 server kernel: ds: 007b   es: 007b   ss: 0068
Oct 19 17:53:32 server kernel: Process nfsd (pid: 6034, threadinfo=da4ae000 task=deacaaa0)
Oct 19 17:53:32 server kernel: Stack: c02ec638 da4e890c da4e88e0 00000002 df30ac40 00000000 00000000 db08700c 
Oct 19 17:53:32 server kernel:        01000000 00000000 da4e88e0 ffffc81e df30ac00 e0a7735f da4b1fbc e0aa3880 
Oct 19 17:53:32 server kernel:        deacaaa0 fffffeff ffffffff fffffef8 ffffffff e0a771f0 00000000 00000000 
Oct 19 17:53:32 server kernel: Call Trace:
Oct 19 17:53:32 server kernel:  [<c02ec638>] svc_process+0x578/0x6b0
Oct 19 17:53:32 server kernel:  [<e0a7735f>] nfsd+0x16f/0x2e0 [nfsd]
Oct 19 17:53:32 server kernel:  [<e0a771f0>] nfsd+0x0/0x2e0 [nfsd]
Oct 19 17:53:32 server kernel:  [<c01012f1>] kernel_thread_helper+0x5/0x14
Oct 19 17:53:32 server kernel: Code:  Bad EIP value.
Comment 8 Aart de Vries 2005-10-19 21:42:10 UTC
I also managed to aplly and test the patch. I get the same NULL pointer as Rico. PS. I compiled and tested this on a AMD Duron system.
Comment 9 Neil Brown 2005-10-19 22:55:10 UTC
Created attachment 54854 [details]
Second attempt to fix bug

Okay... fixing that bug exposed another one.
This one only affects you if you have ACL support
enabled, and the client I was using for testing
didn't.

Please try this new patch.

Thanks,
NeilBrown
Comment 10 Aart de Vries 2005-10-20 02:02:37 UTC
This time you worked your magic. Works great. All three apps that didn't work before (Skype, Eclipse and OpenOffice) now all work again without the workarounds.

Thanks Neil, 
I appreciate the good work.

PS. I don't know whether I should change the status to fixed myself, or whether you would like some more people testing (like Rico). So I'll leave you the honnors to declare this thing fixed.
Comment 11 Neil Brown 2005-10-20 05:18:42 UTC
Best not to mark it 'fixed' until the patch is in the CVS, and preferrable
upstream too.  I'll see what I can do.

Thanks for the testing and quick feedback.

NeilBrown
Comment 12 Neil Brown 2005-10-21 01:19:16 UTC
Ok, this is now in the CVS, and will be heading upstream shortly, so I'm
marking it fixed.  Thanks again,
Comment 13 Olaf Kirch 2005-10-24 11:32:27 UTC
*** Bug 129744 has been marked as a duplicate of this bug. ***
Comment 14 Neil Brown 2005-11-08 10:58:05 UTC
*** Bug 132096 has been marked as a duplicate of this bug. ***
Comment 15 Olaf Kirch 2005-11-14 12:05:23 UTC
*** Bug 133619 has been marked as a duplicate of this bug. ***