Bug 1217488

Summary: pynfs CID* tests on NFS v4.0 fail: OP_SETCLIENTID should return NFS4_OK, instead got NFS4ERR_DELAY
Product: [openSUSE] openSUSE Tumbleweed Reporter: Petr Vorel <petr.vorel>
Component: Kernel:FilesystemsAssignee: Neil Brown <nfbrown>
Status: IN_PROGRESS --- QA Contact: Petr Vorel <petr.vorel>
Severity: Normal    
Priority: P5 - None CC: pcervinka, petr.vorel, yosun
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
See Also: https://bugzilla.suse.com/show_bug.cgi?id=1217128
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Petr Vorel 2023-11-24 19:58:34 UTC
Although we start due #1217128 testing via cdmackay/pynfs.git, which has fix for "nfs4lib.BadCompoundRes: operation OP_SETCLIENTID should return NFS4_OK [1], instead got NFS4ERR_DELAY", we still get 9 tests failing with this error on Tumbleweed. Any idea what could be wrong now?

O I can ask Calum Mackay on linux-nfs if you're busy with more important stuff.

[1] https://git.linux-nfs.org/?p=cdmackay/pynfs.git;a=commit;h=0d4d3fd0bb7a63860b46f3fed9e9ebf287ea51f8
Comment 1 Petr Cervinka 2024-04-17 09:35:45 UTC
It got to 15-SP6 in latest build 80.1: https://openqa.suse.de/tests/14048454#step/CID5/2


nfs4lib.BadCompoundRes: operation OP_SETCLIENTID should return NFS4_OK, instead got NFS4ERR_DELAY


Traceback (most recent call last):
  File "/root/pynfs/nfs4.0/lib/testmod.py", line 234, in run
    self.runtest(self, environment)
  File "/root/pynfs/nfs4.0/servertests/st_setclientid.py", line 363, in testLotsOfClients
    c.init_connection(id)
  File "/root/pynfs/nfs4.0/nfs4lib.py", line 407, in init_connection
    check_result(res)
  File "/root/pynfs/nfs4.0/nfs4lib.py", line 918, in check_result
    raise BadCompoundRes(resop, res.status, msg)
nfs4lib.BadCompoundRes: operation OP_SETCLIENTID should return NFS4_OK, instead got NFS4ERR_DELAY
Comment 2 Neil Brown 2024-04-22 02:03:04 UTC
pynfs only waits for 10 seconds for the DELAY error to go away.  I guess that isn't long enough.

I think that failing the OP_SETCLIENTID just because there are already lots of clients is a bad choice.  Certainly fail if there is a real shortage of memory, but  not otherwise.  Certainly look for idle clients to clean up, but don't fail.

I'll post a patch upstream and see what they think.
Comment 3 Petr Vorel 2024-04-22 04:59:02 UTC
Neil's patch in ML: https://lore.kernel.org/linux-nfs/171375175915.7600.6526208866216039031@noble.neil.brown.name/

Thanks, Neil!
Comment 4 Petr Vorel 2024-04-23 15:16:20 UTC
Based on upstream maintainer's comment about 1 GB not being enough [1] I tested with more RAM (QEMURAM=3600) and it solved the problem [2]. Let's see if Neil's v2 fix [3] is merged in upstream or not.

[1] https://lore.kernel.org/linux-nfs/ZiZnbV+htcvGuGQl@tissot.1015granger.net/
[2] http://quasar.suse.cz/tests/3237
[3] https://lore.kernel.org/linux-nfs/171385732687.7600.2864936377155228614@noble.neil.brown.name/