Bug 1216201 - NFS Client is broken in MicroOS since 10.10
Summary: NFS Client is broken in MicroOS since 10.10
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Network (show other bugs)
Version: Current
Hardware: Other openSUSE Tumbleweed
: P5 - None : Critical (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-10-13 07:32 UTC by Janos Miko
Modified: 2023-10-16 22:36 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Janos Miko 2023-10-13 07:32:33 UTC
Hello,

It seems like the latest nfs-client update broke the nfs client in some cases.

Context: I'm running multiple Kubernetes clusters on top of OpenSUSE MicroOS using  the following Terraform module:
https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner

I'm using Rancher Longhorn for volumes in the cluster and Longhorn's RWX volumes are using nfs under the hood.

Yesterday, a transactional-update scheduled update upgraded all my nodes (including nfs-client package from 2.6.3-39.4 to 2.6.3-39.5) and the RWX volumes started to fail.

The error message was the following in the journal:
---
Oct 11 07:42:23 dev-worker-1 k3s[1294]: Mounting command: /usr/local/sbin/nsmounter
Oct 11 07:42:23 dev-worker-1 k3s[1294]: Mounting arguments: mount -t nfs -o vers=4.1,noresvport,intr,hard 10.43.207.185:/pvc-13538170-4278-4467-b2b0-1f1ba6f54a4c /var/lib/kubelet/plugins/kubernetes.io/csi/driver.longhorn.io/185c34f566c2eca6e8c7c6a2ede2094c076d7d25ddae286dc633eeef80551af0/globalmount
Oct 11 07:42:23 dev-worker-1-autoscaled-small-19baf778f50efd8c k3s[1294]: Output: mount.nfs: Protocol not supported for 10.43.207.185:/pvc-13538170-4278-4467-b2b0-1f1ba6f54a4c on /var/lib/kubelet/plugin
---

I also enabled debug logging in the kernel, you can find those logs in the related bug report.

Anyway, after a few hours I decided to use MicroOS rollback going back to the latest snapshot from 2 days ago and the issue is solved now on that node.

Related bug reports on Github:
https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/issues/1016
https://github.com/longhorn/longhorn/issues/6857

Please let me know if I can provide you any additional information about this issue.

Since my comment in https://bugzilla.opensuse.org/show_bug.cgi?id=1214540 multiple users reported the same issue.

Regards,
Janos Miko
Comment 1 Neil Brown 2023-10-16 01:08:17 UTC
There is no change to the nfs-client package between  2.6.3-39.4 and 2.6.3-39.5.  I appears that the package was rebuilt, possibly because some package that it depended on changed.
So it seems likely that some shared library that nfs uses was also updated.

Do you have a list of other packages that were changed during the update which caused the breakage?
Comment 2 Neil Brown 2023-10-16 06:00:38 UTC
I had a look through the kernel logs in 
  https://github.com/longhorn/longhorn/issues/6857#issuecomment-1758393346

and it looks like a kernel problem.
You probably updated to kernel-default-6.5.6-1.1.x86_64
as that was released recently.  Maybe that has a problem for your particular use case.
At a guess, it might be related to the use of the noresvport option as I doubt that it tested much.

I'll try experimenting...
Comment 3 Janos Miko 2023-10-16 07:18:23 UTC
I tried to downgrade the kernel on all the nodes. But that doesn't seem to help.

~ # transactional-update shell

transactional update # zypper install -y --oldpackage https://download.opensuse.org/history/20231008/tumbleweed/repo/oss/x86_64/kernel-default-6.5.4-1.1.x86_64.rpm
transactional update # zypper addlock kernel-default
transactional update # exit

~ # touch /var/run/reboot-required

And I waited for the reboot.

And the nfs-client version is still the same.

~ # zypper info nfs-client
Loading repository data...
Reading installed packages...


Information for package nfs-client:
-----------------------------------
Repository     : openSUSE-Tumbleweed-Oss
Name           : nfs-client
Version        : 2.6.3-39.5
Arch           : x86_64
Vendor         : openSUSE
Installed Size : 874.5 KiB
Installed      : Yes
Status         : out-of-date (version 2.6.3-39.4 installed)
Comment 4 Janos Miko 2023-10-16 08:30:01 UTC
Hm, forget my previous comment.

After all nodes restarted it looks like it's working.

When I wrote my previous comment I only checked that the pod is running on an already restarted node - that uses the previous kernel. Now all my nodes are really downgraded and rebooted and it works now.

So the issue occurs in kernel-default-6.5.6-1.1 and works in kernel-default-6.5.4-1.1.
Comment 5 Neil Brown 2023-10-16 22:36:21 UTC
I believe this bug is fixed by
https://lore.kernel.org/all/20231009145901.99260-1-olga.kornievskaia@gmail.com/

I have submitted that patch so that next kernel released for tumbleweed should have the fix.  Until then please use the 6.5.4 kernel.