Bug 1219508

Summary: autofs stalls at random times
Product: [openSUSE] openSUSE Tumbleweed Reporter: Martin Wilck <martin.wilck>
Component: BasesystemAssignee: David Disseldorp <ddiss>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: martin.wilck, rgoldwyn, santiago.zarate
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: journal

Description Martin Wilck 2024-02-02 14:26:26 UTC
Created attachment 872404 [details]
journal

I am using autofs to access NFS resources. In irregular intevals, autofs stalls, which results in hanging processes trying to access autofs-mounted resources.

Example:

First a "good" case:

[ 1292.438345] apollon kernel: autofs_mount_wait:254: autofs4:pid:27516:autofs_mount_wait: waiting for mount name=/
[ 1292.438348] apollon kernel: autofs_wait:419: autofs4:pid:27516:autofs_wait: new wait id = 0x00000006, name = 00000000819633e1, nfy=1
[ 1292.438351] apollon kernel: autofs_notify_daemon:92: autofs4:pid:27516:autofs_notify_daemon: wait id = 0x00000006, name = 00000000819633e1, type=5
[ 1291.740237] apollon automount[12337]: handle_packet: type = 5
[ 1291.740335] apollon automount[12337]: handle_packet_missing_direct: token 6, name /net/ananas.local/export/Public, request pid 27516
[ 1291.740396] apollon automount[12337]: attempting to mount entry /net/ananas.local/export/Public
[ 1291.740461] apollon automount[12337]: lookup_mount: lookup(hosts): /net/ananas.local/export/Public -> ananas.local:/export/Public
[ 1291.740530] apollon automount[12337]: parse_mount: parse(sun): expanded entry: ananas.local:/export/Public
[ 1291.740590] apollon automount[12337]: parse_mount: parse(sun): gathered options:
[ 1291.740680] apollon automount[12337]: sun_mount: parse(sun): mounting root /net/ananas.local/export/Public, mountpoint /net/ananas.local/export/Public, what ananas.local:/export/Public, fstype nfs, options nosuid,nodev
[ 1291.740750] apollon automount[12337]: mount(nfs): root=/net/ananas.local/export/Public name=/net/ananas.local/export/Public what=ananas.local:/export/Public, fstype=nfs, options=nosuid,nodev
[ 1291.740839] apollon automount[12337]: mount(nfs): nfs options="nosuid,nodev", nobind=0, nosymlink=0, ro=0
[ 1291.740905] apollon automount[12337]: get_nfs_info: called with host ananas.local(192.168.1.167) proto 6 version 0x20
[ 1291.740959] apollon automount[12337]: get_nfs_info: nfs v3 rpc ping time: 0.000209
[ 1291.741009] apollon automount[12337]: get_nfs_info: host ananas.local cost 209 weight 0
[ 1291.741072] apollon automount[12337]: get_nfs_info: called with host ananas.local(192.168.1.167) proto 17 version 0x20
[ 1291.741123] apollon automount[12337]: prune_host_list: selected subset of hosts that support NFS3 over TCP
[ 1291.741173] apollon automount[12337]: mount_mount: mount(nfs): calling mkdir_path /net/ananas.local/export/Public
[ 1291.741209] apollon automount[12337]: mount(nfs): calling mount -t nfs -s -o nosuid,nodev ananas.local:/export/Public /net/ananas.local/export/Public
[ 1291.820337] apollon automount[12337]: mount_mount: mount(nfs): mounted ananas.local:/export/Public on /net/ananas.local/export/Public
[ 1291.822090] apollon automount[12337]: dev_ioctl_send_ready: token = 6
[ 1291.822434] apollon automount[12337]: mounted /net/ananas.local/export/Public
[ 1292.540540] apollon kernel: autofs_mount_wait:256: autofs4:pid:27516:autofs_mount_wait: mount wait done status=0

Shortly later, a similar mount hangs after the "sun_mount: parse(sun): mounting..." message:

[ 1342.089949] apollon kernel: autofs_mount_wait:254: autofs4:pid:28784:autofs_mount_wait: waiting for mount name=/
[ 1342.089964] apollon kernel: autofs_wait:419: autofs4:pid:28784:autofs_wait: new wait id = 0x00000007, name = 00000000739379e8, nfy=1
[ 1342.089979] apollon kernel: autofs_notify_daemon:92: autofs4:pid:28784:autofs_notify_daemon: wait id = 0x00000007, name = 00000000739379e8, type=5
[ 1342.090224] apollon kernel: autofs_d_manage:426: autofs4:pid:12348:autofs_d_manage: dentry=000000007ea364a8 ananas.local
[ 1342.090251] apollon kernel: autofs_d_manage:426: autofs4:pid:12348:autofs_d_manage: dentry=00000000739379e8 /
[ 1342.090264] apollon kernel: autofs_d_manage:426: autofs4:pid:12348:autofs_d_manage: dentry=00000000739379e8 /
[ 1341.378250] apollon automount[12337]: handle_packet_missing_direct: token 7, name /net/ananas.local/export/Git, request pid 28784
[ 1341.q378421] apollon automount[12337]: attempting to mount entry /net/ananas.local/export/Git
[ 1341.378598] apollon automount[12337]: lookup_mount: lookup(hosts): /net/ananas.local/export/Git -> ananas.local:/export/Git
[ 1341.378758] apollon automount[12337]: parse_mount: parse(sun): expanded entry: ananas.local:/export/Git
[ 1341.378918] apollon automount[12337]: parse_mount: parse(sun): gathered options:
[ 1341.379110] apollon automount[12337]: sun_mount: parse(sun): mounting root /net/ananas.local/export/Git, mountpoint /net/ananas.local/export/Git, what ananas.local:/export/Git, fstype nfs, options nosuid,nodev
<-- nothing -->
[ 3276.455832] apollon automount[12337]: handle_packet: type = 5

Nothing happens until I noticed and tried to restart "autofs.service", which times out, too:

[ 3302.288384] apollon systemd[1]: Stopping Automounts filesystems on demand...
[ 3302.289735] apollon automount[12337]: do_notify_state: signal 15
[ 3302.291839] apollon automount[12337]: master_notify_state_change: sig 15 switching /net from 4 to 5
[ 3482.297728] apollon systemd[1]: autofs.service: State 'stop-sigterm' timed out. Killing.
[ 3482.989282] apollon kernel: autofs_dentry_release:114: autofs4:pid:28785:autofs_dentry_release: releasing 0000000002d0d9f7
[ 3482.989361] apollon kernel: autofs_dentry_release:114: autofs4:pid:28785:autofs_dentry_release: releasing 00000000d97fdbcf
...
[ 3482.305113] apollon systemd[1]: autofs.service: Killing process 12337 (automount) with signal SIGKILL.
[ 3482.305203] apollon systemd[1]: autofs.service: Main process exited, code=killed, status=9/KILL
[ 3482.305379] apollon systemd[1]: autofs.service: Failed with result 'timeout'.
[ 3482.323666] apollon systemd[1]: Stopped Automounts filesystems on demand.

The subsequent restart fails as well:

[ 3482.364940] apollon systemd[1]: Starting Automounts filesystems on demand...
...
[ 3482.383394] apollon automount[933]: Starting automounter version 5.1.8, master map auto.master
[ 3482.383955] apollon automount[933]: using kernel protocol version 5.05

[ 3483.262099] apollon kernel: autofs_d_manage:426: autofs4:pid:940:autofs_d_manage: dentry=00000000739379e8 /
[ 3483.262108] apollon kernel: autofs_mount_wait:254: autofs4:pid:940:autofs_mount_wait: waiting for mount name=/
[ 3483.262121] apollon kernel: autofs_wait:429: autofs4:pid:940:autofs_wait: existing wait id = 0x00000007, name = 00000000739379e8, nfy=1
[ 3482.579217] apollon automount[933]: parse_mount: parse(sun): dequote(""/export/Git"") -> /export/Git
[ 3482.579445] apollon automount[933]: parse_mapent: parse(sun): gathered options:
[ 3482.579738] apollon automount[933]: parse_mapent: parse(sun): dequote(""ananas.local:/export/Git"") -> ananas.local:/export/Git
[ 3482.579965] apollon automount[933]: update_offset_entry: parse(sun): updated multi-mount offset /export/Git -> ananas.local:/export/Git
...
[ 3482.584814] apollon automount[933]: tree_mapent_mount_offset: mount offset /net/ananas.local/export/Backup
[ 3482.585019] apollon automount[933]: remount_active_mount: trying to re-connect to mount /net/ananas.local/export/Backup
[ 3482.585271] apollon automount[933]: mounted offset on /net/ananas.local/export/Backup with timeout 300, freq 75 seconds
[ 3482.585580] apollon automount[933]: remount_active_mount: re-connected to mount /net/ananas.local/export/Backup
[ 3482.585848] apollon automount[933]: tree_mapent_mount_offset: mount offset /net/ananas.local/export/Git
[ 3557.456823] apollon automount[933]: st_expire: state 0 path /net
[ 3558.142739] apollon kernel: autofs_d_manage:426: autofs4:pid:1058:autofs_d_manage: dentry=0000000063ddaa27 zeus.local
[ 3558.142778] apollon kernel: autofs_d_manage:426: autofs4:pid:1058:autofs_d_manage: dentry=00000000c3906c58 /
[ 3558.142881] apollon kernel: autofs_direct_busy:155: autofs4:pid:1058:autofs_direct_busy: top 00000000c3906c58 /
[ 3557.462849] apollon automount[933]: state.c:690: assertion failed: ap->state == ST_READY
[ 3557.462914] apollon automount[933]: expire_proc: exp_proc = 140512157763264 path /net
[ 3557.462985] apollon automount[933]: expire_proc_indirect: expire /net/zeus.local/mnt/git
[ 3557.463041] apollon automount[933]: expire_proc_indirect: expire /net/zeus.local
[ 3662.547687] apollon systemd[1]: autofs.service: start operation timed out. Terminating.
[ 3842.797526] apollon systemd[1]: autofs.service: State 'stop-sigterm' timed out. Killing.
[ 3842.798330] apollon systemd[1]: autofs.service: Killing process 933 (automount) with signal SIGKILL.
[ 3843.479204] apollon kernel: autofs_mount_wait:256: autofs4:pid:940:autofs_mount_wait: mount wait done status=-4
[ 3843.480412] apollon kernel: autofs_dentry_release:114: autofs4:pid:940:autofs_dentry_release: releasing 00000000b3646c6f
[ 3843.480418] apollon kernel: autofs_dentry_release:114: autofs4:pid:940:autofs_dentry_release: releasing 0000000039d15e2f
[ 3842.800405] apollon systemd[1]: autofs.service: Main process exited, code=killed, status=9/KILL
[ 3842.800493] apollon systemd[1]: autofs.service: Failed with result 'timeout'.
[ 3842.830035] apollon systemd[1]: Failed to start Automounts filesystems on demand.
[ 3842.831266] apollon systemd[1]: autofs.service: Consumed 4min 45.195s CPU time.

Full journal attached.

At the time of the failure (and afterwards), I see the following related mounts:

apollon:~ # mount -t nfs4
zeus.local:/mnt/git on /net/zeus.local/mnt/git type nfs4 (rw,nosuid,nodev,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.30.40,local_lock=none,addr=192.168.30.20)
ananas.local:/export/vms on /net/ananas.local/export/vms type nfs4 (rw,nosuid,nodev,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.40,local_lock=none,addr=192.168.1.167)
ananas.local:/export/Public on /net/ananas.local/export/Public type nfs4 (rw,nosuid,nodev,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.40,local_lock=none,addr=192.168.1.167)
apollon:~ # mount -t autofs
systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=1614)
-hosts on /net type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=170192)
-hosts on /net/zeus.local/home type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=170192)
-hosts on /net/zeus.local/mnt/git type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=170192)
-hosts on /net/zeus.local/mnt/nv type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=170192)
-hosts on /net/ananas.local/export/Backup type autofs (rw,relatime,fd=-1,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=-1)
-hosts on /net/ananas.local/export/Git type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571)
-hosts on /net/ananas.local/export/Media type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571)
-hosts on /net/ananas.local/export/Public type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571)
-hosts on /net/ananas.local/export/Wilck type autofs (rw,relatime,fd=-1,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=-1)
-hosts on /net/ananas.local/export/home type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571)
-hosts on /net/ananas.local/export/vms type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571)

Both nfs servers (ananas.local and zeus.local) are running openSUSE Leap 15.5.
I am seeing this kind of stalls only on my tumbleweed system. For a long time I'd thought that the problems were caused by the SUSE VPN somehow interfering with autofs, but I the case documented here was observed without VPN active.
Comment 1 Martin Wilck 2024-02-02 14:32:43 UTC
I forgot, this is recent tumbleweed

autofs-5.1.8-6.3.x86_64
kernel-default-6.7.2-1.1.x86_64
Comment 2 David Disseldorp 2024-02-09 14:16:07 UTC
Thanks for the report Martin.

(In reply to Martin Wilck from comment #0)
> Created attachment 872404 [details]
> journal
> 
> I am using autofs to access NFS resources. In irregular intevals, autofs
> stalls, which results in hanging processes trying to access autofs-mounted
> resources.

Is this a regression from previous kernel / autofs versions? If so, which ones?
 
From a rough skim of the journal you provided it appears that the kernel autofs_wait() path isn't able to obtain sbi->wq_mutex, but a bit more debugging will be needed to work back from there.
Comment 3 Martin Wilck 2024-02-09 15:17:23 UTC
Do you have any hints how to do more debugging? The problem is sporadic; so far I haven't been able to make it happen voluntarily.

I've compiled automount 5.1.9 locally and I have been running that instead of the TW package for a week now, without seeing the hang even once, which used to occur about 1-2 times a day before.

So it might make sense to test a properly packaged autofs 5.1.9. I haven't had time to create that yet though.
Comment 4 Martin Wilck 2024-02-09 15:21:38 UTC
(In reply to David Disseldorp from comment #2)

> Is this a regression from previous kernel / autofs versions? If so, which
> ones?

Yes, but I can't tell which. It has happened on my TW system for a long time already. I just didn't report it because I thought it might be a local misconfiguration of some sort (which is still sort of possible).

Given that I chatted about this on June 22nd in the #discuss-support-network-filesystems channel, I've been seeing the issue since ~2 years. My Leap systems that are in the same network and use the same autofs setup don't show this behavior though.
Comment 5 David Disseldorp 2024-02-10 12:55:16 UTC
(In reply to Martin Wilck from comment #3)
> Do you have any hints how to do more debugging? The problem is sporadic; so
> far I haven't been able to make it happen voluntarily.

A crash dump at the time of the apparent deadlock might be helpful, you could also run automount under strace, but both options wouldn't exactly be unobtrusive.

> I've compiled automount 5.1.9 locally and I have been running that instead
> of the TW package for a week now, without seeing the hang even once, which
> used to occur about 1-2 times a day before.

That sounds promising.

> 
> So it might make sense to test a properly packaged autofs 5.1.9. I haven't
> had time to create that yet though.

Sounds reasonable. If confirmed, it'd also give us some good data-points for bisect. I'll put something together on the OBS.
Comment 6 David Disseldorp 2024-02-12 10:39:09 UTC
(In reply to David Disseldorp from comment #5)
> (In reply to Martin Wilck from comment #3)
...
> > So it might make sense to test a properly packaged autofs 5.1.9. I haven't
> > had time to create that yet though.
> 
> Sounds reasonable. If confirmed, it'd also give us some good data-points for
> bisect. I'll put something together on the OBS.

I've rebased our Factory sources against upstream 5.1.9:

https://build.opensuse.org/package/show/home:dmdiss:autofs_519/autofs

Please give it a spin if you have a moment for testing.
Comment 7 Martin Wilck 2024-02-12 11:14:39 UTC
Thanks, package is installed and running.
Comment 8 Martin Wilck 2024-02-13 10:37:53 UTC
No hang in 24 hours.
Comment 9 Martin Wilck 2024-02-16 13:43:35 UTC
No hang in 4 days, even though I also activated autofs via vpn and sssd (which I'd deactivated before because I'd suspected it was causing the issues).

I vote for pushing this package to factory.
Comment 10 David Disseldorp 2024-02-19 01:40:06 UTC
(In reply to Martin Wilck from comment #9)
> No hang in 4 days, even though I also activated autofs via vpn and sssd
> (which I'd deactivated before because I'd suspected it was causing the
> issues).
> 
> I vote for pushing this package to factory.

Proposed for factory via https://build.opensuse.org/request/show/1147480

Thanks a lot for the test feedback, Martin.

@Goldwyn: flagging needinfo, as you're listed as sole bugowner / reviewer. Feel free to additionally assign me those roles if you like.
Comment 11 David Disseldorp 2024-02-21 03:37:52 UTC
(In reply to David Disseldorp from comment #10)
...
> Proposed for factory via https://build.opensuse.org/request/show/1147480
> 
> Thanks a lot for the test feedback, Martin.
> 
> @Goldwyn: flagging needinfo, as you're listed as sole bugowner / reviewer.
> Feel free to additionally assign me those roles if you like.

Merged, thanks - closing...