|
Bugzilla – Full Text Bug Listing |
| Summary: | autofs stalls at random times | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Martin Wilck <martin.wilck> |
| Component: | Basesystem | Assignee: | David Disseldorp <ddiss> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | martin.wilck, rgoldwyn, santiago.zarate |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | journal | ||
I forgot, this is recent tumbleweed autofs-5.1.8-6.3.x86_64 kernel-default-6.7.2-1.1.x86_64 Thanks for the report Martin. (In reply to Martin Wilck from comment #0) > Created attachment 872404 [details] > journal > > I am using autofs to access NFS resources. In irregular intevals, autofs > stalls, which results in hanging processes trying to access autofs-mounted > resources. Is this a regression from previous kernel / autofs versions? If so, which ones? From a rough skim of the journal you provided it appears that the kernel autofs_wait() path isn't able to obtain sbi->wq_mutex, but a bit more debugging will be needed to work back from there. Do you have any hints how to do more debugging? The problem is sporadic; so far I haven't been able to make it happen voluntarily. I've compiled automount 5.1.9 locally and I have been running that instead of the TW package for a week now, without seeing the hang even once, which used to occur about 1-2 times a day before. So it might make sense to test a properly packaged autofs 5.1.9. I haven't had time to create that yet though. (In reply to David Disseldorp from comment #2) > Is this a regression from previous kernel / autofs versions? If so, which > ones? Yes, but I can't tell which. It has happened on my TW system for a long time already. I just didn't report it because I thought it might be a local misconfiguration of some sort (which is still sort of possible). Given that I chatted about this on June 22nd in the #discuss-support-network-filesystems channel, I've been seeing the issue since ~2 years. My Leap systems that are in the same network and use the same autofs setup don't show this behavior though. (In reply to Martin Wilck from comment #3) > Do you have any hints how to do more debugging? The problem is sporadic; so > far I haven't been able to make it happen voluntarily. A crash dump at the time of the apparent deadlock might be helpful, you could also run automount under strace, but both options wouldn't exactly be unobtrusive. > I've compiled automount 5.1.9 locally and I have been running that instead > of the TW package for a week now, without seeing the hang even once, which > used to occur about 1-2 times a day before. That sounds promising. > > So it might make sense to test a properly packaged autofs 5.1.9. I haven't > had time to create that yet though. Sounds reasonable. If confirmed, it'd also give us some good data-points for bisect. I'll put something together on the OBS. (In reply to David Disseldorp from comment #5) > (In reply to Martin Wilck from comment #3) ... > > So it might make sense to test a properly packaged autofs 5.1.9. I haven't > > had time to create that yet though. > > Sounds reasonable. If confirmed, it'd also give us some good data-points for > bisect. I'll put something together on the OBS. I've rebased our Factory sources against upstream 5.1.9: https://build.opensuse.org/package/show/home:dmdiss:autofs_519/autofs Please give it a spin if you have a moment for testing. Thanks, package is installed and running. No hang in 24 hours. No hang in 4 days, even though I also activated autofs via vpn and sssd (which I'd deactivated before because I'd suspected it was causing the issues). I vote for pushing this package to factory. (In reply to Martin Wilck from comment #9) > No hang in 4 days, even though I also activated autofs via vpn and sssd > (which I'd deactivated before because I'd suspected it was causing the > issues). > > I vote for pushing this package to factory. Proposed for factory via https://build.opensuse.org/request/show/1147480 Thanks a lot for the test feedback, Martin. @Goldwyn: flagging needinfo, as you're listed as sole bugowner / reviewer. Feel free to additionally assign me those roles if you like. (In reply to David Disseldorp from comment #10) ... > Proposed for factory via https://build.opensuse.org/request/show/1147480 > > Thanks a lot for the test feedback, Martin. > > @Goldwyn: flagging needinfo, as you're listed as sole bugowner / reviewer. > Feel free to additionally assign me those roles if you like. Merged, thanks - closing... |
Created attachment 872404 [details] journal I am using autofs to access NFS resources. In irregular intevals, autofs stalls, which results in hanging processes trying to access autofs-mounted resources. Example: First a "good" case: [ 1292.438345] apollon kernel: autofs_mount_wait:254: autofs4:pid:27516:autofs_mount_wait: waiting for mount name=/ [ 1292.438348] apollon kernel: autofs_wait:419: autofs4:pid:27516:autofs_wait: new wait id = 0x00000006, name = 00000000819633e1, nfy=1 [ 1292.438351] apollon kernel: autofs_notify_daemon:92: autofs4:pid:27516:autofs_notify_daemon: wait id = 0x00000006, name = 00000000819633e1, type=5 [ 1291.740237] apollon automount[12337]: handle_packet: type = 5 [ 1291.740335] apollon automount[12337]: handle_packet_missing_direct: token 6, name /net/ananas.local/export/Public, request pid 27516 [ 1291.740396] apollon automount[12337]: attempting to mount entry /net/ananas.local/export/Public [ 1291.740461] apollon automount[12337]: lookup_mount: lookup(hosts): /net/ananas.local/export/Public -> ananas.local:/export/Public [ 1291.740530] apollon automount[12337]: parse_mount: parse(sun): expanded entry: ananas.local:/export/Public [ 1291.740590] apollon automount[12337]: parse_mount: parse(sun): gathered options: [ 1291.740680] apollon automount[12337]: sun_mount: parse(sun): mounting root /net/ananas.local/export/Public, mountpoint /net/ananas.local/export/Public, what ananas.local:/export/Public, fstype nfs, options nosuid,nodev [ 1291.740750] apollon automount[12337]: mount(nfs): root=/net/ananas.local/export/Public name=/net/ananas.local/export/Public what=ananas.local:/export/Public, fstype=nfs, options=nosuid,nodev [ 1291.740839] apollon automount[12337]: mount(nfs): nfs options="nosuid,nodev", nobind=0, nosymlink=0, ro=0 [ 1291.740905] apollon automount[12337]: get_nfs_info: called with host ananas.local(192.168.1.167) proto 6 version 0x20 [ 1291.740959] apollon automount[12337]: get_nfs_info: nfs v3 rpc ping time: 0.000209 [ 1291.741009] apollon automount[12337]: get_nfs_info: host ananas.local cost 209 weight 0 [ 1291.741072] apollon automount[12337]: get_nfs_info: called with host ananas.local(192.168.1.167) proto 17 version 0x20 [ 1291.741123] apollon automount[12337]: prune_host_list: selected subset of hosts that support NFS3 over TCP [ 1291.741173] apollon automount[12337]: mount_mount: mount(nfs): calling mkdir_path /net/ananas.local/export/Public [ 1291.741209] apollon automount[12337]: mount(nfs): calling mount -t nfs -s -o nosuid,nodev ananas.local:/export/Public /net/ananas.local/export/Public [ 1291.820337] apollon automount[12337]: mount_mount: mount(nfs): mounted ananas.local:/export/Public on /net/ananas.local/export/Public [ 1291.822090] apollon automount[12337]: dev_ioctl_send_ready: token = 6 [ 1291.822434] apollon automount[12337]: mounted /net/ananas.local/export/Public [ 1292.540540] apollon kernel: autofs_mount_wait:256: autofs4:pid:27516:autofs_mount_wait: mount wait done status=0 Shortly later, a similar mount hangs after the "sun_mount: parse(sun): mounting..." message: [ 1342.089949] apollon kernel: autofs_mount_wait:254: autofs4:pid:28784:autofs_mount_wait: waiting for mount name=/ [ 1342.089964] apollon kernel: autofs_wait:419: autofs4:pid:28784:autofs_wait: new wait id = 0x00000007, name = 00000000739379e8, nfy=1 [ 1342.089979] apollon kernel: autofs_notify_daemon:92: autofs4:pid:28784:autofs_notify_daemon: wait id = 0x00000007, name = 00000000739379e8, type=5 [ 1342.090224] apollon kernel: autofs_d_manage:426: autofs4:pid:12348:autofs_d_manage: dentry=000000007ea364a8 ananas.local [ 1342.090251] apollon kernel: autofs_d_manage:426: autofs4:pid:12348:autofs_d_manage: dentry=00000000739379e8 / [ 1342.090264] apollon kernel: autofs_d_manage:426: autofs4:pid:12348:autofs_d_manage: dentry=00000000739379e8 / [ 1341.378250] apollon automount[12337]: handle_packet_missing_direct: token 7, name /net/ananas.local/export/Git, request pid 28784 [ 1341.q378421] apollon automount[12337]: attempting to mount entry /net/ananas.local/export/Git [ 1341.378598] apollon automount[12337]: lookup_mount: lookup(hosts): /net/ananas.local/export/Git -> ananas.local:/export/Git [ 1341.378758] apollon automount[12337]: parse_mount: parse(sun): expanded entry: ananas.local:/export/Git [ 1341.378918] apollon automount[12337]: parse_mount: parse(sun): gathered options: [ 1341.379110] apollon automount[12337]: sun_mount: parse(sun): mounting root /net/ananas.local/export/Git, mountpoint /net/ananas.local/export/Git, what ananas.local:/export/Git, fstype nfs, options nosuid,nodev <-- nothing --> [ 3276.455832] apollon automount[12337]: handle_packet: type = 5 Nothing happens until I noticed and tried to restart "autofs.service", which times out, too: [ 3302.288384] apollon systemd[1]: Stopping Automounts filesystems on demand... [ 3302.289735] apollon automount[12337]: do_notify_state: signal 15 [ 3302.291839] apollon automount[12337]: master_notify_state_change: sig 15 switching /net from 4 to 5 [ 3482.297728] apollon systemd[1]: autofs.service: State 'stop-sigterm' timed out. Killing. [ 3482.989282] apollon kernel: autofs_dentry_release:114: autofs4:pid:28785:autofs_dentry_release: releasing 0000000002d0d9f7 [ 3482.989361] apollon kernel: autofs_dentry_release:114: autofs4:pid:28785:autofs_dentry_release: releasing 00000000d97fdbcf ... [ 3482.305113] apollon systemd[1]: autofs.service: Killing process 12337 (automount) with signal SIGKILL. [ 3482.305203] apollon systemd[1]: autofs.service: Main process exited, code=killed, status=9/KILL [ 3482.305379] apollon systemd[1]: autofs.service: Failed with result 'timeout'. [ 3482.323666] apollon systemd[1]: Stopped Automounts filesystems on demand. The subsequent restart fails as well: [ 3482.364940] apollon systemd[1]: Starting Automounts filesystems on demand... ... [ 3482.383394] apollon automount[933]: Starting automounter version 5.1.8, master map auto.master [ 3482.383955] apollon automount[933]: using kernel protocol version 5.05 [ 3483.262099] apollon kernel: autofs_d_manage:426: autofs4:pid:940:autofs_d_manage: dentry=00000000739379e8 / [ 3483.262108] apollon kernel: autofs_mount_wait:254: autofs4:pid:940:autofs_mount_wait: waiting for mount name=/ [ 3483.262121] apollon kernel: autofs_wait:429: autofs4:pid:940:autofs_wait: existing wait id = 0x00000007, name = 00000000739379e8, nfy=1 [ 3482.579217] apollon automount[933]: parse_mount: parse(sun): dequote(""/export/Git"") -> /export/Git [ 3482.579445] apollon automount[933]: parse_mapent: parse(sun): gathered options: [ 3482.579738] apollon automount[933]: parse_mapent: parse(sun): dequote(""ananas.local:/export/Git"") -> ananas.local:/export/Git [ 3482.579965] apollon automount[933]: update_offset_entry: parse(sun): updated multi-mount offset /export/Git -> ananas.local:/export/Git ... [ 3482.584814] apollon automount[933]: tree_mapent_mount_offset: mount offset /net/ananas.local/export/Backup [ 3482.585019] apollon automount[933]: remount_active_mount: trying to re-connect to mount /net/ananas.local/export/Backup [ 3482.585271] apollon automount[933]: mounted offset on /net/ananas.local/export/Backup with timeout 300, freq 75 seconds [ 3482.585580] apollon automount[933]: remount_active_mount: re-connected to mount /net/ananas.local/export/Backup [ 3482.585848] apollon automount[933]: tree_mapent_mount_offset: mount offset /net/ananas.local/export/Git [ 3557.456823] apollon automount[933]: st_expire: state 0 path /net [ 3558.142739] apollon kernel: autofs_d_manage:426: autofs4:pid:1058:autofs_d_manage: dentry=0000000063ddaa27 zeus.local [ 3558.142778] apollon kernel: autofs_d_manage:426: autofs4:pid:1058:autofs_d_manage: dentry=00000000c3906c58 / [ 3558.142881] apollon kernel: autofs_direct_busy:155: autofs4:pid:1058:autofs_direct_busy: top 00000000c3906c58 / [ 3557.462849] apollon automount[933]: state.c:690: assertion failed: ap->state == ST_READY [ 3557.462914] apollon automount[933]: expire_proc: exp_proc = 140512157763264 path /net [ 3557.462985] apollon automount[933]: expire_proc_indirect: expire /net/zeus.local/mnt/git [ 3557.463041] apollon automount[933]: expire_proc_indirect: expire /net/zeus.local [ 3662.547687] apollon systemd[1]: autofs.service: start operation timed out. Terminating. [ 3842.797526] apollon systemd[1]: autofs.service: State 'stop-sigterm' timed out. Killing. [ 3842.798330] apollon systemd[1]: autofs.service: Killing process 933 (automount) with signal SIGKILL. [ 3843.479204] apollon kernel: autofs_mount_wait:256: autofs4:pid:940:autofs_mount_wait: mount wait done status=-4 [ 3843.480412] apollon kernel: autofs_dentry_release:114: autofs4:pid:940:autofs_dentry_release: releasing 00000000b3646c6f [ 3843.480418] apollon kernel: autofs_dentry_release:114: autofs4:pid:940:autofs_dentry_release: releasing 0000000039d15e2f [ 3842.800405] apollon systemd[1]: autofs.service: Main process exited, code=killed, status=9/KILL [ 3842.800493] apollon systemd[1]: autofs.service: Failed with result 'timeout'. [ 3842.830035] apollon systemd[1]: Failed to start Automounts filesystems on demand. [ 3842.831266] apollon systemd[1]: autofs.service: Consumed 4min 45.195s CPU time. Full journal attached. At the time of the failure (and afterwards), I see the following related mounts: apollon:~ # mount -t nfs4 zeus.local:/mnt/git on /net/zeus.local/mnt/git type nfs4 (rw,nosuid,nodev,relatime,vers=4.2,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.30.40,local_lock=none,addr=192.168.30.20) ananas.local:/export/vms on /net/ananas.local/export/vms type nfs4 (rw,nosuid,nodev,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.40,local_lock=none,addr=192.168.1.167) ananas.local:/export/Public on /net/ananas.local/export/Public type nfs4 (rw,nosuid,nodev,relatime,vers=4.2,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=192.168.1.40,local_lock=none,addr=192.168.1.167) apollon:~ # mount -t autofs systemd-1 on /proc/sys/fs/binfmt_misc type autofs (rw,relatime,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=1614) -hosts on /net type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,indirect,pipe_ino=170192) -hosts on /net/zeus.local/home type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=170192) -hosts on /net/zeus.local/mnt/git type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=170192) -hosts on /net/zeus.local/mnt/nv type autofs (rw,relatime,fd=6,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=170192) -hosts on /net/ananas.local/export/Backup type autofs (rw,relatime,fd=-1,pgrp=933,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=-1) -hosts on /net/ananas.local/export/Git type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571) -hosts on /net/ananas.local/export/Media type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571) -hosts on /net/ananas.local/export/Public type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571) -hosts on /net/ananas.local/export/Wilck type autofs (rw,relatime,fd=-1,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=-1) -hosts on /net/ananas.local/export/home type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571) -hosts on /net/ananas.local/export/vms type autofs (rw,relatime,fd=6,pgrp=12337,timeout=300,minproto=5,maxproto=5,offset,pipe_ino=71571) Both nfs servers (ananas.local and zeus.local) are running openSUSE Leap 15.5. I am seeing this kind of stalls only on my tumbleweed system. For a long time I'd thought that the problems were caused by the SUSE VPN somehow interfering with autofs, but I the case documented here was observed without VPN active.