Bugzilla – Bug 55370
pidof, checkproc, killall etc hang if a binary from a not available NFS mount is running.
Last modified: 2007-07-12 14:27:50 UTC
started $HOME/bin/titrax (NFS-home), Yast decided it had to restart network -> NFS-Home became unavailable. Network scripts hang at checkproc/killall/pidof: fix:/tmp # ps xa PID TTY STAT TIME COMMAND ... 10211 ? S 0:00 titrax ... 19386 pts/44 S 0:00 /bin/bash /sbin/yast2 19401 pts/44 S 0:00 /usr/lib/YaST2/bin/y2controlcenter 19417 pts/44 S 0:00 /bin/bash /sbin/yast2 lan 19432 pts/44 S 0:02 /usr/lib/YaST2/bin/y2base lan qt -geometry 800x600 19534 pts/44 S 0:00 /bin/bash /usr/lib/YaST2/servers_non_y2/ag_initscripts 19538 pts/44 S 0:00 /usr/lib/YaST2/bin/y2base lan qt -geometry 800x600 19539 pts/44 S 0:00 /usr/lib/YaST2/bin/y2base lan qt -geometry 800x600 19540 pts/44 S 0:00 /bin/bash /etc/init.d/network stop 19666 pts/44 S 0:00 /bin/bash /sbin/ifdown-dhcp all -o rc 19670 pts/44 S 0:00 pidof dhcpcd 19684 pts/44 R+ 0:00 ps xa fix:/tmp # strace -p 19670 Process 19670 attached - interrupt to quit --- SIGCONT (Continued) @ 0 (0) --- stat64("/proc/10211/exe", <unfinished ...> Bad. pidof et al should at least fail gracefully.
<!-- SBZ_reproduce --> .
Any ideas what the problem could be?
Most likely these application try to stat /proc/pid/cwd or something similar for every processes; including those that hang on the dead nfs mount.
*** Bug 55369 has been marked as a duplicate of this bug. ***
Stupid me, I should have looked at the bug more closely. I stats /proc/pid/exe where the executable resides on an NFS partition. Of course this will hang, and there's nothing I can do about. either yast shouldn't do an rcnetwork restart (why does it think it needs to mess with all interface just because someone added a wlan card?). Or ifdown should use dhcpcd's pid file to kill it, rather than stating the executable.
Assigning to sysconfig maintainer; please discuss with yast folks.
why cant we fail gracefully if network is away? I see this in so many places where the machine is dead without any need, this is really annoying. Every developer should be on a defective switch which cuts off his network about 30% of the time, then things like this would improve.
Olaf? Can't the NFS finally stop hanging all the times?
This is a feature, folks. If you want NFS to fail if the network goes away, mount your file system with -o soft and expect to lose data or crash your KDE session whenever someone rips out the ethernet cable. You cannot have it both ways.
As this problem does only occur in rare cases there won"t be any changes for SLES9/SL9.1. But for 9.2 i will try to get rid of pidof in ifup-dhcp and additionally provide a 'rcnetwork reload'.
We now have a rcnetwork reload, that should solve that partially. For the rest there is no solution.
Why pidof needs to stat that /proc/pid/exe? Also, can't ifup-dhcp get rid of pidof as well?
ifup-dhcp scans /proc/$pid/cmdline to distunguish dhcp clients by the interface they are running on. ifup-dhcp can maybe get rid of pidofproc calls, one way would be to let it write a pid file (per interface) earlier (it currently does so after getting address, because that's when it forks). But in ifup-dhcp there are also some checkproc calls, I'm not sure whether we could get rid of all of them.
<!-- SBZ_reopen -->Reopened by zoz@suse.de at Fri Sep 17 12:20:42 2004, took initial reporter seife@suse.de to cc
Then lets have a look at it.
But not now.
Do we still plan to work on this? I think, part of the problem is solved because roaming machines may use NetworkManager, which doesn't use pidof et al. in the background. (At least I suppose so since it doesn't use the ifup scripts for DHCP.) Given the decreased priority, I suggest to close the bug as WONTFIX.
Reopening due to bug 187175.
*** Bug 187175 has been marked as a duplicate of this bug. ***
The ironical thing is that pidof returns pids for /asdf/dhcpcd just as for /sbin/dhcpcd, so it doesn't really need to stat the exe... I am in the process of writing replacements for pidof, checkproc and killproc which don't cause NFS hangs.
Good, please just consider fixing those in sysvinit.rpm so we do not diverge very much from the rest of the world ...
No, I'm writing replacements which are compatible but have the limited functionality which is sufficient for sysconfig.
Created attachment 99637 [details] shell code replacing pidof, checkproc, killproc The attached three functions should be do the job. I'll add them to /etc/sysconfig/network/scripts/functions which is sourced by all sysconfig scripts. They should be automatically be used in all places, as far as I can see. I don't know if there are any other binaries that we might need to replace.
adjusting severity to major, so it matches the one of bug 187175
Now that it is (supposed to be) fixed in subversion meanwhile, it would be good if it gets test coverage. Christian, are you going to submit the current sysconfig code to Factory any time soon? We are in Alpha stage of 10.2 and now would be the best time to get this in. Thanks.
Changes are aubmitted to autobuild since some time.
This change caused much trouble. See bug 213249. I revert this change, but will still use the replacement for pidof locally in ifup-dhcp. Werner, can we get a improved pidof some day? Peter, do we really have to use pidof? Is there no other way?
Currently I've no clue how to handle this. All system calls trying to get informations about a file from stalled NFS file system will sleep or locked for ever. Even alarm() does not awake a sleeping system call, the only method would be a fork() to execev() a second process doing the job and then read from a pipe(). If the sub process does not provide the informations on the pipe within a time period simply to (SIG)KILL the sub process and skip the file in the main process, e.g. killproc or pidofproc. But this slows down the boot proces a lot.
It's fine to revert the change, with or without pidof replacement. BTW, I meanwhile figured that if we really use replacements we should probably name them my_pidof, my_checkproc etc so to avoid confusion, and to avoid other scripts from accidentally use it unknowingly, just because they happen to source the sysconfig functions for some reason.
removed checkproc() and killproc() and improved pidof() which is now my_pidof(). my_pidof now does not use 'basename' (in /usr) and gets executable from /proc/*/exe, because /proc/*/cmdline does not always contain the full path.
Just to be noted I've a killproc version around which can test the path of the executable about being part of a NFS. Just have a look into my export directory for killproc-2.12.tar.gz. The prgrams killproc and checkproc/pidofproc now know about the option -N for testing the specified executable being part of a NFS. This should also work for symbolic links used for the executable.
OK, will try that. But not immediately.
See bug #224563 ... same problem with pidof and killall5 from sysvinit.
I never tested this killproc version up to now. But ifup-dhcp does no longer use pidof nor a pidof replacement. It now parses the ifup output. There were other bugs that required to change the parts of ifup-dhcp that used pidof (see bug 282033 and bug 260073). So this problem may be considered fixed finally.