Bug 55370 (suse40370) - pidof, checkproc, killall etc hang if a binary from a not available NFS mount is running.
Summary: pidof, checkproc, killall etc hang if a binary from a not available NFS mount...
Status: RESOLVED FIXED
: 55369 (view as bug list)
Alias: suse40370
Product: openSUSE 10.3
Classification: openSUSE
Component: Network (show other bugs)
Version: unspecified
Hardware: All Linux
: P3 - Medium : Normal (vote)
Target Milestone: ---
Assignee: Christian Zoz
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2004-05-12 00:47 UTC by Christian Zoz
Modified: 2007-07-12 14:27 UTC (History)
3 users (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
shell code replacing pidof, checkproc, killproc (1.97 KB, text/plain)
2006-09-26 13:59 UTC, Peter Poeml
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Forgotten User ZhJd0F0L3x 2004-05-12 00:47:09 UTC
started $HOME/bin/titrax (NFS-home), Yast decided it had to restart network ->
NFS-Home became unavailable. Network scripts hang at checkproc/killall/pidof:

fix:/tmp # ps xa
  PID TTY      STAT   TIME COMMAND
...
10211 ?        S      0:00 titrax
...
19386 pts/44   S      0:00 /bin/bash /sbin/yast2
19401 pts/44   S      0:00 /usr/lib/YaST2/bin/y2controlcenter
19417 pts/44   S      0:00 /bin/bash /sbin/yast2 lan
19432 pts/44   S      0:02 /usr/lib/YaST2/bin/y2base lan qt -geometry 800x600
19534 pts/44   S      0:00 /bin/bash /usr/lib/YaST2/servers_non_y2/ag_initscripts
19538 pts/44   S      0:00 /usr/lib/YaST2/bin/y2base lan qt -geometry 800x600
19539 pts/44   S      0:00 /usr/lib/YaST2/bin/y2base lan qt -geometry 800x600
19540 pts/44   S      0:00 /bin/bash /etc/init.d/network stop
19666 pts/44   S      0:00 /bin/bash /sbin/ifdown-dhcp all -o rc
19670 pts/44   S      0:00 pidof dhcpcd
19684 pts/44   R+     0:00 ps xa
fix:/tmp # strace -p 19670
Process 19670 attached - interrupt to quit
--- SIGCONT (Continued) @ 0 (0) ---
stat64("/proc/10211/exe",  <unfinished ...>

Bad. pidof et al should at least fail gracefully.
Comment 1 Forgotten User ZhJd0F0L3x 2004-05-12 00:47:09 UTC
<!-- SBZ_reproduce  -->
.
Comment 2 Thorsten Kukuk 2004-05-12 17:59:59 UTC
Any ideas what the problem could be? 
Comment 3 Olaf Kirch 2004-05-12 19:02:19 UTC
Most likely these application try to stat /proc/pid/cwd or something 
similar for every processes; including those that hang on the dead 
nfs mount. 
Comment 4 Olaf Kirch 2004-05-12 19:03:18 UTC
*** Bug 55369 has been marked as a duplicate of this bug. ***
Comment 5 Olaf Kirch 2004-05-12 19:09:36 UTC
Stupid me, I should have looked at the bug more closely. 
I stats /proc/pid/exe where the executable resides on an NFS 
partition. Of course this will hang, and there's nothing I can 
do about. 
 
either yast shouldn't do an rcnetwork restart (why does it think it 
needs to mess with all interface just because someone added a wlan 
card?). 
 
Or ifdown should use dhcpcd's pid file to kill it, rather than stating 
the executable. 
Comment 6 Olaf Kirch 2004-05-12 19:11:07 UTC
Assigning to sysconfig maintainer; please discuss with yast folks. 
Comment 7 Forgotten User ZhJd0F0L3x 2004-05-12 19:18:22 UTC
why cant we fail gracefully if network is away? I see this in so many places
where the machine is dead without any need, this is really annoying. Every
developer should be on a defective switch which cuts off his network about 30%
of the time, then things like this would improve.
Comment 8 Michal Svec 2004-05-12 19:58:37 UTC
Olaf? Can't the NFS finally stop hanging all the times?
Comment 9 Olaf Kirch 2004-05-12 20:32:12 UTC
This is a feature, folks. If you want NFS to fail if the network 
goes away, mount your file system with -o soft and expect to lose 
data or crash your KDE session whenever someone rips out the ethernet 
cable. 
 
You cannot have it both ways. 
Comment 10 Christian Zoz 2004-06-01 16:02:27 UTC
As this problem does only occur in rare cases there won"t be any changes for
SLES9/SL9.1. But for 9.2 i will try to get rid of pidof in ifup-dhcp and
additionally provide a 'rcnetwork reload'.
Comment 11 Christian Zoz 2004-09-13 22:13:16 UTC
We now have a rcnetwork reload, that should solve that partially. For the rest
there is no solution.
Comment 12 Michal Svec 2004-09-13 22:32:06 UTC
Why pidof needs to stat that /proc/pid/exe?

Also, can't ifup-dhcp get rid of pidof as well?
Comment 13 Peter Poeml 2004-09-17 00:15:43 UTC
ifup-dhcp scans /proc/$pid/cmdline to distunguish dhcp clients by the
interface they are running on. 
ifup-dhcp can maybe get rid of pidofproc calls, one way would be to let
it write a pid file (per interface) earlier (it currently does so after
getting address, because that's when it forks). 
But in ifup-dhcp there are also some checkproc calls, I'm not sure
whether we could get rid of all of them.
Comment 14 Christian Zoz 2004-09-17 18:20:42 UTC
<!-- SBZ_reopen -->Reopened by zoz@suse.de at Fri Sep 17 12:20:42 2004, took initial reporter seife@suse.de to cc
Comment 15 Christian Zoz 2004-09-17 18:20:42 UTC
Then lets have a look at it.
Comment 16 Christian Zoz 2004-09-17 18:22:03 UTC
But not now.
Comment 17 Peter Poeml 2006-05-31 09:43:23 UTC
Do we still plan to work on this? I think, part of the problem is solved
because roaming machines may use NetworkManager, which doesn't use pidof
et al. in the background. (At least I suppose so since it doesn't use
the ifup scripts for DHCP.) Given the decreased priority, I suggest to
close the bug as WONTFIX.
Comment 18 Peter Poeml 2006-09-26 10:38:48 UTC
Reopening due to bug 187175.
Comment 19 Peter Poeml 2006-09-26 10:41:28 UTC
*** Bug 187175 has been marked as a duplicate of this bug. ***
Comment 20 Peter Poeml 2006-09-26 13:23:09 UTC
The ironical thing is that pidof returns pids for /asdf/dhcpcd just as
for /sbin/dhcpcd, so it doesn't really need to stat the exe...

I am in the process of writing replacements for pidof, checkproc and
killproc which don't cause NFS hangs. 
Comment 21 Michal Svec 2006-09-26 13:31:04 UTC
Good, please just consider fixing those in sysvinit.rpm so we do not diverge very much from the rest of the world ...
Comment 22 Peter Poeml 2006-09-26 13:47:11 UTC
No, I'm writing replacements which are compatible but have the limited
functionality which is sufficient for sysconfig.
Comment 23 Peter Poeml 2006-09-26 13:59:26 UTC
Created attachment 99637 [details]
shell code replacing pidof, checkproc, killproc

The attached three functions should be do the job. 

I'll add them to /etc/sysconfig/network/scripts/functions which is
sourced by all sysconfig scripts. They should be automatically be used
in all places, as far as I can see.

I don't know if there are any other binaries that we might need to
replace.
Comment 24 Peter Poeml 2006-09-26 14:32:00 UTC
adjusting severity to major, so it matches the one of bug 187175
Comment 25 Peter Poeml 2006-10-09 08:37:05 UTC
Now that it is (supposed to be) fixed in subversion meanwhile, it would
be good if it gets test coverage. Christian, are you going to submit the
current sysconfig code to Factory any time soon? We are in Alpha stage
of 10.2 and now would be the best time to get this in. Thanks.
Comment 26 Christian Zoz 2006-10-26 11:41:45 UTC
Changes are aubmitted to autobuild since some time.
Comment 27 Christian Zoz 2006-11-17 17:10:47 UTC
This change caused much trouble. See bug 213249.

I revert this change, but will still use the replacement for pidof locally in ifup-dhcp.

Werner, can we get a improved pidof some day?

Peter, do we really have to use pidof? Is there no other way?
Comment 28 Dr. Werner Fink 2006-11-17 17:31:48 UTC
Currently I've no clue how to handle this.  All system calls
trying to get informations about a file from stalled NFS file system
will sleep or locked for ever.  Even alarm() does not awake a
sleeping system call, the only method would be a fork() to execev()
a second process doing the job and then read from a pipe(). If the
sub process does not provide the informations on the pipe within a
time period simply to (SIG)KILL the sub process and skip the file in
the main process, e.g. killproc or pidofproc.  But this slows
down the boot proces a lot.
Comment 29 Peter Poeml 2006-11-18 09:46:15 UTC
It's fine to revert the change, with or without pidof replacement.

BTW, I meanwhile figured that if we really use replacements we should
probably name them my_pidof, my_checkproc etc so to avoid confusion, and
to avoid other scripts from accidentally use it unknowingly, just
because they happen to source the sysconfig functions for some reason.
Comment 30 Christian Zoz 2006-11-20 11:48:55 UTC
removed checkproc() and killproc() and improved pidof() which is now my_pidof().

my_pidof now does not use 'basename' (in /usr) and gets executable from /proc/*/exe, because /proc/*/cmdline does not always contain the full path.
Comment 31 Dr. Werner Fink 2006-11-28 14:47:25 UTC
Just to be noted I've a killproc version around which can test the path
of the executable about being part of a NFS. Just have a look into my
export directory for killproc-2.12.tar.gz.  The prgrams killproc and
checkproc/pidofproc now know about the option -N for testing the
specified executable being part of a NFS.  This should also work for
symbolic links used for the executable.
Comment 32 Christian Zoz 2006-11-28 15:08:21 UTC
OK, will try that. But not immediately.
Comment 33 Dr. Werner Fink 2006-12-13 11:17:37 UTC
See bug #224563 ... same problem with pidof and killall5 from sysvinit.
Comment 34 Christian Zoz 2007-07-12 14:27:50 UTC
I never tested this killproc version up to now. But ifup-dhcp does no longer use pidof nor a pidof replacement. It now parses the ifup output. There were other bugs that required to change the parts of ifup-dhcp that used pidof (see bug 282033 and bug 260073).

So this problem may be considered fixed finally.