Bugzilla – Bug 387202
nscd keeps crashing in mem.c
Last modified: 2010-02-03 09:19:12 UTC
Hi, on my machine, nscd always crashes with an assertion after some time: 25033: handle_request: request received (Version = 2) from PID 25253 25033: GETFDPW 25033: provide access to FD 5, for passwd 25033: Reloading "0" in password cache! 25033: Reloading "10020" in password cache! 25033: remove GETPWBYNAME entry "mmarek" 25033: remove GETPWBYUID entry "10020" nscd: mem.c:399: gc: Assertion `next_hash == &he[db->head->nentries]' failed. or 25991: handle_request: request received (Version = 2) from PID 26107 25991: GETPWBYNAME (nobody) 25991: Haven't found "nobody" in password cache! 25991: Reloading "mmarek" in password cache! 25991: remove GETPWBYNAME entry "mmarek" 25991: remove GETPWBYUID entry "10020" nscd: mem.c:392: gc: Assertion `off_alloc == off_allocend' failed.
Created attachment 212695 [details] nscd log log output from the last run. I did rm /var/run/nscd/* /usr/sbin/nscd -d 2>&1 | tee log-nscd
Petr?
Hmm, do you still encounter this with the 11.0 nscd?
Yes.
$ rpm -q nscd nscd-2.8-15
*** Bug 388435 has been marked as a duplicate of this bug. ***
Created attachment 225788 [details] nscd core file I too am seeing many nscd crashes, sometimes every few minutes, and this stops Thunderbird working. I have attached a core file: is there any other information that would be useful? I am also happy to test any fixes that might be available. nscd is version 2.8-14.1 running on Opensuse 11.0 x86_64. Bob
I can reproduce this myself, just so far didn't figure out what the bug is. I'm still working on it.
Does this help? From /var/log/nscd.log (enabled by hand): 17429: pruning services cache; time 1216524649 17429: considering GETSERVBYPORT entry "`nɑ/tcp", timeout 1216552141 17429: considering GETSERVBYPORT entry " 372^K211e^?/tcp", timeout 1216552130 17429: considering GETSERVBYPORT entry "@/خ/tcp", timeout 1216551996 17429: considering GETSERVBYPORT entry " 272L?354^?/tcp", timeout 1216552070 17429: considering GETSERVBYPORT entry " 332f8343^?/tcp", timeout 1216552142 17429: considering GETSERVBYPORT entry " 252372rU^?/tcp", timeout 1216551945 17429: considering GETSERVBYPORT entry " e5301/tcp", timeout 1216552050 17429: considering GETSERVBYPORT entry "0214247^A/tcp", timeout 1216552119 17429: considering GETSERVBYPORT entry " 312Q=i^?/tcp", timeout 1216552080 17429: considering GETSERVBYPORT entry " ^ZQ356^C^?/tcp", timeout 1216552070 17429: considering GETSERVBYNAME entry "netbios-ns/tcp", timeout 1216552463 17429: considering GETSERVBYNAME entry "bootps/udp", timeout 1216552463 17429: considering GETSERVBYPORT entry "", timeout 1216551905 17429: considering GETSERVBYPORT entry "220u=O/tcp", timeout 1216552098 17429: considering GETSERVBYPORT entry " 272237303^?^?/tcp", timeout 1216552087 17429: considering GETSERVBYPORT entry " 352317/,^?/tcp", timeout 1216552080 17429: considering GETSERVBYPORT entry " :_^^352^?/tcp", timeout 1216551945 17429: considering GETSERVBYNAME entry "ipp/udp", timeout 1216552463 17429: considering GETSERVBYPORT entry "260316317^K/tcp", timeout 1216552087 17429: considering GETSERVBYPORT entry "`fIESC/tcp", timeout 1216552113 17429: considering GETSERVBYPORT entry "@@347^X/tcp", timeout 1216552391 17429: considering GETSERVBYPORT entry "320315301313/tcp", timeout 1216552087 17429: considering GETSERVBYPORT entry "321^B", timeout 1216551905 17429: considering GETSERVBYPORT entry "pVK261/tcp", timeout 1216552050 17429: considering GETSERVBYPORT entry "^P!365(/tcp", timeout 1216552391 17429: considering GETSERVBYPORT entry " *323 247^?/tcp", timeout 1216552391 17429: considering GETSERVBYPORT entry " 212kb267^?/tcp", timeout 1216551934 17429: considering GETSERVBYNAME entry "netbios-ssn/tcp", timeout 1216552463 17429: considering GETSERVBYPORT entry "^Pr^E240/tcp", timeout 1216552130 17429: considering GETSERVBYPORT entry "@322^FH/tcp", timeout 1216552097 17429: considering GETSERVBYPORT entry " ʧuI^?/tcp", timeout 1216552113 17429: considering GETSERVBYPORT entry " 272255^CM^?/tcp", timeout 1216552087 17429: considering GETSERVBYPORT entry " ʻ205367^?/tcp", timeout 1216551905 ... and then it dies a little bit later.
Created attachment 233949 [details] NSCD debug log I am also seeing this bug on 2 systems which use LDAP account information from a RHEL 5 server. I'm attaching the first system's nscd debug log.
Created attachment 233951 [details] NSCD debug log 2 the 2nd system's NSCD debug log.
NSCD is pretty critical for reducing the load on my LDAP server. I would be happy to run some test cases or debug code for you if you want. I have no trouble crashing nscd very quickly on my OpenSUSE 11.0 systems.
I too am having the same problem. I'm using LDAP for account information and kerberos for passwords. I'm seeing nscd crash on all of my servers at least every 15 minutes (I've got a script setup to restart every 5 if it's dead). I'm having this problem both in dom0 and in domU on xen as well as at home on my non-xen systems. I installed libnscd-debuginfo and then ran nscd -d in gdb and got the following: ... 6685: Reloading "103" in password cache! 6685: Reloading "13" in password cache! 6685: Reloading "100" in password cache! 6685: remove GETPWBYUID entry "0" 6685: remove GETPWBYNAME entry "root" nscd: mem.c:399: gc: Assertion `next_hash == &he[db->head->nentries]' failed. Program received signal SIGABRT, Aborted. [Switching to Thread 0x4103f950 (LWP 6688)] 0x00007fe6907ce5c5 in raise () from /lib64/libc.so.6 (gdb) where #0 0x00007fe6907ce5c5 in raise () from /lib64/libc.so.6 #1 0x00007fe6907cfbb3 in abort () from /lib64/libc.so.6 #2 0x00007fe6907c71e9 in __assert_fail () from /lib64/libc.so.6 #3 0x00007fe691362b68 in ?? () from /usr/sbin/nscd #4 0x00007fe691361494 in ?? () from /usr/sbin/nscd #5 0x00007fe6913582c6 in ?? () from /usr/sbin/nscd #6 0x00007fe690d14040 in start_thread () from /lib64/libpthread.so.0 #7 0x00007fe69086f0cd in clone () from /lib64/libc.so.6 (gdb) Unfortunately it doesn't appear there is a debuginfo package for nscd, so this doesn't help quite as much as I'd hoped.
Which debuginfo packages would include the appropriate symbols to be able to get function names from the errors of nscd shown above?
(In reply to comment #17 from Jon Schewe) > Which debuginfo packages would include the appropriate symbols to be able to > get function names from the errors of nscd shown above? > glibc-debuginfo
Thanks. Now I've got a real stack trace to share. Took all of 10 mintues for it to crash this time. GNU gdb 6.8 Copyright (C) 2008 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-suse-linux"... (gdb) run -d Starting program: /usr/sbin/nscd -d [Thread debugging using libthread_db enabled] [New Thread 0x7f8d562276f0 (LWP 818)] [New Thread 0x4102f950 (LWP 821)] [New Thread 0x42112950 (LWP 822)] [New Thread 0x415a7950 (LWP 823)] [New Thread 0x417a8950 (LWP 824)] [New Thread 0x419a9950 (LWP 825)] [New Thread 0x41baa950 (LWP 826)] [New Thread 0x40584950 (LWP 827)] [New Thread 0x40785950 (LWP 828)] 818: Reloading "root" in group cache! 818: remove INITGROUPS entry "root" nscd: mem.c:392: gc: Assertion `off_alloc == off_allocend' failed. Program received signal SIGABRT, Aborted. [Switching to Thread 0x42112950 (LWP 822)] 0x00007f8d556be5c5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 64 ../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory. in ../nptl/sysdeps/unix/sysv/linux/raise.c (gdb) where #0 0x00007f8d556be5c5 in *__GI_raise (sig=<value optimized out>) at ../nptl/sysdeps/unix/sysv/linux/raise.c:64 #1 0x00007f8d556bfbb3 in *__GI_abort () at abort.c:88 #2 0x00007f8d556b71e9 in *__GI___assert_fail ( assertion=0x7f8d5625b3d0 "off_alloc == off_allocend", file=0x7f8d5625b379 "mem.c", line=392, function=0x7f8d5625b450 "gc") at assert.c:78 #3 0x00007f8d56252ba6 in gc (db=0x7f8d5645f200) at mem.c:392 #4 0x00007f8d56251494 in prune_cache (table=0x7f8d5645f200, now=1222348776, fd=-1) at cache.c:499 #5 0x00007f8d562482c6 in nscd_run_prune (p=<value optimized out>) at connections.c:1390 #6 0x00007f8d55c04040 in start_thread (arg=<value optimized out>) at pthread_create.c:297 #7 0x00007f8d5575f0cd in clone () from /lib64/libc.so.6 (gdb) print off_alloc $1 = 1436420560 (gdb) print off_allocend $2 = 512
Looks like Ubuntu has the same bug (#271423). But no solution there either.
For me nscd is particulary important because I'm using it for offline LDAP authentication. So far I'm using a watchdog to restart it when it dies. Opensuse 11.0 x64.
We're having the same problem. OpenSUSE 11.0 (i386 and x86_64) with LDAP for authentication and NFS automount tables. It dies frequently within less than an hour. As result one cannot start Thunderbird (segfaults) and VMware Worstation 6.0.5 (freezes) as LDAP user after nscd has died. As local user it works. See also bug#157078 and http://bugs.gentoo.org/show_bug.cgi?id=223205. The only workaround for us is so far a watchdog daemon that restarts nscd every time it crashes. I can provide you a strace of nscd the next time it crashes. Best regards, Bernd
I gave up on nscd and have been using unscd - http://busybox.net/~vda/unscd/ - and it seems to work just great. Last time I checked it had been up for a month.
Created SRPM, minimal testing on 11.0: https://bugzilla.novell.com/show_bug.cgi?id=157078#c73
Seems I ran into the same stability problem. I have 30 PCs running openSuSE 11.0 on 64 bit. I started logging of nscd now. A fix to the problem would be very welcome.
Our workaround is a watchdog daemon that restarts nscd: ==CUT== watch_procs="/usr/sbin/nscd" ( while true; do for proc in $watch_procs; do if ! checkproc $proc; then logger -t watchdog "Restarting $proc." start_daemon $proc fi done sleep 60 done ) & ==CUT== Nscd crashes up to eight times daily: adnws001:~ # grep nscd /var/log/messages Oct 27 06:16:19 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 27 08:46:21 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 27 12:46:24 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 27 22:31:36 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 04:46:40 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 09:36:43 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 11:01:44 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 13:05:49 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 16:34:35 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 16:36:36 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 17:41:02 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 28 22:40:04 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 29 04:47:06 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 29 11:02:09 adnws001 watchdog: Restarting /usr/sbin/nscd. Oct 29 12:43:10 adnws001 watchdog: Restarting /usr/sbin/nscd.
Luxury! I used the watchdog approach under SuSE 10.2, but under SuSE 11.0 I see nscd crashing every few minutes, or I did before I disabled it.
Following the suggestion in https://bugzilla.novell.com/show_bug.cgi?id=387202#c23 and also a private email from Jaroslaw I have installed unscd on a couple of machines. So far it is looking good and if I don't find any problems I'll roll it out to other machines. I would be interested in hearing a comment from SuSE about unscd as it sounds like it could be the solution to a major headache, but SuSE are much better qualified than I to make that judgement. The drawback is that very few people seem to have tested it so far, and it is a very important piece of software that has to work in a wide variety of environments. On the other hand, the standard nscd has been notoriously flakey for many years, so the bar isn't very high!
I employ now the watchdog-approach from Comment #26. Many Thanks for the code snippet! Just for statistical fun on Halloween: The avarage lifetime of nscd here is 144 minutes/SuSE-11.0-box. (average of crashes on 34 moderately loaded boxes during 14 hours). (Using NIS and DNS, openSuSE 11.0, x86-64, no ldap). Funny enough only in 11 (of 197) crashes there is a log from the kernel in syslog (mostly segfaults, sometimes "general protection").
Watchdogs are a spawn of evil. I've rolled out unscd on all 250 desktops now. Fingers crossed (but still keeping the watchdog alive so I can at least catch any potential issues with unscd). Bob, did you disable debugging already? How's performance?
FWIW we are very strongly considering unscd for post-11.1, though nothing is decided yet.
I am running unscd on several heavily-loaded 11.0 machines now and so far it is looking very good. It hasn't crashed, and every so often I run getent on every account to confirm it is telling the truth (in the past nscd has suffered from corrupt caches as well as segmentation faults).
*If* you are going to try unscd *and* you are using apparmor, you'll need to edit /etc/apparmor.d/usr.sbin.nscd and right after capability net_bind_service, add: capability setgid, capability setuid, for it to work.
*** Bug 439210 has been marked as a duplicate of this bug. ***
I'm planning to release packages at http://www.suse.de/~pbaudis/bug-387202 (will mirror out in an hour or two) as maintenance update for 11.0 in a short while.
nscd still crashes with this patch and in 11.1 - very seldom for me, much more frequently for others. So I will hold this a little more and try to fix that crash too.
*** Bug 426396 has been marked as a duplicate of this bug. ***
*** Bug 417865 has been marked as a duplicate of this bug. ***
*** Bug 426679 has been marked as a duplicate of this bug. ***
(Status update: In bug 446233, we have tested the patch to fix this issue and fixed another race condition which kicks in if nscd does not crash because of this one. Some people still report occasional crashes, but I don't have enough data to debug these. I will wait probably until next Tuesday and proceed to submit 11.0 update with all the nscd fixes we have by then, and some small extras.)
*** Bug 157078 has been marked as a duplicate of this bug. ***
*** Bug 374990 has been marked as a duplicate of this bug. ***
Based on the severity and priority of this bug: How about adding the watchdog workaround as a patch into the current nscd package and release it as an update? That is, modify nscd to start a master process which monitors its child(ren), restart them automatically upon death (maybe with a log entry) and have it kill them upon exit. This patch should neither be that much to add nor too difficult. Of course this is not a real fix and quite ugly, IMHO. However, it would be a _quick_ workaround for all 11.0 installations to make nscd more stable (from the users point of view), relieving the users from implementing a watchdog themselves. It would also buy some time until the real bug is found and squashed.
Just to add that the unscd solution continues to work well for me. It has been running on a number of heavily loaded machines for over a month now without ever crashing, and there has been no sign of bad data. It is good to know progress is being made on the standard nscd as well.
I also switched to unscd about 4 weeks ago on a pool of 34 machines. I haven't encountered any problem since.
Walter: We already do have such a watchdog, it's called "init". Just adding nscd -d to /etc/inittab should work fine. :-)
I see. So, I guess OpenSUSE 12.0 will scrap all those useless scripts in /etc/init.d and start everything from /etc/inittab, right? Nice. Maybe I need to clarify this: comment #51 _was_ meant seriously, no joke intended!
Created attachment 258547 [details] sample watchdog script In case it is useful, here is my version of the watchdog script, designed to be run as a cron job. It has a couple of good features: (1) can check other services, not just nscd (2) uses chkconfig to make sure it only checks services that are meant to be running
Nice script but because of comment #54 quite obsolete, isn't it? :-\ No, seriously, if such a wrapper would be implemented in nscd itself _all_ SUSE users would benefit, even those not capable of writing a wrapper themselves or even those not be able to find (say being aware of) this bugzilla entry. Again, this could be quickly released as an nscd update until the real fix is done (which we're waiting for since when? two years?). The required patch would only need to do the following (in pseudo-code): /* signal handler to kill spawned nscd child */ signal(SIGTERM, { kill(child_pid); } ); /* core loop to (re)spawn nscd child */ for (;;) { child_pid = fork(); if (child_pid == 0) { nscd_main(); /* run nscd main() */ } else { wait(); /* wait for nscd child to exit/die */ log("restarting nscd child"); } } Would that be too difficult?
Nice idea but dangerous: imagine some condition that caused nscd to fail as soon as it started. Then the nscd parent would whirl round burning up CPU and your system would be much more messed up than if nscd just died.
Then add a sleep() to wait a couple of seconds after each wait() to throttle respawning. This should be usually good practice in watchdog wrappers anyways, so I left it out. I said it's only pseudo-code. Moreover, logging will make you notice the problem.
Hi! I've switched to unscd too and it rocks, i'm using it in 32 and 64 bit environments, with a few servers and a with my lap. Never had a problem with it. with regular nscd well i have 2 simple scripts, to make it restart without much hassle: here they are: MartiniMan-lap:~/bin # cat nscd_check #!/bin/bash while true ; do if [ ! "`pidof nscd`" ] ; then echo "`date +%d:%m:%y-%H:%M:%S` restarting nscd" sudo rcnscd restart ; fi sleep 1 ; done ---------------------------------------------------------------------------------------------------------- Pedro Oliveira IT Consultant Email: pmsoliveira@gmail.com URL: http://pedro.linux-geex.com Telefone: +351 96 5867227 ----------------------------------------------------------------------------------------------------------
Sorry, I forgot the second scrip to make the previous one start automatically from RC. Just create this executable file: /etc/init.d/nscd_check #!/bin/sh ### BEGIN INIT INFO # Provides: nscd_check # Required-Start: nscd # Should-Start: # Required-Stop: # Should-Stop: # Default-Start: 3 5 # Default-Stop: 0 1 2 6 # Short-Description: check for nscd # Description: check if nscd is running and restarts it if not ### END INIT INFO # . /etc/rc.status # Reset status of this service rc_reset case "$1" in start) echo -n "Starting nscd_check" nohup /sbin/nscd_check >> /var/log/messages & rc_status -v ;; stop) echo -n "Shutting down nscd_check" pkill nscd_check rc_status -v ;; restart) $0 stop $0 start rc_status ;; *) echo "Usage: $0 {start|stop}" exit 1 ;; esac rc_exit ##################################### after this just type: insserv nscd_check hope it helps. ---------------------------------------------------------------------------------------------------------- Pedro Oliveira IT Consultant Email: pmsoliveira@gmail.com URL: http://pedro.linux-geex.com Telefone: +351 96 5867227 ----------------------------------------------------------------------------------------------------------
nscd 2.9, as shipped with openSuSE 11.1, crashes too. I am using unscd now. Wouldn't it be reasonable to provide unscd as a patch for openSuSE 11.0 and 11.1?
nscd remains crashy for me, too (opensuse 11.1) /me back to using unscd.
Update released for: glibc, glibc-devel, glibc-html, glibc-i18ndata, glibc-info, glibc-locale, glibc-obsolete, glibc-profile, nscd Products: openSUSE 11.0 (debug, i386, i686, ppc, ppc64, x86_64)
nscd still crashes every hour or so after updating to nscd-2.8-14.2 on SuSE 11.0. I will reinstate unscd.
If nscd still crashes for you, please: (i) Set persistent to 0 for all databases in your /etc/nscd.conf (ii) /etc/init.d/nscd stop and run ulimit -c unlimited; nscd -d (iii) When nscd crashes, please post a core here, compress it if it is larger than 1M or so. (iv) Also post your /etc/nsswitch.conf with the core. Without this information, I cannot fix any crashes; nscd on 11.0 crashed only once for me so far after this fix, and I don't have quite enough data to debug it yet, it seems. Thanks!
Egbert König: I plan to package unscd nicely in buildservice in the future, I'm just not sure when will I get to it. To clarify, there are two bugs: bug 387202 against 11.0 and bug 446233 against 11.1. Since nscd is basically the same in 11.0 and 11.1 by now and I will continue to keep them in sync, I'm going to mark 446233 dupe of this one and bump this one to 11.1; further nscd updates will be released for both 11.0 and 11.1. Both of these bugs are in fact many different bugs in nscd, (un)fortunately the unfixed ones trigger only rarely so they aren't as easy to debug.
*** Bug 446233 has been marked as a duplicate of this bug. ***
If anybody cares, I *have* packaged it (although the packaging needs some work) by using bits from the nscd package. home:jnelson-suse if you like. I *have* seen unscd crash, but not the latest version (0.36), which has been very slightly patched to unlink the pidfile and sockets. I actively solicit improvements. I'M NOT RESPONSIBLE FOR ANYTHING THAT GOES WRONG.
Sorry, of course I forgot to mention that - I'm using your work as a base for mine. :)
(In reply to comment #66 from Petr Baudis) > If nscd still crashes for you, please: > > (i) Set persistent to 0 for all databases in your /etc/nscd.conf > (ii) /etc/init.d/nscd stop and run ulimit -c unlimited; nscd -d > (iii) When nscd crashes, please post a core here, compress it if it is larger > than 1M or so. > (iv) Also post your /etc/nsswitch.conf with the core. > > Without this information, I cannot fix any crashes; nscd on 11.0 crashed only > once for me so far after this fix, and I don't have quite enough data to debug > it yet, it seems. > > Thanks! > I managed to get another crash, and will attach the requested info. nscd is version 2.8-14.2, running on SuSE 11.0.
Created attachment 263361 [details] nscd core file plus log messages and config files
Hi, Some good news while everybody is complaining: I installed all Suse 11.0 updates with "zypper update" and rebooted system two days ago and since then nscd keeps running. Before that it crashed every few hours and was restarted with my watchdog daemon. adnws001:~ # rpm -qa | egrep 'nscd|glibc' libnscd-2.0.2-81.1 nscd-2.8-14.2 glibc-2.8-14.2 glibc-locale-2.8-14.2 glibc-devel-2.8-14.2 glibc-info-2.8-14.2 adnws001:~ # uname -a Linux adnws001 2.6.25.18-0.2-pae #1 SMP 2008-10-21 16:30:26 +0200 i686 i686 i386 GNU/Linux Thanks a lot! Bye, Bernd
FWIW, the 11.0 update package works for me so far.
*** Bug 467393 has been marked as a duplicate of this bug. ***
FWIW, another variant, this time with 11.1 (nscd-2.9-2.8): 10765: provide access to FD 12, for hosts 10765: Reloading "die-offenbachs.homelinux.org" in hosts cache! 10765: Reloading "0" in group cache! 10765: Reloading "2222" in group cache! 10765: remove GETHOSTBYNAME entry "localhost" 10765: remove GETPWBYUID entry "51" 10765: remove GETPWBYNAME entry "nobody" 10765: remove GETPWBYUID entry "65534" 10765: remove GETPWBYNAME entry "postfix" nscd: mem.c:412: gc: Zusicherung »next_data < &he_data[db->head->nentries]« nicht erfüllt. Abgebrochen When nscd crashed, amarok takes ages to start up (say 5-10 minutes!), with nscd it takes 2-5 secs. Now, that B O'B will setup a new world order, these bugs really cry for immediate fixes, Petr.
Actually, I have just prepared a new round of nscd updates for 11.0 and 11.1, at http://www.suse.de/~pbaudis/bug-387202-2/ I'm sorry to those who I told before 11.1 and 11.0 nscd is identical, it turns out that the 11.1 glibc update I prepared last did not actually make it to 11.1. :-( So 11.0 should actually have much more stable nscd than 11.1 now. I will try to trigger another round of updates now.
The SWAMPID for this issue is 22192. Please submit the patch and patchinfo file using this ID. (https://swamp.suse.de/webswamp/wf/22192)
Yet another watchdog. root's crontab entry: -0,*/5 * * * 1-7 /root/bin/watchdog_nscd > /dev/null script: #!/bin/bash # watchdog para reiniciar el servicio nscd # idea del case en "307:rc" /usr/sbin/rcnscd status start; status=$? echo "Status= "$status case $status in [1-47]) echo "failed" /bin/logger -p user.warn -t watchdog \ "nscd is not running, restarting. -- Bugzilla 387202; "\ "see root's crontab to disable this wd" /usr/sbin/rcnscd restart ;; [56]) echo "skipped" ;; 0|*) echo "Nothing to do" ;; esac I believe you should create some kind of watchdog and push it via YOU to systems, till this problem is really solved.
Sorry, errata in #81 "/usr/sbin/rcnscd status start" should be "/usr/sbin/rcnscd status", of course. It's of no consequence, anyway.
Petr, for what it worth, since I installed http://www.suse.de/~pbaudis/bug-387202-2/ nscd didn't crashed. A yast compatible repo structure for this dir would ease testing greatly, though. (Well, I use createrepo internally..).
Cheered too soon :-(. Crashed after three days, but I stopped the nscd debugging before my last post. Will set it up again now.
Update released for: nscd Products: openSUSE 11.0 (i386, ppc, x86_64)
I got what I think is that update: cer@nimrodel:~> rpm -q -i nscd Name : nscd Relocations: (not relocatable) Version : 2.8 Vendor: SUSE LINUX Products GmbH, Nuernberg, Germany Release : 14.4 Build Date: Sun 25 Jan 2009 10:06:27 PM CET Install Date: Tue 03 Feb 2009 04:21:08 AM CET Build Host: stravinsky.suse.de I had nscd crash twice today - ie, after the update: Feb 3 15:55:01 nimrodel watchdog: nscd is not running, restarting. -- Feb 3 17:55:01 nimrodel watchdog: nscd is not running, restarting. -- Feb 3 17:55:01 nimrodel nscd: 12295 invalid persistent database file "/var/run/nscd/passwd": verification failed admittedly, it is crashing less.
Carlos, can you please follow the reporting guidelines I outlined in comment 66? Thank you.
(In reply to comment #87) > Carlos, can you please follow the reporting guidelines I outlined in comment > 66? Thank you. Let me see... > (i) Set persistent to 0 for all databases in your /etc/nscd.conf Huh? I have: nimrodel:~ # grep -i persistent /etc/nscd.conf # persistent <service> <yes|no> persistent passwd yes persistent group yes persistent hosts no persistent services yes What exactly do I edit? My configuration is your default supplied config, I think. > (ii) /etc/init.d/nscd stop and run ulimit -c unlimited; nscd -d Done. I now have in the startup script this: case "$1" in start) echo -n "Starting Name Service Cache Daemon" #/sbin/startproc -p $NSCD_PID $NSCD_BIN # Bug 387202#c66 ulimit -c unlimited /sbin/startproc -p $NSCD_PID $NSCD_BIN -d rc_status -v ;; If this is not adequate, please tell me how I change the script - it has to be that way, I have a watchdog restarting the daemon automatically. [...] No, it is not adequate, status says "unused". Undoing the "-d" till you expand the instructions.
Hi Petr, Tried several times with ulimit -c unlimited; nscd -d and nscd does crash, but I never get a core dump ... Tried with a small script dividing by 0 and it writes a core. Any ideas how to get a core dump of a nscd crash?
I wasn't able to get the version from #77 crash - as long as I run it in debug mode - unlike running it as an ordinary runlevel service. Since the debug mode prevents nscd from forking, maybe some fork or clone related race condition in nscd is the real McCoy in this issue. Roland, Carlos please keep the ulimit -c unlimited; nscd -d running in a terminal, and be sure, that rcnscd is not running.
(In reply to comment #90) > Roland, Carlos please keep the ulimit -c unlimited; nscd -d running in a > terminal, and be sure, that rcnscd is not running. Well, I have done just that, but I still need clarification on "persistent" configuration, as per #88
Ok, nscd just crashed. Output in window was: ... 921: GETFDGR 8921: provide access to FD 6, for group 8921: handle_request: request received (Version = 2) from PID 10892 8921: GETFDGR 8921: provide access to FD 6, for group 8921: remove GETPWBYUID entry "51" 8921: remove GETPWBYNAME entry "nobody" 8921: remove GETPWBYUID entry "65534" 8921: remove GETPWBYNAME entry "postfix" nscd: mem.c:368: gc: Assertion `off_allocend <= db->head->first_free' failed. nimrodel:~/Bugzilla/Bug_387202 # There is no core in that directory. Config: cer@nimrodel:~> cat /etc/nscd.conf | egrep -v "^[[:space:]]*$|^#" debug-level 0 paranoia no enable-cache passwd yes positive-time-to-live passwd 600 negative-time-to-live passwd 20 suggested-size passwd 211 check-files passwd yes persistent passwd yes shared passwd yes max-db-size passwd 33554432 auto-propagate passwd yes enable-cache group yes positive-time-to-live group 3600 negative-time-to-live group 60 suggested-size group 211 check-files group yes persistent group yes shared group yes max-db-size group 33554432 auto-propagate group yes enable-cache hosts yes positive-time-to-live hosts 600 negative-time-to-live hosts 0 suggested-size hosts 211 check-files hosts yes persistent hosts no shared hosts yes max-db-size hosts 33554432 enable-cache services yes positive-time-to-live services 28800 negative-time-to-live services 20 suggested-size services 211 check-files services yes persistent services yes shared services yes max-db-size services 33554432 cer@nimrodel:~>
I forgot: cer@nimrodel:~> cat /etc/nsswitch.conf | egrep -v "^[[:space:]]*$|^#" passwd: compat group: compat hosts: files mdns4_minimal [NOTFOUND=return] dns networks: files dns services: files protocols: files rpc: files ethers: files netmasks: files netgroup: files nis publickey: files bootparams: files automount: files nis aliases: files cer@nimrodel:~> And to avoid confusions, I'm on 11.0
One more: 11415: GETFDPW 11415: provide access to FD 4, for passwd 11415: Reloading "0.pool.ntp.org" in hosts cache! 11415: Reloading "1.ch.pool.ntp.org" in hosts cache! 11415: Reloading "0.es.pool.ntp.org" in hosts cache! 11415: Reloading "1.pool.ntp.org" in hosts cache! 11415: Reloading "2.pool.ntp.org" in hosts cache! 11415: Reloading "0.ch.pool.ntp.org" in hosts cache! 11415: Reloading "3.pool.ntp.org" in hosts cache! 11415: Reloading "0.uk.pool.ntp.org" in hosts cache! 11415: Reloading "users.opensuse.org" in hosts cache! 11415: Reloading "0.fr.pool.ntp.org" in hosts cache! 11415: remove GETAI entry "0.pool.ntp.org" 11415: remove GETAI entry "1.ch.pool.ntp.org" 11415: remove GETAI entry "0.es.pool.ntp.org" 11415: remove GETAI entry "1.pool.ntp.org" 11415: remove GETAI entry "2.pool.ntp.org" 11415: remove GETAI entry "0.ch.pool.ntp.org" 11415: remove GETAI entry "nimrodel" 11415: remove GETAI entry "3.pool.ntp.org" 11415: remove GETAI entry "0.uk.pool.ntp.org" 11415: remove GETAI entry "0.fr.pool.ntp.org" 11415: remove GETPWBYNAME entry "upsd" 11415: remove GETPWBYUID entry "115" nscd: mem.c:477: gc: Assertion `next_hash == &he[db->head->nentries]' failed. nimrodel:~/Bugzilla/Bug_387202 # nimrodel:~/Bugzilla/Bug_387202 # ulimit -c unlimited; nscd -d 15414: invalid persistent database file "/var/run/nscd/passwd": verification failed
One more: 15414: handle_request: request received (Version = 2) from PID 17943 15414: GETFDGR 15414: provide access to FD 6, for group 15414: remove GETPWBYUID entry "1000" 15414: remove GETPWBYNAME entry "cer" 15414: handle_request: request received (Version = 2) from PID 7657 15414: GETAI (www.os-translation.com.ar) 15414: remove GETPWBYNAME entry "lp" 15414: remove GETPWBYUID entry "4" nscd: mem.c:368: gc: Assertion `off_allocend <= db->head->first_free' failed. Aborted Another: 21408: GETFDPW 21408: provide access to FD 4, for passwd 21408: handle_request: request received (Version = 2) from PID 1113 21408: GETFDPW 21408: provide access to FD 4, for passwd 21408: remove GETPWBYUID entry "101" 21408: remove GETPWBYNAME entry "messagebus" nscd: mem.c:368: gc: Assertion `off_allocend <= db->head->first_free' failed. Aborted
Another one: 2138: Reloading "0.pool.ntp.org" in hosts cache! 2138: remove GETAI entry "0.pool.ntp.org" 2138: Reloading "1.ch.pool.ntp.org" in hosts cache! 2138: Reloading "0.es.pool.ntp.org" in hosts cache! 2138: Reloading "1.pool.ntp.org" in hosts cache! 2138: Reloading "2.pool.ntp.org" in hosts cache! 2138: Reloading "0.ch.pool.ntp.org" in hosts cache! 2138: Reloading "3.pool.ntp.org" in hosts cache! 2138: Reloading "0.uk.pool.ntp.org" in hosts cache! 2138: Reloading "0.fr.pool.ntp.org" in hosts cache! 2138: remove GETAI entry "1.ch.pool.ntp.org" 2138: remove GETAI entry "0.es.pool.ntp.org" 2138: remove GETAI entry "1.pool.ntp.org" 2138: remove GETAI entry "2.pool.ntp.org" 2138: remove GETAI entry "0.ch.pool.ntp.org" 2138: remove GETAI entry "3.pool.ntp.org" 2138: remove GETAI entry "0.uk.pool.ntp.org" 2138: remove GETAI entry "0.fr.pool.ntp.org" 2138: remove GETPWBYUID entry "51" 2138: remove GETPWBYNAME entry "nobody" 2138: remove GETPWBYUID entry "65534" 2138: remove GETPWBYNAME entry "postfix" 2138: remove GETPWBYNAME entry "upsd" 2138: remove GETPWBYUID entry "115" Segmentation fault nimrodel:~/Bugzilla/Bug_387202 # l total 8 drwxr-xr-x 2 root root 4096 Feb 4 02:00 ./ drwxrwxr-x 33 cer root 4096 Feb 4 01:58 ../ -rw-r--r-- 1 root root 0 Feb 4 04:45 nscd.log Feb 5 12:40:51 nimrodel kernel: nscd[2139]: segfault at fffffdc4 ip b7f8c822 sp adda5f5c error 4 in nscd[b7f7c000+1c000] Well, as I see no comments on how to produce that core file, and it keeps crashing, I'm restarting the "normal" service with automatic watchdog restarting the service, instead of "nscd -d" in a terminal. I will not comment further unless I get feedback to the contrary, I see no point.
> Well, as I see no comments on how to produce that core file, and it keeps > crashing, I'm restarting the "normal" service with automatic watchdog > restarting the service, instead of "nscd -d" in a terminal. I will not > comment further unless I get feedback to the contrary, I see no point. You truely have a point here, Carlos. FWIW, your crashes nicely sum up, what I see very sporadic here. Petr, I really wonder, why you don't provide the nscd debug packages (see #77). Otherwise, similar to Carlos, I see no point in running nscd via gdb...
The debuginfo packages are right there where Petr said in comment #77. Maybe you are confused that there's no nscd-debug{info,source}? That's because debug packages don't exist for subpacks (and nscd is one of glibc), you need to install glibc-debuginfo (-debugsource). Also, Carlos: you didn't yet follow the guidelines of comment #66. You still use persistent databases. Yes, the comment talks about setting it to "0". Of course instead you should use "no" for all databases you have in nscd.conf. See nscd.conf(5). But indeed, more logs are not necessary I think, we see the assertions that cause nscd to exit. core file would be a bit more usefull. They aren't placed into the current pwd, but into the working dir of nscd, which usually is '/' (nscd chdir's into that one as daemon). You probably have some lying around there.
Update released for: glibc, glibc-debuginfo, glibc-debugsource, glibc-devel, glibc-html, glibc-i18ndata, glibc-info, glibc-locale, glibc-obsolete, glibc-profile, nscd Products: openSUSE 11.1 (debug, i586, i686, ppc, ppc64, x86_64)
(In reply to comment #98) > The debuginfo packages are right there where Petr said in comment #77. > Maybe you are confused that there's no nscd-debug{info,source}? That's because > debug packages don't exist for subpacks (and nscd is one of glibc), > you need to install glibc-debuginfo (-debugsource). Well, if you want me to install something in order to produce the coredump, tell me what exactly do I install. > > Also, Carlos: you didn't yet follow the guidelines of comment #66. You still > use persistent databases. Yes, the comment talks about setting it to "0". > Of course instead you should use "no" for all databases you have in nscd.conf. > See nscd.conf(5). No, I didn't, I said in #88 and #91 that I needed clarification. I still do. Do you mean I should use: persistent whatever no > But indeed, more logs are not necessary I think, we see the assertions > that cause nscd to exit. core file would be a bit more usefull. They aren't > placed into the current pwd, but into the working dir of nscd, which usually > is '/' (nscd chdir's into that one as daemon). You probably have some lying > around there. One of the crashes was not an assertion but a segfault. No, there is no core on /. I can run an "updatedb; locate corewhatever", but I need to know the exact name to search for, because it yields up of 5713 entries. Or alternatively, an exact "find" command to find it. As far as I can see, there is no core in: / /tmp /var/run/nscd /root/Bugzilla/Bug_387202 <-- pwd where I run "nscd -d" /root /home/cer Note: the "nscd -d" command runs on an xterm where I did "su -" to root, in order to keep an eye on it. The xterm is under gnome. I remember some mention years ago of X blocking coredumps. But somebody here said he managed to produce coredumps with a code dividing by zero, so it must be nscd which impedes them :-? Suggestion: search for all assertions in the code and replace/add logger calls. In my programming days, an assertion was the last resource to use, and never in production code. It was used instead of proper code to find an unexpected situation, never as error handling code.
> The debuginfo packages are right there where Petr said in comment #77. > Maybe you are confused that there's no nscd-debug{info,source}? Yes. > That's because > debug packages don't exist for subpacks (and nscd is one of glibc), > you need to install glibc-debuginfo (-debugsource). Done that already. As noted before, it would be MUCH easier for every tester, if Petr would provide a zypp compatible repo structure over there: Then people could add the repo target, update and install additional packages. > But indeed, more logs are not necessary I think, we see the assertions > that cause nscd to exit. core file would be a bit more usefull. They aren't > placed into the current pwd, but into the working dir of nscd, which usually > is '/' (nscd chdir's into that one as daemon). You probably have some lying > around there. I got a nscd segfault (a few days ago) too, assertions also, but no core: > for f in $(locate core | egrep '\<core$'); do [ -f $f ] && l $f; done lrwxrwxrwx 1 root root 11 7. Jan 22:20 /dev/core -> /proc/kcore lrwxrwxrwx 1 root root 11 25. Dez 14:30 /lib/udev/devices/core -> /proc/kcore -rw-r--r-- 1 root root 213 3. Dez 08:26 /var/adm/perl-modules/yast2-core For whatever reason, something prevents the kernel from creating a nscd core file. Since two people see this behavior, I bet you won't see ANY cores from somebody as long as you cannot tell us how! Read: try to simulate an assertion or segfault with nscd, and get it to produce one. I bet again, that this fails also. Now find the reason, tell us, and we're back into the game.
Created attachment 270688 [details] nscd backtrace nscd crashed again on my openSUSE 11.0 with nscd-2.8-14.4. Still no core dump, but this time a backtrace: 23851: provide access to FD 6, for group 23851: Reloading "20915" in group cache! *** glibc detected *** nscd: corrupted double-linked list: 0xb7f8d6e0 *** ======= Backtrace: ========= /lib/libc.so.6[0xb7de3fc4] /lib/libc.so.6[0xb7de4264] . . full output in the attached tar.gz file. It also includes my /etc/nscd.conf and /etc/nsswitch.conf files. Hope it helps ...
re #101: core files for multi-thread processes (which nscd is) aren't named "core", but rather "core.$PID", hence your egrep pattern won't find them. > Read: try to simulate an assertion > or segfault with nscd, and get it to produce one. Easy: % ulimit -c unlimited % nscd -d & [1] 28280 % kill -SEGV $! [1]+ Segmentation fault (core dumped) nscd -d % pwd; ls -l core.28280 / -rw------- 1 root root 42344448 2009-02-06 15:10 core.28280 without debug: % nscd % pidof nscd 28304 % kill -SEGV $(pidof nscd) % ls -l core.28304 -rw------- 1 root root 42344448 2009-02-06 15:12 core.28304 % file core.28280 core.28304 core.28280: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'nscd -d' core.28304: ELF 32-bit LSB core file Intel 80386, version 1 (SYSV), SVR4-style, from 'nscd'
Michael, not here, unfortunately: xrated:/# export LANG=C xrated:/# cat /etc/SuSE-release openSUSE 11.1 (i586) VERSION = 11.1 xrated:/# ulimit -c unlimited xrated:/# nscd -d & [1] 8139 xrated:/# kill -SEGV $! xrated:/# pwd; ls -l core* / ls: cannot access core*: No such file or directory [1]+ Segmentation fault nscd -d xrated:/# nscd xrated:/# pidof nscd 8206 xrated:/# kill -SEGV $(pidof nscd) xrated:/# ls -l core* ls: cannot access core*: No such file or directory xrated:/# uname -a Linux xrated 2.6.27.7-9-pae #1 SMP 2008-12-04 18:10:04 +0100 i686 athlon i386 GNU/Linux Do you remember any further details, which may prevent core dumping?
Does $ /sbin/sysctl -a | grep kernel.core show any unusual settings? Like kernel.core_pattern with an absolute path?
No. xrated:/# /sbin/sysctl -a | grep kernel.core kernel.core_uses_pid = 0 kernel.core_pattern = core
Is there enough space on '/'? Also note that your segfault message doesn't include the "(core dumped)" string, so it's really not even attempting to dump core. Very strange. What does 'ulimit -a' say in that very shell, after doing 'ulimit -c unlimited' and the forced segfault in nscd?
apparmor may be getting in the way here, too.
Bingo, that was the missing hint: xrated:/# rcapparmor stop Unloading AppArmor profiles done xrated:/# nscd -d & [1] 18566 xrated:/# kill -SEGV $! xrated:/# ls -l core* -rw------- 1 root root 143011840 Feb 6 21:27 core.18566 [1]+ Segmentation fault (core dumped) nscd -d I've filed a bugzilla report about this sillyness: https://bugzilla.novell.com/show_bug.cgi?id=473529 Would you please vote for it, thanks. Now back to the 'real' problem, will keep nscd running in 'observation' mode.
Created attachment 271484 [details] nscd core, nscd.conf, nsswitch.conf nscd core dump from a openSUSE 11.0 system. I have in addition added to the tar file /etc/nscd.conf, /etc/nsswitch.conf and the standard output.
Created attachment 272448 [details] nscd core, nscd.conf, nsswitch.conf Here's one, that seems new (11.1 updated): 1827: provide access to FD 12, for hosts 1827: handle_request: request received (Version = 2) from PID 30345 1827: GETPWBYNAME (root) 1827: handle_request: request received (Version = 2) from PID 30345 1827: GETPWBYNAME (root) 1827: handle_request: request received (Version = 2) from PID 30345 1827: GETPWBYNAME (root) 1827: handle_request: request received (Version = 2) from PID 30345 1827: GETPWBYNAME (root) 1827: handle_request: request received (Version = 2) from PID 30345 1827: GETPWBYNAME (root) 1827: handle_request: request received (Version = 2) from PID 30346 1827: GETFDGR 1827: provide access to FD 9, for group 1827: handle_request: request received (Version = 2) from PID 30377 1827: GETFDHST 1827: provide access to FD 12, for hosts 1827: remove GETAI entry "xrated" 1827: remove GETHOSTBYADDR entry "127.0.0.2" 1827: Reloading "0" in password cache! 1827: remove INITGROUPS entry "root" nscd: mem.c:477: gc: Assertion `next_hash == &he[db->head->nentries]' failed. Aborted (core dumped)
This bug is marked NEEDINFO, but what info is needed? I've lost track.
Removing bogus NEEDINFO; there are some new crashes to look at, but I'm decreasing priority and severity since they seem to happen much more rarely.
(In reply to comment #115) > Removing bogus NEEDINFO; there are some new crashes to look at, but I'm > decreasing priority and severity since they seem to happen much more rarely. It crashes several times per day here - see some of my last watchdog entries - I have seen it crash four times in an hour: Mar 21 03:35:02 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 21 06:45:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 21 14:20:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 21 16:40:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 21 17:55:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 21 22:55:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 04:45:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 06:40:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 12:45:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 13:20:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 14:45:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 17:05:02 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 20:45:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 22 22:45:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 00:05:02 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 03:20:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 03:25:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 03:30:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 03:35:02 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 03:45:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 05:45:02 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 13:50:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 23 23:05:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd Mar 24 02:20:01 nimrodel watchdog: nscd is not running, restarting. -- Bugzilla 387202; see root's crontab to disable this wd You should clear all asserts from the C code: they are not logged to syslog, only to console. Not all crashes are segfaults - see the kernel log for the same period: Mar 21 03:34:27 nimrodel kernel: nscd[3461]: segfault at bffe0178 ip b7e8e32e sp afe1d034 error 6 in libc-2.8.so[b7e20000+13d000] Mar 21 22:51:12 nimrodel kernel: nscd[19438]: segfault at b8000012 ip b7f66825 sp addbee6c error 4 in nscd[b7f56000+1c000] Mar 22 06:37:28 nimrodel kernel: nscd[31545]: segfault at bfffe44c ip b7fe4822 sp ade3ceec error 4 in nscd[b7fd4000+1c000] Mar 22 20:41:34 nimrodel kernel: nscd[10540]: segfault at bfff66bc ip b80bc825 sp adf14f5c error 4 in nscd[b80ac000+1c000] Mar 23 03:19:59 nimrodel kernel: nscd[17683]: segfault at fff1518c ip b7e42450 sp ade070b8 error 4 in libc-2.8.so[b7dd5000+13d000] cer@nimrodel:~>
Clearing asserts will just make nscd segfault few moments later, at a place that's even much harder to debug. :-(
Of course. Clearing an assert doesn't mean comment it out, just handle the error condition cleanly. Once an assert triggers, it means that a situation thought impossible by the programmer has in fact happened, and thus, the code has to be changed to avoid that situation happening. On the other hand, an assert in such a daemon just kills the daemon silently, without any message to the user/admin. An assert is intended as a message from the dead to the creator of the program, so that the creator can reprogram the cylon. This is not happening. Those asserts are useless. Instead the assert message should be sent to syslog with "warn" or "critical" level, and then the program halted - after logging the situation -. At best, the assert could be used to restart or reinit the daemon (the idea of dividing nscd into a parent and child is not so bad).
Changing the code to avoid the situation happening is the hard part, unfortunately. ;-) I agree that it would be nice if the assert() would be syslogged. I will try to make a patch when I finish the more urgent things on my hands. The asserts still certainly aren't useless, since a mere assert does not help anything anyway - you need to grab a core dump in order to really debug stuff. 95% of the crashes happen during database prune cycle; at this point, little but complete state reset can be done, and that's then pretty much equivalent to the watchdog solution.
I was running opensuse 10.2 on a 150+ node cluster with no problems, but when we upgraded to opensuse 11.1, we started seeing all sorts of network problems. I isolated many of the issues down to nscd dying. After bumping up the debug level, I saw messages like this in /var/log/messages: nscd[13710]: segfault at ffffff468bde0600 ip 00007f468aec3445 sp 00007f468080e7f8 error 4 in libc-2.9.so I'm running NIS, so when this happens I immediately get: do_ypcall: clnt_call: RPC: Unable to send; errno = Operation not permitted on any nodes with nscd down. I'm running unscd as a replacement with much more success, but I wanted to submit this issue to the bug report as there doesn't appear to be a recent update. I also wanted to know if running unscd is the recommended work-around/fix for now, as well as in the future. Thanks.
I can's speak for SuSE, and I recognise that nscd has to work in a number of different environments. But my experience (in an LDAP site) is that unscd has been a total success. Since November I have run it on a mixture of 10.3, 11.0 and 11.1 with not a single crash. I also run a monitor job which regularly checks the output of getent and unscd has passed that test too. In contrast every recent version of the SuSE nscd I have tried has crashed frequently, sometimes as often as every few minutes. Also, even more perniciously, it sometimes keeps running but gives wrong information for some usernames.
FYI: We're considering using unscd for the upcoming releases, the instability of nscd from upstream is a constant hassle, although we provided already many improvements it's still a sad story.
For those interested, we have backported another patch from mainline that reportedly makes nscd quite more stable; we will likely include this in future maintenance updates. If you need stable nscd in 11.1, please test packages that will be available at http://www.suse.de/~pbaudis/bug-505215/ - thanks!
> If you need stable nscd in 11.1, please test packages that > will be available at http://www.suse.de/~pbaudis/bug-505215/ - thanks! Better try this: http://www.suse.de/~pbaudis/bug-509398/ We will see what will be the outcome...
What are the details of bug 505215?
That bug contained support request by a customer containing some core dumps and backtraces pretty much the same as the ones attached to this bug. The outcome of the support request was the patch that I've included in the test build above.
Petr, after installing the packages from http://www.suse.de/~pbaudis/bug-505215/, I didn't got another crash here, running it for two days while the previous versions crash a few times per day. Unfortunately, there's an online update for 11.1 with glibc-2.10.1-3 and nscd-2.10.1-3 (note the higher version numbers!), which does NOT contain your latest fix, thus today those got installed (I've no idea, how to prevent this with zypper, yum had a nice exclude pattern regex per repo for such cases). It would be nice to get yet another glibc/nscd update containing your fix soon..
Hi Petr, guess what, I managed to miss the "downgrade", as noted in https://bugzilla.novell.com/show_bug.cgi?id=387202#c128 until this monday. But because I didn't harvested any nscd crashes in that time, something must be wrong :-[! Indeed, since monday and the "official" glibc/nscd update running, it crashes every few hours again. I think that another glibs/nscd update is not only in order, it's crucial for any serious use of 11.1, please...
I'm sorry that I don't have time to rebuild the package again with the correct revision number - I think SRPMs should be in that directory too so you should be able to do that yourself easily. A maintenance update for SLE11/11.1 is already in the making.
> I'm sorry that I don't have time to rebuild the package again with the > correct revision number - I think SRPMs should be in that directory too so > you should be able to do that yourself easily. Of course, I can build that myself, but then I would have to distribute it to bunch of pretty wide spread systems, manually install, and get rid of rid, when the official release happens. And this precedure is further complicated by the "not so convenient" behavior of zypper (e.g. must use deprecated 'zypper dup' in order to change vendor..). > A maintenance update for SLE11/11.1 is already in the making. That's what I'm after. Cool. Hopefully it doesn't get delayed after the summer holiday season...
Will there be also a fix for openSUSE 11.0?
Unfortunately, that bugfix made another quite hard-to-track-down nscd bug show up, colliding with my vacation as well, so this did get delayed - hopefully I found a culprit of that one too (inverted boolean condition in glibc-2.3.5-nscd-zeronegtimeout.diff) and as soon as we get an ack that the issues are fixed, I'll push the button. In 11.2, unscd is already the default caching daemon instead of nscd. Unfortunately, 11.0 update is laborous since it does not share codebase (and testing) with our SLE11 product and thus it's unlikely it will be done, also since all 11.0 users that need nscd probably found a safer way to deal with the problems than receiving a poorly tested nscd update through the maintenance channel. I recommend to use unscd on 11.0 if you require a caching daemon.
The update is in process of being released.
(In reply to comment #133) > In 11.2, unscd is already the default caching daemon instead of nscd. And it doesn't work. At least, in my 11.2 M7 it doesn't start on boot, and "rcnscd start" fails. No messages in syslog or anywhere I can see. > Unfortunately, 11.0 update is laborous since it does not share codebase (and > testing) with our SLE11 product and thus it's unlikely it will be done, also > since all 11.0 users that need nscd probably found a safer way to deal with the > problems than receiving a poorly tested nscd update through the maintenance > channel. I recommend to use unscd on 11.0 if you require a caching daemon. AFAIK, all users of 11.0 "need" nscd, as it is part of the default system, and some programs or daemons may complain if nscd is not running or is not installed (dependencies).
The problem with unscd and 11.2 M7 (and 11.1 for that matter) is that unscd bumps up against the as-shipped apparmor profile - unscd uses setgroups and that appears to be a no-no. Either set nscd to report-only mode or shut off apparmor (which I do not recommend) until the profile can be repaired. Out of curiosity, why is the profile shipped by apparmor instead of nscd/unscd? As far as 11.0 users "needing" nscd, that's just plain bogus - I'm not aware of any software that *requires* nscd to be running (or even installed), but I'm prepared to be enlightened.
According to bug #157078 (marked as a duplicate of this bug) you _need_ nscd for Thunderbird and nss_ldap under 11.0. Should be fixed by providing a working nscd, see https://bugzilla.novell.com/show_bug.cgi?id=157078#c76
Carlos, then please open a new bug for that; unless it's an apparmor problem, we have that covered by bug 535467 and I've just fixed that one yesterday (Jon: ...by moving the apparmor profile to the nscd package ;-). Walter is right (fun fact: in 11.2+, nscd is also required to be running to work around some routers providing broken DNS services). You have convinced me :), I will prepare also 11.0 nscd update; I will need your cooperation for testing the update, though.
When you say "in 11.2+, nscd is also required to be running to work around some routers providing broken DNS services" - can you go into more detail? (By private mail if that is more appropriate). Personally, I use dnsmasq to proving DNS caching/validation/local resolution services because I have to deal with broken routers, too. Pertinent to the upcoming 11.2 release: unscd's manpage needs a healthy update (it documents none of the command line switches) and I can't make 'nscd -i' work without a second option. The invocation (usage) text needs a slight update (-i does not specify that a parameter is required) - ideally a parameterless -i invocation would use something like "all" and invalidate all of the caches... That said, unscd is shaping up very nicely. The problems with nss_ldap are many and scary - I can see nscd being required to solve issues with that library. I'm very glad some nscd replacement is going to be present, and I'm overjoyed that it'll be installed by default in 11.2 - I suspect this will help out a great deal!
11.2+ uses optimized glibc name resolution mechanism that looks up IPv4 and IPv6 addresses in parallel instead of sequentially - this confuses some cheap network routers, making each process doing the resolution time out once; in case nscd is running, this is not an issue. If you have any problems with unscd, please open separate bugs for them - this bug is already huge and it's hard to track the unscd problems this way.
11.0 test nscd package is now available at http://www.suse.de/~pbaudis/bug-387202/ (the url should start working in ~1hour) - could 11.0 users please test if it works well for them? We can then release it as a maintenance update.
It seems noone who cares about nscd is still using 11.0; fair enough, I will close this bug, we cannot release an untested maintenance update.
(In reply to comment #142) > It seems noone who cares about nscd is still using 11.0; fair enough, I will > close this bug, we cannot release an untested maintenance update. Sorry. This bug did not have "needinfo" when I asked bugzilla to display "my" buglist, so I didn't notice it when I looked. Blame radar failure. (and anyway, I have a watch daemon that restarts the dead nscd automatically). I'm using both 11.0 and 11.2, so I will attempt testing your package this weekend at the latest.
I have downloaded your rpm. However, that rpm is version 2.8.4-14.4, and I already had installed that version, and as I commented on #135, it crashes. I have installed your version, nonetheless, and I will report what happens. [...] Four hours later it is still running fine. I will report on the weekend if it does not crash, or earlier if it does. I'll attempt to leave the bugzilla as needinfo from myself.
(In reply to comment #144) > I have downloaded your rpm. However, that rpm is version 2.8.4-14.4, and I > already had installed that version, and as I commented on #135, it crashes. > > I have installed your version, nonetheless, and I will report what happens. > [...] > Four hours later it is still running fine. I will report on the weekend if it > does not crash, or earlier if it does. I'll attempt to leave the bugzilla as > needinfo from myself. cer@nimrodel:~> ps axu | grep nscd root 7239 0.0 0.0 141508 1024 ? Ssl Jan19 0:07 /usr/sbin/nscd And today is 24, so it hasn't crashed. True, I have hibernated the machine, it hasn't been running that many hours, but it is good news, it hasn't yet crashed. Looks good!
I have installed the new version of nscd too and have not seen any crashes on these machines in the last few days (usually I get around one crash a day). Seems to be a great improvement. (Sorry missed the checking of the new patch too.)
Thank you all for testing! Maintenance, we are getting positive feedback, ok to release the update? Can we have a SWAMPID?
I am undecided.... a large change which might be risky, but fixing some bugs for customers at least ... :/ I would however say yes ... +1
If I recall my 11.0 experience with nscd... The biggest risk would be that nscd starts working suddenly :). So +1 for an update.
I suspect the people who need a working nscd in an LDAP environment gave up on the default one a long time ago and switched to unscd, so that's probably why you didn't get more feedback.
The SWAMPID for this issue is 30491. Please submit the patch and patchinfo file using this ID. (https://swamp.suse.de/webswamp/wf/30491)
convinced by security ;)
Thanks, patchinfo and package submitted.