Bug 467161

Summary: lam-mpi segfaults if nscd is not running; EINPROGRESS not handled in libc somewhere
Product: [openSUSE] openSUSE 11.4 Reporter: Toni Harbaugh-Blackford <harbaugh>
Component: BasesystemAssignee: Philipp Thomas <pth>
Status: RESOLVED WORKSFORME QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P3 - Medium CC: forgotten_xs3PtXj4XH, harbaugh, lchiquitto, marcin.mogielnicki, radmanic, ralf, stefan.fent
Version: Milestone 4 of 6Flags: coolo: SHIP_STOPPER-
Target Milestone: Factory   
Hardware: x86-64   
OS: SUSE Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Toni Harbaugh-Blackford 2009-01-17 14:24:11 UTC
User-Agent:       Mozilla/5.0 (X11; U; Linux i686 (x86_64); en-US; rv:1.8.1.14) Gecko/20080404 Firefox/2.0.0.14

when nscd is not runnning, even simple LAM-MPI 'hello world' program segfaults
within the call 'getpwuid(getuid())', which should not happen.  strace shows that
segfault happens after receiving EINPROGRESS return from ldap communication.

after restarting nscd, the LAM-MPI 'hello world' program runs without error.

The problem appears not to be in LAM-MPI itself, but in libc somewhere.


Reproducible: Always

Steps to Reproduce:
1. stop nscd if it is running ('/etc/init.d/nscd stop')
2. set coredump rlimit to unlimited ('ulimit -c unlimited')
3. lamboot -v
4. hcc -o lam_hello lam_hello.c -lmpi
5. mpirun -np 1 ./lam_hello

step 5 produces core dump

6. mpirun -np 1 strace ./lam_hello > lam_hello.out 2>&1

step 6 captures strace output

7. lamhalt

Actual Results:  
coredump

$ gdb ./lam_hello core
.
.
.
Loaded symbols for /lib64/libnss_dns.so.2
Core was generated by `./lam_hello'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f723ff85e3a in ?? () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f723ff85e3a in ?? () from /lib64/libc.so.6
#1  0x00007f723ff86e38 in realloc () from /lib64/libc.so.6
#2  0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8
#3  0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8
#4  0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8
#5  0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8
#6  0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8
#7  0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8
#8  0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8
#9  0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2
#10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2
#11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2
#12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2
#13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2
#14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2
#15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2
#16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2
#17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2
#18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2
#19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6
#20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6
#21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0
#22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0
#23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0
#24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0
#25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0
#26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0
#27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0
#28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0
#29 0x0000000000400898 in main ()
(gdb) quit
Quitting: You can't do that without a process to debug.


Expected Results:  
after restarting nscd, 'hello world' runs:

$ sudo /etc/init.d/nscd start                         
Starting Name Service Cache Daemon                                                                                                       done
$ mpirun -np 1 ./lam_hello                            
From process: 0 out of 1, Hello World! 


here is the tail of the strace output from step 6 of the 'steps to reproduce',
showing communication with the ldap server

$ tail -30 lam_hello.out
connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.85")}, 16) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(43102), sin_addr=inet_addr("129.43.63.154")}, [16]) = 0
close(4)                                = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_IP) = 4
fcntl(4, F_SETFD, FD_CLOEXEC)           = 0
setsockopt(4, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
setsockopt(4, SOL_TCP, TCP_NODELAY, [1], 4) = 0
fcntl(4, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
connect(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, 16) = -1 EINPROGRESS (Operation now in progress)
poll([{fd=4, events=POLLOUT|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLOUT}])
getpeername(4, {sa_family=AF_INET, sin_port=htons(389), sin_addr=inet_addr("129.43.52.86")}, [16]) = 0
fcntl(4, F_GETFL)                       = 0x802 (flags O_RDWR|O_NONBLOCK)
fcntl(4, F_SETFL, O_RDWR)               = 0
write(4, "0\35\2\1\1w\30\200\0261.3.6.1.4.1.1466.20037", 31) = 31
poll([{fd=4, events=POLLIN|POLLPRI|POLLERR|POLLHUP}], 1, 30000) = 1 ([{fd=4, revents=POLLIN}])
read(4, "0\f\2\1\1x\7\n", 8)            = 8
read(4, "\1\0\4\0\4\0", 6)              = 6
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV (core dumped) +++
-----------------------------------------------------------------------------
It seems that [at least] one of the processes that was started with
mpirun did not invoke MPI_INIT before quitting (it is possible that
more than one process did not invoke MPI_INIT -- mpirun was only
notified of the first one, which was on node n0).

mpirun can *only* be used with MPI programs (i.e., programs that
invoke MPI_INIT and MPI_FINALIZE).  You can use the "lamexec" program
to run non-MPI programs over the lambooted nodes.
-----------------------------------------------------------------------------


Again, here is the stack trace of the core dump

Program terminated with signal 11, Segmentation fault.
#0  0x00007f723ff85e3a in ?? () from /lib64/libc.so.6
(gdb) bt
#0  0x00007f723ff85e3a in ?? () from /lib64/libc.so.6
#1  0x00007f723ff86e38 in realloc () from /lib64/libc.so.6
#2  0x00007f723e4133a9 in CRYPTO_realloc () from /usr/lib64/libcrypto.so.0.9.8
#3  0x00007f723e46cb92 in lh_insert () from /usr/lib64/libcrypto.so.0.9.8
#4  0x00007f723e4700ea in ?? () from /usr/lib64/libcrypto.so.0.9.8
#5  0x00007f723e4702d7 in ?? () from /usr/lib64/libcrypto.so.0.9.8
#6  0x00007f723e46f73c in ERR_load_ERR_strings () from /usr/lib64/libcrypto.so.0.9.8
#7  0x00007f723e470339 in ERR_load_crypto_strings () from /usr/lib64/libcrypto.so.0.9.8
#8  0x00007f723e75c2f9 in SSL_load_error_strings () from /usr/lib64/libssl.so.0.9.8
#9  0x00007f723f6bc03c in ldap_pvt_tls_init () from /usr/lib64/libldap-2.4.so.2
#10 0x00007f723f6bc951 in ldap_int_tls_start () from /usr/lib64/libldap-2.4.so.2
#11 0x00007f723f8d4580 in ?? () from /lib64/libnss_ldap.so.2
#12 0x00007f723f8d4d14 in ?? () from /lib64/libnss_ldap.so.2
#13 0x00007f723f8d553e in ?? () from /lib64/libnss_ldap.so.2
#14 0x00007f723f8d5bcf in ?? () from /lib64/libnss_ldap.so.2
#15 0x00007f723f8d61b9 in _nss_ldap_getpwuid_r () from /lib64/libnss_ldap.so.2
#16 0x00007f723fd08ab8 in ?? () from /lib64/libnss_compat.so.2
#17 0x00007f723fd08cad in ?? () from /lib64/libnss_compat.so.2
#18 0x00007f723fd09040 in _nss_compat_getpwuid_r () from /lib64/libnss_compat.so.2
#19 0x00007f723ffaecfc in getpwuid_r () from /lib64/libc.so.6
#20 0x00007f723ffae55f in getpwuid () from /lib64/libc.so.6
#21 0x00007f72408a8f13 in lam_tmpdir_init_opt () from /usr/lib64/liblam.so.0
#22 0x00007f72408b27d8 in _cio_init () from /usr/lib64/liblam.so.0
#23 0x00007f72408b3099 in _cipc_init () from /usr/lib64/liblam.so.0
#24 0x00007f72408b3b62 in kinit () from /usr/lib64/liblam.so.0
#25 0x00007f72408b38ab in kenter () from /usr/lib64/liblam.so.0
#26 0x00007f7240d308da in lam_linit () from /usr/lib64/libmpi.so.0
#27 0x00007f7240d32870 in lam_mpi_init () from /usr/lib64/libmpi.so.0
#28 0x00007f7240d2bc63 in MPI_Init () from /usr/lib64/libmpi.so.0
#29 0x0000000000400898 in main ()
(gdb) quit
Comment 1 Toni Harbaugh-Blackford 2009-01-17 14:28:11 UTC
This happens in SLES 11 RC1 also, so hopefully we can get it patched
before SLES 11 is GA?

Thanks,
Toni
Comment 2 Petr Baudis 2009-02-13 11:07:32 UTC
nss_ldap -> Ralf
Comment 4 Milisav Radmanic 2009-03-02 13:22:53 UTC
(In reply to comment #1)
> This happens in SLES 11 RC1 also, so hopefully we can get it patched
> before SLES 11 is GA?
> 
> Thanks,
> Toni

How did you test this for SLES 11 RC1? There is no maintained lam-package available for SLES 11. Furthermore the code branch for openSUSE 11.2 isn't even in Alpha state, yet.
And on OpenSUSE 11.1 (where the lam-package is available) the incidence can't be reproduced as described.

regards
Milisav
Comment 5 Toni Harbaugh-Blackford 2009-03-02 13:37:00 UTC
I used openmpi on SLES instead of lam, with the same results

I disabled ldap and used 'plain' /etc/passwd, with the same results.
Comment 6 Toni Harbaugh-Blackford 2009-03-02 13:37:51 UTC
I have not tested SLES 11 since RC1.
Comment 7 Milisav Radmanic 2009-03-02 14:21:04 UTC
(In reply to comment #5)
> I used openmpi on SLES instead of lam, with the same results
> 
> I disabled ldap and used 'plain' /etc/passwd, with the same results.


How do you use mpirun without lam? Can you please describe how to now reproduce the error? If I use mpicc to compile a hello world example like this:


/*                      
 * Sample hello world MPI program for testing MPI.
 */

#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

int
main(int argc, char **argv)
{
  int rank, size;

  /* Start up MPI */

  MPI_Init(&argc, &argv);

  /* Get some info about MPI */

  MPI_Comm_rank(MPI_COMM_WORLD, &rank);
  MPI_Comm_size(MPI_COMM_WORLD, &size);

  /* Print out the canonical "hello world" message */

  printf("Hello, world!  I am %d of %d\n", rank, size);

  /* All done */

  MPI_Finalize();
  return 0;
}



I can run it with mpirun without the error above.


Thanks
Milisav
Comment 8 Toni Harbaugh-Blackford 2009-03-02 14:39:18 UTC
steps to reproduce with openmpi

1) make sure openmpi binaries are in $PATH and libs are in $LD_LIBRARY_PATH
2) /etc/init.d/nscd stop
3) mpicc -o hello hello.c
4) mpirun -np 1 ./hello

You *must* stop nscd; if nscd is running the program will work.  If nscd
is not running the program will segfault in getpwuid_r()

Again, I have not tested this for SLES 11 RC4, just RC1
Comment 10 John Jolly 2009-03-16 13:24:27 UTC
This seems to be a problem with getpwuid, but only with the OpenMPI build.  I am trying to track down the problem with the build right now.

This seems to be fixed in the 1.3 OpenMPI, but as this problem is in SLES10SP2, I am unable to update to a new version.
Comment 11 Toni Harbaugh-Blackford 2009-05-13 18:24:15 UTC
I tested this in the latest opensuse Factory, and the problem appears to be
resolved for all versions of LAM and openmpi that I try (old and new)

could someone else please test?

thanks,
toni
Comment 12 Toni Harbaugh-Blackford 2009-05-29 10:50:06 UTC
The problem is not fixed; recent versions of factory just mask it
with MALLOC_CHECK_, which is now set by default in /etc/profile.d.

If nscd is NOT running, all MPI applications
(no matter what version) segfault if MALLOC_CHECK_ is NOT set.
If MALLOC_CHECK_ *is* set, MPI applications succeed but there is
no indication of any problem even though certain MALLOC_CHECK_ settings
*SHOULD* report an error if there is one.  If there really *is* a problem,
why doesn't MALLOC_CHECK_ report it?

Here are the results for various settings of MALLOC_CHECK_ for the 'hello world'
mpi program (in this case using openmpi, but it doesn't matter) on the
latest factory (2.6.30-rc6-git3-4-default)

---- MALLOC_CHECK_ not set -------
$ ./openmpi_hello.sh 
[mandy-abcc:28858] *** Process received signal ***
[mandy-abcc:28858] Signal: Floating point exception (8)
[mandy-abcc:28858] Signal code: Integer divide-by-zero (1)
[mandy-abcc:28858] Failing at address: 0x7f7df4b81ccd
[mandy-abcc:28858] [ 0] /lib64/libpthread.so.0 [0x7f7df4e75a90]
[mandy-abcc:28858] [ 1] /lib64/libc.so.6 [0x7f7df4b81ccd]
[mandy-abcc:28858] [ 2] /lib64/libc.so.6(cfree+0x76) [0x7f7df4b83c76]
[mandy-abcc:28858] [ 3] /usr/lib64/libcrypto.so.0.9.8(CRYPTO_free+0x19) [0x7f7df30265b9]
[mandy-abcc:28858] [ 4] /usr/lib64/libssl.so.0.9.8 [0x7f7df33724e4]
[mandy-abcc:28858] [ 5] /usr/lib64/libssl.so.0.9.8(ssl_create_cipher_list+0x442) [0x7f7df3372ad2]
[mandy-abcc:28858] [ 6] /usr/lib64/libssl.so.0.9.8(SSL_CTX_new+0x1d3) [0x7f7df336d433]
[mandy-abcc:28858] [ 7] /usr/lib64/libldap-2.4.so.2 [0x7f7df42d0a65]
[mandy-abcc:28858] [ 8] /usr/lib64/libldap-2.4.so.2 [0x7f7df42d0f29]
[mandy-abcc:28858] [ 9] /usr/lib64/libldap-2.4.so.2 [0x7f7df42d10f7]
[mandy-abcc:28858] [10] /usr/lib64/libldap-2.4.so.2(ldap_int_tls_start+0x68) [0x7f7df42d1228]
[mandy-abcc:28858] [11] /lib64/libnss_ldap.so.2 [0x7f7df44e9580]
[mandy-abcc:28858] [12] /lib64/libnss_ldap.so.2 [0x7f7df44e9d14]
[mandy-abcc:28858] [13] /lib64/libnss_ldap.so.2 [0x7f7df44ea53e]
[mandy-abcc:28858] [14] /lib64/libnss_ldap.so.2 [0x7f7df44eabcf]
[mandy-abcc:28858] [15] /lib64/libnss_ldap.so.2(_nss_ldap_getpwuid_r+0x49) [0x7f7df44eb1b9]
[mandy-abcc:28858] [16] /lib64/libnss_compat.so.2 [0x7f7df4705ab8]
[mandy-abcc:28858] [17] /lib64/libnss_compat.so.2 [0x7f7df4705cad]
[mandy-abcc:28858] [18] /lib64/libnss_compat.so.2(_nss_compat_getpwuid_r+0x100) [0x7f7df4706040]
[mandy-abcc:28858] [19] /lib64/libc.so.6(getpwuid_r+0xec) [0x7f7df4baecfc]
[mandy-abcc:28858] [20] /lib64/libc.so.6(getpwuid+0x6f) [0x7f7df4bae55f]
[mandy-abcc:28858] [21] /usr/lib64/mpi/gcc/openmpi/lib64/libopen-rte.so.0(orte_sys_info+0xb4) [0x7f7df5b74ad4]
[mandy-abcc:28858] [22] /usr/lib64/mpi/gcc/openmpi/lib64/libopen-rte.so.0(orte_init_stage1+0xed) [0x7f7df5b6eded]
[mandy-abcc:28858] [23] /usr/lib64/mpi/gcc/openmpi/lib64/libopen-rte.so.0(orte_system_init+0xa) [0x7f7df5b71f1a]
[mandy-abcc:28858] [24] /usr/lib64/mpi/gcc/openmpi/lib64/libopen-rte.so.0(orte_init+0x44) [0x7f7df5b6ead4]
[mandy-abcc:28858] [25] mpirun(orterun+0x164) [0x402d90]
[mandy-abcc:28858] [26] mpirun(main+0x1b) [0x402c27]
[mandy-abcc:28858] [27] /lib64/libc.so.6(__libc_start_main+0xe6) [0x7f7df4b2c586]
[mandy-abcc:28858] [28] mpirun [0x402b49]
[mandy-abcc:28858] *** End of error message ***
./openmpi_hello.sh: line 13: 28858 Floating point exception mpirun -host $MYHOST -mca btl self -np 1 ./mpi_hello2

---- MALLOC_CHECK_=0

$ export MALLOC_CHECK_=0
$ ./openmpi_hello.sh    
From process: 0 out of 1, Hello World! on system mandy-abcc 

---- MALLOC_CHECK_=1

$ export MALLOC_CHECK_=1
$ ./openmpi_hello.sh    
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
From process: 0 out of 1, Hello World! on system mandy-abcc 

---- MALLOC_CHECK_=2

$ export MALLOC_CHECK_=2
$ ./openmpi_hello.sh    
From process: 0 out of 1, Hello World! on system mandy-abcc 

---- MALLOC_CHECK_=3

$ export MALLOC_CHECK_=3
$ ./openmpi_hello.sh    
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
malloc: using debugging hooks
From process: 0 out of 1, Hello World! on system mandy-abcc
Comment 13 Marcin Mogielnicki 2010-03-18 12:12:17 UTC
As this bug report was not updated for quite a long time I'd like to ask a question - is a reason identified and fixed (not necessary in suse release, any pointer for fixed version anywhere will do)? I'm hit by this but and workaround, i.e. MALLOC_CHECK_, makes my machines leak memory on vanilla sles11 kernel. Memory does not leak as long as no binaries requiring MALLOC_CHECK_ run there. Losing 100GB overnight is not uncommon, so I'm desperate for identifying what causes that. i can see neither problem description nor bug report about my problem anywhere.
Comment 14 Philipp Thomas 2010-12-13 14:28:09 UTC
OK, I can confirm it still fails in current factory, so this issue is still unsolved. Moving to factory therefore.
Comment 15 Forgotten User xs3PtXj4XH 2011-02-20 18:03:50 UTC
Looking for an updated to this prior to release of 11.4.  Does it still fail for 11.4RC1?
Comment 16 Philipp Thomas 2011-11-03 13:04:22 UTC
It seems to work in SP2, thus closing.