Bugzilla – Bug 157078
thunderbird segfaults upon startup when nss_ldap is in use
Last modified: 2011-03-04 12:31:22 UTC
just this: jmatejek@titan:~> thunderbird /usr/bin/thunderbird: line 137: 22466 Segmentation fault $AOSS $MOZ_PROGRAM $@ strace is attached, i was unable to get a backtrace, but i can provide it if you tell me what to do
Created attachment 72281 [details] strace of the thunderbird process
This does not happen on my machine. Did you try using your old profile here? Does it work with a fresh profile?
no, it fails even if I remove the profile completely. however, this only happens on a LDAP user on NFS home. on a local account, TB works fine even with the original profile
You chose Hardware:Other, please correct that, which CPU are you using?
it's Athlon XP. precisely: jmatejek@titan:/data/yapt/stable-all> cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 6 model : 6 model name : AMD Athlon(tm) XP 1600+ stepping : 2 cpu MHz : 1400.272 cache size : 256 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 mmx fxsr sse syscall mmxext 3dnowext 3dnow ts bogomips : 2802.89
Not critical since it only happens in special cases. I couldn't reproduce it yet.
The "special case" is where account info is stored in LDAP. To reproduce create an LDAP user account then attempt to run thunderbird as that user.
Created attachment 73841 [details] thunderbird backtrace (10.1 beta 8) Sorry, should have said above, this is still occurring in 10.1 beta 8.
This is nss_ldap related it seems. Ralf, could you please have a look at the backtrace?
Yes, seems so. I'll take a look.
Created attachment 76595 [details] another backtrace (Beta 9) Ugh. The problems seesms to be that Thunderbird ships with its own version of libldap (most probably the one from the mozilla LDAP SDK), which has symbol name conflicts with the System's libldap (from OpenLDAP). See the new attached Backtrace. It calls a few functions from /usr/lib64/libldap-2.3.so.0 and then suddently dives into /usr/lib64/thunderbird/libldap50.so. Is it possible to link Thunderbird against the OpenLDAP version of libldap? Other options are to either link nss_ldap statically against libldap or link Thunderbird statically against its version of libldap.
*** Bug 153444 has been marked as a duplicate of this bug. ***
(In reply to comment #11) > Ugh. The problems seesms to be that Thunderbird ships with its own version of > libldap (most probably the one from the mozilla LDAP SDK), which has symbol > name conflicts with the System's libldap (from OpenLDAP). See the new attached > Backtrace. It calls a few functions from /usr/lib64/libldap-2.3.so.0 and then > suddently dives into /usr/lib64/thunderbird/libldap50.so. yes, just noticed it myself in the backtrace from the other bug. > Is it possible to link Thunderbird against the OpenLDAP version of libldap? No, there is no option for that and would probably much changes. > Other options are to either link nss_ldap statically against libldap or link > Thunderbird statically against its version of libldap. I will check if it's possible to link its own ldap version statically. Thanks.
just note from #153444 Comment #9 - in my case this happens _only_ if network manager is used. (ok, I'm repeating this info, but just in case...)
It's most likely not possible to fix this in 10.1 timeframe. We will have that problem at lease if nss_ldap is active. Lukas, your backtrace shows also that nss_ldap is in use. Could this be the case or do we have some side issue which pulls in nss_ldap?
> Lukas, your backtrace shows also that nss_ldap is in use. Could this be the > case or do we have some side issue which pulls in nss_ldap? I use ldap for user authentication, so there's probably no side side issue which pulls in nss_ldap.
(In reply to comment #11) > Other options are to either link nss_ldap statically against libldap or link > Thunderbird statically against its version of libldap. Would this really solve the problem? I'm not sure how this would change the namespace conflicts.
(In reply to comment #17) > (In reply to comment #11) > > > Other options are to either link nss_ldap statically against libldap or link > > Thunderbird statically against its version of libldap. > > Would this really solve the problem? I'm not sure how this would change the > namespace conflicts. I think so. But after thinking a bit about it, linking nss_ldap statically against OpenLDAP's libldap will most likely be problematic. As that would also mean that nss_ldap needs to be statically linked against cyrus-sasl which is AFAIK not possible. (cyrus-sasl might need to dlopen some plugins during runtime)
As a "workaround" you can force install thunderbird-1.5.0.2-1.1.fc5.whatever.rpm to get a working mail client (sorry kmail). You have to add the --nodeps flag 'cause nspr and nss differ between the distros but this works as a temporary solution for me.....
I wonder why that should make a difference. I wouldn't expect one because it's not SUSE specific problem but a basic one.
*** Bug 188557 has been marked as a duplicate of this bug. ***
Same thing here on SuSE 10.0. Not sure why this started failing all of a sudden. I worked around the failure by running. `getent passwd jerry >> /etc/passwd` where jerry was a posixAccount in my directory service.
(In reply to comment #19) > As a "workaround" you can force install > thunderbird-1.5.0.2-1.1.fc5.whatever.rpm to get a working mail client It would be nice to have a fix in SUSE 10.1 because on our site we use nss_ldap so it is broken for everyone.
I hit this problem (on SuSE 10.1) and discovered a simpler workround mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=292127 Just ensure that the nscd daemon is running (for some reason it had crashed on my systems): # chkconfig nscd on # rcnscd start
The same problem is still in 10.2 Beta1. After uninstalling nss_ldap package, thunderbird started work again.
*** Bug 287199 has been marked as a duplicate of this bug. ***
Still in 10.2 release. Running "nscd" as Bob Vickers mentioned does work around it, but (at least for me) only temporarily: I have to re-start nscd every few hours or Thunderbird starts crashing at startup again. Gerald Carter's workaround of adding a dummy entry in your /etc/passwd file corresponding to the LDAP entry worked better for me, but I'm just on a single-user system...
Created attachment 180971 [details] valgrind output I've hit this bug also on openSUSE 10.3. Uninstalling nss_ldap helped, too.
The problem is that nss daemon dies. Calling "rcnscd restart" is sufficient. You can't authorize without nss_ldap package against ldap after uninstalling it.
Talked to Michael Meeks who has some experience with the feature of "interposing" (which is what causes the symbols to override each other). There are a couple of solutions: 1) Re-namespace the thunderbird symbols like cairo in the upstream bug 2) Dlopen both ldap's with RTLD_LOCAL
Weeelll ... if you can statically link openldap into nss, and then use a map file to ensure that none of it's symbols are exported: that would do the trick. Hopefully, (being cross platform) we have to maintain such a map-file anyway; so it should be possible to ensure that we don't get any openldap symbols straying out into the main process. Failing that, as JP says if you can dlopen *both* versions RTLD_LOCAL - then you'd be alright, but the 1st solution is far simpler. Poke in 'info ld' for: "When it is used to export symbols in executables, it is similar to `--export-dynamic', except for that symbols can be exported selectively with a version script. { global: foo; bar; local: *; };" etc. but perhaps I'm teaching my grandparents to suck eggs :-)
(In reply to comment #31 from Michael Meeks) > Weeelll ... if you can statically link openldap into nss, and then use a map > file to ensure that none of it's symbols are exported: that would do the > trick. Yes, static linking should fix this problem. But I will create other headaches. Mainly maintenance related (See also comment #18). You will have to link all of libldap's dependencies statically into nss_ldap as well (mainly openssl and cyrus-sasl). That means with every update of those dependend packages (security or "normal" maintenance) we will have to release nss_ldap updates as well. So IMO we should try to avoid linking nss_ldap statically. > Hopefully, (being cross platform) we have to maintain such a map-file > anyway; > so it should be possible to ensure that we don't get any openldap symbols > straying out into the main process. nss_ldap already uses such a version script, it only exports the needed symbols. IIRC the problem with Thunderbird is not, that it calls functions from the libldap that nss_ldap is linked against (OpenLDAP), but "nss_ldap" calls into the libldap that Thunderbird is linked against (Mozilla LDAP SDK). > Failing that, as JP says if you can dlopen *both* versions RTLD_LOCAL - then > you'd be alright, but the 1st solution is far simpler. Poke in 'info ld' for: > > "When it is used to export symbols in > executables, it is similar to `--export-dynamic', except for that > symbols can be exported selectively with a version script. > > { global: foo; bar; local: *; };" etc. > > but perhaps I'm teaching my grandparents to suck eggs :-)
*** Bug 349545 has been marked as a duplicate of this bug. ***
Strangely enough, I'm experiencing the same issues with nss_ldap-253-19.4/OS10.2. Restarting nscd temporarily fixes the problem.
This Bug is still present in latest MozillaThunderbird 2.0.0.12_1.1 in OpenSUSE 10.3. It helps to add --without-ldap to the configuration options, remove all ldap files from the specfile and then rebuild the rpm.
Well - it's an interesting bug certainly :-) glibc dlopen's the *nss* libraries - that then link to openldap client: 26206: transferring control: wget 26206: --2008-05-26 15:28:47-- http://www.ibm.com/ Resolving www.ibm.com... 26206: find library=libnss_files.so.2 [0]; searching 26206: search cache=/etc/ld.so.cache 26206: trying file=/lib64/libnss_files.so.2 this good stuff then links directly to openldap-client: $ ldd /lib/libnss_ldap.so.2 linux-gate.so.1 => (0xffffe000) libldap-2.4.so.2 => /usr/lib/libldap-2.4.so.2 (0xf7e6b000) liblber-2.4.so.2 => /usr/lib/liblber-2.4.so.2 (0xf7e5c000) ... $ rpm -qf /usr/lib/libldap-2.4.so.2 openldap2-client-32bit-2.4.9-4 etc. So - the symbol problems are not -such- an issue for the glibc pieces since they are versioned GLIBC_PRIVATE, but as soon as we link the external library we have: $ objdump -T /lib/libnss_ldap.so.2 /lib/libnss_ldap.so.2: file format elf32-i386 DYNAMIC SYMBOL TABLE: ... 00000000 DF *UND* 000002d8 ldap_first_attribute ... 00000000 DF *UND* 0000002b ldap_memfree etc. which cause the grief of course. SOOooo ... Perhaps this is something for RTLD_DEEPBIND in glibc ?
I've built a test package with the RTLD_DEEPBIND thing on, please get it at: http://www.gnome.org/~rodrigo/glibc-2.8-13.src.rpm it's an SRPM, so just rebuild it with: $ rpmbuild -ba glibc-2.8-13.src.rpm
was a pain to rebuild :P but with this glibc, thunderbird does not segfault anymore.
Is it possible to have this incorporated with 11.0 release?
(resetting needinfo, i forgot to do that before)
Depends on glibc maintainers, CCing them
Friedrich: Rebuilding --without-ldap may be a problem in multiuser environments where reliance on ldap directories is high (.edu and large .com). Over the past year I had 93 support requests concerning this particular bug alone, which for 250 active users, places this problem at the top 3 most reported bugs for our organisation.
Created attachment 219530 [details] The patch I added to the glibc package, waiting for approval
Petr - this is potentially a glibc issue ;-)
Note that this bug is present in the final version of 11.0 as well as the beta version. It has become much more serious because in my experience nscd is even more unreliable under 11.0 than it was under 10.2, so the workround is less effective.
Created attachment 226434 [details] stabilised nscd.conf I managed to stabilise nscd somewhat on 11.0 (GM) by tweaking the config file. This has been rolled out on 250+ workstations and so far (fingers crossed) I've got no reports of segfaulting gtk applications (before, both firefox, thunderbird and openoffice would break repeatedly). I am using ldap authentication in a dual-server setup using sync replication. Relevant lines from ldap.conf below: host ldap0.mydomain ldap1.mydomain bind_policy soft bind_timeout 3 As for nscd.conf (attached) the cache values have been found experimentally. The drawback of those values is that nscd consumes about 128MB of memory, which is an overkill, but at least it doesn't trip over itself every so often. For our systems, this configuration is a functioning workaround. It MAY work for you.
I can confirm this bug (users are LDAP authenticated and have NFS homes) for OpenSUSE 11.0 x86_64 and the nscd workaround (don't know about the reliability yet, though).
I tries Jaroslaw's suggestions but on my systems nscd is still hopelessly unreliable and only stays up for a few minutes. The workround of adding entries to the local /etc/passwd file works but is obviously very kludgey.
Unfortunately the nscd.conf workaround is partial. It reduces the frequency of nscd crashes, but doesn't eliminate them entirely. I've implemented a watchdog. YMMV.
So - the real fix is the deepbind fix; Petr - re-assigning to you ;-)
deepbind will be in 11.1
What about an (optional?) update package for 11.0 which has still a lifetime of nearly 2 years?
I think adding this feature would be too high-impact for maintenance update. But I do plan a maintenance update for nscd stability issues.
Well, we've waited for a fix for over 2 years (bug originally filled against 10.1). I would rather argue that the current _problem_ is high-impact, and a fix for it should not be postponed any more, deepbind or not. It's not very "SUSE-way" to have a watchdog babysitting dying nscd. Please consider releasing an update for 11.0 as soon as feasible. This bug is a _REAL_ PITA.
Reopening since present in 11.0
I too think the bug is high impact, after all Thunderbird is one of the most heavily used open source applications. But I think it would be reasonable for SuSE to close the bug IF they could provide a working nscd (since nscd is enabled by default).
Well - altering glibc in this way is -potentially- rather risky; as a compromise - I'd suggest that we let this shake out in 11.1 for a while before we consider back-porting to 11.0. If we break everyone's glibc there will be more problems than thunderbird not working ;-)
Certainly, fixing nscd in 11.0 is highest priority item for me right now and I have a fix almost ready. But backporting deepbind is too risky, I think, and we already had this problem for some time; so I think fixing nscd will be fine for 11.0.
I'd like to remind you that in my scenario (user authentication with nss_ldap and kerberos) nscd is not at all relevant for this bug. Thunderbird segfaults no matter if nscd is running or not. The only workaround that is feasible here is to disable ldap functionality in thunderbird, which someone has pointed out may not be feasible for others who rely on large ldap address books.
Forgot to mention that my information was filed in bug 349545
(In reply to comment #59 from Petr Baudis) > Certainly, fixing nscd in 11.0 is highest priority item for me right now and I > have a fix almost ready. That's really good news. > But backporting deepbind is too risky, I think, and we > already had this problem for some time; so I think fixing nscd will be fine for > 11.0. Agreed. Nscd stability is after all what we need. The means by which this is achieved are less of an issue. Is it available in one of the /opensuse/repositories/* on buildservice? I'm happy to test it. Thanks!
I have to agree that this is a real issue since Thunderbird is not some obscure software. This bug really affects most OpenSUSE installations in a SOHO or anything bigger, so I'd think that SUSE/Novell would consider this bug rather important because that people might be potential SLES customers... That being said, and having a nscd which does not crash is definitely a benefit for anybody, how about releasing a fixed glibc with deepbind in the /opensuse/repositories/* tree? That's why I said optional update in comment #52. That way anybody finding this bug could install the new glibc packages while not forcing them on anybody else by a general update package.
(In reply to comment #57 from Michael Meeks) > Well - altering glibc in this way is -potentially- rather risky; as a > compromise - I'd suggest that we let this shake out in 11.1 for a while before > we consider back-porting to 11.0. If we break everyone's glibc there will be > more problems than thunderbird not working ;-) > Your suggestion is not really acceptable. Reasons: 1) It's not only thunderbird that is affected. High profile applications -- OpenOffice.org among others. Entire ldap based authentication is affected here. 2) Developers seem to live with N+1 version, where as actual users live in the N-1...N space. I do appreciate that N-1 and N bugs are so passée, but postponing the fix is just... cruel :) 3) Petr says in Comment #59 that he's treating this problem as high priority. I would rather let him work in peace on it and release the fix as soon as it's ready and not leave it for "next release". But that's just my point of view. Regards,
Comment on attachment 226434 [details] stabilised nscd.conf Obsoleting the nscd.conf workaround as not really effective.
MozillaThunderbird i586 2.0.0.16-0.1 Update installed today. /usr/bin/thunderbird: line 137: 12035 Segmentation fault $AOSS $MOZ_PROGRAM $@ Seems I should bump the version number of my selfmade thunderbird packages so that the upgrades from openSUSE don't reintroduce that bug every time.
I have managed to stabilise nscd/thunderbird by installing package "lsb". Had anyone have luck with that?
Don't know. I've installed the package anyways. Remind me to check my logs for nscd_watchdog entries (a simple watchdog script that restarts a deceased nscd) and we'll see. For reference, here are the entries of this week (today is Oct 3rd): Sep 29 09:44:53 work2 nscd_watchdog[8725]: Restarted dead nscd Sep 30 10:12:25 work2 nscd_watchdog[694]: Restarted dead nscd Sep 30 18:01:57 work2 nscd_watchdog[13305]: Restarted dead nscd Oct 1 09:35:31 work2 nscd_watchdog[3686]: Restarted dead nscd Oct 1 09:49:04 work2 nscd_watchdog[3943]: Restarted dead nscd Oct 1 09:49:24 work2 nscd_watchdog[3981]: Restarted dead nscd Oct 1 11:04:11 work2 nscd_watchdog[3872]: Restarted dead nscd Oct 1 11:09:26 work2 nscd_watchdog[3935]: Restarted dead nscd Oct 2 10:27:06 work2 nscd_watchdog[17754]: Restarted dead nscd
lsb doesn't make any difference to me...nscd still crashes in a few minutes.
Any news on when will a patch be available? Running OpenSUSE 11.0 x64 and having the same problem. :~> thunderbird /usr/bin/thunderbird: line 134: 5626 Segmentation fault $MOZ_PROGRAM $@ As a workaround, I have also implemented a watchdog but it would be fantastic to have this issue solved once and for all.
We're having the same problem with openSUSE 11.0, latest patches and LDAP for authentication and NFS automount tables. When nscd crashes also VMware Workstation 6.0.5 cannot be started. It hangs. See also bug#387202.
Created attachment 247137 [details] unscd package As for Bug #387202 I've made source RPM of unscd. Attached. The nscd.conf file is not exactly the same as for distribution nscd package. There be dragons.
You may want to disable debugging in the source before rebuilding.
I've just committed a patch to OBS mozilla's MozillaThunderbird package (bug 439588) to fix the crash which led to that report. People using nss_ldap are encouraged to test that package and report back here but please note that it doesn't fix the underlying symbol namespace clash but only and at least one crash resulting from that.
For 11.0, we will work-around the issue by improving nscd stability, so I'm repointing this to the bug tracking that. *** This bug has been marked as a duplicate of bug 387202 ***
This ignores that bug 349545 is completely independent of nscd, which I do not run at all, so I'm reopening 349545
This deepbind change breaks any binary on SuSE that is built using an external malloc. Examples include Google Chrome and Splunk. There are certainly others. The breakage occurs because libresolve, with this fix, will malloc with glibc's malloc and then free with the external malloc. Now that Thunderbird has a fix, I highly encourage the patch to be backed out of glibc in future versions of SuSE. TCMalloc is gaining traction, so there will be more programs in the future that crash only on SuSE. see http://code.google.com/p/google-perftools/issues/detail?id=228
We are aware of this problem; we were tracking it in bug 477061, unfortunately the bug is not public to all. :-( It was also related to https://features.opensuse.org/310176 - at any rate, in 11.4, deepbinding is turned off again since sssd is now used for LDAP queries and there are no library conflicts anymore. Therefore, custom malloc overrides should start working fine again.