Bugzilla – Bug 153386
OOo-base segfaults
Last modified: 2006-03-31 13:54:38 UTC
OOo-base segfaults when newly started and "Table" is selected by mouse. It happens on x86-64 system. On another i386 system, it hangs up when creating a new form and immediately closing the oowriter window (via window manager).
I have reproduced it. It looks like another problem with gcj/gij. This one looks critical => updating severity. I'll try to debug it tomorrow.
Created attachment 70415 [details] Backtrace from the crash.
I will attach an output from valgrind that shows many locations where gcj uses unitialized values. I am not 100% sure if it is the reason for the crash. Though, it think that it is well probable and I would be happy if anyone from the gcc team could look into these problems. I wanted to extract a small example showing the crash from OOo. It is not so easy. So, I found another example at http://www.inonit.com/cygwin/jni/invocationApi/ It uses gcj/gij a similar way as OOo. It does not show the crash. It shows the problems with unutialized variables, though. I will attache the example sources. You can use them the following way: 1. gcc -lgcj -o invoke invoke.c 2. ./invoke You should get: --- cut --- pmladek@golem:/prace/tmp/gcj> ./invoke Hello, World! Arguments sent to this program: From-C-program --- cut --- The example uses the procompiled InvocationHelloWorld.class. It can be compiled by Sun Java using the command: javac InvocationHelloWorld.java It cannnot be compiled by gcj because of the bug #154318
Created attachment 70762 [details] Valgrind log from OOo It is got by: OpenOffice_org-2.0.2-4 (will be used on SL 10.1 beta6) gcc-4.1.0_20060218-2 (from SL 10.1-beta5) I got it the following way: 1. cd /usr/lib/ooo-2.0/program 2. valgrind --log-file=~/obase.valgring.log ./soffice.bin 3. open an .odb file 4. click on the button Tables Then you should get a crash
Created attachment 70765 [details] A sample .odb file
Created attachment 70770 [details] A sample code that shows the gcj/gij problems. See the comment #3 and http://www.inonit.com/cygwin/jni/invocationApi/ for more details.
Created attachment 70773 [details] valgring log from the example
The valgrind stuff of the small example looks harmless. The OO valgrind log shows that this is related to improper use of threading by OO: ==7124== Thread 6: ==7124== Invalid read of size 4 ==7124== at 0x26E6D92D: GC_local_malloc_atomic (pthread_support.c:331) ==7124== by 0x26839D61: _Jv_AllocString (java-gc.h:57) ==7124== by 0x2687D923: _Jv_RunFinalizers() (boehm.cc:540) ==7124== by 0x26867D5F: gnu::gcj::runtime::FinalizerThread::run() (natFinalizerThread.cc:60) ==7124== by 0x2687773A: _Jv_ThreadRun(java::lang::Thread*) (natThread.cc:297) ==7124== by 0x8D9E99D: clone (in /lib/libc-2.3.90.so) ==7124== Address 0x44 is not stack'd, malloc'd or (recently) free'd ==7124== ==7124== Invalid read of size 4 ==7124== at 0x26E6E0E0: GC_local_gcj_malloc (pthread_support.c:371) ==7124== by 0x2683A2F9: _Jv_AllocObjectNoFinalizer (java-gc.h:46) ==7124== by 0x2683A365: _Jv_AllocObject (prims.cc:448) ==7124== by 0x26839D61: _Jv_AllocString (java-gc.h:57) ==7124== by 0x2687D923: _Jv_RunFinalizers() (boehm.cc:540) ==7124== by 0x26867D5F: gnu::gcj::runtime::FinalizerThread::run() (natFinalizerThread.cc:60) ==7124== by 0x2687773A: _Jv_ThreadRun(java::lang::Thread*) (natThread.cc:297) ==7124== by 0x8D9E99D: clone (in /lib/libc-2.3.90.so) ==7124== Address 0x234 is not stack'd, malloc'd or (recently) free'd ==7124== ==7124== Invalid read of size 4 ==7124== at 0x26E6E0E0: GC_local_gcj_malloc (pthread_support.c:371) ==7124== Address 0x234 is not stack'd, malloc'd or (recently) free'd ==7124== Can't extend stack to 0x18218C60 during signal delivery for thread 6: ==7124== too small or bad protection modes ==7124== ==7124== Process terminating with default action of signal 11 (SIGSEGV) ==7124== Access not within mapped region at address 0x18218C60 ==7124== at 0x26E6E0E0: GC_local_gcj_malloc (pthread_support.c:371) it's accessing a (now) invalid thread local data.
I.e. same problem as gcc.gnu.org/PR13212 again - and the workaround that was applied to cure it doesn't work for the OO case. Can you try if LD_PRELOADing libgcj.so to OO works around the problem?
See https://www.redhat.com/archives/fedora-devel-java-list/2006-January/msg00002.html for discussion.
Hmm, OOo freezes when I preload libgcj.so and OOo tries to use it. It is probably because it tries to load the module once more in oob680-m3/jvmfwk/plugins/sunmajor/pluginlib/sunjavaplugin.cxx (function jfw_plugin_startJavaVirtualMachine). I have to leave for some hours now. I'll look at it later today.
Hmm, either we wait for Tom Tromey to produce the final fix as indicated in RedHats bugzilla, or we try to find out ourself what Boehm's fix was.
Given our java competencies I'll defer this one to Tom. Boehm's fix was the wrapper in https://www.redhat.com/archives/fedora-devel-java-list/2006-January/msg00002.html though see the audit trail of PR13212 for why Jakub thinks this is wrong for anything but x86_64.
Re: comment #11: When I preloaded libgcj, it really freezed at oob680-m3/jvmfwk/plugins/sunmajor/pluginlib/sunjavaplugin.cxx, line: err= pCreateJavaVM(&pJavaVM, ppEnv, &vm_args); I tried to hack OOo to use libgcj a cleaner way. In the original version, it loads libgcj by the function osl_loadModule and find the symbol JNI_CreateJavaVM by the function osl_getSymbol. I hacked it to be linked against libgcj and to call the function JNI_CreateJavaVM directly. Unfortunatelly, it did not help. It still freezed. It could be related to the fear mentioned at http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13212#c20 about that the preloaded library could potentially breaks all manner of things. Should I spent more time on attemps to run OOo with the preload libgcj?
I do not mind if we wait for the Tom's fix. Is he really going to fix it?
I tried to use the wrapper mentioned in https://www.redhat.com/archives/fedora-devel-java-list/2006-January/msg00002.html I fixed it to use dlvsym (RTLD_NEXT, "pthread_create","GLIBC_2.1"), so that it should work on ix86. See http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13212#c23 It did not work because our libgcj does not provide the he Boehm collector's pthread_create wrapper (symbol GC_pthread_create). Does Tom plan another fix? If I understand the last comment correctly (http://gcc.gnu.org/bugzilla/show_bug.cgi?id=13212#c25). Tom is only waiting for a list of the dlsym calls for the different architectures. So, we should probably check if the wraper is usable with OOo. What do you think?
I'm confident that Tom or Jakub will do a fix for FC5. Maybe you can add some information on the OOo failures to the upstream bugreport. Can you check if the OOo binary is linked against libgcj and if not, try linking against it? That could also work instead of preloading it.
Strange I tried to link soffice.bin against libgcj. It freezes that same way as it did when I preloaded libgcj.
How does it look with the gcc fix? It still does not work on beta8. Note that this fix is more important for SL 10.1 because there is gij the default JRE. It is less important for NLD10 because there is used Sun JRE by default.
There is no fix yet. I don't know how to proceed here and how serious the problem is (I tried to reproduce it on my beta7 i386 box with no luck). Some PM needs to chime in.
It's serious and we have to investigate this.
Ok, sofar these are the results of my investigation: - I have identified one patch in FC5 gcc that hints at a fix somewhere else. mbuild g148-rguenther-2 currently builds gcc with that patch for x86_64 and i386 - our libgcj.so does provide 00e50e80 g DF .text 000001ea Base pthread_create 00e503e0 g DF .text 000000c7 Base GC_pthread_detach 00e504b0 g DF .text 000000a9 Base GC_pthread_join 00e50230 g DF .text 0000007e Base GC_pthread_sigmask the question is whether it will override libpthreads 00006210 g DF .text 0000006f (GLIBC_2.0) pthread_create 00005820 g DF .text 000009ef GLIBC_2.1 pthread_create if LD_PRELOADed. Note that libgcj itself refers to libpthread.so. The gcc fix mentioned above should make GC_pthread_create visible. I'll check if this makes the proposed wrapper work.
Further investigation reveals that the proposed wrapper has found its way into gccs boehm-gc already. So, now figuring out why it doesn't trigger.
This means LD_PRELOADing libgcj.so should work.
Actually, I'll try http://gcc.gnu.org/ml/java/2006-03/msg00190.html which just appeared.
This didn't seem to work. /work/built/mbuild/g148-rguenther-3 contains the gcc - though I have not rebuilt openoffice with that. Trying that now and checking results tomorrow.
Note that you can get much faster build of OOo if you disable some optional features. See the beginning of OpenOffice_org.spec. It is enough to set: %define test_build_langs 0 %define test_build_binfilters 0 %define test_build_openclipart 0 Richard, do you some more help from our (OOo) side? Note: OOo still freezes on beta8 when it tries to CreateJavaVM and libgcj is LD_PRELOADed. It do not know why...
The failure looks like the same now, regardless of LD_PRELOADing libgcj or not. Looking at strace output of -e open,clone I see that OO clones a lot of thread without having libgcj.so opened before pressing 'Tables' in the wizard. Only then the java machinery starts up and finally segfaults in the known way. So I see really no way of fixing this on the gcj side, but think that OO needs a workaround or fix for the issue. I don't know the exact failure mode, but can imagine the garbage collector being confused by memory that gets passed from OO to the JVM - more thorough investigation would be necessary to prove this. I cannot imagine this is supposed to work without telling the JVM the origin of the memory it gets through some JNI interface, and I don't know if gij provides a means of doing so. There could be several ways to address this is OO (though I don't know anything about OO internals): - start the JVM early - link against libgcj instead of libpthread - ?
I have submitted the GCC with the various patches to BETA.
Ah ha ! - this is the old cookie of Boehm being *useless* wrt. allowing late registration of new threads [ so their stacks can be marked / scanned ] for GC roots correctly. We had precisely the same problem in mono - although we fixed it by tweaking Boehm there & calling some thread registration goodness. Is there a GCJ hook we can call to register a thread ? It would be ~fairly easy to keep track of all known threads inside OO.o and then register them at 1 shot when we load the Java plugin - what do you think ?
My patch to do something similar for Mono in our internal version was: http://lists.ximian.com/archives/public/mono-patches/2005-August/061826.html of course this is a gross stop-gap hack ;-) *Really* IMHO, there should be two fast methods: boehm_gc_enter() boehm_gc_leave() called at top/tail of each entry/exit point into managed / unmanaged code - so that we can do a far more precise job of managing the stacks, and of course - elide this whole "is a thread registered" mess from square 1 - if the __thread local variable in use to track the stack is not set - clearly the thread needs to be registered ;-) Currently Boehm seems to revel in checking way more stack than it needs, groping around to get the begginning of the stack (in dodgy & unorthodox ways) etc. etc. very sub-optimal. Anyhow - the nice/general fix is prolly far-off ;-(
There is a patch that adds _Jv_GCAttachThread/_Jv_GCDetachThread (GC_attach_thread/GC_detach_thread) that attach the _current_ thread to the GC. I don't know if that would help - GCC with this patch was checked into BETA this morning.
So, with a proper versioning of the libgcj pthread_create symbol LD_PRELOAD of libgcj.so now finally catches all pthread_create calls. Now, we get a SIGSEGV in a different place, namely Program received signal SIGSEGV, Segmentation fault. 0xb78bb7f3 in GC_is_black_listed () from /usr/lib/libgcj.so.7.0.0 (gdb) bt #0 0xb78bb7f3 in GC_is_black_listed () from /usr/lib/libgcj.so.7.0.0 #1 0xb78b979d in GC_allochblk_nth () from /usr/lib/libgcj.so.7.0.0 #2 0xb78b9b62 in GC_allochblk () from /usr/lib/libgcj.so.7.0.0 #3 0xb78c527d in GC_new_hblk () from /usr/lib/libgcj.so.7.0.0 #4 0xb78bb581 in GC_allocobj () from /usr/lib/libgcj.so.7.0.0 #5 0xb78c00a8 in GC_generic_malloc_inner () from /usr/lib/libgcj.so.7.0.0 #6 0xb78c0511 in GC_generic_malloc_inner_ignore_off_page () from /usr/lib/libgcj.so.7.0.0 #7 0xb78be3f7 in GC_grow_table () from /usr/lib/libgcj.so.7.0.0 #8 0xb78be61a in GC_register_finalizer_inner () from /usr/lib/libgcj.so.7.0.0 #9 0xb78be7c3 in GC_register_finalizer_no_order () from /usr/lib/libgcj.so.7.0.0 #10 0xb72d9b2d in _Jv_RegisterFinalizer () from /usr/lib/libgcj.so.7.0.0 #11 0xb72d2052 in _Jv_NewStringUtf8Const () from /usr/lib/libgcj.so.7.0.0 #12 0xb72a5deb in _Jv_Linker::ensure_class_linked () from /usr/lib/libgcj.so.7.0.0 #13 0xb72a5fbe in _Jv_Linker::wait_for_state () from /usr/lib/libgcj.so.7.0.0 #14 0xb72cc421 in java::lang::Class::initializeClass () from /usr/lib/libgcj.so.7.0.0 #15 0xb72cc3c3 in java::lang::Class::initializeClass () from /usr/lib/libgcj.so.7.0.0 #16 0xb7297417 in _Jv_CreateJavaVM () from /usr/lib/libgcj.so.7.0.0 #17 0xb729c58d in JNI_CreateJavaVM () from /usr/lib/libgcj.so.7.0.0 #18 0xacbf47fc in jfw_plugin_startJavaVirtualMachine () from /usr/lib/ooo-2.0/program/sunjavaplugin.so #19 0xb57df25e in jfw_startVM () from /usr/lib/ooo-2.0/program/libjvmfwk.so.3 #20 0xade0b7aa in component_writeInfo () from /usr/lib/ooo-2.0/program/javavm.uno.so #21 0xae8607ca in connectivity::getJavaVM () from /usr/lib/ooo-2.0/program/libdbtools680li.so #22 0xae02350f in ?? () from /usr/lib/ooo-2.0/program/libjdbc2.so #23 0xae01318f in ?? () from /usr/lib/ooo-2.0/program/libjdbc2.so #24 0xae021b08 in ?? () from /usr/lib/ooo-2.0/program/libjdbc2.so #25 0xacc28ddc in Java_com_sun_star_sdbcx_comp_hsqldb_NativeStorageAccess_seek () from /usr/lib/ooo-2.0/program/libhsqldb2.so #26 0xb28a1e5a in ?? () from /usr/lib/ooo-2.0/program/libdbpool2.so #27 0xae9c00c8 in ?? () from /usr/lib/ooo-2.0/program/libdba680li.so #28 0xae9c02fc in ?? () from /usr/lib/ooo-2.0/program/libdba680li.so #29 0xae9c05a1 in ?? () from /usr/lib/ooo-2.0/program/libdba680li.so #30 0xae9c1508 in ?? () from /usr/lib/ooo-2.0/program/libdba680li.so #31 0xae9c1650 in ?? () from /usr/lib/ooo-2.0/program/libdba680li.so #32 0xae59c7b8 in ?? () from /usr/lib/ooo-2.0/program/libdbu680li.so #33 0xae59c8a0 in ?? () from /usr/lib/ooo-2.0/program/libdbu680li.so #34 0xae632839 in component_writeInfo () from /usr/lib/ooo-2.0/program/libdbu680li.so #35 0xae697cd6 in component_writeInfo () #36 0xae68ffb3 in component_writeInfo () from /usr/lib/ooo-2.0/program/libdbu680li.so #37 0xae6a2854 in component_writeInfo () from /usr/lib/ooo-2.0/program/libdbu680li.so #38 0xae6a2964 in component_writeInfo () from /usr/lib/ooo-2.0/program/libdbu680li.so #39 0xb6539b7a in TransferableClipboardListener::AddRemoveListener () from /usr/lib/ooo-2.0/program/libsvt680li.so #40 0xb66caa93 in SvtIconChoiceCtrl::ClickIcon () from /usr/lib/ooo-2.0/program/libsvt680li.so no Cigar yet.
Looks like we have a fix! Whoooo! (requires LD_PRELOAD of libgcj.so)
So, the fixed libgcj is now in STABLE and will appear in RC1. It is now required to fix OpenOffice to LD_PRELOAD libgcj.so - Petr, can you prepare an updated package that LD_PRELOADs libgcj.so in the various oo* binaries?
OK, I'm going to add it to the wrappers. Just, to be sure. Is any fix necessary in the OOo code or is it enough to preload the library? Did you tested if it still works with other JREs, ...?
Please handle the case correct that we do not preload with other JREs!
Hmm, I have installation of NLD10-beta8. I updated all the subpackages that were presented and built from gcc sources (libgcc.rpm, cpp.rpm and libgcj.rpm ) from STABLE (to 4.1.0-10). Then I restarted the machine to be sure that the whole system is consitent. OOo freezed after I LD_PRELOADed libgcj, opened an .odb file and pressed the Tables button. What did I wrong? Note that I used OOo built with the un-patched gcc. Is this the problem?
It works for me using LD_PRELOAD=/usr/lib/libgcj.so.7.0.0 /usr/lib/ooo-2.0/program/soffice.bin with the example .odb which architecture are you on? Rebuilding OO should not be necessary (though you can try if linking against libgcj makes the LD_PRELOAD not necessary). Can you run LD_PRELOAD=/usr/lib/libgcj.so.7.0.0 gdb /usr/lib/ooo-2.0/program/soffice.bin and send the backtrace after the segfault?
Wooops, we screwed. Don't bother yet.
Sigh, new gcc in preparation...
OK. Do you have somewhere a working build, so I can test the OOo side with it?
A working build should fall out of mbuild g148-rguenther-8 for i386 shortly. You can use that for testing.
Great, the libgcj from g148-rguenther-8 works well. I am going to update OOo. BTW: What about PPC? Should I preload libgcj there as well or are these problems ix86 specific?
I believe the problems are not architecture dependent, so preloading on every architecture will be required.
I am just submitted the package. libgcj is preload only when gcj/gij JRE is selected. It is preload on all platforms (ix86, x86_64, ppc*). The user is asked to restart OOo when it selects gcj as it was already done for the other JREs. The LD_PRELOAD is in the soffice wrapper, so it does not break the ooqstart speedup. Kendy gave me a hint to preload /usr/\$LIB/libgcj.so.7.0.0. The $LIB should make sure that even the 64-bit applications would work when they are executed from OOo. I did some tests on all platforms (i686, x86_64, ppc). However, I did not have the new libgcj for ppc, so I just checked that it is preload there. Great, it fixed also the other gcj-related bugs, like the bug #150517
The question is if we want to close this bug now. It would be great to use the Red Hat's "cleaner" solution once it is ready. I would vote to remove this hack for NLD10 if possible.
Let's leave it open - and lower priority.
Frankly, I don't expect a cleaner solution - other than maybe linking against libgcj to avoid the LD_PRELOAD (if that works - it would be nice if you try that)
*** Bug 161710 has been marked as a duplicate of this bug. ***
*** Bug 150517 has been marked as a duplicate of this bug. ***
I'm testing another fix from upstream which is reported to also cure the OpenOffice problems. g148-rguenther-10 once it is ready.
Ok, it looks like that fix also works - and it does no longer require an LD_PRELOAD of libgcj.so. So Petr, can you verify it works for you with the libgcj from g148-rguenther-10 and prepare a new OO without the LD_PRELOAD again?
Great, it works without need to LD_PRELOAD libgcj. I tried to reproduce 3 different crashes and it worked. I am going to remove the LD_PRELOAD.
I have removed it in ooo-build CVS. I will submit a package based on it next week. I think that we can close this bug now if the gcc fix really goes to SL10.1.
OK, I have just submitted the updated OOo package. => I am closing the bug as FIXED