Bugzilla – Bug 133596
LTC20983- IBM Java2-142 JVM Crashes (i686)
Last modified: 2006-08-01 19:29:41 UTC
The IBM JVM (IBMJava2-142-ia32-SDK-1.4.2-3.0.i386.rpm) crashes when starting Eclipse 3.1. Sometimes Eclipse starts and then crashes in the middle of working, but most of the time it crashes during startup. When launched from the console, the following error appears: JVMDG217: Dump Handler is Processing Signal 11 - Please Wait. I did not find any core dump from the JVM. The Sun 1.4.2 JVM works just fine. I am running SuSE 10 (OSS) with kernel 2.6.13-15-default (non-GPL). It works on SuSE 9.3, which is what I used before.
Please always choose the correct component and version. Daniel: Can this be supported in that fashion?
Kevin, do you by any chance know what this could cause? It seems like a JVM bug ... is there any chance to get more information?
Hi Daniel, I really don't know anything about Java. About the only thing I can do is mirror this back to the LTC and try to forward it on to someone in the Java software group. Likewise for bug 133703 and bug 133706.
What occurs if you start java with the Just-In-Time compiler disabled? That is, using the command line option of -Djava.compiler=NONE. Is this an emt64t system with an i386 version of SuSE installed? Also, do you see any difference if using the new Java 5 release ( http://www-128.ibm.com/developerworks/java/jdk/linux/download.html ) ?
Oh, also, does it generate a javacore.txt file by chance or a Linux core file if you ensure that ulimit -c unlimited ?
Nothing changes when I start java with the JIT compiler disabled. I haven't gotten a core or a javacore.txt with ulimit -c unlimited, but I have only let the crash run for a few minutes. I'll let it run longer and see what happens. This does not happen running "J2RE 1.5.0 IBM Linux build pxi32dev-20051104", which I have been using for some time now. The system I am running is a ThinkPad T41 with 2 GB RAM and SuSE OSS10.
I was able to get a javacore.txt produced, but I'm not sure it is valid. As the initial bug report states, the JVM spits out the following error "JVMDG217: Dump Handler is Processing Signal 11 - Please Wait.". It never creates a javacore.txt file though. After letting the crash run for 30 minutes, I finally gave up. Rather than killing the java process though, I decided to manually send it a SIGSEGV (kill -11) to see if it would core. That's when I got the javacore.txt file (no Linux core though). Not sure if it is valid for your tests, but I have attached it just in case.
Created attachment 63267 [details] javacore.txt file
Glen: Do you have any Idea what might causing this? Is it the new Kernel? There seem to be some problems with Kernels > 2.6.11 ... however: up to now it was most of the time on 64 Bit platforms.
---- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-01-26 10:08 EDT ------- I apologize for the delay but I have a couple of questions while I try to set something up here to recreate on. Do any of y\'all know if this shows up on SLES 10 Beta (not OSS)? I may have a couple of x86 systems already with that installed that I may try running Eclipse and Java on. Second, I see that Java 5 works but has the latest Java 1.4.2 service release (SR3) been download and attempted? The problems we saw with Java 1.4.2 before was an issue with the JIT compiler and the addition of the execute protection feature in PPC64 mainline kernels that required some changes in Java that were similarly done for x86_64/ia32/ia64 kernels in SLES 9 SMP kernels. If this were related then disabling the JITC should have circumvented that but apperently it doesn\'t.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|ASSIGNED |NEEDINFO ------- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-02-20 21:43 EDT ------- Java 1.4.2 SR4 was released at the start of the month. Is that release been tried? http://www-128.ibm.com/developerworks/java/jdk/linux/download.html
Same issue with 1.4.2 SR4. I was successfully using a beta release of the 1.5.0 JVM, and I just upgraded to GA-1 - it appears to be working as well.
Sorry for the delay, that's all really strange as it works perfectly here. Could you please provide me with some infos: 1.) what output does java -version create? Does it segfault immediately as well? 2.) could you provide the output of: rpm -qa | grep "\(java.*ibm\|eclipse\)" 3.) and the output of cat /proc/cpuinfo Thanks, Daniel
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-03-31 14:56 EDT ------- I downloaded and tried running Eclipse version 3.1.2 (Build id: M20060118-1600) along with Classic VM (build 1.4.2, J2RE 1.4.2 IBM build cxia32142ifx-20060209 (SR4-1) (JIT enabled: jitc)) on a T41p and also on a x366 (x86 32-bit installation) but it always seems to come up OK. There is somebody within IBM that is seeing something very similar that was working with our Java folks that I am going to try and debug with. They also alerted me that a problem like this being reported to the Eclipse folks though on a different distro though I don\'t think it\'s distro specific. See https://bugs.eclipse.org/bugs/show_bug.cgi?id=111252
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEEDINFO |OPEN ------- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-04-05 10:56 EDT ------- I was able to recreate the \"JVMDG217: Dump Handler is Processing Signal 11 - Please Wait.\" though not on OSS10 but I don\'t believe it\'s a OS issue. Java dies in one of it\'s signal handlers while receiving and processing SIGUSR2 signal. Their signal handler expects two additional parameters besides the signal number (a pointer to a siginfo_t and a a pointer to a ucontext_t) and they don\'t appear valid and it faults when accessing the context structure data. Supposedly the sigaction() system call used to indicate that the handler expects three arguments instead on a single paramter had the sa_flags SA_SIGINFO flag set to indicate this. I added some printks to the kernel and ended up seeing the following: sys_rt_sigaction (java.bin:3277): sa=b7d97bd0 flags=14000004 <---- looks like the initial registration sys_rt_sigaction (java.bin:3277): oldsa=00000000 oldflags=0 <--- no prev so does look like first time setup_rt_frame (java.bin:3284): sp=b618e778 pc=b7d97bd0 uc=b618e808 info=b618e788 setup_rt_frame (java.bin:3279): sp=b6b34c30 pc=b7d97bd0 uc=b6b34cc0 info=b6b34c40 setup_rt_frame (java.bin:3284): sp=b618e258 pc=b7d97bd0 uc=b618e2e8 info=b618e268 setup_rt_frame (java.bin:3279): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720 setup_rt_frame (java.bin:3284): sp=b618e778 pc=b7d97bd0 uc=b618e808 info=b618e788 setup_rt_frame (java.bin:3279): sp=b6b34c30 pc=b7d97bd0 uc=b6b34cc0 info=b6b34c40 setup_rt_frame (java.bin:3284): sp=b618e258 pc=b7d97bd0 uc=b618e2e8 info=b618e268 setup_rt_frame (java.bin:3279): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720 setup_rt_frame (java.bin:3279): sp=b6b34c30 pc=b7d97bd0 uc=b6b34cc0 info=b6b34c40 setup_rt_frame (java.bin:3279): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720 sys_rt_sigaction (java.bin:3277): sa=b4dff488 flags=14000000 <-- the handler and flags get changed sys_rt_sigaction (java.bin:3277): oldsa=b7d97bd0 oldflags=14000004 sys_rt_sigaction (java.bin:3277): sa=b7d97bd0 flags=14000000 <- handler restored, not SA_SIGINFO sys_rt_sigaction (java.bin:3277): oldsa=b4dff488 oldflags=14000000 setup_frame (java.bin:3279): sp=b6b34cd0 pc=b7d97bd0 sc=b6b34cd8 I added tracing for the sigaction() system call to identify if java itself was modifying the sa structure info. I see an initial call that appears to be the first time since the previous sa was zeroes. Later, just before the incorrect path to setup the stack frame is made, you see two sigaction invocations from java. It changes the sa_flags and the sa_handler and then again restores the original sa_handler but the flags are not restored properly. Missing is the SA_SIGINFO flag which is defined in asm-i386/signal.h as #define SA_SIGINFO 0x00000004u The sa_handler addresses 0xb7d97bd0 and 0xb4dff488 in the trace correspond to signalSigactionAdapter() and FAMNoExists() respectively. Hopefully, that will help the Java team isolate it further. I am continuing work with the Java team to find out if this is indeed an IBM Java 1.4.2 bug which with latest data seems to indicate it might be (at least a userspace problem, not kernel).
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-04-05 11:00 EDT ------- Just to clarify my previous comment, for a signal handler that has set the SA_SIGINFO sa_flag, it should go through setup_rt_frame label while for one with the old style single parameter, it should go through the routine setup_frame. These kernel routines handle setting up the stack frame for delivering the signal. This corresponds to the following logic in arch/i386/kernel/signal.c handle_signal() : /* Set up the stack frame */ if (ka->sa.sa_flags & SA_SIGINFO) setup_rt_frame(sig, ka, info, oldset, regs); else setup_frame(sig, ka, oldset, regs); So, if the SA_SIGINFO flag is not set then the stack frame is setup for the single parameter handler (which the java signal handler isn\'t).
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-04-06 17:32 EDT ------- The Java folks came back and indicated that the FAMNoExists was not asignal handler in there code. If it\'s isn\'t Java\'s then where does it come from? So I did a quick google search and found a package called Gamin used for File Alteration Monitoring that sounded suspicious because it has a mechanism using SIGUSR2 to enable \"debug-on-the-fly\" that tells me it must have it\'s own signal handler for that purpose. I believe SuSE ships a gamin-0.1.5-5.3 OSS package. So, I took a look at it\'s source and found the following interesting code: prev = signal(SIGUSR2, gam_error_signal); /* if there is already an handler switch back to the original to avoid disturbing the application behaviour */ if ((prev != SIG_IGN) && (prev != SIG_DFL) && (prev != NULL)) signal(SIGUSR2, prev); } So, you see that the old semantics are used via the signal() function to establish a signal handler but since one was already there it immediately tries to restore it. Unfortunetly, by restoring the original handler, the flags are lost because only the single argument sighandler_t is returned and provided not the flags. And so, glibc converts the signal() function to what would be the sigaction() without the SA_SIGINFO flag enabled. BTW, if you\'re wondering why the gam_error_signal name is different it\'s because it and all the rest of the functions in the file are declared static so there is no symbol for them. The closest public symbol to that routine appears to be FAMNoExists. I made some changes and it appears to be working now. I was watching the /proc/pid/maps while I was running through the recreation steps of creating a project, getting the open file dialog, and creating a new file, and at the point that the open file dialog appears, the following appears in the maps: b4f2d000-b4f33000 r-xp 00000000 08:02 338766 /usr/lib/libfam.so.0.0.0 b4f33000-b4f34000 rwxp 00005000 08:02 338766 /usr/lib/libfam.so.0.0.0 So, indeed the gamin library gets loaded. Here are excerpts from the kernel traces: sys_rt_sigaction (gam_server:3317): oldsa=00000000 oldflags=0 <--- event 1 sys_rt_sigaction (gam_server:3317): sa=08050ad9 flags=4000004 <--- event 2 sys_rt_sigaction (gam_server:3317): oldsa=00000000 oldflags=0 <--- event 2 (cont.) . . . sys_rt_sigaction (java.bin:3285): oldsa=b7d97bd0 oldflags=14000004 <-- event 3 setup_rt_frame (java.bin:3287): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720 setup_rt_frame (java.bin:3285): sp=bfffe550 pc=b7d97bd0 uc=bfffe5e0 info=bfffe560 event 1: shows the new logic which is to first interrogate for an existing handler. event 2: since there is none (or it\'s the default one) then the gamin library registers its own event 3: interrogation shows that a handler already exists so you don\'t see a new registration I recreated and found the problem on a non-SuSE distro but I will try and give y\'all a patch against the gamin you deliver (I checked all the way to the latest 0.1.7 souce and it still has the problem) as soon as I can set up the environment.
Sorry everyone, I have been traveling quite a bit and have not been able to keep up with this thread. I am also internal to IBM (RSS, NRSC in Raleigh NC). Is there anything I can do to be of assistance? I show my current version of gamin as gamin-0.1.5-5.3.
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-04-21 17:13 EDT ------- I downloaded the source rpm for ftp://ftp.suse.com/pub/suse/i386/update/10.0/rpm/src/gamin-0.1.5-5.3.src.rpm and installed it on a SLES 9 box since I don't have an OSS 10.0 installation handy so that I could just produce the proper patch for gamin-0.1.5-5.3 source. I'll be attaching it shortly. Since Brad appears to be an IBMer, I am contacting him internally to help him in getting the testfix rpm built on his OSS 10.0 box to verify the patch.
Created attachment 79544 [details] gam_error_sigaction.patch
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-04-21 17:17 EDT ------- patch to use sigaction() rather than signal() and restore prev handler with flags This patch modifies gamin-0.1.5/libgamin/gam_error.c and gamin-0.1.5/lib/gam_error.c to first query for an existing SIGUSR2 signal handler. If none exists or its the default one then the library can register its own. If after registering its own, it determines one got registered within the window of time between the initial check and the registration, it restores the original handler using sigaction which allows the flags to be restored properly. Looking at open gamin bugs in GNOME bugzilla, I found the following one: http://bugzilla.gnome.org/show_bug.cgi?id=321601 Apperently, they found the same issue at the end of last year with the restoration of a SIGUSR2 signal handler, fixed it in a similar manner and tested it though the bug appears to still be open.
----- Additional Comments From brmurphy@us.ibm.com 2006-06-08 15:24 EDT ------- Sorry, took me a while to get back to working on this. This fixed the issue on my system I believe. I've had Eclipse running with IBM JVM 1.4.2 all day with no problems.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|FIXEDAWAITINGTEST |TESTED ------- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-06-08 15:51 EDT ------- Brad indicated the test of the patch was successful so marking bug as TESTED and then moving to SUBMITTED state.
Hi. Is this already in SR4-1 Update? Eclipse works perfectly here, so it seems that this bug can probably be closed. Would like to have a confirmation from IBM that I can close this. Thanks a lot for the investigations, Regards, Daniel
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com) 2006-07-27 09:13 EDT ------- Daniel, The problem is not with Java or Eclipse but with the gamin package. That is where the fix was done. Looking at the gamin website, I see the changelog mention a similar fix done in April 2006 ( http://www.gnome.org/~veillard/gamin/ChangeLog.html ) : "- lib/gam_error.c: (gam_error_init): avoid changing the signal at all as it would break applications if they setup their signal handlers with sigaction, and used the SA_SIGINFO flag (which would change the number of arguments to the handler)" Though the last release of gamin available 0.1.7 obviously doesn't contain it since it's a release from Oct 2005. Therefore I wouldn't consider it fixed unless either Novell picks up a later release of gamin that contains the patch (which I don't think exists yet) or applies the patch to the existing gamin package they deliver.
Uups, sorry. I mixed it up with another bug and oversaw that this was the gamin thing. Apparently gamin is not included anymore on neither SUSE Linux >= 10.1 nor any SLES. This would actually mean we can close this bug as 'WONTFIX', as we do not provide non-security updates for SL 10.0.
changed: What |Removed |Added ---------------------------------------------------------------------------- Status|SUBMITTED |REOPENED Resolution|FIX_BY_IBM | ------- Additional Comments From chavez@us.ibm.com (prefers email at lnx1138@us.ibm.com) 2006-08-01 15:26 EDT ------- Returning bug as WILL_NOT_FIX per SuSE comments...