Bug 133596

Summary: LTC20983- IBM Java2-142 JVM Crashes (i686)
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Brad Murphy <bradmurphy>
Component: X11 ApplicationsAssignee: Daniel Bornkessel <dbornkessel>
Status: RESOLVED WONTFIX QA Contact: Adrian Schröter <adrian.schroeter>
Severity: Normal    
Priority: P5 - None CC: bugproxy, corryk
Version: Final   
Target Milestone: ---   
Hardware: i686   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: javacore.txt file
gam_error_sigaction.patch

Description Brad Murphy 2005-11-12 21:40:00 UTC
The IBM JVM (IBMJava2-142-ia32-SDK-1.4.2-3.0.i386.rpm) crashes when starting Eclipse 3.1.  Sometimes Eclipse starts and then crashes in the middle of working, but most of the time it crashes during startup.  When launched from the console, the following error appears:
   JVMDG217: Dump Handler is Processing Signal 11 - Please Wait.
I did not find any core dump from the JVM.  The Sun 1.4.2 JVM works just fine.  I am running SuSE 10 (OSS) with kernel 2.6.13-15-default (non-GPL).  It works on SuSE 9.3, which is what I used before.
Comment 1 Michael Gross 2005-11-14 15:41:03 UTC
Please always choose the correct component and version.
Daniel: Can this be supported in that fashion?
Comment 2 Daniel Bornkessel 2005-11-14 19:22:16 UTC
Kevin, do you by any chance know what this could cause?
It seems like a JVM bug ... is there any chance to get more information?
Comment 3 Kevin Corry 2005-11-14 19:39:04 UTC
Hi Daniel,
I really don't know anything about Java. About the only thing I can do is mirror this back to the LTC and try to forward it on to someone in the Java software group. Likewise for bug 133703 and bug 133706.

Comment 4 LTC BugProxy 2006-01-10 20:49:17 UTC
What occurs if you start java with the Just-In-Time compiler disabled? That is, using the command line option of -Djava.compiler=NONE. Is this an emt64t system with an i386 version of SuSE installed? Also, do you see any difference if using the new Java 5 release ( http://www-128.ibm.com/developerworks/java/jdk/linux/download.html ) ?
Comment 5 LTC BugProxy 2006-01-10 20:54:30 UTC
Oh, also, does it generate a javacore.txt file by chance or a Linux core file if you ensure that ulimit -c unlimited ?
Comment 6 Brad Murphy 2006-01-13 13:22:01 UTC
Nothing changes when I start java with the JIT compiler disabled.  I haven't gotten a core or a javacore.txt with ulimit -c unlimited, but I have only let the crash run for a few minutes.  I'll let it run longer and see what happens.

This does not happen running "J2RE 1.5.0 IBM Linux build pxi32dev-20051104", which I have been using for some time now.

The system I am running is a ThinkPad T41 with 2 GB RAM and SuSE OSS10.
Comment 7 Brad Murphy 2006-01-13 13:55:35 UTC
I was able to get a javacore.txt produced, but I'm not sure it is valid.  As the initial bug report states, the JVM spits out the following error "JVMDG217: Dump Handler is Processing Signal 11 - Please Wait.".  It never creates a javacore.txt file though.  After letting the crash run for 30 minutes, I finally gave up.  Rather than killing the java process though, I decided to manually send it a SIGSEGV (kill -11) to see if it would core.  That's when I got the javacore.txt file (no Linux core though).  Not sure if it is valid for your tests, but I have attached it just in case.
Comment 8 Brad Murphy 2006-01-13 13:56:19 UTC
Created attachment 63267 [details]
javacore.txt file
Comment 9 Daniel Bornkessel 2006-01-25 15:17:03 UTC
Glen: Do you have any Idea what might causing this? Is it the new Kernel? There seem to be some problems with Kernels > 2.6.11 ... however: up to now it was most of the time on 64 Bit platforms.
Comment 10 LTC BugProxy 2006-01-26 15:11:00 UTC
---- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-01-26 10:08 EDT -------
I apologize for the delay but I have a couple of questions while I try to set
something up here to recreate on. Do any of y\'all know if this shows up on SLES
10 Beta (not OSS)? I may have a couple of x86 systems already with that
installed that I may try running Eclipse and Java on. Second, I see that Java 5
works but has the latest Java 1.4.2 service release (SR3) been download and
attempted?

The problems we saw with Java 1.4.2 before was an issue with the JIT compiler
and the addition of the execute protection feature in PPC64 mainline kernels
that required some changes in Java that were similarly done for x86_64/ia32/ia64
kernels in SLES 9 SMP kernels. If this were related then disabling the JITC
should have circumvented that but apperently it doesn\'t. 
Comment 11 LTC BugProxy 2006-02-21 02:45:24 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEEDINFO




------- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-02-20 21:43 EDT -------
Java 1.4.2 SR4 was released at the start of the month. Is that release been
tried? http://www-128.ibm.com/developerworks/java/jdk/linux/download.html 
Comment 12 Brad Murphy 2006-02-21 13:58:14 UTC
Same issue with 1.4.2 SR4.  I was successfully using a beta release of the 1.5.0 JVM, and I just upgraded to GA-1 - it appears to be working as well.
Comment 13 Daniel Bornkessel 2006-03-28 09:12:34 UTC
Sorry for the delay, that's all really strange as it works perfectly here.
Could you please provide me with some infos:
1.) what output does 
java -version 
create? Does it segfault immediately as well?
2.) could you provide the output of:
rpm -qa | grep "\(java.*ibm\|eclipse\)"
3.) and the output of
cat /proc/cpuinfo

Thanks,
Daniel
Comment 14 LTC BugProxy 2006-03-31 20:00:11 UTC
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-03-31 14:56 EDT -------
I downloaded and tried running Eclipse version 3.1.2 (Build id: M20060118-1600)
along with Classic VM (build 1.4.2, J2RE 1.4.2 IBM build cxia32142ifx-20060209
(SR4-1) (JIT enabled:
jitc)) on a T41p and also on a x366 (x86 32-bit installation) but it always
seems to come up OK.

There is somebody within IBM that is seeing something very similar that was
working with our Java folks that I am going to try and debug with. They also
alerted me that a problem like this being reported to the Eclipse folks though
on a different distro though I don\'t think it\'s distro specific. See
https://bugs.eclipse.org/bugs/show_bug.cgi?id=111252 
Comment 15 LTC BugProxy 2006-04-05 15:01:09 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEEDINFO                    |OPEN




------- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-04-05 10:56 EDT -------
I was able to recreate the \"JVMDG217: Dump Handler is Processing Signal 11 -
Please Wait.\" though not on OSS10 but I don\'t believe it\'s a OS issue. 

Java dies in one of it\'s signal handlers while receiving and processing SIGUSR2
signal. Their signal handler expects two additional parameters besides the
signal number (a pointer to a siginfo_t and a  a pointer to a ucontext_t) and
they don\'t appear valid and it faults when accessing the context structure data.
Supposedly the sigaction() system call used to indicate that the handler expects
three arguments instead on a single paramter had the sa_flags SA_SIGINFO flag
set to indicate this.

I added some printks to the kernel and ended up seeing the following:

sys_rt_sigaction (java.bin:3277): sa=b7d97bd0 flags=14000004 <---- looks like
the initial registration
sys_rt_sigaction (java.bin:3277): oldsa=00000000 oldflags=0 <--- no prev so does
look like first time
setup_rt_frame (java.bin:3284): sp=b618e778 pc=b7d97bd0 uc=b618e808 info=b618e788
setup_rt_frame (java.bin:3279): sp=b6b34c30 pc=b7d97bd0 uc=b6b34cc0 info=b6b34c40
setup_rt_frame (java.bin:3284): sp=b618e258 pc=b7d97bd0 uc=b618e2e8 info=b618e268
setup_rt_frame (java.bin:3279): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720
setup_rt_frame (java.bin:3284): sp=b618e778 pc=b7d97bd0 uc=b618e808 info=b618e788
setup_rt_frame (java.bin:3279): sp=b6b34c30 pc=b7d97bd0 uc=b6b34cc0 info=b6b34c40
setup_rt_frame (java.bin:3284): sp=b618e258 pc=b7d97bd0 uc=b618e2e8 info=b618e268
setup_rt_frame (java.bin:3279): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720
setup_rt_frame (java.bin:3279): sp=b6b34c30 pc=b7d97bd0 uc=b6b34cc0 info=b6b34c40
setup_rt_frame (java.bin:3279): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720
sys_rt_sigaction (java.bin:3277): sa=b4dff488 flags=14000000  <-- the handler
and flags get changed
sys_rt_sigaction (java.bin:3277): oldsa=b7d97bd0 oldflags=14000004
sys_rt_sigaction (java.bin:3277): sa=b7d97bd0 flags=14000000 <- handler
restored, not SA_SIGINFO
sys_rt_sigaction (java.bin:3277): oldsa=b4dff488 oldflags=14000000
setup_frame (java.bin:3279): sp=b6b34cd0 pc=b7d97bd0 sc=b6b34cd8

I added tracing for the sigaction() system call to identify if java itself was
modifying the sa structure info. I see an initial call that appears to be the
first time since the previous sa was zeroes. Later, just before the incorrect
path to setup the stack frame is made, you see two sigaction invocations from
java. It changes the sa_flags and the sa_handler and then again restores the
original sa_handler but the flags are not restored properly. Missing is the
SA_SIGINFO
flag which is defined in asm-i386/signal.h as #define SA_SIGINFO    0x00000004u

The sa_handler addresses 0xb7d97bd0 and 0xb4dff488 in the trace correspond to
signalSigactionAdapter() and FAMNoExists() respectively. Hopefully, that will
help the Java
team isolate it further.

I am continuing work with the Java team to find out if this is indeed an IBM
Java 1.4.2 bug which with latest data seems to indicate it might be (at least a
userspace problem, not kernel). 
Comment 16 LTC BugProxy 2006-04-05 15:05:12 UTC
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-04-05 11:00 EDT -------
Just to clarify my previous comment, for a signal handler that has set the
SA_SIGINFO sa_flag, it should go through setup_rt_frame label while for one with
the old style single parameter,
it should go through the routine setup_frame. These kernel routines handle
setting up
the stack frame for delivering the signal. This corresponds to the following
logic in arch/i386/kernel/signal.c handle_signal() :

        /* Set up the stack frame */
        if (ka->sa.sa_flags & SA_SIGINFO)
                setup_rt_frame(sig, ka, info, oldset, regs);
        else
                setup_frame(sig, ka, oldset, regs);

So, if the SA_SIGINFO flag is not set then the stack frame is setup for the
single parameter handler (which the java signal handler isn\'t). 
Comment 17 LTC BugProxy 2006-04-06 21:35:53 UTC
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-04-06 17:32 EDT -------
The Java folks came back and indicated that the FAMNoExists was not asignal
handler in there code. If it\'s isn\'t Java\'s then where does it come from? So I did
a quick google search and found a package called Gamin used for File Alteration
Monitoring that sounded suspicious because it has a mechanism using SIGUSR2 to
enable \"debug-on-the-fly\" that tells me it must have it\'s own signal handler for
that purpose. I believe SuSE ships a gamin-0.1.5-5.3 OSS package. So, I took a
look at it\'s source and found the following interesting code:

        prev = signal(SIGUSR2, gam_error_signal);
        /* if there is already an handler switch back to the original
           to avoid disturbing the application behaviour */
        if ((prev != SIG_IGN) && (prev != SIG_DFL) && (prev != NULL))
            signal(SIGUSR2, prev);
    }

So, you see that the old semantics are used via the signal() function to
establish a signal handler but since one was already there it immediately tries
to restore it. Unfortunetly, by restoring the original handler, the flags are
lost because only the single argument sighandler_t is returned and provided not
the flags. And so, glibc converts the signal() function to what would be the
sigaction() without the SA_SIGINFO flag enabled.

BTW, if you\'re wondering why the gam_error_signal name is different it\'s because
it and all the rest of the functions in the file are declared static so there is
no symbol for them. The closest public symbol to that routine appears to be
FAMNoExists.

I made some changes and it appears to be working now. I was watching the
/proc/pid/maps while I was running through the recreation steps of creating a
project, getting the open file dialog, and creating a new file, and at the point
that the open file dialog appears, the following appears in the maps:

b4f2d000-b4f33000 r-xp 00000000 08:02 338766     /usr/lib/libfam.so.0.0.0
b4f33000-b4f34000 rwxp 00005000 08:02 338766     /usr/lib/libfam.so.0.0.0

So, indeed the gamin library gets loaded. Here are excerpts from the kernel traces:

sys_rt_sigaction (gam_server:3317): oldsa=00000000 oldflags=0 <--- event 1
sys_rt_sigaction (gam_server:3317): sa=08050ad9 flags=4000004 <--- event 2
sys_rt_sigaction (gam_server:3317): oldsa=00000000 oldflags=0   <--- event 2 (cont.)
.
.
.
sys_rt_sigaction (java.bin:3285): oldsa=b7d97bd0 oldflags=14000004 <-- event 3
setup_rt_frame (java.bin:3287): sp=b6b34710 pc=b7d97bd0 uc=b6b347a0 info=b6b34720
setup_rt_frame (java.bin:3285): sp=bfffe550 pc=b7d97bd0 uc=bfffe5e0 info=bfffe560

event 1: shows the new logic which is to first interrogate for an existing handler. 
event 2: since there is none (or it\'s the default one) then the gamin library
registers its own
event 3: interrogation shows that a handler already exists so you don\'t see a
new registration

I recreated and found the problem on a non-SuSE distro but I will try and give
y\'all a patch against the gamin you deliver (I checked all the way to the latest
0.1.7 souce and it still has the problem) as soon as I can set up the environment. 
Comment 18 Brad Murphy 2006-04-11 13:15:40 UTC
Sorry everyone, I have been traveling quite a bit and have not been able to keep up with this thread.  I am also internal to IBM (RSS, NRSC in Raleigh NC).  Is there anything I can do to be of assistance?  I show my current version of gamin as gamin-0.1.5-5.3.
Comment 19 LTC BugProxy 2006-04-21 21:36:41 UTC
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-04-21 17:13 EDT -------
I downloaded the source rpm for
ftp://ftp.suse.com/pub/suse/i386/update/10.0/rpm/src/gamin-0.1.5-5.3.src.rpm

and installed it on a SLES 9 box since I don't have an OSS 10.0 installation
handy so that I could just produce the proper patch for gamin-0.1.5-5.3 source.
I'll be attaching it shortly.

Since Brad appears to be an IBMer, I am contacting him internally to help him in
getting the testfix rpm built on his OSS 10.0 box to verify the patch. 
Comment 20 LTC BugProxy 2006-04-21 21:38:13 UTC
Created attachment 79544 [details]
gam_error_sigaction.patch
Comment 21 LTC BugProxy 2006-04-21 21:38:16 UTC
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-04-21 17:17 EDT -------
 
patch to use sigaction() rather than signal() and restore prev handler with
flags

This patch modifies gamin-0.1.5/libgamin/gam_error.c and
gamin-0.1.5/lib/gam_error.c to first query for an existing SIGUSR2 signal
handler. If none exists or its the default one then the library can register
its own. If after registering its own, it determines one got registered within
the window of time between the initial check and the registration, it restores
the original handler using sigaction which allows the flags to be restored
properly.

Looking at open gamin bugs in GNOME bugzilla, I found the following one:

http://bugzilla.gnome.org/show_bug.cgi?id=321601

Apperently, they found the same issue at the end of last year with the
restoration of a SIGUSR2 signal handler, fixed it in a similar manner and
tested
it though the bug appears to still be open. 
Comment 22 LTC BugProxy 2006-06-08 19:29:57 UTC
----- Additional Comments From brmurphy@us.ibm.com  2006-06-08 15:24 EDT -------
Sorry, took me a while to get back to working on this.  This fixed the issue on
my system I believe.  I've had Eclipse running with IBM JVM 1.4.2 all day with
no problems. 
Comment 23 LTC BugProxy 2006-06-08 19:55:14 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|FIXEDAWAITINGTEST           |TESTED




------- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-06-08 15:51 EDT -------
Brad indicated the test of the patch was successful so marking bug as TESTED and
then moving to SUBMITTED state. 
Comment 24 Daniel Bornkessel 2006-07-27 10:26:59 UTC
Hi.
Is this already in SR4-1 Update? Eclipse works perfectly here, so it seems that this bug can probably be closed.
Would like to have a confirmation from IBM that I can close this.
Thanks a lot for the investigations,
Regards,
Daniel
Comment 25 LTC BugProxy 2006-07-27 13:15:08 UTC
----- Additional Comments From chavez@us.ibm.com(prefers email via lnx1138@us.ibm.com)  2006-07-27 09:13 EDT -------
Daniel,

The problem is not with Java or Eclipse but with the gamin package. That is
where the fix was done.

Looking at the gamin website, I see the changelog mention a similar fix done in
April 2006 ( http://www.gnome.org/~veillard/gamin/ChangeLog.html ) :

"- lib/gam_error.c: (gam_error_init): avoid changing the signal at all as it
would break applications if they setup their signal handlers with sigaction, and
used the SA_SIGINFO flag (which would change the number of arguments to the
handler)"

Though the last release of gamin available 0.1.7 obviously doesn't contain it
since it's a release from Oct 2005.

Therefore I wouldn't consider it fixed unless either Novell picks up a later
release of gamin that contains the patch (which I don't think exists yet) or
applies the patch to the existing gamin package they deliver. 
Comment 26 Daniel Bornkessel 2006-07-27 15:02:41 UTC
Uups, sorry. I mixed it up with another bug and oversaw that this was the gamin thing.
Apparently gamin is not included anymore on neither SUSE Linux >= 10.1 nor any SLES. This would actually mean we can close this bug as 'WONTFIX', as we do not provide non-security updates for SL 10.0.
Comment 27 LTC BugProxy 2006-08-01 19:29:41 UTC
changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|SUBMITTED                   |REOPENED
         Resolution|FIX_BY_IBM                  |




------- Additional Comments From chavez@us.ibm.com (prefers email at lnx1138@us.ibm.com)  2006-08-01 15:26 EDT -------
Returning bug as WILL_NOT_FIX per SuSE comments...