Bug 106147

Summary: init[1]: segfault at 000000000040839a rip 000000000040839a rsp 00007fffffb81b20 error 15
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Ulrich Windl <Ulrich.Windl>
Component: BasesystemAssignee: Dr. Werner Fink <werner>
Status: RESOLVED WORKSFORME QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: asklein, hare, matz, vetter
Version: Beta 1   
Target Milestone: ---   
Hardware: x86-64   
OS: SUSE Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: /var/log/boot.msg with "segfault" in "init"
/var/log/boot.msg (beta3) with segfault for init and hotplug

Description Ulrich Windl 2005-08-22 13:12:08 UTC
In /var/log/boot.msg I find init problems for a dual Xeon 2.8GHz machine:
<0>init[1]: segfault at 000000000040839a rip 000000000040839a rsp
00007fffffb81b20 error 15
Comment 1 Ulrich Windl 2005-08-22 13:13:13 UTC
Created attachment 46911 [details]
/var/log/boot.msg with "segfault" in "init"
Comment 2 Dr. Werner Fink 2005-08-22 14:09:46 UTC
I need _MORE_ information due to the fact that init works
here on the x86_64 around.

Btw: does this happen also with Beta2 ... this because on
Beta 1 there was a broken gcc4
Comment 3 Ulrich Windl 2005-08-23 06:58:14 UTC
Despite of testing Beta2 (which I don't have at the moment), what information do
you need?
Comment 4 Dr. Werner Fink 2005-08-23 09:34:44 UTC
The question is: does it work with Beta 2, this because it
is known that the gcc4 on Betq 1 was broken.
Comment 5 Ulrich Windl 2005-08-23 10:05:33 UTC
Does the gcc bug mean the problem is in the kernel, or is it in the application
(init)? For the latter, replacing init (from beta2) would not be a big deal.
Comment 6 Dr. Werner Fink 2005-08-23 10:15:37 UTC
I've no idea if only init or both init and the kernel should
be changed.  Nevetheless, the x86_64 systems around here boot
very well.

Matz?  Do you have an idea?
Comment 7 Michael Matz 2005-08-23 14:52:46 UTC
We know only that the gcc bug affected pppd (libpcap to be precise). 
Besides from that we didn't see the bug anywhere else.  But this doesn't mean 
it didn't happen anywhere else, just that we don't know.  Hence we also 
don't know if init or the kernel were affected at all.  As it is working 
on all machines I know it might not even be the compiler fault.  I would 
try to use the init from beta2 to rule this out. 
 
Having said that we also had funny segfaults on double Opterons, which were 
a hardware problem solved by a BIOS update.  Although the machine in 
this bug report is a double Xeon, it might have a similar problem, or 
a random memory corruption.  A memcheck should be run, and all available BIOS 
updates should be applied. 
Comment 8 Ulrich Windl 2005-08-24 05:58:45 UTC
(In reply to comment #7)
(...)
> a hardware problem solved by a BIOS update.  Although the machine in 
> this bug report is a double Xeon, it might have a similar problem, or 
> a random memory corruption.  A memcheck should be run, and all available BIOS 
> updates should be applied. 

I ran the memtest86 for a while before installation. Also the machine has ECC
RAM. Thus the memory corruption would have to happen outside of the RAM chips.
Anyway I'll consider the suggested update options.
Comment 9 Dr. Werner Fink 2005-08-24 09:03:54 UTC
And without such a system I can not fix a segfault within init.
Comment 10 Ulrich Windl 2005-08-29 08:30:07 UTC
The problem is still present in 10.0 beta3 (with different addresses).
Comment 11 Ulrich Windl 2005-08-29 08:32:07 UTC
Created attachment 47929 [details]
/var/log/boot.msg (beta3) with segfault for init and hotplug
Comment 12 Dr. Werner Fink 2005-08-29 10:22:42 UTC
I can not debug this due to the fact that I do not
have your system to do so.
Comment 13 Dr. Werner Fink 2005-08-29 10:43:35 UTC
If I read correct from the boot.msg this could be the init process
from the initial ramfs during boot.  And also hotplug catch an segfault.
Comment 14 Ulrich Windl 2005-08-29 11:35:14 UTC
Yes, you don't have the hardware, but you have the programs. You could at least
disassemble the the code in question. Or alternatively, give me a hint on how to
debug it or provide more information.
Comment 15 Dr. Werner Fink 2005-08-29 11:43:16 UTC
Hmmm ... I've the progam /sbin/init from the sysvinit package yes,
but this does not catch any SIGSEGV and simply guessing is not
very helpful ... beside this a second program, hotplug, catch
a SIGSEGV ... IMHO your system has a problem, e.g. broken BIOS.

Such a broken BIOS I've seen 6 months before on my own x86_64 from
Intel.  After an update the SIGSEGV were gone.
Comment 16 Ulrich Windl 2005-08-29 12:06:51 UTC
What speaks against a system problem IMHO is this: The problems only occurred
with 64bit SUSE Linux 10, not with SuSE Linux 9.3. Also these problems only
occur when booting., I had run multiple extensive iozone benchmaks that did not
cause any instability. Finally the system is just a few weeks old. I'll look for
updates anyway, but let's guess what will be the outcome...
Currently I have BIOS A01 dated 01/18/2005
There is a revision   A02 date 08/28/2005 (yesterday)
Comment 17 Dr. Werner Fink 2005-08-29 12:36:20 UTC
Sorry I do not have a ``Glaskugel''  ... or should I read
10500 lines of code without knowing what looking for and
not knowing what is wrong with your system.  And it could
be that the new gcc uses new features of x86_64 which are
never used by the old gcc from the 9.3.

Please try out the new BIOS.
Comment 18 Michael Matz 2005-08-29 13:29:45 UTC
From looking at the data we have again:  
<6>hotplug[215]: segfault at 00000000004000f0 rip 00000000004000f0 rsp  
    00007fffff84a760 error 15  
<0>init[1]: segfault at 000000000040836a rip 000000000040836a rsp  
    00007fffffccf800 error 15  
  
Note how the accessed address causing the segfault is the same as the  
one in the instruction pointer.  This means that the segfault was not  
generated by a normal data access, but by the instruction fetch phase  
of the CPU.  The address in init (0x40836a) is a perfectly fine address 
in .text on x86-64 in the init program.  There can't be a segfault 
from that address under normal circumstances or due to software errors. 
For there to be a segfault either the dynamic linker would have to not 
map certain memory from the file system (but then whole pages would 
be missing, and not much else would work either), or the hardware signals 
a segfault erratically for no reason. 
 
It really really does not look very much like a software problem.  
Comment 19 Ulrich Windl 2005-08-29 14:13:35 UTC
OK, the README for the newer BIOS mentiones microcode updates (buggy CPU?) and
improved resource allocation for PCI devices. I cannot give a complete answer
right now, but after the BIOS update the graphics card got a different memory
mapping. It might be that the segmentation fauls have gone, but I cannot really
answer, because the onboard graphics card just died (and the machine refuses to
boot). As said before, the machine is brand new, but is is from Dell 8-( Sorry.
Comment 20 Dr. Werner Fink 2005-08-30 09:37:58 UTC
Now I've found also a buggy system here around.  Same symptoms
and also an Intel system.  The same CPU type but newer system
does not shows this behaviour.
Comment 21 Ulrich Windl 2005-08-31 13:37:50 UTC
The motherboard got replaced because of the defective onboard graphics chip, and
the firmware upgrade has been installed again. I'm still seeing the segfault in
init (rip and rsp slightly different).
Comment 22 Dr. Werner Fink 2005-09-12 14:13:01 UTC
WORKSFORME bn the systems around here