Bugzilla – Bug 106147
init[1]: segfault at 000000000040839a rip 000000000040839a rsp 00007fffffb81b20 error 15
Last modified: 2005-09-12 14:13:01 UTC
In /var/log/boot.msg I find init problems for a dual Xeon 2.8GHz machine: <0>init[1]: segfault at 000000000040839a rip 000000000040839a rsp 00007fffffb81b20 error 15
Created attachment 46911 [details] /var/log/boot.msg with "segfault" in "init"
I need _MORE_ information due to the fact that init works here on the x86_64 around. Btw: does this happen also with Beta2 ... this because on Beta 1 there was a broken gcc4
Despite of testing Beta2 (which I don't have at the moment), what information do you need?
The question is: does it work with Beta 2, this because it is known that the gcc4 on Betq 1 was broken.
Does the gcc bug mean the problem is in the kernel, or is it in the application (init)? For the latter, replacing init (from beta2) would not be a big deal.
I've no idea if only init or both init and the kernel should be changed. Nevetheless, the x86_64 systems around here boot very well. Matz? Do you have an idea?
We know only that the gcc bug affected pppd (libpcap to be precise). Besides from that we didn't see the bug anywhere else. But this doesn't mean it didn't happen anywhere else, just that we don't know. Hence we also don't know if init or the kernel were affected at all. As it is working on all machines I know it might not even be the compiler fault. I would try to use the init from beta2 to rule this out. Having said that we also had funny segfaults on double Opterons, which were a hardware problem solved by a BIOS update. Although the machine in this bug report is a double Xeon, it might have a similar problem, or a random memory corruption. A memcheck should be run, and all available BIOS updates should be applied.
(In reply to comment #7) (...) > a hardware problem solved by a BIOS update. Although the machine in > this bug report is a double Xeon, it might have a similar problem, or > a random memory corruption. A memcheck should be run, and all available BIOS > updates should be applied. I ran the memtest86 for a while before installation. Also the machine has ECC RAM. Thus the memory corruption would have to happen outside of the RAM chips. Anyway I'll consider the suggested update options.
And without such a system I can not fix a segfault within init.
The problem is still present in 10.0 beta3 (with different addresses).
Created attachment 47929 [details] /var/log/boot.msg (beta3) with segfault for init and hotplug
I can not debug this due to the fact that I do not have your system to do so.
If I read correct from the boot.msg this could be the init process from the initial ramfs during boot. And also hotplug catch an segfault.
Yes, you don't have the hardware, but you have the programs. You could at least disassemble the the code in question. Or alternatively, give me a hint on how to debug it or provide more information.
Hmmm ... I've the progam /sbin/init from the sysvinit package yes, but this does not catch any SIGSEGV and simply guessing is not very helpful ... beside this a second program, hotplug, catch a SIGSEGV ... IMHO your system has a problem, e.g. broken BIOS. Such a broken BIOS I've seen 6 months before on my own x86_64 from Intel. After an update the SIGSEGV were gone.
What speaks against a system problem IMHO is this: The problems only occurred with 64bit SUSE Linux 10, not with SuSE Linux 9.3. Also these problems only occur when booting., I had run multiple extensive iozone benchmaks that did not cause any instability. Finally the system is just a few weeks old. I'll look for updates anyway, but let's guess what will be the outcome... Currently I have BIOS A01 dated 01/18/2005 There is a revision A02 date 08/28/2005 (yesterday)
Sorry I do not have a ``Glaskugel'' ... or should I read 10500 lines of code without knowing what looking for and not knowing what is wrong with your system. And it could be that the new gcc uses new features of x86_64 which are never used by the old gcc from the 9.3. Please try out the new BIOS.
From looking at the data we have again: <6>hotplug[215]: segfault at 00000000004000f0 rip 00000000004000f0 rsp 00007fffff84a760 error 15 <0>init[1]: segfault at 000000000040836a rip 000000000040836a rsp 00007fffffccf800 error 15 Note how the accessed address causing the segfault is the same as the one in the instruction pointer. This means that the segfault was not generated by a normal data access, but by the instruction fetch phase of the CPU. The address in init (0x40836a) is a perfectly fine address in .text on x86-64 in the init program. There can't be a segfault from that address under normal circumstances or due to software errors. For there to be a segfault either the dynamic linker would have to not map certain memory from the file system (but then whole pages would be missing, and not much else would work either), or the hardware signals a segfault erratically for no reason. It really really does not look very much like a software problem.
OK, the README for the newer BIOS mentiones microcode updates (buggy CPU?) and improved resource allocation for PCI devices. I cannot give a complete answer right now, but after the BIOS update the graphics card got a different memory mapping. It might be that the segmentation fauls have gone, but I cannot really answer, because the onboard graphics card just died (and the machine refuses to boot). As said before, the machine is brand new, but is is from Dell 8-( Sorry.
Now I've found also a buggy system here around. Same symptoms and also an Intel system. The same CPU type but newer system does not shows this behaviour.
The motherboard got replaced because of the defective onboard graphics chip, and the firmware upgrade has been installed again. I'm still seeing the segfault in init (rip and rsp slightly different).
WORKSFORME bn the systems around here