Bugzilla – Bug 154709
ibm intellistation random freezes
Last modified: 2006-03-08 15:34:57 UTC
I use now an IBM IntelliStation M Pro for testing. As a reference, I tested 10.0 on it, and did not have any problem. 10.1 beta6 makes random freezes on the same hardware during installation and also when in use.
I installed back 10.0, and it ran rock solid all night long compiling kernel in a while true loop. On 10.0 I need 'pci=noacpi' to get the onboard Adaptec SCSI running. With 10.1 beta6, it works without this parameter, and random freeze is there even if I use it.
Please attach more information about the used hardware (`hwinfo'), 500 lines of your syslog and /var/log/boot.msg.
I only have 10.0 installed now. Is it worth to send hwinfo from it?
Does it work with nolapic ? If yes please attach acpidmp, hwinfo, dmidecode
With nolapic it does not work at all, as driver for the onboard Adaptec controller does not load correctly.
Created attachment 71269 [details] hwinfo
Created attachment 71270 [details] boot.msg
Created attachment 71271 [details] last 500 lines of messages I installed 10.1 again. This time I experimented with the acpi=off setting. After reboot, when YaST starts again, the machine freezes after about 30 seconds. After 3-4 such reproducable freezes I just used alt+ctlr+backspcae to crash YaST, and was able to collect the necessary information. 10.0 runs rock solid on the same machine...
Please be a bit more specific about these crashes. Are you talking about YaST only or does the whole system freeze? Is there an oops, does the numlock key work after the crash, can you switch to a console?
Whith YaST running, the crash comes earlier, probably because there is more activity. It's really a freezing, no messages on screen, the numlock/capslock/etc do not respond, can't switch console, can not ping the machine, the cursor does not blink any more, so it's really dead without any notice or anything in the logs. With 10.0 even Xen works stable, host and two xen guests compiling kernel all day long without any trouble. So it's something in 10.1 kernel...
I would agree but your attachment is much less than the requested 500 lines. Please look into it and check if there is really no indication about what's going wrong.
That was the whole /var/log/messages, but actually the file was created by: tail -500 /var/log/messages > messages500 Now it's about 300 lines, I send it again.
Created attachment 71384 [details] /var/log/messages
On the first look I cannot find any critical parts in this logfile. You might debug this further by removing any kernel modules that are not essentially required in order for the system to run. Try various boot-parameters, also the `safe settings' and check if the system still crashes. If not, begin reloading the modules one by one until you can locate it. You might use the SysRQ-Keys after the crash to determine the kernel-function in which the crash occurred. Maby this is a problem with X11, so check out if the system also crashes in runlevel 3, ...
It's not an X11 problem, my primary testing target on this machine was testing and getting known with Xen. X11 is not needed on a server, so I use runlevel 3.
There is some progress: - when I tried to use safe setting before, the kernel did not recognize the onboard Adaptec SCSI controller. Now I did not have such a problem, and with safe settings I was able to finish installation. I started a kernel compilation and the machine frooze quickly after that. It had an uptime of 22 minutes, about four times more, than usual. - I removed most of the kernel modules, and now I have more than one hour of uptime while the machine is in: while make -j 500 bzImage ; do make clean ; done What should I do next? - stay with safe setting and reinsert modules - boot normally, remove to have the same set of modules, and test reinserting from there?
Boot normally. Does just removing the processor and thermal modules help?
'processor' was loaded during the successful test. Thermal was not loaded on previous boot, when I checked. It's loaded now. modules, which were loaded during the successful test were: aic7xxx edac_mc generic i2c_algo_bit i2c_core ide_core ipv6 nls_utf8 pci_hotplug piix processor reiserfs scsi_mod scsi_transport_spi sd_mod Extra modules loaded on normal boot: 1a2,3 > af_packet > agpgart 2a5,7 > cdrom > dm_mod > e100 3a9 > edd 6a13,16 > i2c_i801 > i2c_savage4 > i82860_edac > ide_cd 7a18,19 > ide_disk > intel_agp 8a21,22 > loop > mii 9a24 > ntfs 16a32,47 > sg > shpchp > snd > snd_ac97_bus > snd_ac97_codec > snd_intel8x0 > snd_mixer_oss > snd_page_alloc > snd_pcm > snd_pcm_oss > snd_seq > snd_seq_device > snd_timer > soundcore > uhci_hcd > usbcore
Dunno then. There is no clear candidate. Do a binary search of the additional modules.
Sorry, but what do you mean binary search?
You remove half the modules. Check if problem is still there. If yes repeat step 1 If not repeat the algorithm with the other half you removed. At some point you should be left only with the module that is the culprit.
Good news: I found the offending module. It's i82860_edac I did not have any freezes after removing it. i386beta6:~ # modinfo i82860_edac filename: /lib/modules/2.6.16-rc5-git2-2-smp/kernel/drivers/edac/i82860_edac.ko license: GPL author: Red Hat Inc. (http://www.redhat.com.com) Ben Woodard <woodard@redhat.com> description: ECC support for Intel 82860 memory hub controllers vermagic: 2.6.16-rc5-git2-2-smp SMP 586 REGPARM gcc-4.1 depends: edac_mc alias: pci:v00008086d00002531sv*sd*bc*sc*i* srcversion: F3EB7D4B3F2D7950514C91B i386beta6:~ #
Good. Now we need the help of the kernel maintainers.
Ok, good that we made it unsupported. I would vote for just disabling the CONFIG_* for that module. Comments?
Fine with me. If it's that broken...
Ok. I will send a report to the maintainers too.
------------------------------------------------------------------- Wed Mar 8 09:01:47 CET 2006 - ak@suse.de - config/{i386,x86_64}/*: Disable CONFIG_EDAC_I82860 because it hangs machines (#154709) Fixed