Bug 154709 - ibm intellistation random freezes
Summary: ibm intellistation random freezes
Status: RESOLVED FIXED
Alias: None
Product: SUSE Linux 10.1
Classification: openSUSE
Component: Other (show other bugs)
Version: Beta 6
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Andreas Kleen
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-03-02 14:48 UTC by peter czanik
Modified: 2006-03-08 15:34 UTC (History)
1 user (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
hwinfo (37.84 KB, text/plain)
2006-03-05 09:34 UTC, peter czanik
Details
boot.msg (20.47 KB, text/plain)
2006-03-05 09:34 UTC, peter czanik
Details
last 500 lines of messages (10.78 KB, text/plain)
2006-03-05 09:39 UTC, peter czanik
Details
/var/log/messages (28.53 KB, text/plain)
2006-03-06 15:02 UTC, peter czanik
Details

Note You need to log in before you can comment on or make changes to this bug.
Description peter czanik 2006-03-02 14:48:38 UTC
I use now an IBM IntelliStation M Pro for testing. As a reference, I tested 10.0 on it, and did not have any problem.
10.1 beta6 makes random freezes on the same hardware during installation and also when in use.
Comment 1 peter czanik 2006-03-03 07:24:45 UTC
I installed back 10.0, and it ran rock solid all night long compiling kernel in a while true loop.
On 10.0 I need 'pci=noacpi' to get the onboard Adaptec SCSI running. With 10.1 beta6, it works without this parameter, and random freeze is there even if I use it.
Comment 2 Michael Gross 2006-03-03 15:06:39 UTC
Please attach more information about the used hardware (`hwinfo'), 500 lines of your syslog and /var/log/boot.msg.
Comment 3 peter czanik 2006-03-03 15:12:19 UTC
I only have 10.0 installed now. Is it worth to send hwinfo from it?
Comment 4 Andreas Kleen 2006-03-04 14:10:58 UTC
Does it work with nolapic ?

If yes please attach acpidmp, hwinfo, dmidecode
Comment 5 peter czanik 2006-03-04 21:31:50 UTC
With nolapic it does not work at all, as driver for the onboard Adaptec controller does not load correctly.
Comment 6 peter czanik 2006-03-05 09:34:06 UTC
Created attachment 71269 [details]
hwinfo
Comment 7 peter czanik 2006-03-05 09:34:45 UTC
Created attachment 71270 [details]
boot.msg
Comment 8 peter czanik 2006-03-05 09:39:56 UTC
Created attachment 71271 [details]
last 500 lines of messages

I installed 10.1 again. This time I experimented with the acpi=off setting. After reboot, when YaST starts again, the machine freezes after about 30 seconds. After 3-4 such reproducable freezes I just used alt+ctlr+backspcae to crash YaST, and was able to collect the necessary information.
10.0 runs rock solid on the same machine...
Comment 9 Michael Gross 2006-03-06 13:56:07 UTC
Please be a bit more specific about these crashes. Are you talking about YaST only or does the whole system freeze? Is there an oops, does the numlock key work after the crash, can you switch to a console?
Comment 10 peter czanik 2006-03-06 14:02:12 UTC
Whith YaST running, the crash comes earlier, probably because there is more activity.
It's really a freezing, no messages on screen, the numlock/capslock/etc do not respond, can't switch console, can not ping the machine, the cursor does not blink any more, so it's really dead without any notice or anything in the logs.
With 10.0 even Xen works stable, host and two xen guests compiling kernel all day long without any trouble. So it's something in 10.1 kernel...
Comment 11 Michael Gross 2006-03-06 14:47:19 UTC
I would agree but your attachment is much less than the requested 500 lines. Please look into it and check if there is really no indication about what's going wrong.
Comment 12 peter czanik 2006-03-06 15:01:21 UTC
That was the whole /var/log/messages, but actually the file was created by:
tail -500 /var/log/messages > messages500
Now it's about 300 lines, I send it again.
Comment 13 peter czanik 2006-03-06 15:02:48 UTC
Created attachment 71384 [details]
/var/log/messages
Comment 14 Michael Gross 2006-03-07 10:17:48 UTC
On the first look I cannot find any critical parts in this logfile. You might debug this further by removing any kernel modules that are not essentially required in order for the system to run. Try various boot-parameters, also the `safe settings' and check if the system still crashes. If not, begin reloading the modules one by one until you can locate it. You might use the SysRQ-Keys after the crash to determine the kernel-function in which the crash occurred. Maby this is a problem with X11, so check out if the system also crashes in runlevel 3, ...
Comment 15 peter czanik 2006-03-07 10:32:29 UTC
It's not an X11 problem, my primary testing target on this machine was testing and getting known with Xen. X11 is not needed on a server, so I use runlevel 3.
Comment 16 peter czanik 2006-03-07 12:29:10 UTC
There is some progress:

- when I tried to use safe setting before, the kernel did not recognize the onboard Adaptec SCSI controller. Now I did not have such a problem, and with safe settings I was able to finish installation. I started a kernel compilation and the machine frooze quickly after that. It had an uptime of 22 minutes, about four times more, than usual.

- I removed most of the kernel modules, and now I have more than one hour of uptime while the machine is in: while make -j 500 bzImage ; do make clean ; done

What should I do next?
- stay with safe setting and reinsert modules
- boot normally, remove to have the same set of modules, and test reinserting from there?
Comment 17 Andreas Kleen 2006-03-07 13:01:28 UTC
Boot normally.

Does just removing the processor and thermal modules help?
Comment 18 peter czanik 2006-03-07 13:31:06 UTC
'processor' was loaded during the successful test.
Thermal was not loaded on previous boot, when I checked. It's loaded now.

modules, which were loaded during the successful test were:
aic7xxx
edac_mc
generic
i2c_algo_bit
i2c_core
ide_core
ipv6
nls_utf8
pci_hotplug
piix
processor
reiserfs
scsi_mod
scsi_transport_spi
sd_mod 

Extra modules loaded on normal boot:
1a2,3
> af_packet
> agpgart
2a5,7
> cdrom
> dm_mod
> e100
3a9
> edd
6a13,16
> i2c_i801
> i2c_savage4
> i82860_edac
> ide_cd
7a18,19
> ide_disk
> intel_agp
8a21,22
> loop
> mii
9a24
> ntfs
16a32,47
> sg
> shpchp
> snd
> snd_ac97_bus
> snd_ac97_codec
> snd_intel8x0
> snd_mixer_oss
> snd_page_alloc
> snd_pcm
> snd_pcm_oss
> snd_seq
> snd_seq_device
> snd_timer
> soundcore
> uhci_hcd
> usbcore 
Comment 19 Andreas Kleen 2006-03-07 13:35:47 UTC
Dunno then.  There is no clear candidate.

Do a binary search of the additional modules.
Comment 20 peter czanik 2006-03-07 13:50:18 UTC
Sorry, but what do you mean binary search?
Comment 21 Andreas Kleen 2006-03-07 14:15:07 UTC
You remove half the modules. Check if problem is still there.
If yes repeat step 1
If not repeat the algorithm with the other half you removed.

At some point you should be left only with the module that is the culprit.

Comment 22 peter czanik 2006-03-08 07:01:37 UTC
Good news: I found the offending module. It's i82860_edac
I did not have any freezes after removing it.

i386beta6:~ # modinfo i82860_edac
filename:       /lib/modules/2.6.16-rc5-git2-2-smp/kernel/drivers/edac/i82860_edac.ko
license:        GPL
author:         Red Hat Inc. (http://www.redhat.com.com) Ben Woodard <woodard@redhat.com>
description:    ECC support for Intel 82860 memory hub controllers
vermagic:       2.6.16-rc5-git2-2-smp SMP 586 REGPARM gcc-4.1
depends:        edac_mc
alias:          pci:v00008086d00002531sv*sd*bc*sc*i*
srcversion:     F3EB7D4B3F2D7950514C91B
i386beta6:~ #
Comment 23 Michael Gross 2006-03-08 15:13:50 UTC
Good. Now we need the help of the kernel maintainers.
Comment 24 Andreas Kleen 2006-03-08 15:19:17 UTC
Ok, good that we made it unsupported.

I would vote for just disabling the CONFIG_* for that module. Comments?
Comment 25 Olaf Kirch 2006-03-08 15:24:58 UTC
Fine with me. If it's that broken...
Comment 26 Andreas Kleen 2006-03-08 15:26:20 UTC
Ok. I will send a report to the maintainers too.
Comment 27 Andreas Kleen 2006-03-08 15:34:57 UTC
-------------------------------------------------------------------
Wed Mar  8 09:01:47 CET 2006 - ak@suse.de

- config/{i386,x86_64}/*: Disable CONFIG_EDAC_I82860 because it 
  hangs machines (#154709)

Fixed