Bug 112896 - Kernel hardfreeze on double AMD 64
Summary: Kernel hardfreeze on double AMD 64
Status: VERIFIED FIXED
Alias: None
Product: SUSE LINUX 10.0
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Beta 2
Hardware: x86-64 All
: P5 - None : Major
Target Milestone: ---
Assignee: Thomas Renninger
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-25 08:05 UTC by Uwe Köhler
Modified: 2006-06-13 11:18 UTC (History)
3 users (show)

See Also:
Found By: Other
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
hwinfo file created in SuSE 9.3 single processor kernel (253.89 KB, text/plain)
2005-08-25 08:06 UTC, Uwe Köhler
Details
System logfile after normal boot with hardfreeze (17.46 KB, text/plain)
2005-08-25 08:07 UTC, Uwe Köhler
Details
System logfile after booting with acpi=off (no freeze) (15.76 KB, text/plain)
2005-08-25 08:07 UTC, Uwe Köhler
Details
System logfile after booting with acpi = oldboot (system freezes) (12.92 KB, text/plain)
2005-08-25 08:08 UTC, Uwe Köhler
Details
/var/log/messages of Kernel Freeze with powersaved on. (17.02 KB, text/plain)
2005-08-26 10:46 UTC, Uwe Köhler
Details
acpidmp from 10.0 (164 bytes, text/plain)
2006-01-05 13:04 UTC, Uwe Köhler
Details
dmesg SuSE 10.0 (17.13 KB, text/plain)
2006-01-05 13:26 UTC, Uwe Köhler
Details
/proc/acpi/dsdt (13.40 KB, text/plain)
2006-01-05 13:27 UTC, Uwe Köhler
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Uwe Köhler 2005-08-25 08:05:30 UTC
After installing the default SuSE 10.0 kernel on a double processor AMD 64
Opteron the kernel shows a hard freeze after a short while (seems to be a random
delay).

When booting with ACPI = off, this problem goes away and the machine runs fine.
I also noticed lots of "kernel segfault" messages in the system log file.
(attached).
Comment 1 Uwe Köhler 2005-08-25 08:06:51 UTC
Created attachment 47480 [details]
hwinfo file created in SuSE 9.3 single processor kernel
Comment 2 Uwe Köhler 2005-08-25 08:07:21 UTC
Created attachment 47481 [details]
System logfile after normal boot with hardfreeze
Comment 3 Uwe Köhler 2005-08-25 08:07:48 UTC
Created attachment 47482 [details]
System logfile after booting with acpi=off (no freeze)
Comment 4 Uwe Köhler 2005-08-25 08:08:44 UTC
Created attachment 47483 [details]
System logfile after booting with acpi = oldboot (system freezes)
Comment 5 Olaf Kirch 2005-08-25 09:03:17 UTC
Looking at the diff between the boot logs, one thing that sticks out is the 
cpufreq stuff that gets disabled for acpi=off. 
 
Also, can you please reproduce without the nvidiafb module loaded? 
Comment 6 Pavel Machek 2005-08-25 09:51:44 UTC
Hmm, powernow-k8 somehow thinks it has 128 cpus... oops. Really try it without
cpufreq.
Comment 7 Uwe Köhler 2005-08-25 10:00:53 UTC
Thanks for the Tips. However, so far I always relyed on the delivered SuSE
kernels. Here are my questions:

1. Unload nvidiafb = recompile kernel?

2. No cpufreq = deinstall the package?

Anything else you need?
Comment 8 Thomas Renninger 2005-08-25 10:10:34 UTC
You definetely hit bug #102518. Could you update to the latest BIOS, this should
fix the segault .... error 4 messages.
But normally the machine does not freeze.

Maybe you also hit #103786 (ondemand locks CPUs on SMP Opteron machines with
cpufreq).
This should be able to be workarounded with the userspace governor:
/etc/sysconfig/powersave/cpufreq:
POWERSAVE_CPUFREQ_CONTROL="userspace"

Pavel: The 128 cpus is also known: #103028 (and is maybe the most sever one).
Andi is currently assigned to it. He and Mark expect a bug in cpu_online().
I am on holidays for three days now and I don't know which bug to look at first,
anyway.
If you could have a look at this one this would be really great.
I can also provide you a Dual Core machine (4x2 Opteron, *willimas*) with serial
console and power switch attached. Tell me if you have time for it and I send
you details how to access those things.

Uwe: You can disable cpufreq by adding 1 to the boot params (boot into init 1),
then invoke "chkconfig powersaved off" and reboot.
Comment 9 Uwe Köhler 2005-08-26 10:44:31 UTC
OK, found a floppy drive, put it in the machine, flashed the BIOS and X11 in
SuSE 9.3 stopped working. The computer is three weeks old, but the BIOS version
was probably quite old. Running sax2 fixed SuSE 9.3 single processor kernel.

The story for 10.0:
After the BIOS flash reboot with default values (/var/log/messages to follow)
and I got to a terminal login.
init 3 to run sax2
/etc/rc.status: line 122: 6266 Segmentation fault    stty size 0<&1 >/dev/null 2>&1

boot 1
chkconfig powersaved off
reboot

boot default
get me to terminal (X11 not working due to BIOS Flash)
init 3
sax2 (wich is much better now, well done!)

Everything works well.

Should I submit a powersaved bug now and close this one? Does the same trick
work for SuSE 9.3 (have to use single processor kernel there). 

Many thanks for your help, hope you get powersaved working , as well. The
machine is sooo loud. I will start testing applications now.
Comment 10 Uwe Köhler 2005-08-26 10:46:41 UTC
Created attachment 47745 [details]
/var/log/messages of Kernel Freeze with powersaved on.
Comment 11 Olaf Kirch 2005-08-26 10:48:00 UTC
Thomas Renninger is on vacation. Pavel, do you have time to look into this 
next week? 
Comment 12 Pavel Machek 2005-08-26 21:51:31 UTC
It looks to me like "different bios versions provoke different crashes in
suse9.3 and suse10.0, and X somehow fail to work after bios update". I do not
think I can help that much with this one.

Yes, this should definitely be split into different bugs.

Thomas: I guess I'll let andi debug cpu_online() stuff. I'm not too experienced
with SMP.
Comment 13 Olaf Kirch 2005-08-29 08:48:22 UTC
Do you mean to say that it's the X server locking up the machine? 
This should be easy to verify by booting into runlevel 3. 
Comment 14 Uwe Köhler 2005-08-29 10:20:45 UTC
I (if you refer to me :-) did not mean that. After the BIOS update the Graphics
cards was not found and I had to rearun sax2 to enable X again. This could be
mentioned in the handbook. I suspect that the Kernel Freeze is independent of X.
I did not write down all my tries, but I am sure it froze on me a couple of
times before X started.

I did not try SuSE 9.3 with SMP after the BIOS update. I really need the machine
to run at all, even if slow. In SuSE 10.0 beta it runs fine without powersaved.
Do you still want me to try with powersaved on and init 3? I hope my next beta 3
downloads succeeds and I would try with beta 3. 
Comment 15 Olaf Kirch 2005-08-29 14:16:10 UTC
This may be the same issue as bug 103786 
Comment 16 Pavel Machek 2005-09-06 21:07:10 UTC
Is it still broken? Can you confirm it is not same as bug 103786?
Comment 17 Uwe Köhler 2005-09-08 08:18:45 UTC
I just installed Beta 4 with exactly the same problem. As far as bug 103786 is
concerned, I do not have seen any off the phenomena described there. The kernel
freeze appears before or just after X11 managed to start and not after 2 hours.
All works fine when powersaved is off. Anything else I can do to help?
Comment 18 Uwe Köhler 2005-09-19 11:26:42 UTC
Anny more information you need? I found that powersaved is off in RC1 on a
single processor X86_64 machine. Is this connected? I did not have a problem on
the songle processor machine in Beta4.
Comment 19 Pavel Machek 2005-09-19 14:20:17 UTC
I'd suggest running without powersaved on SMP machines, at least for now. It is
not really well tested. Stefan, does powersaved bring any benefits on SMP
machine? It looks like all risks to me...

Oops, okay, there are smp notebooks :-(. Blacklist powernow-k8 for smp kenrels?
Comment 20 Forgotten User ZhJd0F0L3x 2005-09-19 14:37:24 UTC
I have yet to see a amd64 smp notebook :-) But it brings benefit on servers 
also: the south sea island may not drown so fast if we cut energy consumption. 
 
Anyway, i only heard about problems with the ondemand governor and the lastest 
powersave package does disable ondemand and switch to userspace if a SMP amd64 
mache is detected during installation. 
Comment 21 Thomas Renninger 2005-10-16 14:41:21 UTC
What is the status here?
With the powersave daemon 10.0 Gold we set the userspace governor for all SMP machines. Did this help for you?
If not, can you confirm that if the powersave daemon is stopped and the powernow-k8 module is not loaded, the machine does not freeze?
Comment 22 Uwe Köhler 2005-10-20 07:49:50 UTC
Hi there,

SuSE 10.0 only arrived one day before my holidays. Installed it today and still had the problem. The workaround with switching the powersave daemon still works. I could just do with quiter fans ;-) Anyway to switch the constant reminders that powersaved is not running off? 

This is what I did:
chkconfig powersaved off

Hope this helps
Comment 23 Thomas Renninger 2005-10-20 08:29:16 UTC
Yes, I heard about others also have the problem of a running powersave daemon, but still get the annoying kpowersave message that it is not running.

Does dbus/hal and powersave daemon processes run when you get this error?
Hmm, maybe the powersave daemon dies because of the cpus_online bug that exports 128 directories to /sys/devices/system/cpu/cpuX
Could you go into runlevel 3, start the powersave daemon, check whether the process is really running (ps aux|grep powersaved).

If it is running and you still get the kpowersave message we have a dbus/kpowersave problem.

If it is not running it dies unexpectetly, I try to verify and fix it here if this is the case. Hmm just had a look on one of our Dual Core AMD64 machines: powersave daemon is running and cpufreq (even 128 cpus reported) is working fine (with the userspace governor).

Anyway there is another bug report to let kpowersave not complain about a not running powersave daemon. This is a nice example why we should do it.

For now you should quit the kpowersave by right-clicking on its icon and answer no when you get asked whether it should get started the next time.
Comment 24 Holger Macht 2005-10-20 08:57:29 UTC
I think we also need a running kpowersave even if the powersave daemon is not running. I suggest adding a checkbox into kpowersave configuration whether it should complain about a not running powersave daemon. There are also other cases where it makes sence to disable this popup.

Is this ok with you Thomas?
Comment 25 Thomas Renninger 2005-10-20 09:11:03 UTC
Installed it today and still had the problem
-> Does this mean the machine still freezes?
Could you check whether in /etc/sysconfig/powersave/cpufreq the userspace governor was added in the variable CPUFREQ_CONTROL="userspace". If not please do so and try to start the powersave daemon again. Still freezing?

There is an extra bug for the kpowersave pop-up issue: #121965
Comment 26 Uwe Köhler 2005-10-20 09:49:41 UTC
Yes, the machine freezes. I tried it again just a few minutes ago. CPUFREQ_CONTROL="userspace" is set correctly, but I still have to switch of powersaved.

Here are a couple of lines in dmesg that might have to do with it?

 ACPI: Looking for DSDT in initrd... not found!
 not found!

ACPI: CPU0 (power states: C1[C1])
ACPI: Processor [CPU1] (supports 8 throttling states)
ACPI: CPU1 (power states: C1[C1])
    ACPI-0733: *** Warning: Processor Device is not present
    ACPI-0521: *** Warning: Error getting cpuindex for acpiid 0x3
    ACPI-0733: *** Warning: Processor Device is not present
    ACPI-0521: *** Warning: Error getting cpuindex for acpiid 0x4
Comment 27 Danny Al-Gaaf 2005-10-20 11:44:44 UTC
(In reply to comment #24)
> I think we also need a running kpowersave even if the powersave daemon is not
> running. 

I don't know if this is a laptop, but if this is a desktop machine you don't need a running kpowersave without powersave. In this case there is not functionality in KPowersave.
Comment 28 Thomas Renninger 2005-12-22 08:56:58 UTC
Sorry for answering that late.
ACPI-0521: *** Warning: Error getting cpuindex for acpiid 0x3
ACPI-0733: *** Warning: Processor Device is not present
-> This should not be sever

Could you post acpidmp, please?
Is it possible for you to attach a serial console and possibly grep the last messages when the kernel is dying?
Most important for priority: Is it possible for you to install a recent OpenSuse 10.1 installation? If there the problem is still present, we have to fix it! If it's an easy/uncritical fix we could still backport it to 10.0.
Comment 29 Thomas Renninger 2005-12-22 10:12:56 UTC
Please post acpidmp, I think I found a patch that went mainline in 2.6.15-rc5 that could solve this issue and should be save to add to 10.0.
Comment 30 Thomas Renninger 2006-01-04 13:09:25 UTC
Does it still freeze with a recent version of OpenSuse 10.1?
Could you also please attach the whole dmesg output.
Comment 31 Uwe Köhler 2006-01-05 07:40:58 UTC
Sorry, you hit my holidays again and today is my first day with access to the machine. It is normally in use now, but I will try to test 10.1 tomorrow.

As far as acpidmp is concerned it would help me a lot if you could tell me where to find the information short of find / -name "acpidmp".

Cheers and a happy new year.

Uwe
Comment 32 Thomas Renninger 2006-01-05 09:15:39 UTC
Sorry, acpidmp is a binary writing out parts of the BIOS to stdout.
Just do acpidmp >/tmp/acpidmp.txt and attach the file. Thanks.
Comment 33 Uwe Köhler 2006-01-05 13:03:01 UTC
Thanks for your help. Doesn't look good, though. Tried it on SuSE 10.0 with the following result:
acpidmp > acpidmp.txt
acpidmp: cannot map the RSDT

I will attach the file.
Comment 34 Uwe Köhler 2006-01-05 13:04:29 UTC
Created attachment 62040 [details]
acpidmp from 10.0
Comment 35 Thomas Renninger 2006-01-05 13:15:49 UTC
Ooops, acpidmp cannot follow the DSDT or some other ACPI table pointer in the RSDT table ...
Could you try whether you get some sane output trying acpidump -t DSDT (with additionial "u" acpidmp/acpidump).
Could you also attach full dmesg output, please.
Comment 36 Thomas Renninger 2006-01-05 13:17:05 UTC
If acpidump also does not work, just post /proc/acpi/dsdt. Thanks.
Comment 37 Uwe Köhler 2006-01-05 13:22:51 UTC
linux:/home/ukoehler # dmesg > dmesg.txt
linux:/home/ukoehler # acpidump -t DSDT > acpidump.txt
ACPI tables were not found. If you know location of RSD PTR table (from dmesg, etc), supply it with either --addr or -a option
linux:~ # cat /proc/acpi/dsdt > dsdt.txt

Find files attached.
Comment 38 Uwe Köhler 2006-01-05 13:26:27 UTC
Created attachment 62041 [details]
dmesg SuSE 10.0
Comment 39 Uwe Köhler 2006-01-05 13:27:06 UTC
Created attachment 62042 [details]
/proc/acpi/dsdt
Comment 40 Thomas Renninger 2006-01-05 13:52:37 UTC
Damn, it seems not to be the bug I hoped it would be.
Do you still have a free partition for a 10.1 OpenSuse Preview installation?
If it works there, we could lower the severity. Otherwise this is important to fix.

Hannes: You had the only machine I know of, where acpidmp fails and also reported the bug, I expect it to be that machine that also freezes?
Does it run again, so that we can reproduce the freeze here if 10.1 or SLES/NLD 10 Previews still freeze?
Comment 41 Uwe Köhler 2006-01-09 07:47:28 UTC
Tried to install version 10.0.42 over the weekend. This crashed the bootmanager and currently I cannot boot the machine at all (haven't got the 10.0 installation DVD with me for a rescue). No further information so far. Will try on.
Comment 42 Forgotten User ZhJd0F0L3x 2006-01-09 08:04:04 UTC
(In reply to comment #41)
> Tried to install version 10.0.42 over the weekend. This crashed the bootmanager
> and currently I cannot boot the machine at all (haven't got the 10.0
> installation DVD with me for a rescue). No further information so far. Will try
> on.

i had a similar problem with the bootloader.
If you have any CD / DVD at all, try commenting out the line
#gfxmenu (hd0,0)/message
in /boot/grub/menu.lst, this gave me back the bootability (machine hung in the
bootloader, never got to the kernel). Will have to create a bugreport for this :-)
> 

Comment 43 Uwe Köhler 2006-01-09 10:30:47 UTC
Many thanks for the tip. Version 10.0.42 ran for 45 minutes (idle) which is a lot longer than 10.0 (about 20s). It froze when I tried to shut the system down. Any more information you want to fix version 10.0?
Comment 44 Thomas Renninger 2006-01-09 11:29:35 UTC
This is probably a duplicate of #141238.
Could you follow the last comments of Holger Köhlerschmidt there and see if it helps.
Let's go on there if you think this is the problem.
For 10.0 this probably will become a "Won't fix" for the system freeze.
10.1 has higher priority at the moment, let's see that we get this machine running smoothly on the latest version first.
Now all begins to make sense... The shutdown problem and the "not able to use gfxmenu" entry with bootable CDs, seem to have the same base issue.
Comment 45 Forgotten User ZhJd0F0L3x 2006-01-09 11:54:09 UTC
(In reply to comment #44)
> Now all begins to make sense... The shutdown problem and the "not able to use
> gfxmenu" entry with bootable CDs, seem to have the same base issue.

I don't think the gfxmenu problem is related. This is from grub, not from CD and i have an i386 UP machine with the gfxmenu problem.

Also, it always worked and just failed with the latest release, so i think it is just a plain simple gfxmenu bug.
Comment 46 Thomas Renninger 2006-04-20 21:30:56 UTC
Bug was still set to "need info"..., please reassign if info has been provided.
> It froze when I tried to shut the system down.
This sounds like another bug? You might want to switch to console and watch the kernel output, maybe you see the kernel ooopsing?
Not sure if I already mentioned that: You should be sure to run the latest BIOS.
Comment 47 Thomas Renninger 2006-06-13 10:43:11 UTC
This bug is getting to long and confusing:
Summary: 
 - cpufreq/ondemand does not freeze the machine anymore -> Closing.
 - The machine does not shutdown -> another problem -> new bug or duplicate of 
   #141238 (if this is an AMD Turion with ATI chipset, it probably is a 
   duplicate).
Comment 48 Uwe Köhler 2006-06-13 11:18:51 UTC
Initial problem has been fixed by your help and installing quiet fans on the machine. Will test with Suse 10.1 in the near future and open a new report if necessary.

Many thanks for your help