Bugzilla – Bug 112896
Kernel hardfreeze on double AMD 64
Last modified: 2006-06-13 11:18:51 UTC
After installing the default SuSE 10.0 kernel on a double processor AMD 64 Opteron the kernel shows a hard freeze after a short while (seems to be a random delay). When booting with ACPI = off, this problem goes away and the machine runs fine. I also noticed lots of "kernel segfault" messages in the system log file. (attached).
Created attachment 47480 [details] hwinfo file created in SuSE 9.3 single processor kernel
Created attachment 47481 [details] System logfile after normal boot with hardfreeze
Created attachment 47482 [details] System logfile after booting with acpi=off (no freeze)
Created attachment 47483 [details] System logfile after booting with acpi = oldboot (system freezes)
Looking at the diff between the boot logs, one thing that sticks out is the cpufreq stuff that gets disabled for acpi=off. Also, can you please reproduce without the nvidiafb module loaded?
Hmm, powernow-k8 somehow thinks it has 128 cpus... oops. Really try it without cpufreq.
Thanks for the Tips. However, so far I always relyed on the delivered SuSE kernels. Here are my questions: 1. Unload nvidiafb = recompile kernel? 2. No cpufreq = deinstall the package? Anything else you need?
You definetely hit bug #102518. Could you update to the latest BIOS, this should fix the segault .... error 4 messages. But normally the machine does not freeze. Maybe you also hit #103786 (ondemand locks CPUs on SMP Opteron machines with cpufreq). This should be able to be workarounded with the userspace governor: /etc/sysconfig/powersave/cpufreq: POWERSAVE_CPUFREQ_CONTROL="userspace" Pavel: The 128 cpus is also known: #103028 (and is maybe the most sever one). Andi is currently assigned to it. He and Mark expect a bug in cpu_online(). I am on holidays for three days now and I don't know which bug to look at first, anyway. If you could have a look at this one this would be really great. I can also provide you a Dual Core machine (4x2 Opteron, *willimas*) with serial console and power switch attached. Tell me if you have time for it and I send you details how to access those things. Uwe: You can disable cpufreq by adding 1 to the boot params (boot into init 1), then invoke "chkconfig powersaved off" and reboot.
OK, found a floppy drive, put it in the machine, flashed the BIOS and X11 in SuSE 9.3 stopped working. The computer is three weeks old, but the BIOS version was probably quite old. Running sax2 fixed SuSE 9.3 single processor kernel. The story for 10.0: After the BIOS flash reboot with default values (/var/log/messages to follow) and I got to a terminal login. init 3 to run sax2 /etc/rc.status: line 122: 6266 Segmentation fault stty size 0<&1 >/dev/null 2>&1 boot 1 chkconfig powersaved off reboot boot default get me to terminal (X11 not working due to BIOS Flash) init 3 sax2 (wich is much better now, well done!) Everything works well. Should I submit a powersaved bug now and close this one? Does the same trick work for SuSE 9.3 (have to use single processor kernel there). Many thanks for your help, hope you get powersaved working , as well. The machine is sooo loud. I will start testing applications now.
Created attachment 47745 [details] /var/log/messages of Kernel Freeze with powersaved on.
Thomas Renninger is on vacation. Pavel, do you have time to look into this next week?
It looks to me like "different bios versions provoke different crashes in suse9.3 and suse10.0, and X somehow fail to work after bios update". I do not think I can help that much with this one. Yes, this should definitely be split into different bugs. Thomas: I guess I'll let andi debug cpu_online() stuff. I'm not too experienced with SMP.
Do you mean to say that it's the X server locking up the machine? This should be easy to verify by booting into runlevel 3.
I (if you refer to me :-) did not mean that. After the BIOS update the Graphics cards was not found and I had to rearun sax2 to enable X again. This could be mentioned in the handbook. I suspect that the Kernel Freeze is independent of X. I did not write down all my tries, but I am sure it froze on me a couple of times before X started. I did not try SuSE 9.3 with SMP after the BIOS update. I really need the machine to run at all, even if slow. In SuSE 10.0 beta it runs fine without powersaved. Do you still want me to try with powersaved on and init 3? I hope my next beta 3 downloads succeeds and I would try with beta 3.
This may be the same issue as bug 103786
Is it still broken? Can you confirm it is not same as bug 103786?
I just installed Beta 4 with exactly the same problem. As far as bug 103786 is concerned, I do not have seen any off the phenomena described there. The kernel freeze appears before or just after X11 managed to start and not after 2 hours. All works fine when powersaved is off. Anything else I can do to help?
Anny more information you need? I found that powersaved is off in RC1 on a single processor X86_64 machine. Is this connected? I did not have a problem on the songle processor machine in Beta4.
I'd suggest running without powersaved on SMP machines, at least for now. It is not really well tested. Stefan, does powersaved bring any benefits on SMP machine? It looks like all risks to me... Oops, okay, there are smp notebooks :-(. Blacklist powernow-k8 for smp kenrels?
I have yet to see a amd64 smp notebook :-) But it brings benefit on servers also: the south sea island may not drown so fast if we cut energy consumption. Anyway, i only heard about problems with the ondemand governor and the lastest powersave package does disable ondemand and switch to userspace if a SMP amd64 mache is detected during installation.
What is the status here? With the powersave daemon 10.0 Gold we set the userspace governor for all SMP machines. Did this help for you? If not, can you confirm that if the powersave daemon is stopped and the powernow-k8 module is not loaded, the machine does not freeze?
Hi there, SuSE 10.0 only arrived one day before my holidays. Installed it today and still had the problem. The workaround with switching the powersave daemon still works. I could just do with quiter fans ;-) Anyway to switch the constant reminders that powersaved is not running off? This is what I did: chkconfig powersaved off Hope this helps
Yes, I heard about others also have the problem of a running powersave daemon, but still get the annoying kpowersave message that it is not running. Does dbus/hal and powersave daemon processes run when you get this error? Hmm, maybe the powersave daemon dies because of the cpus_online bug that exports 128 directories to /sys/devices/system/cpu/cpuX Could you go into runlevel 3, start the powersave daemon, check whether the process is really running (ps aux|grep powersaved). If it is running and you still get the kpowersave message we have a dbus/kpowersave problem. If it is not running it dies unexpectetly, I try to verify and fix it here if this is the case. Hmm just had a look on one of our Dual Core AMD64 machines: powersave daemon is running and cpufreq (even 128 cpus reported) is working fine (with the userspace governor). Anyway there is another bug report to let kpowersave not complain about a not running powersave daemon. This is a nice example why we should do it. For now you should quit the kpowersave by right-clicking on its icon and answer no when you get asked whether it should get started the next time.
I think we also need a running kpowersave even if the powersave daemon is not running. I suggest adding a checkbox into kpowersave configuration whether it should complain about a not running powersave daemon. There are also other cases where it makes sence to disable this popup. Is this ok with you Thomas?
Installed it today and still had the problem -> Does this mean the machine still freezes? Could you check whether in /etc/sysconfig/powersave/cpufreq the userspace governor was added in the variable CPUFREQ_CONTROL="userspace". If not please do so and try to start the powersave daemon again. Still freezing? There is an extra bug for the kpowersave pop-up issue: #121965
Yes, the machine freezes. I tried it again just a few minutes ago. CPUFREQ_CONTROL="userspace" is set correctly, but I still have to switch of powersaved. Here are a couple of lines in dmesg that might have to do with it? ACPI: Looking for DSDT in initrd... not found! not found! ACPI: CPU0 (power states: C1[C1]) ACPI: Processor [CPU1] (supports 8 throttling states) ACPI: CPU1 (power states: C1[C1]) ACPI-0733: *** Warning: Processor Device is not present ACPI-0521: *** Warning: Error getting cpuindex for acpiid 0x3 ACPI-0733: *** Warning: Processor Device is not present ACPI-0521: *** Warning: Error getting cpuindex for acpiid 0x4
(In reply to comment #24) > I think we also need a running kpowersave even if the powersave daemon is not > running. I don't know if this is a laptop, but if this is a desktop machine you don't need a running kpowersave without powersave. In this case there is not functionality in KPowersave.
Sorry for answering that late. ACPI-0521: *** Warning: Error getting cpuindex for acpiid 0x3 ACPI-0733: *** Warning: Processor Device is not present -> This should not be sever Could you post acpidmp, please? Is it possible for you to attach a serial console and possibly grep the last messages when the kernel is dying? Most important for priority: Is it possible for you to install a recent OpenSuse 10.1 installation? If there the problem is still present, we have to fix it! If it's an easy/uncritical fix we could still backport it to 10.0.
Please post acpidmp, I think I found a patch that went mainline in 2.6.15-rc5 that could solve this issue and should be save to add to 10.0.
Does it still freeze with a recent version of OpenSuse 10.1? Could you also please attach the whole dmesg output.
Sorry, you hit my holidays again and today is my first day with access to the machine. It is normally in use now, but I will try to test 10.1 tomorrow. As far as acpidmp is concerned it would help me a lot if you could tell me where to find the information short of find / -name "acpidmp". Cheers and a happy new year. Uwe
Sorry, acpidmp is a binary writing out parts of the BIOS to stdout. Just do acpidmp >/tmp/acpidmp.txt and attach the file. Thanks.
Thanks for your help. Doesn't look good, though. Tried it on SuSE 10.0 with the following result: acpidmp > acpidmp.txt acpidmp: cannot map the RSDT I will attach the file.
Created attachment 62040 [details] acpidmp from 10.0
Ooops, acpidmp cannot follow the DSDT or some other ACPI table pointer in the RSDT table ... Could you try whether you get some sane output trying acpidump -t DSDT (with additionial "u" acpidmp/acpidump). Could you also attach full dmesg output, please.
If acpidump also does not work, just post /proc/acpi/dsdt. Thanks.
linux:/home/ukoehler # dmesg > dmesg.txt linux:/home/ukoehler # acpidump -t DSDT > acpidump.txt ACPI tables were not found. If you know location of RSD PTR table (from dmesg, etc), supply it with either --addr or -a option linux:~ # cat /proc/acpi/dsdt > dsdt.txt Find files attached.
Created attachment 62041 [details] dmesg SuSE 10.0
Created attachment 62042 [details] /proc/acpi/dsdt
Damn, it seems not to be the bug I hoped it would be. Do you still have a free partition for a 10.1 OpenSuse Preview installation? If it works there, we could lower the severity. Otherwise this is important to fix. Hannes: You had the only machine I know of, where acpidmp fails and also reported the bug, I expect it to be that machine that also freezes? Does it run again, so that we can reproduce the freeze here if 10.1 or SLES/NLD 10 Previews still freeze?
Tried to install version 10.0.42 over the weekend. This crashed the bootmanager and currently I cannot boot the machine at all (haven't got the 10.0 installation DVD with me for a rescue). No further information so far. Will try on.
(In reply to comment #41) > Tried to install version 10.0.42 over the weekend. This crashed the bootmanager > and currently I cannot boot the machine at all (haven't got the 10.0 > installation DVD with me for a rescue). No further information so far. Will try > on. i had a similar problem with the bootloader. If you have any CD / DVD at all, try commenting out the line #gfxmenu (hd0,0)/message in /boot/grub/menu.lst, this gave me back the bootability (machine hung in the bootloader, never got to the kernel). Will have to create a bugreport for this :-) >
Many thanks for the tip. Version 10.0.42 ran for 45 minutes (idle) which is a lot longer than 10.0 (about 20s). It froze when I tried to shut the system down. Any more information you want to fix version 10.0?
This is probably a duplicate of #141238. Could you follow the last comments of Holger Köhlerschmidt there and see if it helps. Let's go on there if you think this is the problem. For 10.0 this probably will become a "Won't fix" for the system freeze. 10.1 has higher priority at the moment, let's see that we get this machine running smoothly on the latest version first. Now all begins to make sense... The shutdown problem and the "not able to use gfxmenu" entry with bootable CDs, seem to have the same base issue.
(In reply to comment #44) > Now all begins to make sense... The shutdown problem and the "not able to use > gfxmenu" entry with bootable CDs, seem to have the same base issue. I don't think the gfxmenu problem is related. This is from grub, not from CD and i have an i386 UP machine with the gfxmenu problem. Also, it always worked and just failed with the latest release, so i think it is just a plain simple gfxmenu bug.
Bug was still set to "need info"..., please reassign if info has been provided. > It froze when I tried to shut the system down. This sounds like another bug? You might want to switch to console and watch the kernel output, maybe you see the kernel ooopsing? Not sure if I already mentioned that: You should be sure to run the latest BIOS.
This bug is getting to long and confusing: Summary: - cpufreq/ondemand does not freeze the machine anymore -> Closing. - The machine does not shutdown -> another problem -> new bug or duplicate of #141238 (if this is an AMD Turion with ATI chipset, it probably is a duplicate).
Initial problem has been fixed by your help and installing quiet fans on the machine. Will test with Suse 10.1 in the near future and open a new report if necessary. Many thanks for your help