Bugzilla – Bug 145192
SUSE 9.3 Pro and SUSE 10 not running with a Gigabyte K8NS Board
Last modified: 2007-02-16 14:22:03 UTC
This seems to be a kernel and/or hardware depend problem. It happens for me (other versions not checked) with SUSE 9.3 Pro and also with SUSE 10.0 in both cases (32 and 64 bit). The installation is running without any problems each time and in each version I've checked. But if the installation is 'really' completed (Patches installed, final configuration also completed) and the system boots the first time than it freezes after a short time. As far as I could see, this could already happen during the boot process independent if I try to boot in runlevel 5 or in runlevel 3. Sometimes the boot process could finish and the freeze happens after some minutes. Because this happens with various graphics boards from various vendors I'm relatively sure it has nothing to do with the graphis problems mentioned in the release notes. After days and days of testing - also with various motherboards - I could see that the problem seems to be related to some services which are started by default after the installation process (if they are installed of course ;-) Maybe some of these packages are unusual, but we install them to have a quick way to switch various services between our machines in case of problems. Some of the services are switched off here by default and are only on in the special case we need them on this machine. I can't say why, but disabling all these services solves th problem. I'm sure that not all of these services are related to this problem, but may be there are some dependicies betweeen them. Here a list of all services we switch off after installation: canna, cpuspeed, cupsrenice, cups, irq_balancer, iscsi, isdn, isdngw, mwavem, nfs, pcmcia, powersaved, rcd, smbfs, suse-blinux, vocald, xend for SUSE 10 and canna, cpuspeed, cups, irq_balancer, isdn, isdngw, mwavem, nfs, pcmcia, powersaved, rcd, smbfs, suse-blinux, vocald, xend for SUSE 9.3 pro. If these services are off, the machine works with no problem. Independend if in 32 or 64 bit and independent if SUSE 10 or SUSE 9.3 Unfortunately - if the machine freezes - I could not find any information in the logfiles The only way to reactivate the machine was the reset switch. Please let me know If You need mor information
definately no yast bug -> HW or kernel problem @sf: any ideas?
Does this happen only on K8NS? Can you please find out which service fails (irq_balancer probably not, it's never started on UP systems). Any error message would be appreciated, Best would be to get the kernel messages via serial console.
Sorry for the delay, it was a bit time consuming to check this, and in the moment I could say only something about 10.0. As far as I could see, powersaved seems to be the problem. I've disabled (insserv -r 'service') all the candidates of the list (except all these ot course, which are not prart of the distribution - pcmcia, rcd and suse-blinux - drag an drop is a nice thing, but sometimes ... ;-) and than step by step installed (insserv 'service') only one service from the list and rebooted the machine. The machine was in clean state each time, immediately after install, no patches installed. As said before, the machine simply freezes - without any notification in the logfiles. (Watching the kernel messages via serial console was impossible in the moment) But this sounds plausible for me in this context. Powersaved lets the machine go sleeping but the machine never wakes up.
Perhaps this could help too: If the machine freezes, the hard disk activity led is permanent in on state
Seems to be a powersave problem. Thomas, could you please take a look?
First check whether an laptop ACPI module has been wrongly detected for that machine: cat /var/lib/acpi/laptop_modules must be empty (Gigabyte K8NS is a workstation, right?). When trying the stuff below, please always add the boot param sysreq=1 (or add it in the grub menu). If you should come to the freeze try (only works when the machine is not totally dead -> NUM lock key should still show keyboard led activity), then hit the key combination Sysreq-T and try to write down the last called functions, most important the one shown near a EIP line. I expect cpufreq or an ACPI module. Disk led looks like ACPI. Please try: - Use boot param "1" to boot into init 1 (machine is still running?). - Then set CPUFREQD_MODULE="off" in /etc/sysconfig/powersave/cpufreq. - enter: "init 5" -> Is the machine still freezing? Then it's not cpufreq. If it is still freezing please try: - set ACPI_MODULES="NONE" in /etc/sysconfig/powersave/common - throw out "processor thermal fan" modules in INITRD_MODULES="..." in /etc/sysconfig/kernel - invoke "mkinitrd" (will build a new initrd, but not include fan, processor and thermal module so they won't get loaded early on next reboot - Boot into runlevel 5 The machine should not freeze anymore even when powersaved runs? If yes let's identify which one causes the freeze: for x in "fan ac processor thermal button";do echo "Module $x freezing machine?" modprobe $x echo "No." done It might still not freeze, then it might be an ACPI read/write access in /proc/acpi/*/*. But this should be enough for now ...
Ahh, first search for the newest BIOS, ACPI problems often get solved by a new BIOS!
> First check whether an laptop ACPI module has been wrongly detected for that > machine: cat /var/lib/acpi/laptop_modules must be empty (Gigabyte K8NS is a > workstation, right?). Yes to both (laptop_modules is empty, K8NS is a workstation board) > - Then set CPUFREQD_MODULE="off" in /etc/sysconfig/powersave/cpufreq. I've done this direct during th install process, in the moment I can pause before the first boot. This seems to be the solution. I've installed all actual patches during the normal install process. The machine is runnig an in a state where nothing is different from your normal install suggestions except the partitioning, the software selection and the above. (And the network depended settings of course) In this special case it seems that the BIOS release has nothing to do with the problem. On the other side, the newest BIOS fixes some ACPI problems, Some of the wakeup funktions are now working.
Sorry, the above was not the solution but at least a part of the problem. A fresh installed machine is freezing again but not so fast. After disabling cpufreqd nad pwersaved the machine runs with no problems. I'll try the second step (changing ACPI_MODULES to "NONE"...) But for a clean environment I have to install the machine again. Needs (hopefull) 2 or 3 days.
After 2 or 3 days of running without problems, I'm hopefull the problem is solved. The 3 things (CPUFREQD_MODULE="off" in /etc/sysconfig/powersave/cpufreq, changing ACPI_MODULES to "NONE" in /etc/sysconfig/powersave/common and rebuilding the initrd after removing the fan, thermal and processor) together and rebuilding the initrd seems to be the solution.
Sorry again, the previous was not the solution. Exactly the same thing as above happened today. Now I've done all things suggested in comment 6 step by step and without touching the machine from boot until freeze (except to look if she is alive) The part in comment 6 to check if aone of the initrd modules is the problem is not done, beause the machine also freezes without all of these modules loaded. As far as I could see, to disable the services complete (insserv -r) solves the problem. In the other machines I've never changed /etc/sysconfig/powersave/cpufreq, /etc/sysconfig/kernel and /etc/sysconfig/powersave/common and these machines are running without problems.
I've tried it again - if I do all things sugessted in comment 6, the machine freezes after some time (from some hours until some days.) If I simply prevent the execution of powersaved completely (insserv -r), the machine works. Looks for me that at least one (or the first of all) problem(s) is powersaved directly. For now I don't know, how to continue.
I've tried it again. If I simply disable powersaved and cpuspeed completely (insserv -r) then the machine runs without any problems. I've done this immediately after registration of the last crash (on 03/13/06) and the machine runs until now. The problem seems to be a combination of cpuspeed and powerosaved directly because - following the suggestions in comment 6 - no other modules are loaded.
You must not run cpuspeed and powersave daemon together! Have you done that? Then please try again with only powersave daemon running. This is the preferred daemon to check CPU frequency
I'll try this immediately. The daemons canna, cupsrenice, cups, irq_balancer, iscsi, isdn, isdngw, mwavem, nfs, smbfs, vocald, xend and powersaved are active (insserv), independent if they can start (isdn for example can not start because there is no isdn hardware.) For powersaved the suggested settings from comment 6 are used. Please be patient for 1 week for the result.
After 1 week of problem free running, I'm hopeful to say the powersaved daemon directly seems not to be the problem. But Now I'm not sure, how to continue. Actually 3 changes are in use: - in /etc/sysconfig/powersave/cpufreq CPUFREQD_MODULE="off" is set, - in /etc/sysconfig/powersave/common ACPI_MODULES="NONE" is set and - in /etc/sysconfig/kernel the modules "processor thermal fan" are removed from INITRD_MODULES. A new initrd without these modules is used. What should be the next steps to track down the problem.
Sorry for the delay, you must reassign the bug back to me, to get rid of the "need info" state -> yes it's stupid ... It is probably cpufreq. If you do not start the powersaved and do: modprobe powernow-k8 modprobe cpufreq_ondemand echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor And go on working (switch between high/low load on the machine so that CPU frequencies are switched), you can monitor that by: watch -n1 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_* Does the machine also freeze then? (Have you already updated to the latest BIOS for this mainboard? If not you should do that first!).
The same procedure as every ... ;-) To be more precise - the machine is completely reinstalled, all actual patches are applyed, cpuspeed and powersaved are not started (disabled via insserv -r and rebooted), no other changes. The first freeze was immediately after entering the above (comment 17) lines. The second try was running appr. 4 hours. The BIOS is the latest (F18)
Can you retry with: CPUFREQ_CONTROL="userspace" in /etc/sysconfig/powersave/cpufreq and start powersaved. You should be able to reproduce this faster by giving load and idle the machine again (let cpufreq switch, go up and down with load). If userspace governor works, please provide output of all files of (this should be a sufficient workaround for 10.0 as you now still have all functionalities): /sys/devices/system/cpu/cpu0/cpufreq/ondemand/* Is it possible for you to install the latest 10.1 RCx whether the problem is still there.
The same as the previous try. The machine freezes (at least in this special case, I've tried this only one time) immediately after/during the cpu-frequence changes. (I've set up a cron job for producing some different load) and the machine freezes immediately after the job starts. I'll try to install 10.1.
So, 10.1 RC3 / 32 Bit is running, without powersaved with no problem, with powersaved the same as with 10.0 - the machine simply freezes.
Sorry for the late answer. This all sounds like a BIOS issue, ACPI related. Can you please attach acpidump output and dmesg output when the powernow-k8 module has been loaded. If powersaved is not activated you can just modprobe powernow-k8, then rmmod powernow-k8, the machine should not freeze as no stepping should be performed. Now there should be something in dmesg regarding cpufreq/powernow. Can you send the whole dmesg output, please. Maybe this comes out to something like #189488. However I doubt it's exactly the same problem because we are talking about a Gigabyte mainboard here and the other one is an MSI.
Maybe you are right. During the last time I run many tests with different relaeses of different linux distributions (suse, fedora, debian, knoppix). The tests are not finished at this time, but this could be said in the moment: Suse 9.0, fedora core 1 and debian sarge (debian with kernel 2.4 AND with kernel 2.6) dont have these problems. With fedora core 5 and knoppix 5 I got exactly the same problems I have with suse 9.3 and 10, and the same solution (after preventing the above listed packages from starting) The only distribution without any problems is debian, because in debian all these packages are not installed by default. I've not done the cross-check (installing the 'problem-packages' in debian) so until now. Another phenomenon probably regarding acpi I found during my tests too - the bios data is destroyed from linux. I'll create a separate bug rebort for this. BTW, You mentioned bug #189488 but I have no access to this bug.
> Another phenomenon probably regarding acpi I found during my tests too - the > bios data is destroyed from linux Means acpidump does not work? If that is the case you could try with acpidump --addr 0xXY --length 10000 >acpidump You should be able to read out the address from dmesg|less. At the beginning there are the addresses of different ACPI tables listed (the last value). ACPI: RSDP (v000 VIAK8T ) @ 0x00000000000f77c0 ACPI: RSDT (v001 VIAK8T AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fee3040 ACPI: FADT (v001 VIAK8T AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fee30c0 ACPI: SSDT (v001 PTLTD POWERNOW 0x00000001 LTP 0x00000001) @ 0x000000007fee8300 ACPI: MADT (v001 VIAK8T AWRDACPI 0x42302e31 AWRD 0x00000000) @ 0x000000007fee8240 ACPI: DSDT (v001 VIAK8T AWRDACPI 0x00001000 MSFT 0x0100000e) @ 0x0000000000000000 Try the DSDT address. If you don't have luck with the latest DSDT try another one with great length value (e.g. 100000), acpidump should get the table as soon as it finds the "DSDT" magic value of the DSDT table in the specified memory range. This probably can only be fixed with acpidump output. Please also provide the dmesg output with powernow-k8 module loaded as described in comment #22.
No, acpidump works - as far as I could see after a short look - but please be patient for some days, in the moment I have really big problems regarding the timed wakeup feature of the bios an have to manage 6 or 7 different installations on each test-machine. This is very important for us.
Regarding acpidump - as far as I'm right (please see bug #193369) the bios data values are incorrect after the operation system was running at least one times. So I think the informations extracted via acpidump could be incorrect. Is there another solution to get these informations? I've asked the motherboard manufacturer for a tool like acpidump but system independent. I hope for an answer in the next time.
Why do think the table is corrupted? It may simply be a modification in the ACPI interpreter, which your BIOS does not like. Can you post the acpidump and dmesg output, please, then we could work in parallel. Can you also be a bit more specific about the alarm issue you want to do. You try to make use of /proc/acpi/alarm? This is both based on ACPI and both problems could have one base source.
I'm sick in the moment, so please be patient for some days.
No reaction on this bug for a long time. Closing as CANTFIX.