Bugzilla – Bug 299891
Crash in early stage install
Last modified: 2007-09-28 14:11:37 UTC
I try to install 10.3beta1 on a dual opteron system presently running 10.3alpha7. When booting from the burned DVD, the Welcome splash screen appears, and after that the boot menu screen. When I choose to install the beta-release, (and press Esc to see the messages on the console), the install procedure gets as far as this: -------------------------------------------------------------- Loading basic drivers.........................OK Starting hardware detection ...........OK (If...................brokenmodules=driver_name) --------------------------------------------------------------- and after that, the system is powered off instantaneously mostly, or sometimes hangs after the messages ---------------------------------------------------------------- Micro Star International CK804 IDE drivers pata, amd74xx, generic loading pata ---------------------------------------------------------------- When I choose Boot from harddisk, 10.3alpha7 boots OK. When I choose to do a memory test, the test is done OK When I choose to do a firmware test, the system hangs as stated before. System is MSI K8N Master2FAR, two single core opterons, 2 GB mem, 200 GB Maxtor IDE disk, Matrox Parhelia graphics card, Hauppauge WinTV 150 tv-card, Plextor PX 740A dvd writer. Checksum of DVD-download is OK, I burned de DVD twice, last time on 10.2 OS and on lowest speed. Same result. See also my post on the Open Suse Beta forum in Suseforums.net
Please add the output of "lspci -nn" to Bug 299010. *** This bug has been marked as a duplicate of bug 299010 ***
I've added lspci -nn to bug 299010. The behaviour signalled by the original poster of 299010 is somewhat different from my findings: in my case, the crash appears in a much earlier state. But that's just my 2 cents.
Jogchum. Please post '/var/log/boot.msg' and 'hwinfo --all' from working installation. Also, after 10.3b1 installation system is fully loaded... * Press "ctrl-alt-f9". It will give you a command console. * mount a usb stick or hard disk partition to /mnt * "cp /var/log/boot.msg /mnt", "dmesg > /mnt/dmesg.log", "hwinfo --all > /mnt/hwinfo.log" * Post the results here. Thanks.
Created attachment 157786 [details] The boot.msg created by 10.3alpha7
Created attachment 157787 [details] hwinfo created by 10.3.alpha7
I'm afraid I'm unable to fulfill the last request you made: after 10.3b1 installation system is fully loaded... * Press "ctrl-alt-f9". It will give you a command console. When 10.3b1 installation system is fully loaded, I have only a few seconds (less than 5) before the system crashes, and in that time either I get no response to 'ctrl-alt-F9', or I do get a prompt, namely /# but then the keyboard is as dead as a doornail (even numlock etc don't work). I'm giving up for now, and I have only tomorrow night before I go on a tree week holiday. regards, jogchum
Console F4 logs kernel messages. Taking a photo of the console when the machine locks up should give us some clue. Thanks.
Created attachment 158046 [details] Photo of console 4
After a few times trying, going to console 4 succeeded! I've made a - hopefully readable - photo of the screen. See the attachment. I won't be able to react for the next three weeks. regards, jogchum
Thanks. Hmmm... Weird. pata_amd hasn't been changed between a6 and b1. Please ping me back when you come back from your vacation. I'll prep debug kernels. Thanks.
Jogchum, did you have a chance to look into this bug on Beta1 or Beta2? Since there hasn't been any progress on this bug, I'm lowering severity to crit.
Back from holiday; tried beta2, behaviour is the same... Tejun, something I can do further, perhaps with debug kernels you planned to prepare? regards, Jogchum
Tried beta3, same result... regards, Jogchum
info provided
Can you please try the live CD of beta3plus from: http://ftp.opensuse.org/pub/opensuse/distribution/10.3-Beta3plus/iso/cd/ This has several kernel fixes in this area
Created attachment 164149 [details] Screenshot just before entering non-responding state; live CD 10.3beta3plus
Crashes also (powers the system off), but in a - at least seemingly - later stage. On one try however, the system did not power off, but went into a non-responding state - numlock etc didn't give a reaction too. I made a number of photographs, two of which I'll add. First one (img_3090.jpg) gives the last screen before the no-responding state. The second one (img_3092.jpg) is from another start-up, one which leads to a power off. The shot is taken just before the power off, so it includes the last message line on the console before power-off. I noted that this CD is a i386 release, not x86-64. But I assume you noted that too. regards, Jogchum
Created attachment 164150 [details] Screenshot just before power-off; 10.3beta3plus live CD
I didn't notice the arch, no. But interesting that it doesn't matter to your computer. Adrian also noted sudden power offs and interestingly also has a matrox graphics card. They seem to dislike our X server or something.
I noted that in the screenshots I uploaded a few days ago (img_3090.jpg and img_3092.jpg) acpi was colled just before the anomalies (power-off, hung system) appeared. So I installed with boot option acpi=off, and now the system installs properly. Two o'clock in the night now, going to bed... regards, Jogchum
So, with ACPI turned off, harddisks are detected and work properly, right? cc'ing Thomas for ACPI.
Yes, correct. I have 10.3beta3 installed now. Funny thing is, now I have 10.3beta3 installed, I don't have to give acpi=off when booting the installed system - it runs fine... Totally OT on this bug, but I gave the 386 live release from beta3 a go on my Acer 1710 laptop (which runs 10.0 at the moment). It seemed to try to run the X-system (runlevel 5 was entered), but without success: afer a few flickerings on a blackish screen, it gave the login prompt on console 1. It's our 'production' machine, so I can't do too much testing on it. But as said, totally OT here. regards, Jogchum
> Adrian also noted sudden power offs and interestingly also has a matrox > graphics card. They seem to dislike our X server or something. This is a Matrox Parhelia, which is not supported by the mga driver. So it's very unlikely that it is related to the graphics drver. With fbdev driver in use it would probably happen with any other graphics card.
>Funny thing is, now I have 10.3beta3 installed, I don't have to give acpi=off >when booting the installed system - it runs fine... Ok, then I close this one fixed, please reopen if you should see further problems. >Totally OT Yeah, very Off-Topic. Better open another bug report if this should get addressed.
Honestly, I doubt if this bug should be considered as 'fixed'. In the install phase the bug is still there. Now that is clear that acpi=off si a work around for the installation process, the bug should be better tracable, though. Just my 2 cents... regards, Jogchum PS I wil open a bug report for the prblem with the Acer I noticed (provided this bug has not been reported yet, of course).
Jogchum, if installation of b3 doesn't work w/o acpi=off, please reopen the bug.
Indeed is the installation problem like stated in the original posting: only with acpi=off installation works. I reopen the bug.
Yes sorry, the bug should not have been closed. Investigating... Jogchum: If you pass acpi=off boot parameter when you boot the install system, the parameter should have been added by yast. Can you check in /boot/grub/menu.lst whether your first/default boot entry has an acpi=off parameter added. If yes, you can remove it and you should run into this problem again (you can add the parameter again later to get a working system...)? This bug has been declared as a duplicate of bug #299010. There a lot of people were involved and added in CC. Were they dropped by marking this as a duplicate? I already asked because I also saw this in another bug some time ago and I thought this got fixed, shouldn't all the CC'ed people from bug #299010 also be CC'ed here?
Yes, it is present in the difault boot entry; and on removal the system reboots (not: powers off! at least not on the one try I gave it); when I give acpi=off on the boot promt the system start up normal. regards, Jogchum
I doubt if 299010 is still seen as a duplicate of this bug; see comment #27 from Tejun Heo on 299010. Behaviour is quite different with me, and my controller is different too. regrs, Jogchum
Yep. I tried to reproduce this on a MSI Platinum, but this seems to work fine now. Could you try (instead of acpi=off): pci=noacpi noapic nolapic possibly also could help: pci=nommconf pci=nomsi It's enough to find the first one working, in case pci=noacpi works you should also try noapic or nolapic boot param. Hmm, maybe I am still to disk/irq oriented (the disk is found correctly now?), the fact that the machine powers off more looks like a not-irq related bug. Maybe we should start with this: does this boot parameter work (without acpi=off): init=/bin/bash If yes and the disk is found, I am on the wrong track and the above boot parameters are probably not needed to be tested. I'd suggest to not load any acpi modules then. Kay showed me how to do this via udev rules, but I forgot again... Kai could you please tell us.
Neither pci=noacpi noapic (I tried also pci=noapic: not sure if I understood you right here) nolapic (also pci=nolapic) pci=nommconf pci=nomsi works: in all cases power-off. init=/bin/bash works, the root device is mounted, and the /home partition (these are the only two partitions on the disk, apart from swap of course) is mountable. What strikes me is that power-off is always seen just or almost just after the message Loading CPUFreq modules Could that give a clue maybe? regards, Jogchum
> Loading CPUFreq modules > Could that give a clue maybe? Definetly. I could have seen this before, but was confused by the disk related duplicates and by the fact that Tejun was assigned to this one ->taking over. Still not sure whether it's cpufreq, possibly other ACPI accesses. Can you try to boot with CPUFREQ=off boot parameter. If this does not work it's probably some other ACPI module. If this works, do: - rmmod battery - rmmod thermal - echo 0x21F >/sys/module/acpi/parameters/debug_level Do (always wait some secs, to be sure the done step is not the offender): - modprobe processor - modprobe powernow-k8 - modprobe cpufreq_ondemand - echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor Please extract the logged messages in /var/log/messages by time/date after machine hang and got rebooted and attach them.
Is this a i386/32 bit or x86_64/64 bit installed system? If it is a i386 installed system, nohz=disable could help?
I'll run the tests when I'm home from work. System and installation is x86_64/64 bit. regards, Jogchum
With CPUFREQ=off the system starts up correctly. None of the commands halts or hangs the system, but see the attachment for the responses to the commands, and the (few) relevant lines in /var/log/messages regards, Jogchum
Created attachment 173419 [details] System responses from the tests and relevant lines /var/log/messages
Can you please attach acpidump output. It could also be useful if you boot with CPUFREQ=off (this one should be your preferred boot parameter for now and provide you most functionality -> all but cpufreq). Copy this code into a file and do a chmod 755 on the file and execute it: --------------------------------------- #!/bin/bash rmmod battery rmmod thermal (should not be loaded anyway) logger XXXXXXXXXXX echo 0x21F >/sys/modules/acpi/paramters/debug_level modprobe processor echo 0x3 >/sys/modules/acpi/paramters/debug_level logger YYYYYYYYYYYY --------------------------------------- Can you attach the output of /var/log/messages between XXXXXXXX and YYYYYYY, pls. Hmmm, before doing any of this, you shoul look out for a BIOS update, this again looks like a BIOS issue.
Created attachment 173480 [details] The acpidump
Apparently the requested modules are not present, as in the previous tests: ---------------------------------- # ./test2results.sh ERROR: Module battery does not exist in /proc/modules ERROR: Module thermal does not exist in /proc/modules ./test2results.sh: line 6: /sys/modules/acpi/paramters/debug_level: No such file or directory FATAL: Error inserting processor (/lib/modules/2.6.22.5-16-default/kernel/drivers/acpi/processor.ko): No such device ./test2results.sh: line 8: /sys/modules/acpi/paramters/debug_level: No such file or directory ------------------------------------ There is nothing between Sep 20 00:26:16 souder-exp jogchum: XXXXXXXXXXX Sep 20 00:26:16 souder-exp jogchum: YYYYYYYYYYYY in /var/log/messages There does not seem to be a BIOS-upgrade for the K8N Master2-FAR on the MSI-site. regards, Jogchum
Sorry I mis-spelled the paramters, should be parameters, but there should be no need, I expect acpidump is enough.
This is a bug in ACPICA: Name (_PSS, Package (0x02) { Package (0x06) { 0x0708, 0x0000D6D8, 0x64, 0x09, 0xE020298A, 0x018A }, Package (0x06) { 0x03E8, 0x00002EE0, 0x64, 0x09, 0xE0202C82, 0x0482 }, Package (0x06) { 0xFFFF, 0xFFFFFFFF, 0xFF, 0xFF, 0xFFFFFFFF, 0x03FF }, Package (0x06) { 0xFFFF, 0xFFFFFFFF, 0xFF, 0xFF, 0xFFFFFFFF, 0x03FF }, Package (0x06) { 0xFFFF, 0xFFFFFFFF, 0xFF, 0xFF, 0xFFFFFFFF, 0x03FF }, Package (0x06) { 0xFFFF, 0xFFFFFFFF, 0xFF, 0xFF, 0xFFFFFFFF, 0x03FF } } I had a similar bug on a machine that had no valid package/AML information inside of a package, but was filled up with zeros. I wonder why it does not work. It was bug #189488 (getting interesting at comment #11). Whether zeros or not, the contents after package(0x2) should not get evaluated. Adding Alexej, AFAIK he got my patch forwarded from Bob (and even signed-off?) and it got slightly modified after running in their test suites... It could be that the parser in the first cycle ignores the amount of packages, already generates meta-info for the other packages, which later leads to difficulties... BIOS developers like to fill up CPU frequency data to not need to allocate space for different CPUs dynamically. On some machines, the info is often filled up with the same frequency information which is then ignored by the kernel, on some they cut it by package size definition. It's hard to predict the risk of this, but I expect if we can come up with a save patch, that ignores everything after the snd package (amount of package elements) it should go in. I still have another important bug: Alexey, could you already have a look at this, pls (you might want to take over if you have time...). Len, now it would be convenient if we could work on the latest ACPICA sources...
Created attachment 173544 [details] Fixed and recompiled DSDT For verification whether it's really that, could you copy this attachment to e.g. /etc/DSDT.aml (the filename must stay the same) modify ACPI_DSDT="" to ACPI_DSDT="/etc/DSDT.aml" in /etc/sysconfig/kernel and invoke: mkintird Then reboot without acpi=off and without CPUFREQ=no Does it boot? If you do: powersave -c Do you get DYNAMIC as output? If this gets fixed you should set the entry back to: ACPI_DSDT="" and invoke mkinitrd again, so that the DSDT does not get added to initrd and does not get overriden by the kernel at early boot. The problem is, that this table is generated by the BIOS depending on your hardware, if you e.g. add more memory you must not use this table anymore!
It still powers off after I took the steps you described :-( regards, Jogchum
Created attachment 173793 [details] Result of the mkinitrd command I forgot to add the output of the mkinitrd command. One has to issue it from /, because otherwise sbin/update-bootloader is not found. Is it on purpose that this command has a relative in stead of absolute path? But the output of mkinitrd seems OK to me. The output of ---------------------------------------- powersave -c liblazy (liblazy_dbus_send_method_call:97): Received error reply: Method "GetCPUFreqGovernor" with signature "" on interface "org.freedesktop.Hal.Device.CPUFreq" doesn't exist Could not get current CPUFreq policy. ----------------------------------------- Logical, because CPUFREQ is set off, I think? regards, Jogchum
You can check with dmesg |less -> There must be a sentence like DSDT overriden by initrd or similar, you could grep for DSDT or initrd (possibly initramfs instead of initrd). If this line is there, everything should work fine (also without CPUFREQ=off or other workaround boot parameters). > sbin/update-bootloader is not found -> strange, works here. AFAIK there has still work been done in this package. > powersave -c > liblazy (liblazy_dbus_send_method_call:97): Received error reply What the... You can test whether this directory exists instead: ls /sys/devices/system/cpu/cpu0/cpufreq If it exists you can watch the frequencies switching with: watch -n1 cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq > Logical, because CPUFREQ is set off, I think? Be sure CPUFREQ=off is not passed. At runtime you can see the boot parameters here: cat /proc/cmdline you can modify them in /boot/grub/menu.lst.
This line is in output dmesg: ACPI: Override [DSDT-AWRDACPI] from initramfs - tainting kernel But still no boot without CPUFREQ=off. --------------------------------------------------- Regarding > Be sure CPUFREQ=off is not passed. At runtime you can see the boot parameters > here: > cat /proc/cmdline > you can modify them in /boot/grub/menu.lst. : without CPUFREQ=off the system does not boot, so.... ----------------------------------------------------- souder-exp:~ # ls /sys/devices/system/cpu/cpu0/cpufreq ls: cannot access /sys/devices/system/cpu/cpu0/cpufreq: No such file or directory souder-exp:~ # regards, Jogchum
That is strange, if processor module cannot be loaded: > FATAL: Error inserting processor > (/lib/modules/2.6.22.5-16-default/kernel/drivers/acpi/processor.ko): No such > device powernow-k8 should also not load and cpufreq should be disabled anyway, I wonder why it boots with CPUFREQ=no, but not without. Can you try things from comment #38 (with CPUFREQ=no): Copy this code into a file and do a chmod 755 on the file and execute it (hope I got the filenames correctly now: --------------------------------------- #!/bin/bash logger XXXXXXXXXXX echo 0x21F >/sys/modules/acpi/parameters/debug_level modprobe processor echo 0x3 >/sys/modules/acpi/parameters/debug_level logger YYYYYYYYYYYY --------------------------------------- Can you attach the output of /var/log/messages between XXXXXXXX and YYYYYYY, pls.
Few days away from home, won't be able to test it until tomorrownight. I'm sorry. regards, Jogchum
Actually, it is not /sys/modules/acpi/parameters/debug_level but /sys/module/acpi/parameters/debug_level But there is nothing written to /var/log/messages between the XXXXXXXXXXX and YYYYYYYYYYYY lines (which themselves are written indeed). regards, Jogchum
Forgot to tell, the script did not give any (error) messages, and lsmod | grep processor gives processor 59592 1 thermal so modprobe does it's work apparantly, only there's no logging.
#comment 32 states that the processor module could not be loaded (therefore powernow-k8 could not be loaded which needs the processor module). I expect, there a boot param like acpi=off has been used accidently. About comment #42: The problem is that our userspace acpica sources are *really* old and the problem I mentioned in that comment should already be fixed. But Intel has not published the fixed sources, we still have to use the code from more than a year ago which makes this one very hard to debug (-> Len, we have to talk about this again privately, I hope there wasn't some kind of policy change at Intel about this...). Bug #297119 has a very similar CPUFreq states declaration: Declares a package of size 2. In it there are 2 valid packages and 4 invalid (0xFFFFF...) packages, the latter ones have to be ignored totally. There the invalid packages are exported to cpufreq layer (!ACPI bug!) and the powernow-k8 module seem to have a sanity check on the exported values and ignores them: invalid freq entries 3900000 kHz vs. 65535000 kHz. It's too late now to fix the ACPI parse now, I try to find a sanity check for powernow-k8: There is a differenciation depending on a processor capability flag, the variable used is cpu_family and can be CPU_OPTERON and CPU_HW_PSTATE. In bug 297119 we have a non Opteron, here it is a Opteron and I expect therefore: fill_powernow_table_pstate (-> non-opteron case) and fill_powernow_table_fidvid (-> operon case) is used to initialise and sanity check these values. The non-opteron case looks more scary (a read msr is done with the bogus info)... Anyway, better ignore all bogus info and mark it invalid at the very beginning. If it's this, attached patch should help and should be safe enough to go in, even for RCx. I built an rpm for you to test, verification whether everything is fine now would be appreciated (it may take some hours until the ftp server got synced and the file pops up): ftp.suse.com/pub/people/trenn/wrong_acpi_freq_info/kernel-default-2.6.22.5-29.x86_64.rpm
Created attachment 174384 [details] Ignore bogus CPU frequency values wrongly exported by ACPI layer early
I just installed the new kernel (had to remove some ivtv-rpm's that depended on the original kernel, but no further problems) and rebooted with the CPUFREQ=off statement removed from /boot/grub/menu.lst. I'm sorry to say, but still a power-off.... Tried three times, same result. Boot option CPUFREQ=off lets the system boot again. regards, Jogchum
The problems started between alpha7 and beta1. What are de differences in treating CPU-frequency handling between those releases? Could there be a clue? As always, just my 2 cents. regards, Jogchum
Loading ondemand governor by default came in around that time. Only difference should be that cpufreq already gets active at installation time and that you see the hang there. It shouldn't make a difference at a finished up installation. I'll have a look for other changes...
I put a cpufreq debug enabled kernel here: ftp.suse.com/pub/people/trenn/cpufreq_debug_10_3/kernel-default-2.6.22.7-31.x86_64.rpm Can you install that one pls. Next we should make sure the cpufreq modules can be loaded manually (even with passing CPUFREQ=off, AFAIK only hal start script is looking out for this): Boot the newly installed kernel with CPUFREQ=off and cpufreq.debug=7 modprobe processor (might already be loaded) modprobe powernow-k8 modprobe cpufreq_ondemand echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo ondemand >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo ondemand >/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo ondemand >/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor Now the machine should switch off. You should be able to extract extra cpufreq related debug info from /var/log/messages (from last boot). If it does not switch off, it possibly might be some strange interference between ACPI modules and cpufreq and we need to dig further. As said, you might want to contact me via ICQ, I can help then interactively that might speed up things...
System freezes - no power off, but cursor disappears and system is non-responding - after modprobe powernow-k8. Messages fomr /var/log/messages: -------------------------------------------------------------------------------- Sep 25 16:21:55 souder-exp kernel: powernow-k8: Found 2 AMD Opteron(tm) Processor 244 processors (version 2.00.00) Sep 25 16:21:55 souder-exp kernel: cpufreq-core: trying to register driver powernow-k8 Sep 25 16:21:55 souder-exp kernel: cpufreq-core: adding CPU 0 Sep 25 16:21:55 souder-exp kernel: powernow-k8: 0 : fid 0xa, vid 0x6 Sep 25 16:21:55 souder-exp kernel: powernow-k8: 1 : fid 0x2, vid 0x12 Sep 25 16:21:55 souder-exp kernel: powernow-k8: 0 : fid 0xa (1800 MHz), vid 0x6 Sep 25 16:21:55 souder-exp kernel: powernow-k8: 1 : fid 0x2 (1000 MHz), vid 0x12 Sep 25 16:21:55 souder-exp kernel: powernow-k8: cpu0, init lo 0x60a, hi 0x1 Sep 25 16:21:55 souder-exp kernel: powernow-k8: policy current frequency 1800000 kHz Sep 25 16:21:55 souder-exp kernel: freq-table: table entry 0: 1800000 kHz, 1546 index Sep 25 16:21:55 souder-exp kernel: freq-table: table entry 1: 1000000 kHz, 4610 index Sep 25 16:21:55 souder-exp kernel: freq-table: setting show_table for cpu 0 to ffff810055a648c0 Sep 25 16:21:55 souder-exp kernel: powernow-k8: cpu_init done, current fid 0xa, vid 0x6 Sep 25 16:21:55 souder-exp kernel: cpufreq-core: setting new policy for CPU 0: 1000000 - 1800000 kHz Sep 25 16:21:55 souder-exp kernel: freq-table: request for verification of policy (1000000 - 1800000 kHz) for cpu 0 Sep 25 16:21:55 souder-exp kernel: freq-table: verification lead to (1000000 - 1800000 kHz) for cpu 0 Sep 25 16:21:55 souder-exp kernel: freq-table: request for verification of policy (1000000 - 1800000 kHz) for cpu 0 Sep 25 16:21:55 souder-exp kernel: freq-table: verification lead to (1000000 - 1800000 kHz) for cpu 0 Sep 25 16:21:55 souder-exp kernel: cpufreq-core: new min and max freqs are 1000000 - 1800000 kHz Sep 25 16:21:55 souder-exp kernel: cpufreq-core: governor switch Sep 25 16:21:55 souder-exp kernel: cpufreq-core: __cpufreq_governor for CPU 0, event 1 Sep 25 16:21:55 souder-exp kernel: cpufreq-core: governor: change or update limits Sep 25 16:21:55 souder-exp kernel: cpufreq-core: __cpufreq_governor for CPU 0, event 3 Sep 25 16:21:55 souder-exp kernel: cpufreq-core: initialization complete Sep 25 16:21:55 souder-exp kernel: cpufreq-core: adding CPU 1 ----------------------------------------------------------------------------- I don't remember you mentioning ICQ chat; how can I reach you there, or is that the same as IRC? I'm looking now at channel openSUSE-bugs, nut don;'t see you there, Is there a big time-gap? I'm in the Netherlands (as you might have guessed...) regards, Jogchum
Coming in between Alpha7 and Beta1 are only using ondemand per default governor and nohz patches. First we should check whether it's ondemand governor breaking things. (Maybe ondemand governor already tries to set another freq on cpu0 while cpu1 is still initialising and this Opteron does not like that?) Could it be that Alpha7 was an updated system (still having userspace governor set per default) Do you have: CPUFREQ_ENABLED="userspace" set in /etc/sysconfig/powersave/cpufreq ? Ohh... with ondemand per default the userspace workaround does not work anymore for machines freezing with ondemand governor. It depends on the switching latency of the driver, but if the driver (powernow-k8) is supposed to work with ondemand, the governor is used automatically at driver load time now. I placed a kernel with cpufreq debug and with performance governor used per default here: ftp.suse.com/pub/people/trenn/performance_gov_default_10_3/kernel-default-2.6.22.8-31.x86_64.rpm modprobe powernow-k8 should work now? and modprobe cpufreq_ondemand echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo ondemand >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor echo ondemand >/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor echo ondemand >/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor freezes the system? If not it could be that ondemand governor already kicks in for cpu0 while cpu1 still gets initialised... If yes: We could only blacklist the CPU revision of these Opterons to not use ondemand per default. Because cpufreq (with ondemand per default) is used at installation time now where no userspace tools are available, fiddling around in userspace is not a real option here... Because of ICQ: I mailed you my data privately, I can also have a look at an irc channel tomorrow if you prefer this one...
modprobe powernow-k8 doesn't freeze of power-off the system now. As for cpufreq_ondemand, this module is not found: souder-exp:~ # modprobe cpufreq_ondemand FATAL: Module cpufreq_ondemand not found. /lib/modules/`uname -r`/kernel/arch/x86_64/kernel/cpufreq/ only has -rw-r--r-- 1 root root 34280 Sep 25 19:16 acpi-cpufreq.ko -rw-r--r-- 1 root root 44512 Sep 25 19:16 powernow-k8.ko I suppose the echo commands don't make sense then? BTW, I have only 2 CPU's (no dual-cores), so there's only cpu0 and cpu1. Read my email now, so I saw your invitations to ICQ; sorry! I'm afraid I've never used ICQ; I've started Kopete, but only thing I can do in this UI is add a contact, AFAICS, so how to make contact with you this way? Tomorrow I'll be working: leaving 6:45, coming home around 17:45 if public transport is accurate. Timezone is CEST, so that's no problem.
ondemand governor is already compiled in, my fault. Just doing: echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor echo ondemand >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor Should activate it. I could reproduce this once on a 4 socket Opteron Dual Core machine. Debug info or printk seems to prevent the probably lock/race condition. I just want to mention that this looks sever. Investigating...
I tried: - MUTEX_DEBUG, PROVE_LOCKING configs - I also tried to 100% reproduce this with some delays, no luck until now It's probably hanging at: lock_policy_rwsem_write(cpu); in cpufreq_add_dev(..) in drivers/cpufreq/cpufreq.c But this lock is clustered all over the cpufreq core, I even couldn't reproduce this at all anymore. We can either remove CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND and set CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y again. We would loose the ability that cpufreq is set up at installation time then Or people need to add brokenmodules=powernow-k8 to install on affected machines and CPUFREQ=off later until this is finally tracked down. Jogchum: I won't have time at around 19:00, maybe later, you may want to ping me through mail or icq.
echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor is enough to let the system power-off.
This all is very weird. For the logs in #58, I expect you hit a rare dead-lock condition (which I think I also could run into once, I now setup machines rebooting all the time to see how often this gets hit, it seem to happen very rarely). I expect this sometimes happens with: ondemand as default governor + slower freq switching opterons + smp, but this is only a rough guess. But I have no explanation why your machine reboots. Especially with the debug kernel where you could load the powernow-k8 module (performance governor active), then you activated ondemand. This has nothing to do with the change that we switched to ondemand per default, it's the normal way it should have worked for a long time. Has this been an updated system (we had problems with ondemand a long time ago and we blacklisted some machines to use userspace governor instead, that would explain why it worked before). A bit of a problem is that because of ondemand per default we cannot blacklist this machine anymore to use the userspace governor. If you still have the kernel from comment #59/#60 it's possible. Normally this should work: set CPUFREQ_ENABLED="userspace" in /etc/sysconfig/ be sure you have not started a desktop (or kpowersaved or gnome-power-manager explicitly might activate the ondemand gov). You need to restart the powersaved then. Whether userspace is active can be checked here: cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor If the CPU has load it must switch up: e.g. cat /dev/zero >/dev/null should produce 100% load on one processor.
(In reply to comment #64 from Thomas Renninger) > Normally this should work: set CPUFREQ_ENABLED="userspace" in /etc/sysconfig/ It has to be CPUFREQ_CONTROL="userspace" in /etc/sysconfig/cpufreq
Created attachment 175114 [details] Do not use ondemand per default on opterons This one should be a safe way to use performance governor per default on opterons (It's switched to ondemand later via userspace tools). Blacklisting could be done more clever (e.g. all smp AMDs without fire and forget or similar). I think we should not add it that late, because of one bug report (I still couldn't run again into a machine hang yet) and it wouldn't help for this specific machine anyway. Even if the dead lock condition is found and a possible fix, modifing the complicated rw semaphore code paths, this is nothing we want in RC3 or an update... If this got evaluated a bit further we might want to provide something like the attached patch in an update...
Maybe the power off has to do with nohz changes + tsc (which gets unstable with cpufreq) as time source -> notsc boot parameter could help then. If not, I'd also like to see extended acpi debug output when cpufreq gets activated, maybe we could do that together via chat, you might want to contact me directly via mail. If possible, a 10.2 or AlphaX installed system could also help debugging. If it's still there, just checking whether cpufreq worked there could give a hint, maybe it was never activated. Or maybe HW got some damage between Alpha7 and next update (probably not..., but you never know...)
It doesn't look that sever -> downgrading: - I let two machines run over night to hit the deadlock case -> no freeze (powernow-k8 cannot be unloaded). - Jogchum tried a 10.2, cpufreq never worked there because of an ACPI bug (see comment #42) he gets: register performance failed: bad ACPI data If there really is a deadlock condition we should get some more reports soon, I won't waste time on this any more for now. It may be that cpufreq never worked on Jogchum's machine. Next we will try whether 10.3 Alpha6 really has cpufreq working. I could imagine the power off comes from a broken voltage regulator. That would mean cpufreq was never activated before and he has broken HW. It could also be a side effect of the ACPI fix for the seldom package declaration (see comment #42) -> but there should not have been changes at this place between Alpha7/Beta1.
Outch. Thanks to Jogchum I got modprobe powernow-k8 with some more ACPI debug enabled. It's a BIOS bug..., the ACPI tables have cpufreq info for CPU0, but simply miss the cpufreq Info for CPU1. There could get added something to get out of the driver more gracefully. Currently the powernow-k8 driver even loads (which it did not with e.g. 10.2): cpufreq-core: initialization complete cpufreq-core: adding CPU 1 nsutils-0454 [00] ns_build_internal_name: Returning [ffff81007cfd9560] (rel) "_PCT" nsutils-0869 [00] ns_get_node : _PCT, AE_NOT_FOUND processor_perflib-0312 [00] processor_get_performa: ACPI-based processor performance control unavailable powernow-k8: register performance failed: bad ACPI data powernow-k8: MP systems not supported by PSB BIOS structure cpufreq-core: initialization failed cpufreq-core: driver powernow-k8 up and running It's a bit strange that behaviour changed even between 10.2 and 10.3, but there is not much we can do here (beside exiting gracefully). > I let two machines run over night to hit the deadlock case (the one was the > machine which I thought froze) -> no freeze Those are still happily rebooting. Maybe this was a false alarm... Here again the board: MSI K8N Master2FAR Someone should tell MSI about this issue. Jogchum, you may want to monitor the support sites of MSI and look out for a possibly upcoming BIOS.
Created attachment 175489 [details] Modified DSDT with CPU freq info added for CPU1 You may want to follow commment #42 and try this DSDT, with some luck it works. Still, you should nag MSI for a new BIOS and revert the modified DSDT again after testing by removing the added entry in /etc/sysconfig/kernel and invoking mkinitrd afterwards.