Bug 299891 - Crash in early stage install
Summary: Crash in early stage install
Status: RESOLVED INVALID
Alias: None
Product: openSUSE 10.3
Classification: openSUSE
Component: Installation (show other bugs)
Version: Beta 1
Hardware: x86-64 openSUSE 10.3
: P5 - None : Major (vote)
Target Milestone: ---
Assignee: Thomas Renninger
QA Contact: Jiri Srain
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2007-08-13 17:52 UTC by Jogchum Reitsma
Modified: 2007-09-28 14:11 UTC (History)
7 users (show)

See Also:
Found By: Beta-Customer
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
coolo: SHIP_STOPPER-


Attachments
The boot.msg created by 10.3alpha7 (121.09 KB, application/octet-stream)
2007-08-15 21:55 UTC, Jogchum Reitsma
Details
hwinfo created by 10.3.alpha7 (105.29 KB, text/plain)
2007-08-15 21:57 UTC, Jogchum Reitsma
Details
Photo of console 4 (688.14 KB, image/jpeg)
2007-08-16 22:14 UTC, Jogchum Reitsma
Details
Screenshot just before entering non-responding state; live CD 10.3beta3plus (1.06 MB, image/jpeg)
2007-09-14 20:06 UTC, Jogchum Reitsma
Details
Screenshot just before power-off; 10.3beta3plus live CD (1.10 MB, image/jpeg)
2007-09-14 20:09 UTC, Jogchum Reitsma
Details
System responses from the tests and relevant lines /var/log/messages (1.26 KB, text/plain)
2007-09-19 17:12 UTC, Jogchum Reitsma
Details
The acpidump (115.49 KB, text/plain)
2007-09-19 22:25 UTC, Jogchum Reitsma
Details
Fixed and recompiled DSDT (22.01 KB, application/octet-stream)
2007-09-20 07:06 UTC, Thomas Renninger
Details
Result of the mkinitrd command (1.08 KB, text/plain)
2007-09-21 07:19 UTC, Jogchum Reitsma
Details
Ignore bogus CPU frequency values wrongly exported by ACPI layer early (2.58 KB, patch)
2007-09-24 15:20 UTC, Thomas Renninger
Details | Diff
Do not use ondemand per default on opterons (1.43 KB, patch)
2007-09-27 11:11 UTC, Thomas Renninger
Details | Diff
Modified DSDT with CPU freq info added for CPU1 (22.13 KB, application/octet-stream)
2007-09-28 14:11 UTC, Thomas Renninger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jogchum Reitsma 2007-08-13 17:52:33 UTC
I try to install 10.3beta1 on a dual opteron system presently running 10.3alpha7. When booting from the burned DVD, the Welcome splash screen appears, and after that the boot menu screen. When I choose to install the beta-release, (and press Esc to see the messages on the console), the install procedure gets as far as this:

--------------------------------------------------------------
Loading basic drivers.........................OK
Starting hardware detection ...........OK

(If...................brokenmodules=driver_name)
---------------------------------------------------------------

and after that, the system is powered off instantaneously mostly, or sometimes hangs after the messages

----------------------------------------------------------------
Micro Star International CK804 IDE
drivers pata, amd74xx, generic
loading pata
----------------------------------------------------------------

When I choose Boot from harddisk, 10.3alpha7 boots OK.
When I choose to do a memory test, the test is done OK
When I choose to do a firmware test, the system hangs as stated before.

System is MSI K8N Master2FAR, two single core opterons, 2 GB mem, 200 GB Maxtor IDE disk, Matrox Parhelia graphics card, Hauppauge WinTV 150 tv-card, Plextor PX  740A dvd writer.

Checksum of DVD-download is OK, I burned de DVD twice, last time on 10.2 OS and on lowest speed. Same result.

See also my post on the Open Suse Beta forum in Suseforums.net
Comment 1 Andreas Jaeger 2007-08-14 08:36:34 UTC
Please add the output of "lspci -nn" to Bug 299010.

*** This bug has been marked as a duplicate of bug 299010 ***
Comment 2 Jogchum Reitsma 2007-08-14 17:45:30 UTC
I've added lspci -nn to bug 299010.

The behaviour signalled by the original poster of 299010 is somewhat different from my findings: in my case, the crash appears in a much earlier state.

But that's just my 2 cents.
Comment 3 Tejun Heo 2007-08-15 06:39:56 UTC
Jogchum.  Please post '/var/log/boot.msg' and 'hwinfo --all' from working installation.  Also, after 10.3b1 installation system is fully loaded...

* Press "ctrl-alt-f9".  It will give you a command console.
* mount a usb stick or hard disk partition to /mnt
* "cp /var/log/boot.msg /mnt", "dmesg > /mnt/dmesg.log", "hwinfo --all > /mnt/hwinfo.log"
* Post the results here.

Thanks.
Comment 4 Jogchum Reitsma 2007-08-15 21:55:24 UTC
Created attachment 157786 [details]
The boot.msg created by 10.3alpha7
Comment 5 Jogchum Reitsma 2007-08-15 21:57:36 UTC
Created attachment 157787 [details]
hwinfo created by 10.3.alpha7
Comment 6 Jogchum Reitsma 2007-08-15 22:37:20 UTC
I'm afraid I'm unable to fulfill the last request you made: 

after 10.3b1 installation system is fully loaded...

* Press "ctrl-alt-f9".  It will give you a command console.

When 10.3b1 installation system is fully loaded, I have only a few seconds (less than 5) before the system crashes, and in that time either I get no response to
'ctrl-alt-F9', or I do get a prompt, namely
/#
but then the keyboard is as dead as a doornail (even numlock etc don't work).

I'm giving up for now, and I have only tomorrow night before I go on a tree week holiday.

regards, jogchum
Comment 7 Tejun Heo 2007-08-16 19:11:18 UTC
Console F4 logs kernel messages.  Taking a photo of the console when the machine locks up should give us some clue.  Thanks.
Comment 8 Jogchum Reitsma 2007-08-16 22:14:58 UTC
Created attachment 158046 [details]
Photo of console 4
Comment 9 Jogchum Reitsma 2007-08-16 22:15:44 UTC
After a few times trying, going to console 4 succeeded! I've made a - hopefully readable - photo of the screen.

See the attachment.

I won't be able to react for the next three weeks.

regards, jogchum
Comment 10 Tejun Heo 2007-08-17 16:46:59 UTC
Thanks.  Hmmm... Weird.  pata_amd hasn't been changed between a6 and b1.  Please ping me back when you come back from your vacation.  I'll prep debug kernels.  Thanks.
Comment 11 Christoph Thiel 2007-08-28 11:30:55 UTC
Jogchum, did you have a chance to look into this bug on Beta1 or Beta2?

Since there hasn't been any progress on this bug, I'm lowering severity to crit.
Comment 12 Jogchum Reitsma 2007-09-06 09:15:35 UTC
Back from holiday; tried beta2, behaviour is the same...

Tejun, something I can do further, perhaps with debug kernels you planned to prepare?

regards, Jogchum
Comment 13 Jogchum Reitsma 2007-09-10 16:42:59 UTC
Tried beta3, same result...

regards, Jogchum
Comment 14 Stephan Kulow 2007-09-12 16:20:59 UTC
info provided
Comment 15 Stephan Kulow 2007-09-14 13:14:45 UTC
Can you please try the live CD of beta3plus from: http://ftp.opensuse.org/pub/opensuse/distribution/10.3-Beta3plus/iso/cd/

This has several kernel fixes in this area
Comment 16 Jogchum Reitsma 2007-09-14 20:06:15 UTC
Created attachment 164149 [details]
Screenshot just before entering non-responding state; live CD 10.3beta3plus
Comment 17 Jogchum Reitsma 2007-09-14 20:07:48 UTC
Crashes also (powers the system off), but in a - at least seemingly - later stage. On one try however, the system did not power off, but went into a non-responding state - numlock etc didn't give a reaction too.

I made a number of photographs, two of which I'll add. First one (img_3090.jpg) gives the last screen before the no-responding state. The second one (img_3092.jpg) is from another start-up, one which leads to a power off. The shot is taken just before the power off, so it includes the last message line on the console before power-off.

I noted that this CD is a i386 release, not x86-64. But I assume you noted that too.

regards, Jogchum
Comment 18 Jogchum Reitsma 2007-09-14 20:09:36 UTC
Created attachment 164150 [details]
Screenshot just before power-off; 10.3beta3plus live CD
Comment 19 Stephan Kulow 2007-09-15 05:58:51 UTC
I didn't notice the arch, no. But interesting that it doesn't matter to your computer.

Adrian also noted sudden power offs and interestingly also has a matrox graphics card. They seem to dislike our X server or something.
Comment 20 Jogchum Reitsma 2007-09-16 00:17:06 UTC
I noted that in the screenshots I uploaded a few days ago (img_3090.jpg and img_3092.jpg) acpi was colled just before the anomalies (power-off, hung system) appeared. So I installed with boot option acpi=off, and now the system installs properly.

Two o'clock in the night now, going to bed...

regards, Jogchum 
Comment 21 Tejun Heo 2007-09-16 11:31:22 UTC
So, with ACPI turned off, harddisks are detected and work properly, right?  cc'ing Thomas for ACPI.
Comment 22 Jogchum Reitsma 2007-09-16 14:53:05 UTC
Yes, correct. I have 10.3beta3 installed now.
Funny thing is, now I have 10.3beta3 installed, I don't have to give acpi=off when booting the installed system - it runs fine...

Totally OT on this bug, but I gave the 386 live release from beta3 a go on my Acer 1710 laptop (which runs 10.0 at the moment). It seemed to try to run the X-system (runlevel 5 was entered), but without success: afer a few flickerings on a blackish screen, it gave the login prompt on console 1. It's our 'production' machine, so I can't do too much testing on it. But as said, totally OT here.

regards, Jogchum
Comment 23 Stefan Dirsch 2007-09-16 19:40:05 UTC
> Adrian also noted sudden power offs and interestingly also has a matrox
> graphics card. They seem to dislike our X server or something.

This is a Matrox Parhelia, which is not supported by the mga driver. So it's very unlikely that it is related to the graphics drver. With fbdev driver in use it would probably happen with any other graphics card.
Comment 24 Thomas Renninger 2007-09-17 12:27:23 UTC
>Funny thing is, now I have 10.3beta3 installed, I don't have to give acpi=off
>when booting the installed system - it runs fine...
Ok, then I close this one fixed, please reopen if you should see further problems.
>Totally OT
Yeah, very Off-Topic. Better open another bug report if this should get addressed.
Comment 25 Jogchum Reitsma 2007-09-17 15:30:32 UTC
Honestly, I doubt if this bug should be considered as 'fixed'. In the install phase the bug is still there. Now that is clear that acpi=off si a work around for the installation process, the bug should be better tracable, though.

Just my 2 cents...

regards, Jogchum

PS I wil open a bug report for the prblem with the Acer I noticed (provided this bug has not been reported yet, of course).
Comment 26 Tejun Heo 2007-09-17 17:11:01 UTC
Jogchum, if installation of b3 doesn't work w/o acpi=off, please reopen the bug.
Comment 27 Jogchum Reitsma 2007-09-17 19:30:03 UTC
Indeed is the installation problem like stated in the original posting: only with acpi=off installation works.

I reopen the bug.
Comment 28 Thomas Renninger 2007-09-18 10:14:10 UTC
Yes sorry, the bug should not have been closed.
Investigating...

Jogchum: If you pass acpi=off boot parameter when you boot the install system, the parameter should have been added by yast. Can you check in /boot/grub/menu.lst whether your first/default boot entry has an acpi=off parameter added.
If yes, you can remove it and you should run into this problem again (you can add the parameter again later to get a working system...)?

This bug has been declared as a duplicate of bug #299010. There a lot of people were involved and added in CC. Were they dropped by marking this as a duplicate? I already asked because I also saw this in another bug some time ago and I thought this got fixed, shouldn't all the CC'ed people from bug #299010 also be CC'ed here?
Comment 29 Jogchum Reitsma 2007-09-18 17:10:44 UTC
Yes, it is present in the difault boot entry; and on removal the system reboots (not: powers off! at least not on the one try I gave it); when I give acpi=off on the boot promt the system start up normal.

regards, Jogchum
Comment 30 Jogchum Reitsma 2007-09-18 17:17:26 UTC
I doubt if 299010 is still seen as a duplicate of this bug; see comment #27 from Tejun Heo on 299010. Behaviour is quite different with me, and my controller is different too.

regrs, Jogchum
Comment 31 Thomas Renninger 2007-09-18 17:43:02 UTC
Yep. I tried to reproduce this on a MSI Platinum, but this seems to work fine now.
Could you try (instead of acpi=off):
pci=noacpi
noapic
nolapic
possibly also could help:
pci=nommconf
pci=nomsi
It's enough to find the first one working, in case pci=noacpi works you should also try noapic or nolapic boot param.

Hmm, maybe I am still to disk/irq oriented (the disk is found correctly now?), the fact that the machine powers off more looks like a not-irq related bug.

Maybe we should start with this:
does this boot parameter work (without acpi=off):
init=/bin/bash
If yes and the disk is found, I am on the wrong track and the above boot parameters are probably not needed to be tested.
I'd suggest to not load any acpi modules then. Kay showed me how to do this via udev rules, but I forgot again... Kai could you please tell us.
Comment 32 Jogchum Reitsma 2007-09-18 19:07:01 UTC
Neither

pci=noacpi
noapic (I tried also pci=noapic: not sure if I understood you right here)
nolapic (also pci=nolapic)
pci=nommconf
pci=nomsi

works: in all cases power-off.

init=/bin/bash works, the root device is mounted, and the /home partition (these are the only two partitions on the disk, apart from swap of course) is mountable.

What strikes me is that power-off is always seen just or almost just after the message

Loading CPUFreq modules

Could that give a clue maybe?

regards, Jogchum
Comment 33 Thomas Renninger 2007-09-19 11:03:26 UTC
> Loading CPUFreq modules
> Could that give a clue maybe?
Definetly. I could have seen this before, but was confused by the disk related duplicates and by the fact that Tejun was assigned to this one ->taking over. Still not sure whether it's cpufreq, possibly other ACPI accesses.

Can you try to boot with CPUFREQ=off boot parameter.
If this does not work it's probably some other ACPI module.
If this works, do:
- rmmod battery
- rmmod thermal
- echo 0x21F >/sys/module/acpi/parameters/debug_level

Do (always wait some secs, to be sure the done step is not the offender):
- modprobe processor
- modprobe powernow-k8
- modprobe cpufreq_ondemand
- echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

Please extract the logged messages in /var/log/messages by time/date after machine hang and got rebooted and attach them.
Comment 34 Thomas Renninger 2007-09-19 11:04:40 UTC
Is this a i386/32 bit or x86_64/64 bit installed system?
If it is a i386 installed system, nohz=disable could help?
Comment 35 Jogchum Reitsma 2007-09-19 11:57:02 UTC
I'll run the tests when I'm home from work.

System and installation is x86_64/64 bit.

regards, Jogchum
Comment 36 Jogchum Reitsma 2007-09-19 17:03:55 UTC
With CPUFREQ=off the system starts up correctly.

None of the commands halts or hangs the system, but see the attachment for the responses to the commands, and the (few) relevant lines in /var/log/messages

regards, Jogchum
Comment 37 Jogchum Reitsma 2007-09-19 17:12:12 UTC
Created attachment 173419 [details]
System responses from the tests and relevant lines /var/log/messages
Comment 38 Thomas Renninger 2007-09-19 19:06:15 UTC
Can you please attach acpidump output.
It could also be useful if you boot with CPUFREQ=off (this one should be your preferred boot parameter for now and provide you most functionality -> all but cpufreq).
Copy this code into a file and do a chmod 755 on the file and execute it:
---------------------------------------
#!/bin/bash

rmmod battery
rmmod thermal   (should not be loaded anyway)
logger XXXXXXXXXXX
echo 0x21F >/sys/modules/acpi/paramters/debug_level
modprobe processor
echo 0x3 >/sys/modules/acpi/paramters/debug_level
logger YYYYYYYYYYYY
---------------------------------------

Can you attach the output of /var/log/messages between XXXXXXXX and YYYYYYY, pls.

Hmmm, before doing any of this, you shoul look out for a BIOS update, this again looks like a BIOS issue.
Comment 39 Jogchum Reitsma 2007-09-19 22:25:13 UTC
Created attachment 173480 [details]
The acpidump
Comment 40 Jogchum Reitsma 2007-09-19 22:38:08 UTC
Apparently the requested modules are not present, as in the previous tests:

----------------------------------
# ./test2results.sh
ERROR: Module battery does not exist in /proc/modules
ERROR: Module thermal does not exist in /proc/modules
./test2results.sh: line 6: /sys/modules/acpi/paramters/debug_level: No such file or directory
FATAL: Error inserting processor (/lib/modules/2.6.22.5-16-default/kernel/drivers/acpi/processor.ko): No such device
./test2results.sh: line 8: /sys/modules/acpi/paramters/debug_level: No such file or directory
------------------------------------

There is nothing between

Sep 20 00:26:16 souder-exp jogchum: XXXXXXXXXXX
Sep 20 00:26:16 souder-exp jogchum: YYYYYYYYYYYY

in /var/log/messages

There does not seem to be a BIOS-upgrade for the K8N Master2-FAR on the MSI-site.

regards, Jogchum
Comment 41 Thomas Renninger 2007-09-20 05:59:01 UTC
Sorry I mis-spelled the paramters, should be parameters, but there should be no need, I expect acpidump is enough.
Comment 42 Thomas Renninger 2007-09-20 06:58:05 UTC
This is a bug in ACPICA:

        Name (_PSS, Package (0x02)
        {
            Package (0x06)
            {
                0x0708, 
                0x0000D6D8, 
                0x64, 
                0x09, 
                0xE020298A, 
                0x018A
            }, 

            Package (0x06)
            {
                0x03E8, 
                0x00002EE0, 
                0x64, 
                0x09, 
                0xE0202C82, 
                0x0482
            }, 

            Package (0x06)
            {
                0xFFFF, 
                0xFFFFFFFF, 
                0xFF, 
                0xFF, 
                0xFFFFFFFF, 
                0x03FF
            }, 

            Package (0x06)
            {
                0xFFFF, 
                0xFFFFFFFF, 
                0xFF, 
                0xFF, 
                0xFFFFFFFF, 
                0x03FF
            }, 

            Package (0x06)
            {
                0xFFFF, 
                0xFFFFFFFF, 
                0xFF, 
                0xFF, 
                0xFFFFFFFF, 
                0x03FF
            }, 

            Package (0x06)
            {
                0xFFFF, 
                0xFFFFFFFF, 
                0xFF, 
                0xFF, 
                0xFFFFFFFF, 
                0x03FF
            }
        }


I had a similar bug on a machine that had no valid package/AML information inside of a package, but was filled up with zeros. I wonder why it does not work. It was bug #189488 (getting interesting at comment #11). 

Whether zeros or not, the contents after package(0x2) should not get evaluated.
Adding Alexej, AFAIK he got my patch forwarded from Bob (and even signed-off?) and it got slightly modified after running in their test suites...

It could be that the parser in the first cycle ignores the amount of packages, already generates meta-info for the other packages, which later leads to difficulties...

BIOS developers like to fill up CPU frequency data to not need to allocate space for different CPUs dynamically. On some machines, the info is often filled up with the same frequency information which is then ignored by the kernel, on some they cut it by package size definition.

It's hard to predict the risk of this, but I expect if we can come up with a save patch, that ignores everything after the snd package (amount of package elements) it should go in.

I still have another important bug: Alexey, could you already have a look at this, pls (you might want to take over if you have time...).

Len, now it would be convenient if we could work on the latest ACPICA sources...
Comment 43 Thomas Renninger 2007-09-20 07:06:14 UTC
Created attachment 173544 [details]
Fixed and recompiled DSDT

For verification whether it's really that, could you copy this attachment to e.g. /etc/DSDT.aml   (the filename must stay the same)
modify ACPI_DSDT="" to ACPI_DSDT="/etc/DSDT.aml" in /etc/sysconfig/kernel and invoke:
mkintird
Then reboot without acpi=off and without CPUFREQ=no
Does it boot?
If you do:
powersave -c
Do you get DYNAMIC as output?
If this gets fixed you should set the entry back to: ACPI_DSDT="" and invoke mkinitrd again, so that the DSDT does not get added to initrd and does not get overriden by the kernel at early boot. The problem is, that this table is generated by the BIOS depending on your hardware, if you e.g. add more memory you must not use this table anymore!
Comment 44 Jogchum Reitsma 2007-09-20 23:04:24 UTC
It still powers off after I took the steps you described :-(

regards, Jogchum
Comment 45 Jogchum Reitsma 2007-09-21 07:19:55 UTC
Created attachment 173793 [details]
Result of the mkinitrd command

I forgot to add the output of the mkinitrd command. One has to issue it from /, because otherwise sbin/update-bootloader is not found. Is it on purpose that this command has a relative in stead of absolute path?

But the output of mkinitrd seems OK to me.

The output of
----------------------------------------
powersave -c
liblazy (liblazy_dbus_send_method_call:97): Received error reply: Method "GetCPUFreqGovernor" with signature "" on interface "org.freedesktop.Hal.Device.CPUFreq" doesn't exist

Could not get current CPUFreq policy.
-----------------------------------------

Logical, because CPUFREQ is set off, I think?

regards, Jogchum
Comment 46 Thomas Renninger 2007-09-21 07:53:27 UTC
You can check with dmesg |less
-> There must be a sentence like DSDT overriden by initrd or similar, you could grep for DSDT or initrd (possibly initramfs instead of initrd).
If this line is there, everything should work fine (also without CPUFREQ=off or other workaround boot parameters).

> sbin/update-bootloader is not found -> strange, works here. AFAIK there has still work been done in this package.

> powersave -c
> liblazy (liblazy_dbus_send_method_call:97): Received error reply
What the...
You can test whether this directory exists instead:
ls /sys/devices/system/cpu/cpu0/cpufreq
If it exists you can watch the frequencies switching with:
watch -n1 cat /sys/devices/system/cpu/cpu*/cpufreq/scaling_cur_freq

> Logical, because CPUFREQ is set off, I think?
Be sure CPUFREQ=off is not passed. At runtime you can see the boot parameters here:
cat /proc/cmdline
you can modify them in /boot/grub/menu.lst.
Comment 47 Jogchum Reitsma 2007-09-21 09:06:45 UTC
This line is in output dmesg:

ACPI: Override [DSDT-AWRDACPI] from initramfs - tainting kernel

But still no boot without CPUFREQ=off.
---------------------------------------------------

Regarding 

> Be sure CPUFREQ=off is not passed. At runtime you can see the boot parameters
> here:
> cat /proc/cmdline
> you can modify them in /boot/grub/menu.lst.

: without CPUFREQ=off the system does not boot, so....

-----------------------------------------------------

souder-exp:~ # ls /sys/devices/system/cpu/cpu0/cpufreq
ls: cannot access /sys/devices/system/cpu/cpu0/cpufreq: No such file or directory
souder-exp:~ #

regards, Jogchum


Comment 48 Thomas Renninger 2007-09-21 09:21:43 UTC
That is strange, if processor module cannot be loaded:
> FATAL: Error inserting processor
> (/lib/modules/2.6.22.5-16-default/kernel/drivers/acpi/processor.ko): No such
> device

powernow-k8 should also not load and cpufreq should be disabled anyway, I wonder why it boots with CPUFREQ=no, but not without.

Can you try things from comment #38 (with CPUFREQ=no):
Copy this code into a file and do a chmod 755 on the file and execute it (hope I got the filenames correctly now:
---------------------------------------
#!/bin/bash

logger XXXXXXXXXXX
echo 0x21F >/sys/modules/acpi/parameters/debug_level
modprobe processor
echo 0x3 >/sys/modules/acpi/parameters/debug_level
logger YYYYYYYYYYYY
---------------------------------------

Can you attach the output of /var/log/messages between XXXXXXXX and YYYYYYY,
pls.
Comment 49 Jogchum Reitsma 2007-09-21 17:14:29 UTC
Few days away from home, won't be able to test it until tomorrownight. I'm sorry.

regards, Jogchum
Comment 50 Jogchum Reitsma 2007-09-22 20:31:02 UTC
Actually, it is not

/sys/modules/acpi/parameters/debug_level 

but

/sys/module/acpi/parameters/debug_level

But there is nothing written to /var/log/messages between the XXXXXXXXXXX and YYYYYYYYYYYY lines (which themselves are written indeed).

regards, Jogchum
Comment 51 Jogchum Reitsma 2007-09-23 17:39:18 UTC
Forgot to tell, the script did not give any (error) messages, and

lsmod | grep processor
gives
processor              59592  1 thermal

so modprobe does it's work apparantly, only there's no logging.
Comment 52 Thomas Renninger 2007-09-24 15:18:30 UTC
#comment 32 states that the processor module could not be loaded (therefore powernow-k8 could not be loaded which needs the processor module). I expect, there a boot param like acpi=off has been used accidently.

About comment #42: The problem is that our userspace acpica sources are *really* old and the problem I mentioned in that comment should already be fixed. But Intel has not published the fixed sources, we still have to use the code from more than a year ago which makes this one very hard to debug (-> Len, we have to talk about this again privately, I hope there wasn't some kind of policy change at Intel about this...).

Bug #297119 has a very similar CPUFreq states declaration: Declares a package of size 2. In it there are 2 valid packages and 4 invalid (0xFFFFF...) packages, the latter ones have to be ignored totally. There the invalid packages are exported to cpufreq layer (!ACPI bug!) and the powernow-k8 module seem to have a sanity check on the exported values and ignores them: invalid freq entries 3900000 kHz vs. 65535000 kHz.

It's too late now to fix the ACPI parse now, I try to find a sanity check for powernow-k8:

There is a differenciation depending on a processor capability flag, the variable used is cpu_family and can be CPU_OPTERON and CPU_HW_PSTATE. In bug 297119 we have a non Opteron, here it is a Opteron and I expect therefore:
fill_powernow_table_pstate (-> non-opteron case)
and
fill_powernow_table_fidvid (-> operon case)
is used to initialise and sanity check these values.

The non-opteron case looks more scary (a read msr is done with the bogus info)... Anyway, better ignore all bogus info and mark it invalid at the very beginning.

If it's this, attached patch should help and should be safe enough to go in, even for RCx.

I built an rpm for you to test, verification whether everything is fine now would be appreciated (it may take some hours until the ftp server got synced and the file pops up):
ftp.suse.com/pub/people/trenn/wrong_acpi_freq_info/kernel-default-2.6.22.5-29.x86_64.rpm
Comment 53 Thomas Renninger 2007-09-24 15:20:00 UTC
Created attachment 174384 [details]
Ignore bogus CPU frequency values wrongly exported by ACPI layer early
Comment 54 Jogchum Reitsma 2007-09-24 18:22:54 UTC
I just installed the new kernel (had to remove some ivtv-rpm's that depended on the original kernel, but no further problems) and rebooted with the CPUFREQ=off statement removed from /boot/grub/menu.lst.

I'm sorry to say, but still a power-off.... Tried three times, same result. Boot option CPUFREQ=off lets the system boot again.

regards, Jogchum
Comment 55 Jogchum Reitsma 2007-09-25 09:25:55 UTC
The problems started between alpha7 and beta1. What are de differences in treating CPU-frequency handling between those releases? Could there be a clue?

As always, just my 2 cents.

regards, Jogchum
Comment 56 Thomas Renninger 2007-09-25 10:23:08 UTC
Loading ondemand governor by default came in around that time. Only difference should be that cpufreq already gets active at installation time and that you see the hang there. It shouldn't make a difference at a finished up installation. I'll have a look for other changes...
Comment 57 Thomas Renninger 2007-09-25 13:13:25 UTC
I put a cpufreq debug enabled kernel here:
ftp.suse.com/pub/people/trenn/cpufreq_debug_10_3/kernel-default-2.6.22.7-31.x86_64.rpm

Can you install that one pls.

Next we should make sure the cpufreq modules can be loaded manually (even with passing CPUFREQ=off, AFAIK only hal start script is looking out for this):
Boot the newly installed kernel with CPUFREQ=off and cpufreq.debug=7

modprobe processor (might already be loaded)
modprobe powernow-k8
modprobe cpufreq_ondemand
echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo ondemand >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo ondemand >/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo ondemand >/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor

Now the machine should switch off. You should be able to extract extra cpufreq related debug info from /var/log/messages (from last boot).

If it does not switch off, it possibly might be some strange interference between ACPI modules and cpufreq and we need to dig further.

As said, you might want to contact me via ICQ, I can help then interactively that might speed up things...
Comment 58 Jogchum Reitsma 2007-09-25 15:19:13 UTC
System freezes - no power off, but cursor disappears and system is non-responding - after modprobe powernow-k8.

Messages fomr /var/log/messages:
--------------------------------------------------------------------------------
Sep 25 16:21:55 souder-exp kernel: powernow-k8: Found 2 AMD Opteron(tm) Processor 244    processors (version 2.00.00)                      Sep 25 16:21:55 souder-exp kernel: cpufreq-core: trying to register driver powernow-k8                                                     Sep 25 16:21:55 souder-exp kernel: cpufreq-core: adding CPU 0                                                                              Sep 25 16:21:55 souder-exp kernel: powernow-k8:    0 : fid 0xa, vid 0x6                                                                    Sep 25 16:21:55 souder-exp kernel: powernow-k8:    1 : fid 0x2, vid 0x12                                                                   Sep 25 16:21:55 souder-exp kernel: powernow-k8:    0 : fid 0xa (1800 MHz), vid 0x6                                                         Sep 25 16:21:55 souder-exp kernel: powernow-k8:    1 : fid 0x2 (1000 MHz), vid 0x12                                                        Sep 25 16:21:55 souder-exp kernel: powernow-k8: cpu0, init lo 0x60a, hi 0x1                                                                Sep 25 16:21:55 souder-exp kernel: powernow-k8: policy current frequency 1800000 kHz                                                       Sep 25 16:21:55 souder-exp kernel: freq-table: table entry 0: 1800000 kHz, 1546 index                                                      Sep 25 16:21:55 souder-exp kernel: freq-table: table entry 1: 1000000 kHz, 4610 index                                                      Sep 25 16:21:55 souder-exp kernel: freq-table: setting show_table for cpu 0 to ffff810055a648c0                                            Sep 25 16:21:55 souder-exp kernel: powernow-k8: cpu_init done, current fid 0xa, vid 0x6                                                    Sep 25 16:21:55 souder-exp kernel: cpufreq-core: setting new policy for CPU 0: 1000000 - 1800000 kHz                                       Sep 25 16:21:55 souder-exp kernel: freq-table: request for verification of policy (1000000 - 1800000 kHz) for cpu 0                        Sep 25 16:21:55 souder-exp kernel: freq-table: verification lead to (1000000 - 1800000 kHz) for cpu 0                                      Sep 25 16:21:55 souder-exp kernel: freq-table: request for verification of policy (1000000 - 1800000 kHz) for cpu 0                        Sep 25 16:21:55 souder-exp kernel: freq-table: verification lead to (1000000 - 1800000 kHz) for cpu 0                                      Sep 25 16:21:55 souder-exp kernel: cpufreq-core: new min and max freqs are 1000000 - 1800000 kHz                                           Sep 25 16:21:55 souder-exp kernel: cpufreq-core: governor switch                                                                           Sep 25 16:21:55 souder-exp kernel: cpufreq-core: __cpufreq_governor for CPU 0, event 1                                                     Sep 25 16:21:55 souder-exp kernel: cpufreq-core: governor: change or update limits                                                         Sep 25 16:21:55 souder-exp kernel: cpufreq-core: __cpufreq_governor for CPU 0, event 3                                                     Sep 25 16:21:55 souder-exp kernel: cpufreq-core: initialization complete                                                                   Sep 25 16:21:55 souder-exp kernel: cpufreq-core: adding CPU 1   

-----------------------------------------------------------------------------

I don't remember you mentioning ICQ chat; how can I reach you there, or is that the same as IRC? I'm looking now at channel  openSUSE-bugs, nut don;'t see you there,

Is there a big time-gap? I'm in the Netherlands (as you might have guessed...)

regards, Jogchum                                                                           
Comment 59 Thomas Renninger 2007-09-25 17:31:58 UTC
Coming in between Alpha7 and Beta1 are only using ondemand per default governor and nohz patches.

First we should check whether it's ondemand governor breaking things. (Maybe ondemand governor already tries to set another freq on cpu0 while cpu1 is still initialising and this Opteron does not like that?)

Could it be that Alpha7 was an updated system (still having userspace governor set per default)
Do you have:
CPUFREQ_ENABLED="userspace"
set in /etc/sysconfig/powersave/cpufreq ?

Ohh... with ondemand per default the userspace workaround does not work anymore for machines freezing with ondemand governor. It depends on the switching latency of the driver, but if the driver (powernow-k8) is supposed to work with ondemand, the governor is used automatically at driver load time now.

I placed a kernel with cpufreq debug and with performance governor used per default here:
ftp.suse.com/pub/people/trenn/performance_gov_default_10_3/kernel-default-2.6.22.8-31.x86_64.rpm
modprobe powernow-k8 should work now?
and
modprobe cpufreq_ondemand
echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo ondemand >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
echo ondemand >/sys/devices/system/cpu/cpu2/cpufreq/scaling_governor
echo ondemand >/sys/devices/system/cpu/cpu3/cpufreq/scaling_governor
freezes the system?

If not it could be that ondemand governor already kicks in for cpu0 while cpu1 still gets initialised...

If yes:
We could only blacklist the CPU revision of these Opterons to not use ondemand per default.
Because cpufreq (with ondemand per default) is used at installation time now where no userspace tools are available, fiddling around in userspace is not a real option here...

Because of ICQ: I mailed you my data privately, I can also have a look at an irc channel tomorrow if you prefer this one...
Comment 60 Jogchum Reitsma 2007-09-25 18:46:59 UTC
modprobe powernow-k8 doesn't freeze of power-off the system now.
As for cpufreq_ondemand, this module is not found:

souder-exp:~ # modprobe cpufreq_ondemand
FATAL: Module cpufreq_ondemand not found.

/lib/modules/`uname -r`/kernel/arch/x86_64/kernel/cpufreq/ only has

-rw-r--r-- 1 root root 34280 Sep 25 19:16 acpi-cpufreq.ko
-rw-r--r-- 1 root root 44512 Sep 25 19:16 powernow-k8.ko

I suppose the echo commands don't make sense then? BTW, I have only 2 CPU's (no dual-cores), so there's only cpu0 and cpu1.

Read my email now, so I saw your invitations to ICQ; sorry!

I'm afraid I've never used ICQ; I've started Kopete, but only thing I can do in this UI is add a contact, AFAICS, so how to make contact with you this way?

Tomorrow I'll be working: leaving 6:45, coming home around 17:45 if public transport is accurate.
Timezone is CEST, so that's no problem.
Comment 61 Thomas Renninger 2007-09-26 14:01:03 UTC
ondemand governor is already compiled in, my fault. Just doing:
echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
echo ondemand >/sys/devices/system/cpu/cpu1/cpufreq/scaling_governor
Should activate it.

I could reproduce this once on a 4 socket Opteron Dual Core machine.
Debug info or printk seems to prevent the probably lock/race condition.

I just want to mention that this looks sever. Investigating...
Comment 62 Thomas Renninger 2007-09-26 16:11:54 UTC
I tried:
- MUTEX_DEBUG, PROVE_LOCKING configs
- I also tried to 100% reproduce this with some delays, no luck until now

It's probably hanging at:
  lock_policy_rwsem_write(cpu);
in
  cpufreq_add_dev(..) in drivers/cpufreq/cpufreq.c

But this lock is clustered all over the cpufreq core, I even couldn't reproduce this at all anymore.

We can either remove CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND and set CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y again.
We would loose the ability that cpufreq is set up at installation time then

Or people need to add brokenmodules=powernow-k8 to install on affected machines and CPUFREQ=off later until this is finally tracked down.

Jogchum: I won't have time at around 19:00, maybe later, you may want to ping me through mail or icq.
Comment 63 Jogchum Reitsma 2007-09-27 09:44:26 UTC
echo ondemand >/sys/devices/system/cpu/cpu0/cpufreq/scaling_governor

is enough to let the system power-off.
Comment 64 Thomas Renninger 2007-09-27 10:06:27 UTC
This all is very weird.
For the logs in #58, I expect you hit a rare dead-lock condition (which I think I also could run into once, I now setup machines rebooting all the time to see how often this gets hit, it seem to happen very rarely). I expect this sometimes happens with:
ondemand as default governor + slower freq switching opterons + smp, but this is only a rough guess.

But I have no explanation why your machine reboots.
Especially with the debug kernel where you could load the powernow-k8 module (performance governor active), then you activated ondemand. This has nothing to do with the change that we switched to ondemand per default, it's the normal way it should have worked for a long time.
Has this been an updated system (we had problems with ondemand a long time ago and we blacklisted some machines to use userspace governor instead, that would explain why it worked before).
A bit of a problem is that because of ondemand per default we cannot blacklist this machine anymore to use the userspace governor. If you still have the kernel from comment #59/#60 it's possible.
Normally this should work: set CPUFREQ_ENABLED="userspace" in /etc/sysconfig/
be sure you have not started a desktop (or kpowersaved or gnome-power-manager explicitly might activate the ondemand gov). You need to restart the powersaved then. Whether userspace is active can be checked here:
cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
If the CPU has load it must switch up: e.g. cat /dev/zero >/dev/null should produce 100% load on one processor.
Comment 65 Holger Macht 2007-09-27 10:30:47 UTC
(In reply to comment #64 from Thomas Renninger)
> Normally this should work: set CPUFREQ_ENABLED="userspace" in /etc/sysconfig/

It has to be CPUFREQ_CONTROL="userspace" in /etc/sysconfig/cpufreq
Comment 66 Thomas Renninger 2007-09-27 11:11:47 UTC
Created attachment 175114 [details]
Do not use ondemand per default on opterons

This one should be a safe way to use performance governor per default on opterons (It's switched to ondemand later via userspace tools).
Blacklisting could be done more clever (e.g. all smp AMDs without fire and forget or similar).
I think we should not add it that late, because of one bug report (I still couldn't run again into a machine hang yet) and it wouldn't help for this specific machine anyway.

Even if the dead lock condition is found and a possible fix, modifing the complicated rw semaphore code paths, this is nothing we want in RC3 or an update... If this got evaluated a bit further we might want to provide something like the attached patch in an update...
Comment 67 Thomas Renninger 2007-09-27 11:22:26 UTC
Maybe the power off has to do with nohz changes + tsc (which gets unstable with cpufreq) as time source -> notsc boot parameter could help then.

If not, I'd also like to see extended acpi debug output when cpufreq gets activated, maybe we could do that together via chat, you might want to contact me directly via mail.

If possible, a 10.2 or AlphaX installed system could also help debugging. If it's still there, just checking whether cpufreq worked there could give a hint, maybe it was never activated.
Or maybe HW got some damage between Alpha7 and next update (probably not..., but you never know...)
Comment 68 Thomas Renninger 2007-09-28 09:39:23 UTC
It doesn't look that sever -> downgrading:

- I let two machines run over night to hit the deadlock case -> no freeze
  (powernow-k8 cannot be unloaded).

- Jogchum tried a 10.2, cpufreq never worked there because of an ACPI bug
  (see comment #42) he gets: register performance failed: bad ACPI data

If there really is a deadlock condition we should get some more reports soon, I won't waste time on this any more for now.

It may be that cpufreq never worked on Jogchum's machine. Next we will try whether 10.3 Alpha6 really has cpufreq working.

I could imagine the power off comes from a broken voltage regulator. That would mean cpufreq was never activated before and he has broken HW.
It could also be a side effect of the ACPI fix for the seldom package declaration (see comment #42) -> but there should not have been changes at this place between Alpha7/Beta1.
Comment 69 Thomas Renninger 2007-09-28 14:09:14 UTC
Outch.
Thanks to Jogchum I got modprobe powernow-k8 with some more ACPI debug enabled.
It's a BIOS bug..., the ACPI tables have cpufreq info for CPU0, but simply miss the cpufreq Info for CPU1.

There could get added something to get out of the driver more gracefully. Currently the powernow-k8 driver even loads (which it did not with e.g. 10.2):

cpufreq-core: initialization complete
cpufreq-core: adding CPU 1
 nsutils-0454 [00] ns_build_internal_name: Returning [ffff81007cfd9560] (rel) "_PCT"
 nsutils-0869 [00] ns_get_node           : _PCT, AE_NOT_FOUND
processor_perflib-0312 [00] processor_get_performa: ACPI-based processor performance control unavailable
powernow-k8: register performance failed: bad ACPI data
powernow-k8: MP systems not supported by PSB BIOS structure
cpufreq-core: initialization failed
cpufreq-core: driver powernow-k8 up and running

It's a bit  strange that behaviour changed even between 10.2 and 10.3, but there is not much we can do here (beside exiting gracefully).


> I let two machines run over night to hit the deadlock case (the one was the
> machine which I thought froze) -> no freeze
Those are still happily rebooting. Maybe this was a false alarm...


Here again the board:
MSI K8N Master2FAR

Someone should tell MSI about this issue.
Jogchum, you may want to monitor the support sites of MSI and look out for a possibly upcoming BIOS.
Comment 70 Thomas Renninger 2007-09-28 14:11:37 UTC
Created attachment 175489 [details]
Modified DSDT with CPU freq info added for CPU1

You may want to follow commment #42 and try this DSDT, with some luck it works.
Still, you should nag MSI for a new BIOS and revert the modified DSDT again after testing by removing the added entry in /etc/sysconfig/kernel and invoking mkinitrd afterwards.