Bug 145366

Summary: Hard Lockup on Amilo 1667G
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Gerhard Fuernkranz <nospam456>
Component: KernelAssignee: Andreas Kleen <ak>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: david.keck, jacob.shin, mark.langsdorf, nospam456, trenn
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Customer Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: hwinfo
lspci output
dmesg output - kernel boot messages
lsmod output
Debug messages when repeatedly toggling the "silent" button
Netconsole output before lockup
One more panic
Don't ack apic when disabled
1st new panic
2nd new panic
3rd new panic - divide error

Description Gerhard Fuernkranz 2006-01-25 00:09:22 UTC
I'm running SuSE 10.0 (32bit) with the latests YOU updates on
a notebook FSC Amilo 1667G, with a Mobile Athlon64 4000+.

After about 15min..2h, the machine locks up completely. No
console messages, no ping, no reaction to keyboard, no console
switching, no SysRq - completely dead.

I'm pretty confident that the lockup has to do with the
PowerNow stuff. The notebook did run stabe for 24h, when
I did not start powersaved (although only @ 800 Mhz),
and it also appears to run stable with powersaved running
and CPU frequency set to maximum (-> powersave -f).
However, with dynamic frequency scaling, I always get a hard
lockup after approx 15min..2h.

What infomation could I provide for diagnosis of the lockup?


Another interesting related issue I noticed:

If BATTERY_SCHEME=performance, and if CPUFREQUENCY="performance"
is set in the "performance" scheme, then the frequency nevertheless
goes down from 2600 to 800 MHz, when I disconnect AC power.
Comment 1 Gerhard Fuernkranz 2006-01-25 00:11:42 UTC
Created attachment 64869 [details]
hwinfo
Comment 2 Gerhard Fuernkranz 2006-01-25 00:12:38 UTC
Created attachment 64870 [details]
lspci output
Comment 3 Gerhard Fuernkranz 2006-01-25 00:14:48 UTC
Created attachment 64872 [details]
dmesg output - kernel boot messages

These messages are captured with "powersave -f",
where the machines runs stable.
Comment 4 Gerhard Fuernkranz 2006-01-25 00:25:45 UTC
Btw: To exclude ndiswrapper, fglrx or vmware modules as
possible reasons for the problem, I also did run the kernel
without these modules loaded (on alpha console only, w/o X11)
and also ran into the lockup after some time.
Comment 5 Gerhard Fuernkranz 2006-01-25 00:38:31 UTC
Created attachment 64876 [details]
lsmod output
Comment 6 Hannes Reinecke 2006-01-25 11:14:49 UTC
Oh, great. Powersaved again.

Andi, could you have a look?
Comment 7 Forgotten User ZhJd0F0L3x 2006-01-25 11:46:40 UTC
powersaved is probably not guilty since we are only doing legal stuff ;-)
Comment 8 Thomas Renninger 2006-01-25 16:35:33 UTC
If BATTERY_SCHEME=performance, and if CPUFREQUENCY="performance"
is set in the "performance" scheme, then the frequency nevertheless
goes down from 2600 to 800 MHz, when I disconnect AC power.
-> BIOS limits CPU frequencies on battery (on some machine you can modify that in BIOS settings). However, cpufreq core could have a little problem with that if BIOS also lowers the freq behind the kernels back -> I have a patch for that in the queue. Hmm but that should not lock up  the machine, but it would be worth a try.

Maybe this only happens if you have unplugged/replugged the AC adapter at least once?

Does it work if you use the userspace governor:
CPUFREQ_CONTROL="userspace"
in /etc/sysconfig/powersave/cpufreq
Comment 9 Gerhard Fuernkranz 2006-01-25 18:32:26 UTC
> BIOS limits CPU frequencies on battery

That's what I basically did susect as well.

> (on some machine you can modify that in BIOS settings).

In my BIOS I did not find such settings, so it might
be hardcoded.

> However, cpufreq core could have a little problem with that if
> BIOS also lowers the freq behind the kernels back -> I have a
> patch for that in the queue. Hmm but that should not lock up
> the machine, but it would be worth a try.
>
> Maybe this only happens if you have unplugged/replugged the
> AC adapter at least once?

To clarify:

The lockup also happens with AC power connected all the time.
It is NOT necessary to disconnect AC power to get the lockup.
It just happens after some time, when dynamic frequency scaling
is turned on, and as said, I have not seen this lockup so far
with powersaved stopped, and the machine also seems to run stable
with "powersave -f" in effect (i.e. with CPU at full speed).

The obvious frequency limiting by the BIOS on battery operation
was just an additional observation, but is likely not the root
cause of my lockup. I can attach the ACPI events logged during
AC disconnection and reconnection though, if this is helpful.

I'll also test whether dynamic frequency scaling via the the
userspace governor makes a difference (i.e. whether it also
leads to a lockup after some time).

Is there any additional info you need from my machine?
Comment 10 Gerhard Fuernkranz 2006-01-26 10:00:41 UTC
I did switch to the userspace governor, and yesterday
evening the machine was indeed running for several hours
without locking up, and it also did survive the last night
without freezing :-) Looks good so far, and I'll keep it
under observation.

Of course, this still does not explain why I've seen
the lockups with the kernel governor. I guess, it is
basically expected to work as well, isn't it?
Comment 11 Thomas Renninger 2006-01-26 13:05:42 UTC
I also expect it not to be the freq limitation...

You see the acpi events that happened in:
/var/log/acpid or just call acpi_listen you see the events popping up there..
would be a good idea, may be some strange processor event is happening shortly before?

How does the machine freeze? Can you still push some NUMLOCK button and the LED is still reacting?

Try to boot with sysreq=1 or better set ENABLE_SYSRQ="yes" in /etc/sysconfig/sysctl (needs rebooting to be activated);
If the machine hangs, can you still send the kernel sysreq requests?
If it hangs and sysreq is active try: SysRq-T. (Best is to use a serial console for that, so that you can C&P the output, if this is not possible you may want to try to write us the last functions the machine is in by hand).

When the LED/sysreq is still working it could just be a semaphore lock?
When not, maybe there happen too much switches in too short time?
Try to increase: /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_*
Especially the sampling_rate_min you should try a value of *100 and *10.

Another debug try would be to install the kernel-debug rpm. When you booted the kernel do: echo 7 >/sys/modules/cpufreq/debug.
In this case the kernel writes out additional messages on every cpufreq change, hopefully we get an idea what is happening then.
Comment 12 Gerhard Fuernkranz 2006-01-26 13:41:23 UTC
> How does the machine freeze? Can you still push some NUMLOCK
> button and the LED is still reacting?

The keyboard LEDs are no longer responding to numlock, etc.

> Try to boot with sysreq=1 or better set ENABLE_SYSRQ="yes" in
> /etc/sysconfig/sysctl (needs rebooting to be activated);
> If the machine hangs, can you still send the kernel sysreq requests?

I already tried to enable SYSRQ - it also did not work
any more when the machine freezes.

> If it hangs and sysreq is active try: SysRq-T. (Best is to use
> a serial console for that, so that you can C&P the output, if
> this is not possible you may want to try to write us the last
> functions the machine is in by hand).

Unfortunately the notebook does not have a serial port at all.

> When not, maybe there happen too much switches in too short time?
> Try to increase: /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_*
> Especially the sampling_rate_min you should try a value of *100 and *10.

I'll try to play with these parameters.

Btw:
If I try the debug kernel, do I need some special settings to get 
all messages written by the kernel synchronously to the console?
(Otherwise, if the messages are logged asynchronously, e.g. via
syslog and the machine freezes, then I guess I will not see the most
recent messages before the freeze any more - but these messages are
likely the most interesting ones)
Comment 13 Thomas Renninger 2006-01-26 14:34:48 UTC
IIRC they are written synchronously.
But you are right, we might miss the most interessting ones.
You might want to setup a net console, you find the docs (it's really easy to setup) in the kernel Documentation/networking dir:
linux-2.6.15/Documentation/networking/netconsole.txt
Of course you need a working network and a second machine.
Boot the debug kernel, setup the netconsole and echo 7 >/sys/modules/cpufreq/debug and we hopefully see the last tries in powernow-k8.
Did the incrementing of sampling_rate_* help?
Comment 14 Forgotten User ZhJd0F0L3x 2006-01-26 14:40:45 UTC
(In reply to comment #11)
> Try to increase: /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_*
> Especially the sampling_rate_min you should try a value of *100 and *10.

Those will be overwritten by the powersave daemon.
But we are running the kernel governor with pretty conservative settings anyway:

seife@strolchi:~> grep . /sys/devices/system/cpu/cpu0/cpufreq/ondemand/*
/sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load:0
/sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_down_factor:1
/sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate:333000
/sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_max:40000000
/sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_min:40000
/sys/devices/system/cpu/cpu0/cpufreq/ondemand/up_threshold:80
Comment 15 Thomas Renninger 2006-01-26 15:46:17 UTC
That could be it.
Powersave daemon must not override the sampling rate_min(it may increase the initial sampling_rate_min but not lower it!) as it is calculated by the transition latency of the hardware.

Strange that it is not modified on my machine (AMD64):
cat ondemand/sampling_rate_min
620000
If I calculated right, this refers to a transition latency of 1240 nano seconds.
A value of 40000 assumes a transition latency of 80 nano seconds, quite fast...
And this only refers to the transition latency, the max waiting time between transitions could be much higher.

Please check the value and verify whether the system still freezes with a value of 1000000.
Comment 16 Forgotten User ZhJd0F0L3x 2006-01-26 19:19:49 UTC
(In reply to comment #15)
> That could be it.
> Powersave daemon must not override the sampling rate_min

you can't. It's read-only:
root@strolchi:/sys/devices/system/cpu/cpu0/cpufreq/ondemand> ls -l sampling_rate_min
-r--r--r-- 1 root root 4096 2006-01-26 20:11 sampling_rate_min

> Strange that it is not modified on my machine (AMD64):

no. not strange.

> cat ondemand/sampling_rate_min
> 620000
> If I calculated right, this refers to a transition latency of 1240 nano
> seconds.
> A value of 40000 assumes a transition latency of 80 nano seconds, quite fast...

Pentium M 2GHz

> Please check the value and verify whether the system still freezes with a value
> of 1000000.

you cannot set sampling_rate_min, it is purely for your information.

Also,powersaved cannot do anything stupid:
root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate_min
40000
root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate
333000
root@strolchi:/sys/.../cpu0/cpufreq/ondemand> echo 39999 > sampling_rate
bash: echo: write error: Invalid argument
root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate
333000
root@strolchi:/sys/.../cpu0/cpufreq/ondemand> echo 40000 > sampling_rate
root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate
40000

So if we try to write values that the driver does not like, it just refuses it. If here is a bug, it is either a driver or a BIOS bug. No powersaved bug.
Comment 17 Gerhard Fuernkranz 2006-01-27 10:04:39 UTC
> You might want to setup a net console, you find the docs
> (it's really easy to setup) in the kernel Documentation/networking dir:
> linux-2.6.15/Documentation/networking/netconsole.txt

Thanks for the hint. I was not aware of the existance of
the netconsole before - I only knew the serial console.
Tried it, and seems to work fine (though only over Ethernet,
and not with my WLAN).
Comment 18 Gerhard Fuernkranz 2006-01-27 10:08:56 UTC
> Try to increase: /sys/devices/system/cpu/cpu0/cpufreq
> /ondemand/sampling_rate_*
> Especially the sampling_rate_min you should try a value of *100 and *10.

Unfortunately sampling_rate_min seems to be read-only.
I can chmod the file, and then echo a new value, but nevertheless
it has no effect. Reading it after modification gives still
the old value.

Btw, on my machine, sampling_rate_min is 620000.
Comment 19 Thomas Renninger 2006-01-27 11:57:12 UTC
Unfortunately sampling_rate_min seems to be read-only.
-> OK, sorry for the confusion.
Having a closer look reveals that sampling_rate_min isn't used in the governor anyway.
Maybe it helps increasing sampling_rate, I have 1240000 us, 1,24s. What is your value there?
It is the only thing coming in my mind for now the ondemand could cause a hard lookup: switching too fast.
If this does not work, maybe you can compile your own kernel and we can add some printks to verify that ondemand governor really waits long enough until it does the next switching:

- install kernel-source.rpm
- cd /usr/src/linux
- cp arch/x86_64/defconfig.default .config
- edit .config and modify CONFIG_CPU_FREQ_DEBUG to CONFIG_CPU_FREQ_DEBUG=y
- make all modules_install install
- edit bootloader config to point your old kernel (you never know) and the new
  one that should now be installed in /boot.

echo 7 >/sys/modules/cpufreq/parameters/debug
could already be sufficient to check whether the ondemand governor waits long enough between switches...
If you really have to go that far, please set up a net console and send the whole output until it freezes with this setup, maybe we see something suspicious.

I could imagine that some external event ACPI processor? is forcing the cpufreq-core to reset the speed. In this case it could be that the processor has not calmed down from the last set. Ondemand governor does not know about that, it's workqueue gets activated some us after that, it tries to switch -> clash!
In theory the userspace governor should also be able to freeze the machine in this scenario, but may it's just because it does not switch that often and fast as ondemand governor, so nobody recognised userspace freezes until now?
Comment 20 Gerhard Fuernkranz 2006-01-27 15:11:48 UTC
> - install kernel-source.rpm
> - cd /usr/src/linux
> - cp arch/x86_64/defconfig.default .config

Really x86_64?
I'm running on an Athlon64 hardware, but the 32-bit kernel.

I'm also not sure: Did the Yast Online Update also upgrade my
installed kernel-src package, or only the binary kernel package?
Or in other words: Do the kernel-src and the binary kernel package
still match after the YOU? Or do I get an "old" (i.e. the
original 10.0) kernel, if I compile the sources?

[...]
> If you really have to go that far, please set up a net console

Sure, otherwise it's rather useless. Let's see if I find some
time next weekend to build a debug kernel.

One problem is indeed, that I do not know how to enforce the
lockup. Yesterday evening the machine was even running stable
for a couple of hours with the "ondemand" governor, and I was even
afraid, that I can't reproduce it at all any more. But finally,
I kept it turned on over night - and in the morning it was frozen
again. And at other times, I logged in after booting, and the
machine was frozen after a few minutes. Basically, my evidence that
the problem might likely be related to the "ondemand" governor
comes from the fact that that I have never observed it with fixed
frequency so far. Of course other causes are still not impossible.
With the userspace governor, I have only one night experience
(no lockup) so far, which is probably not long enough to claim
with statisitcal significance, that it always works with the
userspace governor.
Comment 21 Gerhard Fuernkranz 2006-01-27 15:19:10 UTC
> I could imagine that some external event ACPI processor? is forcing
> the cpufreq-core to reset the speed.

As said, disconnecting AC power seems to force the processor to
run at 800 MhZ. It also does generate a "CPU" event in addition
to the AC and battery events. I also have a "silent" button,
which, if turned on, also forces the CPU to run at 800 MhZ. It
also generates CPU events, when it is switch on or off (but no
AC or battery events). But so far, I have not been able to
reproduce/force a lockup simply by repeatedly disconnecting/
reconnecting AC power or by turning the silent button on/off.
Comment 22 Thomas Renninger 2006-01-27 15:27:16 UTC
You can run a 64-bit kernel below an 32-bit userspace linux.
Would be interesting if the problems vanish in this case...

To reproduce, maybe something like helps?:
while true;do dd if=/dev/zero of=/dev/shm/tmp bs=$((1024*1024)) count=1000;sync;sleep 3;done
Depending of how fast your machine is you should modify the values a bit (copying 100MB junks to memory in this case).
watch -n1 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
should show you up/down going freqs.
Comment 23 Gerhard Fuernkranz 2006-01-27 15:42:45 UTC
> As said, disconnecting AC power seems to force the processor to
> run at 800 MhZ. It also does generate a "CPU" event in addition
> to the AC and battery events. I also have a "silent" button,
> which, if turned on, also forces the CPU to run at 800 MhZ. It
> also generates CPU events, when it is switch on or off (but no
> AC or battery events). But so far, I have not been able to
> reproduce/force a lockup simply by repeatedly disconnecting/
> reconnecting AC power or by turning the silent button on/off.

Forgot to mention:
I experimented with the "silent" button in conjunction with the
plerformance governor. When turning the button on/off several
times, I did observer a frequence switch between 2600 and 800 MhZ.

But after a couple of on/off cycles, I ended up in a situation
where the frequency was locked at 800 MhZ, even with released
"silent" button. In this state, I was no longer able to switch
to a higher frequence any more, even not by changing the governor
to powersave back to performance. Only a reboot helped to escape.
Comment 24 Gerhard Fuernkranz 2006-01-27 15:58:51 UTC
(In reply to comment #22)
> You can run a 64-bit kernel below an 32-bit userspace linux.

But I guess, ndiswrapper and fglrx won't work any longer, will they?

For a test, that's ok, but eventuelly, I need them for WLAN, and my
graphics card did not work with the open source xorg driver, but only
with the ATI driver, which implies the fglrx kernel module :-(
Comment 25 Forgotten User ZhJd0F0L3x 2006-01-27 16:04:37 UTC
(In reply to comment #22)
> To reproduce, maybe something like helps?:
> while true;do dd if=/dev/zero of=/dev/shm/tmp bs=$((1024*1024))
> count=1000;sync;sleep 3;done

This will allocate 1G in tmpfs. The machine will probably lock up due to this, unless it has much more than 1G of RAM.

> Depending of how fast your machine is you should modify the values a bit
> (copying 100MB junks to memory in this case).

1G, not 100MB.

Just doing "dd if=/dev/zero bs=1M count=1000 > /dev/null" should do equally well to push the CPU.
Comment 26 Gerhard Fuernkranz 2006-01-27 23:18:04 UTC
Below is what happens when I turn my "silent" button on or off.
Is that the behaviour you would expect?
Does it mean, that the BIOS changes the frequency unexpectedly
unter the kernels back? Or does the BIOS gracefully ask the kernel
to change the frequency anyway?

ACPI events:
------------

silent botton on:
processor CPU1 00000080 00000006

silent button off:
processor CPU1 00000080 00000000

debug messages:
---------------

silent button on:

cpufreq-core: updating policy for CPU 0
cpufreq-core: setting new policy for CPU 0: 800000 - 2600000 kHz
freq-table: request for verification of policy (800000 - 2600000 kHz) for cpu 0
freq-table: verification lead to (800000 - 2600000 kHz) for cpu 0
freq-table: request for verification of policy (800000 - 800000 kHz) for cpu 0
freq-table: verification lead to (800000 - 800000 kHz) for cpu 0
cpufreq-core: new min and max freqs are 800000 - 800000 kHz
cpufreq-core: governor: change or update limits
cpufreq-core: __cpufreq_governor for CPU 0, event 3
cpufreq-core: target for CPU 0: 800000 kHz, relation 1
powernow-k8: targ: cpu 0, 800000 kHz, min 800000, max 800000, relation 1
powernow-k8: targ: curr fid 0x12, vid 0xa
freq-table: request for target 800000 kHz (relation: 1) for cpu 0
freq-table: target is 6 (800000 kHz, 6144)
powernow-k8: cpu 0 transition to index 6
powernow-k8: table matched fid 0x0, giving vid 0x18
powernow-k8: cpu 0, changing to fid 0x0, vid 0x18
cpufreq-core: notification 0 of frequency transition to 800000 kHz
userspace: saving cpu_cur_freq of cpu 0 to be 800000 kHz
powernow-k8: ph1 (cpu0): start, currfid 0x12, currvid 0xa, reqvid 0x18, rvo 0x2
powernow-k8: ph1 maxvid=0x6
powernow-k8: ph1 complete, currfid 0x12, currvid 0xa
powernow-k8: ph2 (cpu0): starting, currfid 0x12, currvid 0xa, reqfid 0x0
powernow-k8: writing fid 0x10, lo 0x10a10, hi 0x190
powernow-k8: writing fid 0xe, lo 0x10a0e, hi 0x190
powernow-k8: writing fid 0xc, lo 0x10a0c, hi 0x190
powernow-k8: writing fid 0xa, lo 0x10a0a, hi 0x190
powernow-k8: writing fid 0x0, lo 0x10a00, hi 0x190
powernow-k8: ph2 complete, currfid 0x0, currvid 0xa
powernow-k8: ph3 (cpu0): starting, currfid 0x0, currvid 0xa
powernow-k8: writing vid 0x18, lo 0x11800, hi 0x1
powernow-k8: ph3 complete, currfid 0x0, currvid 0x18
powernow-k8: transitioned (cpu0): new fid 0x0, vid 0x18
cpufreq-core: notification 1 of frequency transition to 800000 kHz
cpufreq-core: scaling loops_per_jiffy to 3205083 for frequency 800000 kHz
userspace: saving cpu_cur_freq of cpu 0 to be 800000 kHz

silent button off:

cpufreq-core: updating policy for CPU 0
cpufreq-core: setting new policy for CPU 0: 800000 - 2600000 kHz
freq-table: request for verification of policy (800000 - 2600000 kHz) for cpu 0
freq-table: verification lead to (800000 - 2600000 kHz) for cpu 0
freq-table: request for verification of policy (800000 - 2600000 kHz) for cpu 0
freq-table: verification lead to (800000 - 2600000 kHz) for cpu 0
cpufreq-core: new min and max freqs are 800000 - 2600000 kHz
cpufreq-core: governor: change or update limits
cpufreq-core: __cpufreq_governor for CPU 0, event 3
cpufreq-core: target for CPU 0: 2600000 kHz, relation 1
powernow-k8: targ: cpu 0, 2600000 kHz, min 800000, max 2600000, relation 1
powernow-k8: targ: curr fid 0x0, vid 0x18
freq-table: request for target 2600000 kHz (relation: 1) for cpu 0
freq-table: target is 0 (2600000 kHz, 2578)
powernow-k8: cpu 0 transition to index 0
powernow-k8: table matched fid 0x12, giving vid 0xa
powernow-k8: cpu 0, changing to fid 0x12, vid 0xa
cpufreq-core: notification 0 of frequency transition to 2600000 kHz
userspace: saving cpu_cur_freq of cpu 0 to be 2600000 kHz
cpufreq-core: scaling loops_per_jiffy to 10416519 for frequency 2600000 kHz
powernow-k8: ph1 (cpu0): start, currfid 0x0, currvid 0x18, reqvid 0xa, rvo 0x2
powernow-k8: ph1 maxvid=0x6
powernow-k8: ph1: curr 0x18, req vid 0xa
powernow-k8: writing vid 0x17, lo 0x11700, hi 0x1
powernow-k8: ph1: curr 0x17, req vid 0xa
powernow-k8: writing vid 0x16, lo 0x11600, hi 0x1
powernow-k8: ph1: curr 0x16, req vid 0xa
powernow-k8: writing vid 0x15, lo 0x11500, hi 0x1
powernow-k8: ph1: curr 0x15, req vid 0xa
powernow-k8: writing vid 0x14, lo 0x11400, hi 0x1
powernow-k8: ph1: curr 0x14, req vid 0xa
powernow-k8: writing vid 0x13, lo 0x11300, hi 0x1
powernow-k8: ph1: curr 0x13, req vid 0xa
powernow-k8: writing vid 0x12, lo 0x11200, hi 0x1
powernow-k8: ph1: curr 0x12, req vid 0xa
powernow-k8: writing vid 0x11, lo 0x11100, hi 0x1
powernow-k8: ph1: curr 0x11, req vid 0xa
powernow-k8: writing vid 0x10, lo 0x11000, hi 0x1
powernow-k8: ph1: curr 0x10, req vid 0xa
powernow-k8: writing vid 0xf, lo 0x10f00, hi 0x1
powernow-k8: ph1: curr 0xf, req vid 0xa
powernow-k8: writing vid 0xe, lo 0x10e00, hi 0x1
powernow-k8: ph1: curr 0xe, req vid 0xa
powernow-k8: writing vid 0xd, lo 0x10d00, hi 0x1
powernow-k8: ph1: curr 0xd, req vid 0xa
powernow-k8: writing vid 0xc, lo 0x10c00, hi 0x1
powernow-k8: ph1: curr 0xc, req vid 0xa
powernow-k8: writing vid 0xb, lo 0x10b00, hi 0x1
powernow-k8: ph1: curr 0xb, req vid 0xa
powernow-k8: writing vid 0xa, lo 0x10a00, hi 0x1
powernow-k8: ph1: changing vid for rvo, req 0x9
powernow-k8: writing vid 0x9, lo 0x10900, hi 0x1
powernow-k8: ph1: changing vid for rvo, req 0x8
powernow-k8: writing vid 0x8, lo 0x10800, hi 0x1
powernow-k8: ph1 complete, currfid 0x0, currvid 0x8
powernow-k8: ph2 (cpu0): starting, currfid 0x0, currvid 0x8, reqfid 0x12
powernow-k8: writing fid 0xa, lo 0x1080a, hi 0x190
powernow-k8: writing fid 0xc, lo 0x1080c, hi 0x190
powernow-k8: writing fid 0xe, lo 0x1080e, hi 0x190
powernow-k8: writing fid 0x10, lo 0x10810, hi 0x190
powernow-k8: writing fid 0x12, lo 0x10812, hi 0x190
powernow-k8: ph2 complete, currfid 0x12, currvid 0x8
powernow-k8: ph3 (cpu0): starting, currfid 0x12, currvid 0x8
powernow-k8: writing vid 0xa, lo 0x10a12, hi 0x1
powernow-k8: ph3 complete, currfid 0x12, currvid 0xa
powernow-k8: transitioned (cpu0): new fid 0x12, vid 0xa
cpufreq-core: notification 1 of frequency transition to 2600000 kHz
userspace: saving cpu_cur_freq of cpu 0 to be 2600000 kHz
Comment 27 Gerhard Fuernkranz 2006-01-28 17:22:02 UTC
Created attachment 65543 [details]
Debug messages when repeatedly toggling the "silent" button

(In reply to comment #23)

> I experimented with the "silent" button in conjunction with the
> performance governor. When turning the button on/off several
> times, I did observer a frequence switch between 2600 and 800 MhZ.
> 
> But after a couple of on/off cycles, I ended up in a situation
> where the frequency was locked at 800 MhZ, even with released
> "silent" button. In this state, I was no longer able to switch
> to a higher frequence any more, even not by changing the governor
> to powersave back to performance. Only a reboot helped to escape.

This attachment shows the debug messages printed by the kernel, when the frequency lock to 800 MzH occurs as a result of repeatedly turning the "silent" button on/off.
Comment 28 Thomas Renninger 2006-01-28 17:47:19 UTC
to comment #22, it is possible to run a 64-bit kernel below a 32-bit userspace system but you would have sever problems compiling it on a 32-bit system. Better try to install the x86_64 binary kernel-default*.rpm if you want to try that one.
To be honest I haven't tried myself, only heard that it works, if you have problems, maybe it is even easier to try a small x86_64 installation if you still have a free partition.

I just wanted to mention that before you wast time and try it.
I will have a closer look at the logs on Monday...
Comment 29 Gerhard Fuernkranz 2006-01-28 19:50:05 UTC
Created attachment 65544 [details]
Netconsole output before lockup

I just got my lockup again (with a self-compiled 32-bit kernel).
Actually it seems to be a panic. Unfortunately no stack backtrace
in the netconsole output.

[...]
powernow-k8: ph3 complete, currfid 0x0, currvid 0x18
powernow-k8: transitioned (cpu0): new fid 0x0, vid 0x18
cpufreq-core: notification 1 of frequency transition to 800000 kHz
cpufreq-core: scaling loops_per_jiffy to 3205148 for frequency 800000 kHz
userspace: saving cpu_cur_freq of cpu 0 to be 800000 kHz
no vm86_info: BAD
Kernel panic - not syncing: Attempted to kill the idle task!
Comment 30 Gerhard Fuernkranz 2006-01-28 20:04:07 UTC
I had another freeze yesterdy night. However, this one was
different than all the other ones I have seen, since it
also garbled my X screen. I had not yet setup netconsole
logging to a file at that time, so I have only the captured
netconsole output from with cut&paste from the xterm running
netcat.

[...]
powernow-k8: ph1: curr 0xd, req vid 0xa
powernow-k8: writing vid 0xc, lo 0x10c00, hi 0x1
powernow-k8: ph1: curr 0xc, req vid 0xa
powernow-k8: writing vid 0xb, lo 0x10b00, hi 0x1
powernow-k8: ph1: curr 0xb, req vid 0xa
powernow-k8: writing vid 0xa, lo 0x10a00, hi 0x1
powernow-k8: ph1: changing vid for rvo, req 0x9
powernow-k8: writing vid 0x9, lo 0x10900, hi 0x1
powernow-k8: ph1: changing vid for rvo, req 0x8
powernow-k8: writing vid 0x8, lo 0x10800, hi 0x1
powernow-k8: ph1 complete, currfid 0x0, currvid 0x8
powernow-k8: ph2 (cpu0): starting, currfid 0x0, currvid 0x8, reqfid 0x12
powernow-k8: writing fid 0xa, lo 0x1080a, hi 0x190
powernow-k8: writing fid 0xc, lo 0x1080c, hi 0x190
powernow-k8: writing fid 0xe, lo 0x1080e, hi 0x190
powernow-k8: writing fid 0x10, lo 0x10810, hi 0x190
powernow-k8: writing fid 0x12, lo 0x10812, hi 0x190
powernow-k8: ph2 complete, currfid 0x12, currvid 0x8
powernow-k8: ph3 (cpu0): starting, currfid 0x12, currvid 0x8
powernow-k8: writing vid 0xa, lo 0x10a12, hi 0x1
powernow-k8: ph3 complete, currfid 0x12, currvid 0xa
powernow-k8: transitioned (cpu0): new fid 0x12, vid 0xa
cpufreq-core: notification 1 of frequency transition to 2600000 kHz
userspace: saving cpu_cur_freq of cpu 0 to be 2600000 kHz
unexpected IRQ trap at vector 5f
Comment 31 Mark Langsdorf 2006-01-30 16:33:52 UTC
It looks like the PN! driver has completed all operations before the lockup starts.  It's apparently PN! related but I don't think it's a driver bug.
Comment 32 Thomas Renninger 2006-01-30 16:40:55 UTC
But I guess, ndiswrapper and fglrx won't work any longer, will they?
->I got ndiswrapper working on 64bit, you have to search for the Windows 64-bit drivers, I don't know about fglrx.

So the machine also locks up with userspace goveror, it just switches not that often than ondemand, therefore it is hard to run into it?

vm86.c, is this stuff that only gets called on x86_64 machines that run a 32-bit kernel?
So this seems to be a 32-bit kernel on a x86_64 machine problem?
assembler involved -> nothing for me..., Andi do the messages:
no vm86_info: BAD
Kernel panic - not syncing: Attempted to kill the idle task!
and
unexpected IRQ trap at vector 5f
tell you anything?

Hmm the changelog of arch/i386/vm86.c scares me a bit, maybe it's some weird stuff like that?:
22 mar 2002 - Manfred detected the stackfaults, but didn't handle
 *                them correctly. Now the emulation will be in a
 *                consistent state after stackfaults - Kasper Dupont
 *                <kasperd@daimi.au.dk>

Gerhard: Thanks for the good catches!
Comment 33 Andreas Kleen 2006-01-30 17:08:36 UTC
5f is in the middle of the interrupt space and probably not a legal interrupt.

That unexpected interrupt 5f somehow ended up in the vm86 handling code (which
wasn't expected to be running). This  then somehow lead to the panic.

Your system seems be not running with "apic". The unexpected IRQ trap 
at vector message comes from the local APIC code, which got called from
the fallback handler. That's a kernel bug - if we're not in APIC mode
we shouldn't call into the APIC code even for unexpected interrupts.
It tried to access the local APIC, which might then have caused some bad
behaviour (it's not mapped when local APIC is off).  But normally
these unexpected interrupts shouldn't happen in the first place.

The vm86 code should only be entered when the CPU reported an EFLAGS
with the VM bit set. Gerhard - did you run any vm86 applications (like
dosemu or restarting the X server or hwinfo) during the crashes?

Comment 34 Gerhard Fuernkranz 2006-01-30 18:15:25 UTC
(In reply to comment #33)

> Your system seems be not running with "apic".

Was this meant as a question to me, or do you know anyway?
In the former case, how can I see whether it's running in APIC mode?

Normally I do not boot explicitly with "noapic".
(IIRC, I tried "noapic" once, but it did not make a difference)

> The vm86 code should only be entered when the CPU reported an EFLAGS
> with the VM bit set. Gerhard - did you run any vm86 applications (like
> dosemu or restarting the X server or hwinfo) during the crashes?

Not that I'm aware of. I never used dosemu or wine on this machine
so far. I did run VMware trial version on the machine several times,
but no VM was running when the machine crashed.

The X server was running during this crash, but it was not being
restarted at the time of the crash. Since I cannot reproduce the
problem whenever I like, the lockup usually happens when I keep the
machine running over night, when the machine is is mostly idle.
IIRC, I was logged-in into an otherwise ide KDE session, and the
screen saver was running.

Is my understanding correct, that according to the panic message,
obviously the idle task was interrupted by an unexpected interrupt
which eventually lead to the panic? But why should just the idle
task have the VM bit set in EFLAGS?
Comment 35 Andreas Kleen 2006-01-30 18:23:57 UTC
>Was this meant as a question to me, or do you know anyway?
>In the former case, how can I see whether it's running in APIC mode?

I know from your hwinfo, unless you changed the configuration since then.

>Is my understanding correct, that according to the panic message,
>obviously the idle task was interrupted by an unexpected interrupt
>which eventually lead to the panic? But why should just the idle
>task have the VM bit set in EFLAGS?

I have no idea. Especially not why frequency scaling should cause this.
Well, it's possible that the illegal access of the local APIC had 
some very weird side effect. There's a small chance it will go away
when that is fixed.

As to why the 5f interrupt happens - that's probably some hardware
problem. Perhaps it's caused by your chipset which doesn't like
the CPU changing frequency. Or some other hardware flakiness. 

One thing that might be worth trying 
if you want to recompile your kernel. You can add a 
printk("vm86 used in %s:%d\n", current->comm, current->pid);
in arch/i386/kernel/vm86.c:do_sys_vm86 and see if something
unexpected uses vm86 mode during your night.
Comment 36 Gerhard Fuernkranz 2006-01-31 00:19:16 UTC
Created attachment 65796 [details]
One more panic

I got a panic once again. This time vm86 seems not be involved, and there's a stack backtrace from a running process available -> gettimeofday(). But it is once again int 0x5F ...

(Btw, I have added the printks to vm86.c)
Comment 37 Gerhard Fuernkranz 2006-01-31 00:29:54 UTC
(In reply to comment #35)

> I have no idea. Especially not why frequency scaling should cause this.

I must fully admit, I don't understand this either. There seems to be a correlation, though.

> Well, it's possible that the illegal access of the local APIC had 
> some very weird side effect. There's a small chance it will go away
> when that is fixed.

Btw, I'm also wondering, whether vm86.c:do_int() should really return_to_32bit() if the interrupt which cannot be handled is an (asynchronous) hardware interrupt and not a synchronous exception caused by the current process. What do you think? Are there situations, where return_to_32bit() makes sense for a HW int?
Comment 38 Gerhard Fuernkranz 2006-01-31 00:42:35 UTC
To get rid of the invalid APIC acess, when int 5F occurs: Is is sufficient to comment out the call to ack_APIC_irq() in apic.c:ack_bad_irq() as Workaround?
(granted that the APIC is not used on my system)

Btw, I found with Google:
"INT 5F - IRQ15/7 relocated by software; HP 95LX GRAPHICS PRIMITIVES"

But I have no idea, whether/how/why this would apply to my notebook under Linux.
Comment 39 Andreas Kleen 2006-02-01 15:10:35 UTC
Yes, commenting ack_APIC_irq is sufficient.

Your second panic is weird. It is a bounds exception which happens when
someone uses the quite obscure bound checking instruction. The kernel
never uses that.

For me it looks like a hardware problem - perhaps whatever injects the weird
5f interrupt also injects other interrupts that shouldn't be there.

To be honest I doubt we can do much about that from the software side.
I would recommend to not use powersaved.

Ok there is a small chance that commenting out the APIC ack will help,
but if that doesn't help I'm out of ideas.

Comment 40 Andreas Kleen 2006-02-01 15:48:44 UTC
Created attachment 66049 [details]
Don't ack apic when disabled

Here's the correct patch for the ack problem btw.
Comment 41 Andreas Kleen 2006-02-02 13:56:54 UTC
Did not acking the APIC change anything?
Comment 42 Gerhard Fuernkranz 2006-02-03 23:23:18 UTC
(In reply to comment #41)
> Did not acking the APIC change anything?

Obviously not. I got thre more panics, which I'll attach below.
The first one was again a v86 one:

  no vm86_info: BAD
  Kernel panic - not syncing: Attempted to kill the idle task!

The second one

  Kernel panic - not syncing: Attempted to kill the idle task!

(looks like do_exit() was called from the idle task - buy why?)

And the 3rd one was a "divide error" in aes_encrypt(). This
panic includes a call trace. However, maybe I overlooked something,
but I did not find any divide instructions in aes-i586-asm.S.
Hmmmm...
Comment 43 Gerhard Fuernkranz 2006-02-03 23:24:29 UTC
Created attachment 66440 [details]
1st new panic
Comment 44 Gerhard Fuernkranz 2006-02-03 23:25:22 UTC
Created attachment 66441 [details]
2nd new panic
Comment 45 Gerhard Fuernkranz 2006-02-03 23:26:22 UTC
Created attachment 66442 [details]
3rd new panic - divide error
Comment 46 Andreas Kleen 2006-02-04 07:34:55 UTC
Yes, it looks like something is sending random exceptions to the CPU.
My guess is that it's a chipset bug of some kind. I can't offer any other
advice other than not running powersaved which seems to trigger that bug.

Comment 47 Andreas Kleen 2006-02-07 22:13:06 UTC
Maybe we can blacklist the machine, but i will only do it if there's 
at least one other report of it to ensure it's a systematic problem.
Comment 48 Andreas Kleen 2006-02-08 10:59:05 UTC
Jacob, David, FYI. I actually already closed it as wontfix, but it might
point to some issue in powernow-k8 on that machine or hardware issue.

Comment 49 Gerhard Fuernkranz 2006-02-16 19:28:08 UTC
Last weekend I upgraded once again to the latest YOU patchces, and it also did include a kernel upgrade from ...-15.7 to version 2.6.13-15.8. I don't know what you have fixed in this version; but anyway, so far my notebook did successfully survive three nights in dynamic CPU frequency scaling mode without locking up. That's probably not enough evidence yet to be absolutely sure that the problem is gone, but it is at least a very good sign :-)

Btw, many thanks to verybody, for your effords to resolve the problem.