|
Bugzilla – Full Text Bug Listing |
| Summary: | Hard Lockup on Amilo 1667G | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Gerhard Fuernkranz <nospam456> |
| Component: | Kernel | Assignee: | Andreas Kleen <ak> |
| Status: | RESOLVED WONTFIX | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | CC: | david.keck, jacob.shin, mark.langsdorf, nospam456, trenn |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | SuSE Linux 10.0 | ||
| Whiteboard: | |||
| Found By: | Customer | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
hwinfo
lspci output dmesg output - kernel boot messages lsmod output Debug messages when repeatedly toggling the "silent" button Netconsole output before lockup One more panic Don't ack apic when disabled 1st new panic 2nd new panic 3rd new panic - divide error |
||
|
Description
Gerhard Fuernkranz
2006-01-25 00:09:22 UTC
Created attachment 64869 [details]
hwinfo
Created attachment 64870 [details]
lspci output
Created attachment 64872 [details]
dmesg output - kernel boot messages
These messages are captured with "powersave -f",
where the machines runs stable.
Btw: To exclude ndiswrapper, fglrx or vmware modules as possible reasons for the problem, I also did run the kernel without these modules loaded (on alpha console only, w/o X11) and also ran into the lockup after some time. Created attachment 64876 [details]
lsmod output
Oh, great. Powersaved again. Andi, could you have a look? powersaved is probably not guilty since we are only doing legal stuff ;-) If BATTERY_SCHEME=performance, and if CPUFREQUENCY="performance" is set in the "performance" scheme, then the frequency nevertheless goes down from 2600 to 800 MHz, when I disconnect AC power. -> BIOS limits CPU frequencies on battery (on some machine you can modify that in BIOS settings). However, cpufreq core could have a little problem with that if BIOS also lowers the freq behind the kernels back -> I have a patch for that in the queue. Hmm but that should not lock up the machine, but it would be worth a try. Maybe this only happens if you have unplugged/replugged the AC adapter at least once? Does it work if you use the userspace governor: CPUFREQ_CONTROL="userspace" in /etc/sysconfig/powersave/cpufreq > BIOS limits CPU frequencies on battery That's what I basically did susect as well. > (on some machine you can modify that in BIOS settings). In my BIOS I did not find such settings, so it might be hardcoded. > However, cpufreq core could have a little problem with that if > BIOS also lowers the freq behind the kernels back -> I have a > patch for that in the queue. Hmm but that should not lock up > the machine, but it would be worth a try. > > Maybe this only happens if you have unplugged/replugged the > AC adapter at least once? To clarify: The lockup also happens with AC power connected all the time. It is NOT necessary to disconnect AC power to get the lockup. It just happens after some time, when dynamic frequency scaling is turned on, and as said, I have not seen this lockup so far with powersaved stopped, and the machine also seems to run stable with "powersave -f" in effect (i.e. with CPU at full speed). The obvious frequency limiting by the BIOS on battery operation was just an additional observation, but is likely not the root cause of my lockup. I can attach the ACPI events logged during AC disconnection and reconnection though, if this is helpful. I'll also test whether dynamic frequency scaling via the the userspace governor makes a difference (i.e. whether it also leads to a lockup after some time). Is there any additional info you need from my machine? I did switch to the userspace governor, and yesterday evening the machine was indeed running for several hours without locking up, and it also did survive the last night without freezing :-) Looks good so far, and I'll keep it under observation. Of course, this still does not explain why I've seen the lockups with the kernel governor. I guess, it is basically expected to work as well, isn't it? I also expect it not to be the freq limitation... You see the acpi events that happened in: /var/log/acpid or just call acpi_listen you see the events popping up there.. would be a good idea, may be some strange processor event is happening shortly before? How does the machine freeze? Can you still push some NUMLOCK button and the LED is still reacting? Try to boot with sysreq=1 or better set ENABLE_SYSRQ="yes" in /etc/sysconfig/sysctl (needs rebooting to be activated); If the machine hangs, can you still send the kernel sysreq requests? If it hangs and sysreq is active try: SysRq-T. (Best is to use a serial console for that, so that you can C&P the output, if this is not possible you may want to try to write us the last functions the machine is in by hand). When the LED/sysreq is still working it could just be a semaphore lock? When not, maybe there happen too much switches in too short time? Try to increase: /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_* Especially the sampling_rate_min you should try a value of *100 and *10. Another debug try would be to install the kernel-debug rpm. When you booted the kernel do: echo 7 >/sys/modules/cpufreq/debug. In this case the kernel writes out additional messages on every cpufreq change, hopefully we get an idea what is happening then. > How does the machine freeze? Can you still push some NUMLOCK > button and the LED is still reacting? The keyboard LEDs are no longer responding to numlock, etc. > Try to boot with sysreq=1 or better set ENABLE_SYSRQ="yes" in > /etc/sysconfig/sysctl (needs rebooting to be activated); > If the machine hangs, can you still send the kernel sysreq requests? I already tried to enable SYSRQ - it also did not work any more when the machine freezes. > If it hangs and sysreq is active try: SysRq-T. (Best is to use > a serial console for that, so that you can C&P the output, if > this is not possible you may want to try to write us the last > functions the machine is in by hand). Unfortunately the notebook does not have a serial port at all. > When not, maybe there happen too much switches in too short time? > Try to increase: /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_* > Especially the sampling_rate_min you should try a value of *100 and *10. I'll try to play with these parameters. Btw: If I try the debug kernel, do I need some special settings to get all messages written by the kernel synchronously to the console? (Otherwise, if the messages are logged asynchronously, e.g. via syslog and the machine freezes, then I guess I will not see the most recent messages before the freeze any more - but these messages are likely the most interesting ones) IIRC they are written synchronously. But you are right, we might miss the most interessting ones. You might want to setup a net console, you find the docs (it's really easy to setup) in the kernel Documentation/networking dir: linux-2.6.15/Documentation/networking/netconsole.txt Of course you need a working network and a second machine. Boot the debug kernel, setup the netconsole and echo 7 >/sys/modules/cpufreq/debug and we hopefully see the last tries in powernow-k8. Did the incrementing of sampling_rate_* help? (In reply to comment #11) > Try to increase: /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_* > Especially the sampling_rate_min you should try a value of *100 and *10. Those will be overwritten by the powersave daemon. But we are running the kernel governor with pretty conservative settings anyway: seife@strolchi:~> grep . /sys/devices/system/cpu/cpu0/cpufreq/ondemand/* /sys/devices/system/cpu/cpu0/cpufreq/ondemand/ignore_nice_load:0 /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_down_factor:1 /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate:333000 /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_max:40000000 /sys/devices/system/cpu/cpu0/cpufreq/ondemand/sampling_rate_min:40000 /sys/devices/system/cpu/cpu0/cpufreq/ondemand/up_threshold:80 That could be it. Powersave daemon must not override the sampling rate_min(it may increase the initial sampling_rate_min but not lower it!) as it is calculated by the transition latency of the hardware. Strange that it is not modified on my machine (AMD64): cat ondemand/sampling_rate_min 620000 If I calculated right, this refers to a transition latency of 1240 nano seconds. A value of 40000 assumes a transition latency of 80 nano seconds, quite fast... And this only refers to the transition latency, the max waiting time between transitions could be much higher. Please check the value and verify whether the system still freezes with a value of 1000000. (In reply to comment #15) > That could be it. > Powersave daemon must not override the sampling rate_min you can't. It's read-only: root@strolchi:/sys/devices/system/cpu/cpu0/cpufreq/ondemand> ls -l sampling_rate_min -r--r--r-- 1 root root 4096 2006-01-26 20:11 sampling_rate_min > Strange that it is not modified on my machine (AMD64): no. not strange. > cat ondemand/sampling_rate_min > 620000 > If I calculated right, this refers to a transition latency of 1240 nano > seconds. > A value of 40000 assumes a transition latency of 80 nano seconds, quite fast... Pentium M 2GHz > Please check the value and verify whether the system still freezes with a value > of 1000000. you cannot set sampling_rate_min, it is purely for your information. Also,powersaved cannot do anything stupid: root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate_min 40000 root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate 333000 root@strolchi:/sys/.../cpu0/cpufreq/ondemand> echo 39999 > sampling_rate bash: echo: write error: Invalid argument root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate 333000 root@strolchi:/sys/.../cpu0/cpufreq/ondemand> echo 40000 > sampling_rate root@strolchi:/sys/.../cpu0/cpufreq/ondemand> cat sampling_rate 40000 So if we try to write values that the driver does not like, it just refuses it. If here is a bug, it is either a driver or a BIOS bug. No powersaved bug. > You might want to setup a net console, you find the docs
> (it's really easy to setup) in the kernel Documentation/networking dir:
> linux-2.6.15/Documentation/networking/netconsole.txt
Thanks for the hint. I was not aware of the existance of
the netconsole before - I only knew the serial console.
Tried it, and seems to work fine (though only over Ethernet,
and not with my WLAN).
> Try to increase: /sys/devices/system/cpu/cpu0/cpufreq
> /ondemand/sampling_rate_*
> Especially the sampling_rate_min you should try a value of *100 and *10.
Unfortunately sampling_rate_min seems to be read-only.
I can chmod the file, and then echo a new value, but nevertheless
it has no effect. Reading it after modification gives still
the old value.
Btw, on my machine, sampling_rate_min is 620000.
Unfortunately sampling_rate_min seems to be read-only. -> OK, sorry for the confusion. Having a closer look reveals that sampling_rate_min isn't used in the governor anyway. Maybe it helps increasing sampling_rate, I have 1240000 us, 1,24s. What is your value there? It is the only thing coming in my mind for now the ondemand could cause a hard lookup: switching too fast. If this does not work, maybe you can compile your own kernel and we can add some printks to verify that ondemand governor really waits long enough until it does the next switching: - install kernel-source.rpm - cd /usr/src/linux - cp arch/x86_64/defconfig.default .config - edit .config and modify CONFIG_CPU_FREQ_DEBUG to CONFIG_CPU_FREQ_DEBUG=y - make all modules_install install - edit bootloader config to point your old kernel (you never know) and the new one that should now be installed in /boot. echo 7 >/sys/modules/cpufreq/parameters/debug could already be sufficient to check whether the ondemand governor waits long enough between switches... If you really have to go that far, please set up a net console and send the whole output until it freezes with this setup, maybe we see something suspicious. I could imagine that some external event ACPI processor? is forcing the cpufreq-core to reset the speed. In this case it could be that the processor has not calmed down from the last set. Ondemand governor does not know about that, it's workqueue gets activated some us after that, it tries to switch -> clash! In theory the userspace governor should also be able to freeze the machine in this scenario, but may it's just because it does not switch that often and fast as ondemand governor, so nobody recognised userspace freezes until now? > - install kernel-source.rpm > - cd /usr/src/linux > - cp arch/x86_64/defconfig.default .config Really x86_64? I'm running on an Athlon64 hardware, but the 32-bit kernel. I'm also not sure: Did the Yast Online Update also upgrade my installed kernel-src package, or only the binary kernel package? Or in other words: Do the kernel-src and the binary kernel package still match after the YOU? Or do I get an "old" (i.e. the original 10.0) kernel, if I compile the sources? [...] > If you really have to go that far, please set up a net console Sure, otherwise it's rather useless. Let's see if I find some time next weekend to build a debug kernel. One problem is indeed, that I do not know how to enforce the lockup. Yesterday evening the machine was even running stable for a couple of hours with the "ondemand" governor, and I was even afraid, that I can't reproduce it at all any more. But finally, I kept it turned on over night - and in the morning it was frozen again. And at other times, I logged in after booting, and the machine was frozen after a few minutes. Basically, my evidence that the problem might likely be related to the "ondemand" governor comes from the fact that that I have never observed it with fixed frequency so far. Of course other causes are still not impossible. With the userspace governor, I have only one night experience (no lockup) so far, which is probably not long enough to claim with statisitcal significance, that it always works with the userspace governor.
> I could imagine that some external event ACPI processor? is forcing
> the cpufreq-core to reset the speed.
As said, disconnecting AC power seems to force the processor to
run at 800 MhZ. It also does generate a "CPU" event in addition
to the AC and battery events. I also have a "silent" button,
which, if turned on, also forces the CPU to run at 800 MhZ. It
also generates CPU events, when it is switch on or off (but no
AC or battery events). But so far, I have not been able to
reproduce/force a lockup simply by repeatedly disconnecting/
reconnecting AC power or by turning the silent button on/off.
You can run a 64-bit kernel below an 32-bit userspace linux. Would be interesting if the problems vanish in this case... To reproduce, maybe something like helps?: while true;do dd if=/dev/zero of=/dev/shm/tmp bs=$((1024*1024)) count=1000;sync;sleep 3;done Depending of how fast your machine is you should modify the values a bit (copying 100MB junks to memory in this case). watch -n1 cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq should show you up/down going freqs.
> As said, disconnecting AC power seems to force the processor to
> run at 800 MhZ. It also does generate a "CPU" event in addition
> to the AC and battery events. I also have a "silent" button,
> which, if turned on, also forces the CPU to run at 800 MhZ. It
> also generates CPU events, when it is switch on or off (but no
> AC or battery events). But so far, I have not been able to
> reproduce/force a lockup simply by repeatedly disconnecting/
> reconnecting AC power or by turning the silent button on/off.
Forgot to mention:
I experimented with the "silent" button in conjunction with the
plerformance governor. When turning the button on/off several
times, I did observer a frequence switch between 2600 and 800 MhZ.
But after a couple of on/off cycles, I ended up in a situation
where the frequency was locked at 800 MhZ, even with released
"silent" button. In this state, I was no longer able to switch
to a higher frequence any more, even not by changing the governor
to powersave back to performance. Only a reboot helped to escape.
(In reply to comment #22) > You can run a 64-bit kernel below an 32-bit userspace linux. But I guess, ndiswrapper and fglrx won't work any longer, will they? For a test, that's ok, but eventuelly, I need them for WLAN, and my graphics card did not work with the open source xorg driver, but only with the ATI driver, which implies the fglrx kernel module :-( (In reply to comment #22) > To reproduce, maybe something like helps?: > while true;do dd if=/dev/zero of=/dev/shm/tmp bs=$((1024*1024)) > count=1000;sync;sleep 3;done This will allocate 1G in tmpfs. The machine will probably lock up due to this, unless it has much more than 1G of RAM. > Depending of how fast your machine is you should modify the values a bit > (copying 100MB junks to memory in this case). 1G, not 100MB. Just doing "dd if=/dev/zero bs=1M count=1000 > /dev/null" should do equally well to push the CPU. Below is what happens when I turn my "silent" button on or off. Is that the behaviour you would expect? Does it mean, that the BIOS changes the frequency unexpectedly unter the kernels back? Or does the BIOS gracefully ask the kernel to change the frequency anyway? ACPI events: ------------ silent botton on: processor CPU1 00000080 00000006 silent button off: processor CPU1 00000080 00000000 debug messages: --------------- silent button on: cpufreq-core: updating policy for CPU 0 cpufreq-core: setting new policy for CPU 0: 800000 - 2600000 kHz freq-table: request for verification of policy (800000 - 2600000 kHz) for cpu 0 freq-table: verification lead to (800000 - 2600000 kHz) for cpu 0 freq-table: request for verification of policy (800000 - 800000 kHz) for cpu 0 freq-table: verification lead to (800000 - 800000 kHz) for cpu 0 cpufreq-core: new min and max freqs are 800000 - 800000 kHz cpufreq-core: governor: change or update limits cpufreq-core: __cpufreq_governor for CPU 0, event 3 cpufreq-core: target for CPU 0: 800000 kHz, relation 1 powernow-k8: targ: cpu 0, 800000 kHz, min 800000, max 800000, relation 1 powernow-k8: targ: curr fid 0x12, vid 0xa freq-table: request for target 800000 kHz (relation: 1) for cpu 0 freq-table: target is 6 (800000 kHz, 6144) powernow-k8: cpu 0 transition to index 6 powernow-k8: table matched fid 0x0, giving vid 0x18 powernow-k8: cpu 0, changing to fid 0x0, vid 0x18 cpufreq-core: notification 0 of frequency transition to 800000 kHz userspace: saving cpu_cur_freq of cpu 0 to be 800000 kHz powernow-k8: ph1 (cpu0): start, currfid 0x12, currvid 0xa, reqvid 0x18, rvo 0x2 powernow-k8: ph1 maxvid=0x6 powernow-k8: ph1 complete, currfid 0x12, currvid 0xa powernow-k8: ph2 (cpu0): starting, currfid 0x12, currvid 0xa, reqfid 0x0 powernow-k8: writing fid 0x10, lo 0x10a10, hi 0x190 powernow-k8: writing fid 0xe, lo 0x10a0e, hi 0x190 powernow-k8: writing fid 0xc, lo 0x10a0c, hi 0x190 powernow-k8: writing fid 0xa, lo 0x10a0a, hi 0x190 powernow-k8: writing fid 0x0, lo 0x10a00, hi 0x190 powernow-k8: ph2 complete, currfid 0x0, currvid 0xa powernow-k8: ph3 (cpu0): starting, currfid 0x0, currvid 0xa powernow-k8: writing vid 0x18, lo 0x11800, hi 0x1 powernow-k8: ph3 complete, currfid 0x0, currvid 0x18 powernow-k8: transitioned (cpu0): new fid 0x0, vid 0x18 cpufreq-core: notification 1 of frequency transition to 800000 kHz cpufreq-core: scaling loops_per_jiffy to 3205083 for frequency 800000 kHz userspace: saving cpu_cur_freq of cpu 0 to be 800000 kHz silent button off: cpufreq-core: updating policy for CPU 0 cpufreq-core: setting new policy for CPU 0: 800000 - 2600000 kHz freq-table: request for verification of policy (800000 - 2600000 kHz) for cpu 0 freq-table: verification lead to (800000 - 2600000 kHz) for cpu 0 freq-table: request for verification of policy (800000 - 2600000 kHz) for cpu 0 freq-table: verification lead to (800000 - 2600000 kHz) for cpu 0 cpufreq-core: new min and max freqs are 800000 - 2600000 kHz cpufreq-core: governor: change or update limits cpufreq-core: __cpufreq_governor for CPU 0, event 3 cpufreq-core: target for CPU 0: 2600000 kHz, relation 1 powernow-k8: targ: cpu 0, 2600000 kHz, min 800000, max 2600000, relation 1 powernow-k8: targ: curr fid 0x0, vid 0x18 freq-table: request for target 2600000 kHz (relation: 1) for cpu 0 freq-table: target is 0 (2600000 kHz, 2578) powernow-k8: cpu 0 transition to index 0 powernow-k8: table matched fid 0x12, giving vid 0xa powernow-k8: cpu 0, changing to fid 0x12, vid 0xa cpufreq-core: notification 0 of frequency transition to 2600000 kHz userspace: saving cpu_cur_freq of cpu 0 to be 2600000 kHz cpufreq-core: scaling loops_per_jiffy to 10416519 for frequency 2600000 kHz powernow-k8: ph1 (cpu0): start, currfid 0x0, currvid 0x18, reqvid 0xa, rvo 0x2 powernow-k8: ph1 maxvid=0x6 powernow-k8: ph1: curr 0x18, req vid 0xa powernow-k8: writing vid 0x17, lo 0x11700, hi 0x1 powernow-k8: ph1: curr 0x17, req vid 0xa powernow-k8: writing vid 0x16, lo 0x11600, hi 0x1 powernow-k8: ph1: curr 0x16, req vid 0xa powernow-k8: writing vid 0x15, lo 0x11500, hi 0x1 powernow-k8: ph1: curr 0x15, req vid 0xa powernow-k8: writing vid 0x14, lo 0x11400, hi 0x1 powernow-k8: ph1: curr 0x14, req vid 0xa powernow-k8: writing vid 0x13, lo 0x11300, hi 0x1 powernow-k8: ph1: curr 0x13, req vid 0xa powernow-k8: writing vid 0x12, lo 0x11200, hi 0x1 powernow-k8: ph1: curr 0x12, req vid 0xa powernow-k8: writing vid 0x11, lo 0x11100, hi 0x1 powernow-k8: ph1: curr 0x11, req vid 0xa powernow-k8: writing vid 0x10, lo 0x11000, hi 0x1 powernow-k8: ph1: curr 0x10, req vid 0xa powernow-k8: writing vid 0xf, lo 0x10f00, hi 0x1 powernow-k8: ph1: curr 0xf, req vid 0xa powernow-k8: writing vid 0xe, lo 0x10e00, hi 0x1 powernow-k8: ph1: curr 0xe, req vid 0xa powernow-k8: writing vid 0xd, lo 0x10d00, hi 0x1 powernow-k8: ph1: curr 0xd, req vid 0xa powernow-k8: writing vid 0xc, lo 0x10c00, hi 0x1 powernow-k8: ph1: curr 0xc, req vid 0xa powernow-k8: writing vid 0xb, lo 0x10b00, hi 0x1 powernow-k8: ph1: curr 0xb, req vid 0xa powernow-k8: writing vid 0xa, lo 0x10a00, hi 0x1 powernow-k8: ph1: changing vid for rvo, req 0x9 powernow-k8: writing vid 0x9, lo 0x10900, hi 0x1 powernow-k8: ph1: changing vid for rvo, req 0x8 powernow-k8: writing vid 0x8, lo 0x10800, hi 0x1 powernow-k8: ph1 complete, currfid 0x0, currvid 0x8 powernow-k8: ph2 (cpu0): starting, currfid 0x0, currvid 0x8, reqfid 0x12 powernow-k8: writing fid 0xa, lo 0x1080a, hi 0x190 powernow-k8: writing fid 0xc, lo 0x1080c, hi 0x190 powernow-k8: writing fid 0xe, lo 0x1080e, hi 0x190 powernow-k8: writing fid 0x10, lo 0x10810, hi 0x190 powernow-k8: writing fid 0x12, lo 0x10812, hi 0x190 powernow-k8: ph2 complete, currfid 0x12, currvid 0x8 powernow-k8: ph3 (cpu0): starting, currfid 0x12, currvid 0x8 powernow-k8: writing vid 0xa, lo 0x10a12, hi 0x1 powernow-k8: ph3 complete, currfid 0x12, currvid 0xa powernow-k8: transitioned (cpu0): new fid 0x12, vid 0xa cpufreq-core: notification 1 of frequency transition to 2600000 kHz userspace: saving cpu_cur_freq of cpu 0 to be 2600000 kHz Created attachment 65543 [details] Debug messages when repeatedly toggling the "silent" button (In reply to comment #23) > I experimented with the "silent" button in conjunction with the > performance governor. When turning the button on/off several > times, I did observer a frequence switch between 2600 and 800 MhZ. > > But after a couple of on/off cycles, I ended up in a situation > where the frequency was locked at 800 MhZ, even with released > "silent" button. In this state, I was no longer able to switch > to a higher frequence any more, even not by changing the governor > to powersave back to performance. Only a reboot helped to escape. This attachment shows the debug messages printed by the kernel, when the frequency lock to 800 MzH occurs as a result of repeatedly turning the "silent" button on/off. to comment #22, it is possible to run a 64-bit kernel below a 32-bit userspace system but you would have sever problems compiling it on a 32-bit system. Better try to install the x86_64 binary kernel-default*.rpm if you want to try that one. To be honest I haven't tried myself, only heard that it works, if you have problems, maybe it is even easier to try a small x86_64 installation if you still have a free partition. I just wanted to mention that before you wast time and try it. I will have a closer look at the logs on Monday... Created attachment 65544 [details]
Netconsole output before lockup
I just got my lockup again (with a self-compiled 32-bit kernel).
Actually it seems to be a panic. Unfortunately no stack backtrace
in the netconsole output.
[...]
powernow-k8: ph3 complete, currfid 0x0, currvid 0x18
powernow-k8: transitioned (cpu0): new fid 0x0, vid 0x18
cpufreq-core: notification 1 of frequency transition to 800000 kHz
cpufreq-core: scaling loops_per_jiffy to 3205148 for frequency 800000 kHz
userspace: saving cpu_cur_freq of cpu 0 to be 800000 kHz
no vm86_info: BAD
Kernel panic - not syncing: Attempted to kill the idle task!
I had another freeze yesterdy night. However, this one was different than all the other ones I have seen, since it also garbled my X screen. I had not yet setup netconsole logging to a file at that time, so I have only the captured netconsole output from with cut&paste from the xterm running netcat. [...] powernow-k8: ph1: curr 0xd, req vid 0xa powernow-k8: writing vid 0xc, lo 0x10c00, hi 0x1 powernow-k8: ph1: curr 0xc, req vid 0xa powernow-k8: writing vid 0xb, lo 0x10b00, hi 0x1 powernow-k8: ph1: curr 0xb, req vid 0xa powernow-k8: writing vid 0xa, lo 0x10a00, hi 0x1 powernow-k8: ph1: changing vid for rvo, req 0x9 powernow-k8: writing vid 0x9, lo 0x10900, hi 0x1 powernow-k8: ph1: changing vid for rvo, req 0x8 powernow-k8: writing vid 0x8, lo 0x10800, hi 0x1 powernow-k8: ph1 complete, currfid 0x0, currvid 0x8 powernow-k8: ph2 (cpu0): starting, currfid 0x0, currvid 0x8, reqfid 0x12 powernow-k8: writing fid 0xa, lo 0x1080a, hi 0x190 powernow-k8: writing fid 0xc, lo 0x1080c, hi 0x190 powernow-k8: writing fid 0xe, lo 0x1080e, hi 0x190 powernow-k8: writing fid 0x10, lo 0x10810, hi 0x190 powernow-k8: writing fid 0x12, lo 0x10812, hi 0x190 powernow-k8: ph2 complete, currfid 0x12, currvid 0x8 powernow-k8: ph3 (cpu0): starting, currfid 0x12, currvid 0x8 powernow-k8: writing vid 0xa, lo 0x10a12, hi 0x1 powernow-k8: ph3 complete, currfid 0x12, currvid 0xa powernow-k8: transitioned (cpu0): new fid 0x12, vid 0xa cpufreq-core: notification 1 of frequency transition to 2600000 kHz userspace: saving cpu_cur_freq of cpu 0 to be 2600000 kHz unexpected IRQ trap at vector 5f It looks like the PN! driver has completed all operations before the lockup starts. It's apparently PN! related but I don't think it's a driver bug. But I guess, ndiswrapper and fglrx won't work any longer, will they? ->I got ndiswrapper working on 64bit, you have to search for the Windows 64-bit drivers, I don't know about fglrx. So the machine also locks up with userspace goveror, it just switches not that often than ondemand, therefore it is hard to run into it? vm86.c, is this stuff that only gets called on x86_64 machines that run a 32-bit kernel? So this seems to be a 32-bit kernel on a x86_64 machine problem? assembler involved -> nothing for me..., Andi do the messages: no vm86_info: BAD Kernel panic - not syncing: Attempted to kill the idle task! and unexpected IRQ trap at vector 5f tell you anything? Hmm the changelog of arch/i386/vm86.c scares me a bit, maybe it's some weird stuff like that?: 22 mar 2002 - Manfred detected the stackfaults, but didn't handle * them correctly. Now the emulation will be in a * consistent state after stackfaults - Kasper Dupont * <kasperd@daimi.au.dk> Gerhard: Thanks for the good catches! 5f is in the middle of the interrupt space and probably not a legal interrupt. That unexpected interrupt 5f somehow ended up in the vm86 handling code (which wasn't expected to be running). This then somehow lead to the panic. Your system seems be not running with "apic". The unexpected IRQ trap at vector message comes from the local APIC code, which got called from the fallback handler. That's a kernel bug - if we're not in APIC mode we shouldn't call into the APIC code even for unexpected interrupts. It tried to access the local APIC, which might then have caused some bad behaviour (it's not mapped when local APIC is off). But normally these unexpected interrupts shouldn't happen in the first place. The vm86 code should only be entered when the CPU reported an EFLAGS with the VM bit set. Gerhard - did you run any vm86 applications (like dosemu or restarting the X server or hwinfo) during the crashes? (In reply to comment #33) > Your system seems be not running with "apic". Was this meant as a question to me, or do you know anyway? In the former case, how can I see whether it's running in APIC mode? Normally I do not boot explicitly with "noapic". (IIRC, I tried "noapic" once, but it did not make a difference) > The vm86 code should only be entered when the CPU reported an EFLAGS > with the VM bit set. Gerhard - did you run any vm86 applications (like > dosemu or restarting the X server or hwinfo) during the crashes? Not that I'm aware of. I never used dosemu or wine on this machine so far. I did run VMware trial version on the machine several times, but no VM was running when the machine crashed. The X server was running during this crash, but it was not being restarted at the time of the crash. Since I cannot reproduce the problem whenever I like, the lockup usually happens when I keep the machine running over night, when the machine is is mostly idle. IIRC, I was logged-in into an otherwise ide KDE session, and the screen saver was running. Is my understanding correct, that according to the panic message, obviously the idle task was interrupted by an unexpected interrupt which eventually lead to the panic? But why should just the idle task have the VM bit set in EFLAGS? >Was this meant as a question to me, or do you know anyway? >In the former case, how can I see whether it's running in APIC mode? I know from your hwinfo, unless you changed the configuration since then. >Is my understanding correct, that according to the panic message, >obviously the idle task was interrupted by an unexpected interrupt >which eventually lead to the panic? But why should just the idle >task have the VM bit set in EFLAGS? I have no idea. Especially not why frequency scaling should cause this. Well, it's possible that the illegal access of the local APIC had some very weird side effect. There's a small chance it will go away when that is fixed. As to why the 5f interrupt happens - that's probably some hardware problem. Perhaps it's caused by your chipset which doesn't like the CPU changing frequency. Or some other hardware flakiness. One thing that might be worth trying if you want to recompile your kernel. You can add a printk("vm86 used in %s:%d\n", current->comm, current->pid); in arch/i386/kernel/vm86.c:do_sys_vm86 and see if something unexpected uses vm86 mode during your night. Created attachment 65796 [details]
One more panic
I got a panic once again. This time vm86 seems not be involved, and there's a stack backtrace from a running process available -> gettimeofday(). But it is once again int 0x5F ...
(Btw, I have added the printks to vm86.c)
(In reply to comment #35) > I have no idea. Especially not why frequency scaling should cause this. I must fully admit, I don't understand this either. There seems to be a correlation, though. > Well, it's possible that the illegal access of the local APIC had > some very weird side effect. There's a small chance it will go away > when that is fixed. Btw, I'm also wondering, whether vm86.c:do_int() should really return_to_32bit() if the interrupt which cannot be handled is an (asynchronous) hardware interrupt and not a synchronous exception caused by the current process. What do you think? Are there situations, where return_to_32bit() makes sense for a HW int? To get rid of the invalid APIC acess, when int 5F occurs: Is is sufficient to comment out the call to ack_APIC_irq() in apic.c:ack_bad_irq() as Workaround? (granted that the APIC is not used on my system) Btw, I found with Google: "INT 5F - IRQ15/7 relocated by software; HP 95LX GRAPHICS PRIMITIVES" But I have no idea, whether/how/why this would apply to my notebook under Linux. Yes, commenting ack_APIC_irq is sufficient. Your second panic is weird. It is a bounds exception which happens when someone uses the quite obscure bound checking instruction. The kernel never uses that. For me it looks like a hardware problem - perhaps whatever injects the weird 5f interrupt also injects other interrupts that shouldn't be there. To be honest I doubt we can do much about that from the software side. I would recommend to not use powersaved. Ok there is a small chance that commenting out the APIC ack will help, but if that doesn't help I'm out of ideas. Created attachment 66049 [details]
Don't ack apic when disabled
Here's the correct patch for the ack problem btw.
Did not acking the APIC change anything? (In reply to comment #41) > Did not acking the APIC change anything? Obviously not. I got thre more panics, which I'll attach below. The first one was again a v86 one: no vm86_info: BAD Kernel panic - not syncing: Attempted to kill the idle task! The second one Kernel panic - not syncing: Attempted to kill the idle task! (looks like do_exit() was called from the idle task - buy why?) And the 3rd one was a "divide error" in aes_encrypt(). This panic includes a call trace. However, maybe I overlooked something, but I did not find any divide instructions in aes-i586-asm.S. Hmmmm... Created attachment 66440 [details]
1st new panic
Created attachment 66441 [details]
2nd new panic
Created attachment 66442 [details]
3rd new panic - divide error
Yes, it looks like something is sending random exceptions to the CPU. My guess is that it's a chipset bug of some kind. I can't offer any other advice other than not running powersaved which seems to trigger that bug. Maybe we can blacklist the machine, but i will only do it if there's at least one other report of it to ensure it's a systematic problem. Jacob, David, FYI. I actually already closed it as wontfix, but it might point to some issue in powernow-k8 on that machine or hardware issue. Last weekend I upgraded once again to the latest YOU patchces, and it also did include a kernel upgrade from ...-15.7 to version 2.6.13-15.8. I don't know what you have fixed in this version; but anyway, so far my notebook did successfully survive three nights in dynamic CPU frequency scaling mode without locking up. That's probably not enough evidence yet to be absolutely sure that the problem is gone, but it is at least a very good sign :-) Btw, many thanks to verybody, for your effords to resolve the problem. |