Bug 105981 - false acpi_thermal_critical alert
Summary: false acpi_thermal_critical alert
Status: RESOLVED DUPLICATE of bug 98178
Alias: None
Product: SUSE LINUX 10.0
Classification: openSUSE
Component: Mobile Devices (show other bugs)
Version: Beta 2
Hardware: i686 SUSE Other
: P5 - None : Major
Target Milestone: ---
Assignee: Thomas Renninger
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2005-08-19 23:30 UTC by Forgotten User OS1JNCFbCX
Modified: 2005-08-21 12:16 UTC (History)
2 users (show)

See Also:
Found By: Beta-Customer
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Output of acpidmp (230.32 KB, application/octet-stream)
2005-08-20 20:47 UTC, Forgotten User OS1JNCFbCX
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Forgotten User OS1JNCFbCX 2005-08-19 23:30:39 UTC
My ThinkPad T42p received a acpi_thermal_critical and thus rebooted. The syslog
says:

Aug 20 01:07:42 sighup syslog-ng[3891]: Changing permissions on special file
/dev/xconsole
Aug 20 01:07:42 sighup syslog-ng[3891]: Changing permissions on special file
/dev/tty10
Aug 20 01:07:42 sighup kernel: acpi_thermal-0463 [861] acpi_thermal_critical :
Critical trip point
Aug 20 01:07:42 sighup kernel: Critical temperature reached (95 C), shutting down.
Aug 20 01:07:42 sighup init: Switching to runlevel: 0

Actually I am pretty sure that this was a false alert because:

1. This machine never before went even near this limit even on very heavy load.

2. At that moment I was playing Lincity-NG and nothing else was running that
could have lead to heady load.

3. The machine did not really feel hot.

4. After immediately rebooting the machine,
/proc/acpi/thermal_zone/THM0/temperature said that the machine is at 50 C. It
seems impossible that the machine could have cooled down from 95 C to 50 C
within the timeframe of one boot cycle.

Unfortunately this sort of problem is almost impossible to reproduce thus I
don't know whether you can do anything about that.
Comment 1 Forgotten User OS1JNCFbCX 2005-08-20 11:44:14 UTC
Did some furhter investigation under heavy load.

When monitoring /proc/acpi/thermal_zone/THM0/temperature on older SUSE releases
(e.g. 9.3) the temperature went slowly up under heavy load and went slowly down
when the heavy load was no longer present.

Under 10.0 now the value seems still somewhat reasonable when there is no heavy
load but as soon as I put the machine under heavy load the value starts jumping
up and down with a speed that is almost impossible from a physical point of
view. By jumping up and down it seems that at some random time it accidently
hits the critical limit and thus shuts the system down.
Comment 2 Danny Al-Gaaf 2005-08-20 16:14:27 UTC
sounds like a acpi kernel problem
Comment 3 Thomas Renninger 2005-08-20 17:23:10 UTC
It seems as if current kernels or Xorg configuration let a lot Thinkpads
overheat quite quickly.
Even I still have no idea what could be the cause of this and whether this is
really kernel ACPI related, I like to find out.

Here you find a short discussion of other Thinkpad user suffering the same problem:
http://mailman.linux-thinkpad.org/pipermail/linux-thinkpad/2005-August/thread.html

If the graphic card/driver fits can you try this (forwarded message)
(In Xorg.conf):
ATI FireGL mobility T2, Option "DynamicClocks" "on". This way the temperature 
of the gfx card goes down from ~ 98°C to 56°C, when idling over night it goes 
down to 48°C and the fan stops completely.

Even this one and #98178 are duplicates I like to let this one open as general
Thinkpad overheating bug and the other one as "ondemand passive thermal policy
broken" bug
Comment 4 Forgotten User OS1JNCFbCX 2005-08-20 17:44:23 UTC
Ok, will try this.

But note: When the system is idle, I am at about 45°C even without that option
although the fan is still running.
Comment 5 Thomas Renninger 2005-08-20 18:40:22 UTC
Not sure whether this is fixable and time, but if this is really because of a
regression in current linux kernels (proabably ACPI), I consider this as sever
as Thinkpads are the laptops that are known to work really well with Linux.

I have no idea but could imagine that something with fan control does not work
as expected. Even if fan is on, Dirk reports that it could run much faster (e.g.
at boot time - BIOS controlled).

Robert: Can you send me the output of acpidmp, please.

Does it make a difference when ibm_acpi module is not loaded?
(Try something to produce load to get full performance: e.g. cat /dev/zero
>/dev/null)
You can then watch your temperature increase nicley with e.g.:
watch -n1 cat /proc/acpi/thermal_zone/*/temperature
or with ibm_acpi module:
watch -n1 cat /proc/acpi/*ibm*/thermal


Comment 6 Thomas Renninger 2005-08-20 18:41:36 UTC
This is not the first report of this kind, adding behlert, increasing severity.
Comment 7 Thomas Renninger 2005-08-20 19:48:19 UTC
Have you already updateded the BIOS and embedded controller
(http://www-307.ibm.com/pc/support/site.wss/document.do?sitestyle=ibm&lndocid=MIGR-50277)
firmware?

If not, could you please send me the output of acpidmp before and after updating
to latest BIOS/embedded controller firmware. Does updating solve this issue?
Comment 8 Dirk Mueller 2005-08-20 19:52:27 UTC
this should be an exact duplicate of my bugreport, Robert also uses a Thinkpad 
p-Series laptop.  
 
Robert, try modprobe ibm_acpi experimental=1 
 
then you can more closely watch the behaviour via :  
 
watch 
"cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq /proc/acpi/ibm/fan /proc/acpi/ibm/thermal" 
 
for more details, see bugreport 98178 
 
whats your trip_point polling frequency? does it also go away if you lower it 
to 1 s or something like that? 
 
 
 
Comment 9 Dirk Mueller 2005-08-20 20:03:28 UTC
there's a new bios released just yesterday,but it doesn't seem to fix anything 
relevant:  
 
http://www-307.ibm.com/pc/support/site.wss/document.do?sitestyle=ibm&lndocid=MIGR-50275 
Comment 10 Thomas Renninger 2005-08-20 20:15:54 UTC
I wonder whether the new embedded controller firmware causes this.
The description is talking about getting rid of nasty fan noise.
Do you both have new embedded controller firmware?
Can someone confirm that this only happened with the new firmware?
Comment 11 Forgotten User OS1JNCFbCX 2005-08-20 20:47:32 UTC
Created attachment 46766 [details]
Output of acpidmp

Ok, first the output of acpidmp.

Have not yet loaded ibm_acpi. Will do so now and evaluate...
Comment 12 Forgotten User OS1JNCFbCX 2005-08-20 20:49:15 UTC
The system has the latest BIOS and the latest embedded controller firmware.

I don't think that the embedded controller firmware is responsible because I
never had problems with that in 9.3 or before.
Comment 13 Dirk Mueller 2005-08-20 20:57:40 UTC
ok, 9.3 and older didn't use the kernel ondemand frequency scaler but an  
userspace implementation that was far more conservative and therefore didn't  
trigger the overheating.  
  
I already downgraded to kernel from 9.3 and it doesn't make a difference,  
kernel ondemand is broken there as well.   
  
anyway, can we agree that this is a duplicate of 98178 ?  
Comment 14 Forgotten User OS1JNCFbCX 2005-08-20 21:04:56 UTC
Yes, most likely it is a duplicate.

I will do more investigations as suggested above later. At the very moment I
have to do some other stuff.
Comment 15 Dirk Mueller 2005-08-20 23:22:16 UTC

*** This bug has been marked as a duplicate of 98178 ***
Comment 16 Forgotten User OS1JNCFbCX 2005-08-21 12:16:11 UTC
Ok, I tried now Option "DynamicClocks" "on".

Where there is no significant change when the system is idle it seems that
applications with high graphics activity no longer trigger the alert.