Bug 175702

Summary: On HP laptops the system shuts down due bogus EC reads
Product: [openSUSE] SUSE Linux 10.1 Reporter: Antonis Stylianou <astylianou>
Component: BasesystemAssignee: Thomas Renninger <trenn>
Status: RESOLVED INVALID QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: acpi-bugzilla, luming.yu, sandro.bordacchini, suse-beta
Version: Final   
Target Milestone: ---   
Hardware: x86   
OS: SuSE Linux 10.1   
Whiteboard:
Found By: Beta-Customer Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: 2 photos of the boot sequence when the computer starts to shutdown
hwinfo
boot.msg
this files is the chkconfic -i output during safemode
this is the boot.msg when the machine shutsdown boot.omsg
the requested acpidump file
/var/log/messages outpout
Acpidump of my HP dv5175ea
HWinfo output for my hp dv5175ea
syslog when dv5175ea boot in fail safe mode
syslog when dv5175ea shutdown after booting

Description Antonis Stylianou 2006-05-15 14:14:21 UTC
after I installed the suse 10.1 on a  hp notebook dv8000 (dual hard drive core duo system) i realised that i cannot boot the new system. The systems enters into a shudown process after mounting the root volume. During shutdown the sustem fails to clear unmount the drives.  I am not sure have not got the time to test it but i think is an acpi bug 

I cannot install the 10.0 version of suse because it fails to find the hard drive (bit strange because older versions of ubeuntu and fedora have not any problem in identifing the hard drive. I have installed the 6.04 ubuntu without facing the shoudown problem.
Comment 1 Michael Gross 2006-05-15 14:31:10 UTC
Please be more verbose. What messages are printed exactly, maby create a photograph and attach /var/log/boot.msg. Does booting with `ACPI=off' change anything?

For the problem with the hdd detection, create another report. We cannot handle more than one issue in one report.
Comment 2 Antonis Stylianou 2006-05-16 12:36:14 UTC
Created attachment 83608 [details]
2 photos of the boot sequence when the computer starts to shutdown 

2 photos of the boot sequence where the computer statst to shutdown. I have tried to boot with acpi=off but still the same
Comment 3 Michael Gross 2006-05-17 10:41:34 UTC
Please avoid compressing attachments if it is not required, this makes the handling much easier here. The screenshots do not cover the whole screen, however I suppose it enters rl 0.

You did not tell me if booting with `ACPI=off' changes this behavior.
Comment 4 Antonis Stylianou 2006-05-18 07:46:45 UTC
Ok I may have made a mistake on the previous post i got some time today and I have checked again  about this bug and this is an acpi problem.  The ACPI=off resolves the boot problem. however I am not an acpi expert so I cannot fix it on suse.  The ubuntu 6.04 version works ok with acpi on on this machine (at least i get battery status)  . A friend who has the same machine but with AMD 64 processor does not seem to have this problem. I believe the only diference between my machine and his, is that myne is core duo and it has a dual hard drive. ( I do not know if HP changed something else on this machine dv8000) 
Comment 5 Michael Gross 2006-05-18 12:34:57 UTC
Please attach the output of `hwinfo'. You might try updating your BIOS to the latest version, notebook BIOS's are sometimes quite buggy. If this does not fix the problem, I reassign this for further investigation.
Comment 6 Antonis Stylianou 2006-05-18 23:01:38 UTC
Acording to hp there are not any bios upgrates. the laptop is 2 weeks old below there is an attachment of hwinfo output and boot.msg
Comment 7 Antonis Stylianou 2006-05-18 23:03:49 UTC
Created attachment 84234 [details]
hwinfo
Comment 8 Antonis Stylianou 2006-05-18 23:04:31 UTC
Created attachment 84235 [details]
boot.msg
Comment 9 Michael Gross 2006-05-22 14:15:13 UTC
Why does the system switch to runlevel 3? Did you do this intentional? What are the lines before it switches to rl0, are there any errors? Please attach the boot messages if you try to enter rl5. Also try booting with the `safe settings' and see if it happens there, too.
Comment 10 Michael Gross 2006-05-22 14:16:58 UTC
You might also try if this does happen with the standard kernel (non SMP).
Comment 11 Antonis Stylianou 2006-05-22 20:47:28 UTC
I cannot sent a boot.msg when the system switches to runlevel 0 because it does sthutdown. In order to get that file out i have to take out the hard drive and put it on another computer and i cannot do that . 


The boot.msg i have sent is when i boot with the safe settings thus rl3 and noacpi. The pictures i have sent in the beggining, they show the lines before it switches to rl0 and there are not any errors. 

I have tried to boot with the standard kernel and the problems remains the same.

The problem is not on SMB is on ACPI if I boot with acpi=off then the problem is solved but who wants a laptop with acpi off. That problem is only suse relaited ubuntu 5.10 and 6.06 do not have that problem also I have tried fedora core and it also works fine
Comment 12 Christian Boltz 2006-05-22 21:28:53 UTC
(In reply to comment #11)
> I cannot sent a boot.msg when the system switches to runlevel 0 because it 
> does sthutdown. In order to get that file out i have to take out the hard 
> drive and put it on another computer and i cannot do that . 

It's much easier - the previous boot.msg file is available as boot.omesg.
It shouldn't be too hard to get this file ;-)
Comment 13 Michael Gross 2006-05-23 13:30:50 UTC
> The problem is not on SMB is on ACPI if I boot with acpi=off then the problem
> is solved but who wants a laptop with acpi off.

This is not the point. We know now that the problem is related to the ACPI implementation. I'm reassigning this now.
Comment 15 Thomas Renninger 2006-05-23 15:57:40 UTC
Can you boot into runlevel 1 (just add a "1" as kernel boot parameter)?
If yes, please attach dmesg and acpidump output.

1) This is to find out the service/module that interferes with ACPI subsystem:

If yes, please try to find out which service forces your machine to shutdown.
You can use "chkconfig -l" to get an overview of service that are started in runlevel 2 and 3. Start them by hand (e.g. rcacpid start). I expect it could be hal or powersave daemon, but it could also be something else. Do you use lm_sensors, maybe it's that?

2) This is to find out the ACPI module that could interfere

- Is /var/lib/acpi/laptop_modules empty? If not delete the file's contents (do
  not delete the file itself). Like that the extra ACPI module should not get
  loaded.
- Remove "processor thermal fan" from INITRD_MODULES= in /etc/sysconfig/kernel
  and invoke "mkinitrd"
- Set ACPI_MODULES="NONE" in /etc/sysconfig/powersave/common
Can you boot normal now?
If yes, you might want to boot, then load the ACPI modules (thermal,button,fan,ac,battery,processor, evtl. extra ACPI laptop module) by hand (modprobe). Does the machine now shutdown when a certain ACPI module is loaded? If no, does it shutdown if you restart the powersaved ("rcpowersaved restart") after modprobing ACPI modules?
Comment 16 Antonis Stylianou 2006-05-26 00:22:49 UTC
Created attachment 85167 [details]
this files is the chkconfic -i output during safemode
Comment 17 Antonis Stylianou 2006-05-26 00:23:47 UTC
Created attachment 85168 [details]
this is the boot.msg when the machine shutsdown boot.omsg
Comment 18 Antonis Stylianou 2006-05-26 00:31:10 UTC
I did  No 2 as comment#15 explaing and i identified the thermal module to be responsible for  the shutdown. After "modprobe thermal" the laptop enters runlevel  0 
Comment 19 Thomas Renninger 2006-05-26 07:53:18 UTC
Your machine shutsdown because one thermal zone reports 0 degree temp.
<0>Critical temperature reached (0 C), shutting down.
<6>ACPI: Thermal Zone [TZ02] (0 C)

Maybe all values (including critical trip point) of that thermal zone are wrong and exported as zero and this line gets active =>
	if (tz->temperature >= tz->trips.critical.temperature) {
		ACPI_WARNING((AE_INFO, "Critical trip point"));

Maybe this is some kind of dummy thermal zone and values of 0 C should be ignored?
Can you attach acpidump, please.
If you provide info, please check the button "This comment provides the needed information. Change the status of this bug back to ASSIGNED." below the Comment field. Like that I know that it's my turn again...
Comment 20 Antonis Stylianou 2006-05-28 21:49:34 UTC
Created attachment 85451 [details]
the requested acpidump file
Comment 21 Thomas Renninger 2006-05-29 09:14:08 UTC
I wonder why the installation kernel succeeded, there the thermal module should also get loaded, maybe because /sbin/poweroff is missing that early?
Do you always get this shutdown at boot time or only here and there?

I don't see how the temp could get zero here.
It seems as if EC reads related to the "virtual", second thermal device all return zero:
\_SB.PCI0.LPCB.EC0.VTMP and \_SB.PCI0.LPCB.EC0.VTSD
In the critical and temperature related AML functions.

Can you try ec_intr=0 and/or acpi_serialize boot params, please.

If this does still not work, you should try to install and boot a kernel-debug (rpm) kernel. If it boots, you can increase ACPI debug output by e.g. echo 0xFF >/proc/acpi/debug_level
Best is you do as mentioned in comment #15 part 2 (do not load ACPI modules). Then increase acpi debug output, then load the thermal module. Please send dmesg or /var/log/messages output of the part when thermal module got loaded until the machine started to shutdown (there should be a lot of additional debug output now).
If the kernel-debug kernel does not boot, tell me and I can compile you a kernel with only additional ACPI_DEBUG=y config compiled in or I can tell you how to do that.
Comment 22 Antonis Stylianou 2006-05-31 16:23:03 UTC
Created attachment 86279 [details]
/var/log/messages outpout 

the proposed solution ec_intr=0 and/or acpi_serialize boot params did not work. I unstaled the debug kernel and this is the outpout of /var/log/messages
Comment 23 Sandro Bordacchini 2006-10-06 23:16:14 UTC
Hoping to be useful, I will attach some other logs.
I have the same problem with a HP Pavillion DV5175EA notebook.
I'm using OpenSuse 10.1 (DVD install).
Comment 24 Sandro Bordacchini 2006-10-06 23:18:17 UTC
Created attachment 100885 [details]
Acpidump of my HP dv5175ea

(taken with nb booted in failsafe mode)
Comment 25 Sandro Bordacchini 2006-10-06 23:19:27 UTC
Created attachment 100886 [details]
HWinfo output for my hp dv5175ea

(taken with nb booted in failsafe mode)
Comment 26 Sandro Bordacchini 2006-10-06 23:20:36 UTC
Created attachment 100887 [details]
syslog when dv5175ea boot in fail safe mode
Comment 27 Sandro Bordacchini 2006-10-06 23:21:19 UTC
Created attachment 100888 [details]
syslog when dv5175ea shutdown after booting
Comment 28 Thomas Renninger 2006-10-09 09:57:40 UTC
Thanks for detailed logs/outputs.
I expect this comes from ACPI funcs that get stuck similar as described here:
http://bugzilla.kernel.org/show_bug.cgi?id=5534

I will have a close look at it the next days, no time today/tomorrow.
Comment 29 Thomas Renninger 2007-01-05 16:47:04 UTC
Very strange that this always happens.
Can you try to move the thermal module away from /lib/modules/kernel-ver/kernel/drivers/acpi/thermal.ko to e.g. /tmp
Now if you boot, wait some time, copy the module back and load it manually with modprobe thermal it should not shut down?

I added a patch which I think could fix that (as said in comment #28) to SLED10-SP1 branch. SLED10 is same code base than 10.1 and AFAIK the 10.1 will get SLED-SP1 kernel sooner or later.
You can try whether it helps by installing this one:
ftp://ftp.suse.com/pub/projects/kernel/kotd/sle10-sp-i386/SLES10_SP1_BRANCH/kernel-smp.i586.rpm

(Best install it with rpm -ivh xy.rpm --force (possibly also --nodeps), you need to adjust /boot/grub/menu.lst by hand to point to the newly installed kernel. Like that the old kernel will be kept installed.
Comment 30 Sandro Bordacchini 2007-01-06 15:11:26 UTC
Loading thermal module manually make the system shutdown itself.
From /var/log/messages :

Jan  5 19:57:30 fermi kernel: ACPI: Thermal Zone [TZ01] (25 C)
Jan  5 19:57:30 fermi kernel: ACPI Warning (acpi_thermal-0470): Critical trip point [20060127]
Jan  5 19:57:30 fermi kernel: Critical temperature reached (0 C), shutting down.
Jan  5 19:57:30 fermi kernel: ACPI: Thermal Zone [TZ02] (0 C)
Jan  5 19:57:30 fermi shutdown[3689]: shutting down for system halt
Jan  5 19:57:32 fermi init: Switching to runlevel: 0
Jan  5 19:57:38 fermi kernel: bootsplash: status on console 0 changed to on
Jan  5 19:57:39 fermi auditd[2816]: The audit daemon is exiting.
Jan  5 19:57:39 fermi kernel: audit(1168023459.104:6): audit_pid=0 old=2816 by auid=4294967295
Jan  5 19:57:39 fermi smpppd[2899]: terminating on signal 15
Jan  5 19:57:39 fermi sshd[2896]: Received signal 15; terminating.
Jan  5 19:58:01 fermi kernel: Kernel logging (proc) stopped.
Jan  5 19:58:01 fermi kernel: Kernel log daemon terminating.
Jan  5 19:58:02 fermi syslog-ng[2716]: syslog-ng version 1.6.8 going down

I was going to upgrade to 10.2, do u think this can resolve the issue?
Comment 31 Antonis Stylianou 2007-01-06 18:25:15 UTC
No the problem is the same for 10.2. but u can always upgrrte to ubuntu far more stable and far more easy. 
Comment 32 Thomas Renninger 2007-01-08 12:52:20 UTC
This is an EC _REG problem.
It seems Ubuntu does never invoke:
\_SB_.PCI0.LPCB.EC0_._REG
with a value of "3" passed as first and value "1" passed as second argument.

As soon as _REG is invoked with a value of "3" passed as first and "1" passed as second argument, ECRY becomes "1".

                    Method (_REG, 2, NotSerialized)
                    {
                        If (LEqual (Arg0, 0x03))
                        {
                            Store (Arg1, ECRY)
                        }
                    }
This is the only place where ECRY is written to, so for this BIOS this must never happen. Because as soon as ECRY is "1" all EC access is avoided by checking for ECOK method or directly LEqual (\_SB.PCI0.LPCB.EC0.ECRY, One).
If this happens battery, thermal and other info normally accessed via EC variables are not accessed and bogus values are returned.

E.g. for thermal (temperature and critical trip point):
               If (LEqual (\_SB.PCI0.LPCB.EC0.ECRY, One))
                {
                    Multiply (\_SB.PCI0.LPCB.EC0.VTSD, 0x0A, Local0)
                    Add (Local0, 0x0AAC, Local0)
                    Return (Local0)
                }


This is even reproducable in userspace by adding some Store (xy, debug), recompiling the DSDT and invoke it via acpiexec:
I did that for the suspicious _REG func:

                    Method (_REG, 2, NotSerialized)
                    {
		        Store ("_REG invoked", debug)
		        Store (Arg0, debug)
		        Store (Arg1, debug)
                        If (LEqual (Arg0, 0x03))
                        {
                            Store (Arg1, ECRY)
                        }
                    }

and outcome with acpiexec already at initialisation time is:
[ACPI Debug]  String: [0x0C] "_REG invoked"
[ACPI Debug]  Integer: 0x00000001
[ACPI Debug]  Integer: 0x00000001
[ACPI Debug]  String: [0x0C] "_REG invoked"
[ACPI Debug]  Integer: 0x00000003
[ACPI Debug]  Integer: 0x00000001
[ACPI Debug]  String: [0x0C] "_REG invoked"
[ACPI Debug]  Integer: 0x00000003
[ACPI Debug]  Integer: 0x00000001

Therefore ECRY is "1" and a lot (all?) methods trying to access EC later are checked for this not being "1" or returning static, bogus info.

I call this a BIOS bug, is there a new BIOS available?

I doubt Ubuntu is going to work (can you confirm that you see this Temperature: 0 C shutdown problem with SUSE and it does work with Ubuntu?) or they might have added the BIOS blacklisted.

I just found this in an Ubuntu forum, one claims a BIOS update helped, so this is the first thing you should try:
https://launchpad.net/ubuntu/+source/linux-source-2.6.17/+bug/58696
Comment 33 Thomas Renninger 2007-01-08 12:55:20 UTC
Long story...: Can you please upgrade BIOS and tell us whether it helps, tks.
Comment 34 Thomas Renninger 2007-01-08 12:57:48 UTC
Here another report where a BIOS update helped:
https://lists.ubuntu.com/archives/kubuntu-users/2006-June/006183.html
(Follow up the thread). I like to close this one, but I will stil wait for confirmation whether it really works and whether a new BIOS is available.
Comment 35 Antonis Stylianou 2007-01-08 17:16:02 UTC
NO UBUNTU WORKS JUST FINE SINCE 6.04 THIS IS ON THE DESCRIPTION OF THE BUG "I have installed the 6.04 ubuntu without facing the shoudown problem." i made a mistake on the shoudown i meant  shutdown. When I wrote for this bug the first think i did was to check for a new bios. In (2006-05) i was using the latest bios but now there is an upgrate. I cannot test this upgrade in the near future because now i am using Ubuntu 
Comment 36 Thomas Renninger 2007-01-08 17:29:54 UTC
Sandro: Can you try the latest BIOS, pls.

Antonis: Can you at least check whether you get a /proc/acpi directory with Ubuntu, I wonder whether they just blacklisted the machine to boot acpi=off.
Comment 37 Antonis Stylianou 2007-01-08 17:55:07 UTC
Yes there is a /proc/acpi/ and the thermal_zone is there. Also I made a mistake on the previous post i have a small partition with novel linux for testing that i forgot about. ( the problem reproduced on novel also) there is an upgrade for my bios but HP does not list it with their automatic hardware upgrade on windows XP. I will contact HP and  let you know for the bios upgrate. Have in mind that when  I tried fedora core the problem was not reproduced.  
Comment 38 Thomas Renninger 2007-01-09 14:56:27 UTC
Does this get read by ACPI/Intel folks if I add acpi-bugzilla@lists.sourceforge.net to our bugzilla? I think so, if not I will use acpi@linux.intel.com again. I didn't know this one is registered, will always add this when I get such reports from now on then.

I just checked, the HP dv8000 from Antonis and the HP Pavillion DV5175EA from Sandro have exactly the same DSDT (only one mem region is declared with a different value), so you suffer the same problem for sure.

Maybe someone of the Intel guys can shed some light on the ACPI spec concerning the _REG method (this is about _REG method, chapter 6.5.4 ACPI spec 3.0 on page 228/229).
My question is whether _REG(3,0) (Embedded Controller, disconnection handler) must be called after EC driver has been initialised or (this is what I think and kernel currently seems to do and then it's a BIOS bug) whether it only must be called after EC driver gets unloaded (therefore never or only at machine shutdown time).

Ubuntu must have a patch included that avoids that EC's _REG method is invoked with parameters _REG(3,1) or they call _REG(3,0) somewhen. I didn't find a nice way to access ubuntu kernel package files I try to find out which patch could workaround/fix this issue in ubuntu...
Comment 39 Luming Yu 2007-01-09 15:52:38 UTC
Please take a look at http://bugzilla.kernel.org/show_bug.cgi?id=1690.
If it is Not related to this problem, please just let me know.

Thanks,
Luming
Comment 40 Thomas Renninger 2007-01-11 11:10:53 UTC
Sorry I was wrong, it's not the _REG method, everything is fine here.

It's just the EC read returns totally bogus data (in this case zero for the thermal read \_SB.PCI0.LPCB.EC0.DTSD):

If (LEqual (\_SB.PCI0.LPCB.EC0.ECRY, One))
                {
                    Multiply (\_SB.PCI0.LPCB.EC0.DTSD, 0x0A, Local0)
                    Add (Local0, 0x0AAC, Local0)
                    Return (Local0)
                }

Therefore 0xAAC is returned which evaluates to a temperature of 0 C.

As this is an HP laptop it could be related to:
http://rudin.suse.de:8888/show_bug.cgi?id=179702

Ubuntu has psmouse compiled as module, this could be it.
If you also have a 10.2 installed you might want to test a kernel from here:
ftp.suse.com/pub/people/trenn/hp_fixes_final
Otherwise if the patch is ok and commited to SLE10-SP1, I can point you to a kernel or you have to wait until 10.1 update kernel gets synced with SLE10-SP1 kernel.
Comment 41 Sandro Bordacchini 2007-01-14 16:16:42 UTC
I've just upgraded BIOS on my Hp Pavillion dv5175ea from rev. F.09 to rev. F.22.

The issue is resolved, no more shutdown: i'm still using the standard "Suse 10.1" kernel, with no patches.
Comment 42 Thomas Renninger 2007-01-15 09:07:15 UTC
Thanks.