Bug 116763

Summary: HP nx8220 no longer installingas in 9.3
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Stefan Behlert <behlert>
Component: KernelAssignee: Thomas Renninger <trenn>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: acpi, aj, hare, meissner, trenn
Version: RC 2   
Target Milestone: ---   
Hardware: Other   
OS: All   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Jpeg of point of stop
output of acpidmp
Reverted check of function value - please review someone...

Description Stefan Behlert 2005-09-13 13:44:39 UTC
I've a HP nx8220 here that stops during kernel initialization when 
installation is running. 
With 9.3 the installation worked flawless, with 10.0RC2 I have to disable ACPI 
for installation (which results in 'no network found', but that's another 
story). 
That's a clear regression :( 
It looks like interrupts are set wrong or at the wrong type.
Comment 1 Andreas Jaeger 2005-09-13 15:03:42 UTC
did you try the other options like irqpoll?
Comment 2 Thomas Renninger 2005-09-13 15:07:20 UTC
pci=noacpi boots the machine and installation should work, still a blocker?
Comment 3 Andreas Kleen 2005-09-13 15:14:12 UTC
Please connect serial console and add boot log, also add acpidmp
Comment 4 Stefan Behlert 2005-09-13 15:17:07 UTC
I'd happily do so if the machine had a serial connector :( 
All these new notebooks don't have one, and we do not have a docking station 
for that type of notebook :( 
 
Comment 5 Andreas Kleen 2005-09-13 15:26:49 UTC
Sigh. we really really need usb or firewire console.

You could boot with vga=0x0f01 (to make the font as small as possible) 
and the photograph it (including scrollback) 

acpidmp should work with pci=noacpi/acpi=off anyways.
Comment 6 Stefan Behlert 2005-09-13 15:52:50 UTC
ok, booting with  vga=0x0f01  and photographing the point where it stopped 
worked. sorry no scroll back working. 
I'll make an installation with pci=noapic and try the acpidmp then, if that's 
ok (or should it work while in linuxrc? I don't think so, or?) 
 
Comment 7 Stefan Behlert 2005-09-13 15:53:50 UTC
Created attachment 49783 [details]
Jpeg of point of stop
Comment 8 Stefan Behlert 2005-09-13 16:10:00 UTC
Created attachment 49785 [details]
output of acpidmp

That's the output with pci=noacpi
Comment 9 Andreas Jaeger 2005-09-13 16:31:32 UTC
Lowering severity.
Comment 11 Stefan Behlert 2005-09-26 07:51:43 UTC
Ok, two weeks have passed. Anything new? 10.0 is hitting the street in a few 
months and it looks more and more like no HP from that generation will work 
without workaround and manual intervention (See also the nc6230 we gave you, 
Thomas). 
Comment 12 Stefan Behlert 2005-09-26 07:53:30 UTC
*** Bug 116971 has been marked as a duplicate of this bug. ***
Comment 13 Thomas Renninger 2005-10-03 16:37:34 UTC
I think I got it. The problem is, I don't know why it works -> adding some info.

The problem is an endless loop when iterating over a list. Adding this debug
info into our (or vanilla 2.6.14-rcX) kernels (be careful for whitespaces):


--- vanilla-linux-2.6.14-rc3/drivers/acpi/scan.c.orig   2005-10-03
18:27:00.000000000 +0200
+++ vanilla-linux-2.6.14-rc3/drivers/acpi/scan.c        2005-10-03
18:27:51.000000000 +0200
@@ -555,6 +555,7 @@

        spin_lock(&acpi_device_lock);
        list_for_each_safe(node, next, &acpi_device_list) {
+               printk(KERN_ERR "Driver attach: prev %p - node %p - next %p\n",
node->prev, nod
e, node->next);
                struct acpi_device *dev =
                    container_of(node, struct acpi_device, g_list);


Results in a never ending list iteration (also added some more debug info,
please note the prev/node/next pointers that are all equal):

acpi_bus_match for driver [motherboard] for device [C23D]
Driver attach: prev c18eb820 - node c18eb820 - next c18eb820
    scan-0616 [02] acpi_driver_attach    : Found driver [motherboard] for device
[C23D]
acpi_bus_match for driver [motherboard] for device [C23D]
Driver attach: prev c18eb820 - node c18eb820 - next c18eb820
    scan-0616 [02] acpi_driver_attach    : Found driver [motherboard] for device
[C23D]
acpi_bus_match for driver [motherboard] for device [C23D]
...

After I realised the bug may not be located where the list is actually touched,
I compared changes with old (2.6.12.6) kernels.

The third shot was the hit, reverting this change makes the machine boot again:

--- vanilla-linux-2.6.14-rc3.orig/drivers/acpi/scan.c   2005-10-03
18:21:35.000000000 +0200
+++ vanilla-linux-2.6.14-rc3/drivers/acpi/scan.c        2005-10-03
18:21:58.000000000 +0200
@@ -1111,7 +1111,7 @@
         *
         * TBD: Assumes LDM provides driver hot-plug capability.
         */
-       result = acpi_bus_find_driver(device);
+       acpi_bus_find_driver(device);

       end:
        if (!result)

Will also attach this as patch that fixes the problem.
However, a kernel hacker has to review -> I don't know why this fixes anything ...
Comment 14 Thomas Renninger 2005-10-03 16:39:02 UTC
Created attachment 51332 [details]
Reverted check of function value - please review someone...
Comment 15 Thomas Renninger 2005-10-06 08:07:23 UTC
I asked the author (rajesh.shah@intel.com) who made this change and it came out
that this has been added accidently:

FWD:
________________________________________________
Looking at this closely now, checking for the result does appear
to be wrong. Binding a driver for a device should be optional,
and should not fail adding the device to the acpi list. I suspect
a previous iteration through this code failed to find a driver
match, returned failure to the caller and caused bad things to
happen. So, your patch looks good to me.

cc'ing Bob for his expert opinion.
________________________________________________

Olaf can you add the patch to 10.0 and if it makes sense in 10.1 Alpha.
Comment 16 Thomas Renninger 2005-10-06 08:35:54 UTC
*** Bug 98280 has been marked as a duplicate of this bug. ***
Comment 17 Olaf Kirch 2005-10-06 11:38:20 UTC
Done. It's still scary that this mistake can mess up that list. 
There are other conditions where we bail out of that function without 
adding the device entry - will this have similar devastating effects? 
Comment 18 Thomas Renninger 2005-10-07 12:44:52 UTC
No idea, if something similar pops up, I know at least where to search now ...
Comment 19 Andreas Kleen 2005-10-11 00:21:42 UTC
The patch helps an Asus K8N-DL to boot too. Previously it would
also hang with an endless loop in the device list. It still doesn't
boot unfortunately without pci=noacpi, but that's probably a different
problem and it's better to have basic ACPI enabled.
Comment 22 Thomas Renninger 2005-10-14 09:16:40 UTC
*** Bug 117452 has been marked as a duplicate of this bug. ***
Comment 23 Forgotten User ZhJd0F0L3x 2005-10-14 09:55:44 UTC
Thomas, are you sure about this duplicate? These are totally differen symptoms:
one machine crashes at boot immediately, the other hangs much later during
package installation...
Comment 24 Thomas Renninger 2005-10-14 10:07:09 UTC
After installation the system crashs on bootup with several stack traces, but I
can't scroll up to see more. I see only a stack trace of ata_piix.

-> You are right it is strange that he could boot the system for installation.

Could it be that some device's resources were not requested during installation
(firewire, ..., whatever) that haven't been touched during installation, but get
requested after installation reboot?
Then it would make sense. Still it is strange that the ooopses happen in the
disk driver?!?

-> I installed 10_0 CVS branch kernel and it booted fine without adding
acpi=oldboot.

Matthias: Please reopen if you still see problems with the kernel I installed or
the next YOU update kernel ...
Comment 25 Matthias Boettger 2005-11-22 12:08:30 UTC
Thomas, I saw it first today. You have installed the 10_0 CVS branch kernel default kernel but I need the smp kernel (HT CPU). The problem still exists in the -smp kernel. I installed the kernel-smp from dist.
Comment 26 Marcus Meissner 2005-11-30 16:28:09 UTC
this is not in the current 10.0 update (matthias confirmed it)
Comment 27 Thomas Renninger 2005-12-08 16:49:23 UTC
Sorry for the confusion.
The "HP nx8220 no longer installingas in 9.3" fix is in the update kernel.
It seems as if comment #23 is valid and Matthias' problem is not a duplicate?

I will close this one again and reopen the initial report from Matthias.