Bug 145197

Summary: e100 crashes on resume from suspend-to-RAM
Product: [openSUSE] SUSE Linux 10.1 Reporter: Joachim Gleissner <joachim.gleissner>
Component: KernelAssignee: Olaf Kirch <okir>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: dmueller
Version: Beta 1   
Target Milestone: ---   
Hardware: i386   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Picture of the Oops
Proposed patch

Description Joachim Gleissner 2006-01-24 15:03:51 UTC
Machine is a Sony Vaio VGN-FS115B. I'll attach a screen shot of the oops.
Comment 1 Joachim Gleissner 2006-01-24 15:05:22 UTC
Created attachment 64761 [details]
Picture of the Oops
Comment 2 Forgotten User ZhJd0F0L3x 2006-01-24 18:26:54 UTC
exactly the same oops happens after suspend to disk or after
echo -n 2 > $SYSFS_DEVICE_PATH/power/state
echo -n 0 > $SYSFS_DEVICE_PATH/power/state # oopsen here

I reproduced this on a second e100 machine (Compaq armada e500), the same oops.
Did still work with 2.6.15-git12-6 (beta1 kernel)
Comment 3 Pavel Machek 2006-01-24 19:44:02 UTC
Stefan, can you try to generate the diff between last working and see if you can spot something interesting?

Otherwise just add prints into e100_hw_init to see where it dereferences the NULL, and try to fix it. I'm afraid I do not have the right hardware....

(feel free to reassign to any kernel hacker that has e100.... Karsten had some collection on network cards?)
Comment 4 Forgotten User ZhJd0F0L3x 2006-01-24 23:01:52 UTC
Karsten, anything you can do to help here?
I also notified lkml and netdev lists about this one.
Comment 5 Olaf Kirch 2006-01-25 10:22:16 UTC
There was an e100 update in 2.6.16-rc1-git3, which seems to introduce
this problem.

Apparently it dies in e100_exec_cb_wait


        if ((err = e100_exec_cb(nic, NULL, e100_setup_ucode)))
                DPRINTK(PROBE,ERR, "ucode cmd failed with error %d\n", err);
        /* we see this message in the oops; it returns ENOMEM because
         * nic->cbs_avail == 0 */

        /*...*/
        while (!(cb->status & cpu_to_le16(cb_complete))) {
                msleep(10);
                if (!--counter) break;
        }

I think it dies while referencing cb->status, which is NULL. That's because
the cb's aren't allocated until later.
Comment 6 Olaf Kirch 2006-01-25 10:23:34 UTC
Created attachment 64889 [details]
Proposed patch
Comment 7 Olaf Kirch 2006-01-25 10:35:15 UTC
kalman-okir-587 kernel-default: IN PROGRESS
 - i386: not started yet

please test
Comment 8 Olaf Kirch 2006-01-25 11:22:37 UTC
[mbuild kalman-okir-587] kernel-default on i386: succeeded
Comment 9 Forgotten User ZhJd0F0L3x 2006-01-25 11:44:59 UTC
Yes, sir. I can boogie.

Works for me on Armada E500 and suspend to RAM.
Comment 10 Olaf Kirch 2006-01-25 12:08:38 UTC
Thanks for confirming.
Fix is in CVS tree
Comment 11 Joachim Gleissner 2006-01-25 12:16:01 UTC
Just for the record, it also works on the Sony Vaio now. Thanks!
Comment 12 Dirk Mueller 2006-01-25 15:40:05 UTC
*** Bug 145507 has been marked as a duplicate of this bug. ***
Comment 13 Olaf Kirch 2006-01-26 11:17:05 UTC
Following a comment from Jesse Brandenburg on netdev, I have adapted the
patch to simply not call hw_init inside the resume() function.

I'm currently building a new kernel with this patch, and I would like people
to test this:

 -	suspend to RAM and resume
 -	suspend to disk and resume
 -	ifconfig eth0 down; suspend/resume; ifconfig up

mbuild job is

 - queued kernel-default for dist i386
Your jobid is 'kalman-okir-589'. Reports will be sent to okir@suse.de.
Comment 14 Olaf Kirch 2006-01-26 13:06:16 UTC
The mbuild job is

kalman-okir-593 kernel-default: IN PROGRESS
 - i386: building (on bach-1, ETA at 14:08)
Comment 15 Dirk Mueller 2006-01-26 15:07:52 UTC
seems to work here, but now produces this:

Jan 26 15:44:32 schleppi klogd: Debug: sleeping function called from invalid context at mm/slab.c:2515
Jan 26 15:44:32 schleppi klogd: in_atomic():0, irqs_disabled():1
Jan 26 15:44:32 schleppi klogd:  [<c014cc6a>] kmem_cache_alloc+0x1b/0x79
Jan 26 15:44:32 schleppi klogd:  [<c01ce167>] acpi_os_acquire_object+0xb/0x36
Jan 26 15:44:32 schleppi klogd:  [<c01e4f30>] acpi_ut_allocate_object_desc_dbg+0x13/0x49
Jan 26 15:44:32 schleppi klogd:  [<c01e4f7b>] acpi_ut_create_internal_object_dbg+0x15/0x68
Jan 26 15:44:32 schleppi klogd:  [<c01e1169>] acpi_rs_set_srs_method_data+0x3d/0xb7
Jan 26 15:44:32 schleppi klogd:  [<c014bcde>] cache_alloc_debugcheck_after+0xb8/0xea
Jan 26 15:44:32 schleppi klogd:  [<c01e870b>] acpi_pci_link_set+0x40/0x1c0
Jan 26 15:44:32 schleppi klogd:  [<c01e87d1>] acpi_pci_link_set+0x106/0x1c0
Jan 26 15:44:32 schleppi klogd:  [<c01e88e0>] irqrouter_resume+0x55/0x73
Jan 26 15:44:32 schleppi klogd:  [<c020af67>] __sysdev_resume+0x11/0x53
Jan 26 15:44:32 schleppi klogd:  [<c020b0a7>] sysdev_resume+0x16/0x47
Jan 26 15:44:32 schleppi klogd:  [<c020efd2>] device_power_up+0x5/0xa
Jan 26 15:44:32 schleppi klogd:  [<c012ef8c>] swsusp_suspend+0x6b/0x85
Jan 26 15:44:32 schleppi klogd:  [<c012fdf6>] pm_suspend_disk+0x44/0xd1
Jan 26 15:44:32 schleppi klogd:  [<c012e4cc>] enter_state+0x50/0x160
Jan 26 15:44:32 schleppi klogd:  [<c012e664>] state_store+0x88/0x95
Jan 26 15:44:32 schleppi klogd:  [<c012e5dc>] state_store+0x0/0x95
Jan 26 15:44:32 schleppi klogd:  [<c0182666>] subsys_attr_store+0x1e/0x22
Jan 26 15:44:32 schleppi klogd:  [<c0182927>] sysfs_write_file+0x9b/0xc1
Jan 26 15:44:32 schleppi klogd:  [<c018288c>] sysfs_write_file+0x0/0xc1
Jan 26 15:44:32 schleppi klogd:  [<c014f816>] vfs_write+0xa1/0x146
Jan 26 15:44:32 schleppi klogd:  [<c014fd2c>] sys_write+0x3c/0x63
Jan 26 15:44:32 schleppi klogd:  [<c0102a3b>] sysenter_past_esp+0x54/0x79
Comment 16 Olaf Kirch 2006-01-26 15:45:08 UTC
That seems to be a different problem in the generic swsusp code.
Please open a new bug report for this
Comment 17 Olaf Kirch 2006-01-26 15:45:32 UTC
Anyone else able to confirm that this patch fixes the problem as well?
Comment 18 Forgotten User ZhJd0F0L3x 2006-01-26 15:49:06 UTC
works fine.
The sleeping in atomic is known and harmless. It goes away as soon as we disable the debugging again :-)
Comment 19 Forgotten User ZhJd0F0L3x 2006-01-26 15:49:50 UTC
the sleeping in atomic is actually in generic acpi code, it happens on suspedn to ram and disk. And the ACPI guys know about it.
Comment 20 Joachim Gleissner 2006-01-26 15:51:22 UTC
Works for me, too.
Comment 21 Olaf Kirch 2006-01-26 16:04:08 UTC
Thanks, updated patch is in CVS.