Bug 148399 - second swsuspend in a row freezes with memory corruption
Summary: second swsuspend in a row freezes with memory corruption
Status: RESOLVED WONTFIX
Alias: None
Product: SUSE Linux 10.1
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Beta 6
Hardware: i686 Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Pavel Machek
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-06 14:59 UTC by Carl-Daniel Hailfinger
Modified: 2006-04-12 23:57 UTC (History)
0 users

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
screen shot of memory corruption (151.21 KB, image/jpeg)
2006-03-05 17:50 UTC, Carl-Daniel Hailfinger
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Carl-Daniel Hailfinger 2006-02-06 14:59:46 UTC
Machine is a PIII-1000, kernel is untainted.

The first suspend-to-disk works just fine (besides the warning mentioned in bug 145880). The second suspend-to-disk will freeze/hang with a BUG.

Hand-written log follows:

Stopping tasks: ==========================================|
Shrinking memory... done (0 pages freed)
pnp: Device 00:0a disabled.
pnp: Device 00:09 disabled.
    ACPI-0201: *** Warning: Device is not power manageable
swsusp: Need to copy 52361 pages
swsusp: critical section/: done (52361 pages copied)
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
swsusp: Restoring Highmem
Debug: sleeping function called from invalid context at mm/slab.c:2515
in_atomic():0, irqs_disabled():1
kmem_cache_alloc+0x1b/0x79
acpi_os_acquire_object+0xb/0x36
acpi_ut_allocate_object_desc_dbg+0x13/0x49
acpi_ut_create_internal_object_dbg+0x15/0x68
acpi_rs_set_srs_method_data+0x3d/0xb7
cache_alloc_debugcheck_after+0xb8/0xea
acpi_pci_link_set+0x40/0x1c0
acpi_pci_link_set+0x106/0x1c0
irqrouter_resume+0x55/0x73
__sysdev_resume+0x11/0x53
sysdev_resume+0x16/0x47
device_power_up+0x5/0xa
swsusp_suspend+0x6b/0x85
pm_suspend_disk+0x44/0xd1
enter_state+0x50/0x160
state_store+0x88/0x95
state_store+0x0/0x95
subsys_attr_store+0x1e/0x22
sysfs_write_file+0x9b/0xc1
sysfs_write_file+0x0/0xc1
vfs_write+0xa1/0x146
sys_write+0x3c/0x63
syscall_call+0x7/0xb
ACPI-0201: *** Warning: Device is not power manageable
PCI: setting latency timer of device 0000:00:01.0 to 64
ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LINKD] -> GSI 5 (level, low) -> IRQ 5
pnp: Device 00:09 activated.
pnp: Device 00:0a activated.
pnp: Device 00:0b does not supported activation.
pnp: Failed to activate device 00:0c.
BUG: soft lockup detected on CPU#0!
Pid: 0, comm: swapper
EIP is at rtl_8169_interrupt+0x36/0x311 [r8169]
handle_IRQ_event
__do_IRQ
do_IRQ
common_interrupt
unix_shutdown
__do_softirq
do_softirq
do_IRQ
common_interrupt
acpi_processor_idle
cpu_idle
start_kernel
Comment 1 Pavel Machek 2006-02-09 09:12:27 UTC
Can you try without rtl8169?

What happens if you simply disable the soft lockup watchdog?
Comment 2 Forgotten User ZhJd0F0L3x 2006-02-09 10:37:53 UTC
r8169 is probably still totally broken wrt. suspend, which means: maybe even try it without r8169 ever loaded. IIRC it even borked the system after unloading it.
Comment 3 Forgotten User ZhJd0F0L3x 2006-02-09 10:38:55 UTC
i will also install 10.1 on my r8169 toughbook and try it there.
Comment 4 Pavel Machek 2006-02-20 13:53:25 UTC
Perhaps we can just blacklist r8169...
Comment 5 Forgotten User ZhJd0F0L3x 2006-02-20 14:58:36 UTC
i blacklisted it, but iirc (there is an old bug with kkeil and pavel in cc) the r8169 is seriously broken, even unloading does not help. it is broken after resume. But i have to retry this, unfortunately i don't have the machine right now.

Holger, please merge the commit into Code10 packages. Thanks.
Comment 6 Forgotten User ZhJd0F0L3x 2006-02-20 16:09:50 UTC
i tried it on the toughbook:
- suspend to disk looked okay, even multiple suspends
- suspend to RAM broke the r8169, so the second suspend failed with
  "NetworkManager not stopped" => all "ifconfig", "ip", whatever that
  wanted to access network hung in state D.

Unloading r8169 fixed suspend to ram => we unload it before suspend to ram and disk, just to be sure.

Somebody should fix r8169. I stared at the code for quite some time, but it did not get better ;-)
Comment 7 Pavel Machek 2006-02-21 23:14:23 UTC
Can you try if unloading r8169 fixes it for you?
Comment 8 Carl-Daniel Hailfinger 2006-02-22 15:47:00 UTC
Unloading r8169 before resume is done automatically with latest powersave.

The second swsuspend in a row now causes slab corruption. Messages follow:

Slab corruption: start=ccce65bc, len=32
Redzone: 0x0/0x0.
Last user: [<00000000>]
000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
Prev obj: start=ccce6576, len=32
Redzone: 0x0/0x0.
Last user: [<00000000>]
000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
slab error in cache_alloc_debugcheck_after(): cache 'size-32': double free, or memory outside object was overwritten
BUG: spinlock recursion on CPU#0, myecho/14363
 lock: c02d7d40, .magic: dead4ead, .owner:myecho/14363, .owner_cpu: 0
BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40
BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40
BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40
BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40
Comment 9 Carl-Daniel Hailfinger 2006-02-22 16:06:08 UTC
If I delete r8169.ko from /lib/modules/, I can progress further.
Now the second resume hangs. Messages:

Stopping tasks: ===========================================================|
Shrinking memory... done (0 pages freed)
Loading image metadata... done (50 pages loaded)
Loading image data pages (50449 pages) ... done
pnp: Device 00:0b disabled
pnp: Device 00:0a disabled

And after that it hangs.
Comment 10 Carl-Daniel Hailfinger 2006-02-23 00:34:24 UTC
Bad bad bad. I get reproducible memory corruption during the second suspend cycle. Sometimes it triggers an oops/bug/slabcorruption during suspend, sometimes during resume. The first cycle is always fine.
Current kernel is 2.6.16-rc4-4-default.

This is a regression from the kernels in SUSE Linux 10.0.
Comment 11 Carl-Daniel Hailfinger 2006-02-23 00:47:24 UTC
Pavel: Any ideas what I can do to find the cause for this memory corruption bug?
Comment 12 Pavel Machek 2006-02-23 15:57:20 UTC
I'd try to suspend from single-user mode (no modules), then try to find out if it is module causing the corruption. I think it is because I believe I can suspend/resume as many times as I want to.
Comment 13 Carl-Daniel Hailfinger 2006-03-05 15:43:18 UTC
Hm. I now upgraded to 2.6.16-rc5-git2 and it still locks up even with less modules than before. Will try single-user mode next.
Comment 14 Carl-Daniel Hailfinger 2006-03-05 17:42:12 UTC
My findings so far:
Suspend with graphical frontend from inside kde, r8169 loaded: lockup during second suspend.
Suspend with graphical frontend from inside kde, r8169 never loaded: lockup during second resume with spinlock lockup (myecho/...).
Suspend with "powersave -U" from runlevel 3, r8169 auto-unloaded: lockup during second resume with message "spinlock lockup detected on CPU#0, myecho/4881".
Suspend with "powersave -U" from runlevel 3, r8169 never loaded: slab corruption and spinlock lockup during second resume with message "spinlock lockup detected on CPU#0, myecho/3729".

The "myecho" appearing in most backtraces seems to be part of powersave.rpm.

Pavel: Did you try suspending with SUSE or vanilla kernels and with "powersave -U" or with "echo disk>/sys/power/state"? Is your basesystem a 10,1beta?

Oh, and the memory corruption is present fpor all configurations tested so far. I just have to wait a bit for the "spinlock lockup" messages to repeat and suddenly printk will print garbage on screen, NULL pointers are not NULL anymore etc, but after some time, the printk messages will look nomally again.
Comment 15 Carl-Daniel Hailfinger 2006-03-05 17:50:07 UTC
Created attachment 71276 [details]
screen shot of memory corruption
Comment 16 Carl-Daniel Hailfinger 2006-03-05 19:50:18 UTC
I'm stuck. Suspend and resume from single user mode works just fine even with all modules loaded.

What should I do now? It seems that the hang only happens in runlevel 3/5.
Comment 17 Carl-Daniel Hailfinger 2006-03-05 21:41:25 UTC
OK, this is getting silly. The bug is independent of runlevel, it just depends on the powersave utility.

If I use "powersave -U", I get kernel hangs due to memory corruption. Latest corruption sign was "swap_free: Unused swap offset entry..." in an endless loop.

If I use "echo disk >/sys/power/state", everything works just fine.
Comment 18 Carl-Daniel Hailfinger 2006-03-05 21:44:33 UTC
seife: please un-blacklist r8169 for suspend-to-disk. It works fine and was just a victim of "powersave -U".

What the hell is "powersave -U" doing to cause memory corruption?
Comment 19 Forgotten User ZhJd0F0L3x 2006-03-05 21:57:57 UTC
Powersave is doing nothing special. It uses "myecho" instead of echo, but this is the most trivial echo replacement you can think of, see
http://forge.novell.com/modules/xfmod/svn/svnbrowse.php?uri=filedetails.php%3Frepname%3Dpowersave%26path%3D%252Ftrunk%252Fpowersave%252Fhelpertools%252Fmyecho.c%26rev%3D0%26sc%3D0

And if userspace can cause memory corruption, it is a kernel bug ;-)

Does the r8169 actually work after suspend/resume or does it just not crash?
Comment 20 Carl-Daniel Hailfinger 2006-03-05 22:31:16 UTC
Yes, r8169 works after resume.

Will retry the experiment with myecho to see if it is to blame.
Comment 21 Carl-Daniel Hailfinger 2006-03-05 22:55:15 UTC
r8169 works fine after resume, I downloaded a few 700 MB .iso files with it and had no corruption. There were a few suspend/resume bugs in r8169, but IIRC these were fixed upstream in 2.6.16-rc1 or so.

myecho works OK. So what else is different when running "powersave -U"?
Comment 22 Forgotten User ZhJd0F0L3x 2006-03-06 05:46:09 UTC
we are stopping the following services:
DEFAULT_S2D_RESTART="slmodemd irda"

and unloading the following modules:
DEFAULT_S2D_UNLOAD="usb_storage sbp2 ohci_hcd uhci_hcd stir4200 ohci1394 ipw2200 rt2500 prism54 ath_pci r8169 lt_modem Intel536  Intel537"

So maybe the unloading of one of these modules is to blame?

Other than that, there is not much special in powersaved wrt. suspend. I _think_ (but am not sure) that we set cpufreq to maximum before suspend.
We prepare GRUB to not show a menu but directly boot the system without delay, but this should not cause this.

You can prevent unloading of _any_ module before suspend in powersaved by
setting UNLOAD_MODULES_BEFORE_SUSPEND2DISK="NONE" in /etc/powersave/sleep,
look in /var/log/suspend2disk.log for anything else that is done during suspend.
Comment 23 Pavel Machek 2006-03-14 15:06:34 UTC
Carl, could you try Stefan's suggestions?
Comment 24 Carl-Daniel Hailfinger 2006-04-06 00:33:47 UTC
Even if no modules are unloaded, no services stopped, no filesystems unmounted and still it crashes reliably when suspending twice with powersaved and works perfectly when using "echo disk >/sys/power/state"

The kernel in beta9 reacts differently to the corruption: It will reliably reboot during the second suspend with powersaved.
Comment 25 Carl-Daniel Hailfinger 2006-04-06 01:22:04 UTC
SOLVED!

Setting the shutdown mode to "platform" causes memory corruption.
Setting the shutdown mode to "shutdown" works perfectly dozens of times.

How can you explain that?
Comment 26 Forgotten User ZhJd0F0L3x 2006-04-06 06:38:39 UTC
probably your BIOS is just broken.
You might want to bring this up on the apci-devel list.
Comment 27 Pavel Machek 2006-04-12 20:36:47 UTC
...or probably report to kernel.org bugzilla.

Sorry, I don't think I can debug S4 (==platform mode) problems.
Comment 28 Carl-Daniel Hailfinger 2006-04-12 23:08:11 UTC
Pavel, if you can't debug S4 issues, why is it then the default?
Comment 29 Forgotten User ZhJd0F0L3x 2006-04-12 23:57:23 UTC
Well, it is a trade-off. Some machines break with platform - most of them being desktops with broken BIOSen.
Others break with "shutdown", in subtle ways, e.g. battery status not updating after resume. Most of them being notebooks that assume that the OS does the correct thing. And platform is the correct thing to do.
Of course there could be bugs in the "platform" implementation, this is where you could get help from ACPI guys at bugzilla.kernel.org.

BTW: the kernel has shutdown as default, but powersave sets platform - because it usually is the right thing to do.

I got as much bugreports from machines not working correctly with shutdown as i get now from machines not working correctly with platform.