Bugzilla – Bug 148399
second swsuspend in a row freezes with memory corruption
Last modified: 2006-04-12 23:57:23 UTC
Machine is a PIII-1000, kernel is untainted. The first suspend-to-disk works just fine (besides the warning mentioned in bug 145880). The second suspend-to-disk will freeze/hang with a BUG. Hand-written log follows: Stopping tasks: ==========================================| Shrinking memory... done (0 pages freed) pnp: Device 00:0a disabled. pnp: Device 00:09 disabled. ACPI-0201: *** Warning: Device is not power manageable swsusp: Need to copy 52361 pages swsusp: critical section/: done (52361 pages copied) Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. swsusp: Restoring Highmem Debug: sleeping function called from invalid context at mm/slab.c:2515 in_atomic():0, irqs_disabled():1 kmem_cache_alloc+0x1b/0x79 acpi_os_acquire_object+0xb/0x36 acpi_ut_allocate_object_desc_dbg+0x13/0x49 acpi_ut_create_internal_object_dbg+0x15/0x68 acpi_rs_set_srs_method_data+0x3d/0xb7 cache_alloc_debugcheck_after+0xb8/0xea acpi_pci_link_set+0x40/0x1c0 acpi_pci_link_set+0x106/0x1c0 irqrouter_resume+0x55/0x73 __sysdev_resume+0x11/0x53 sysdev_resume+0x16/0x47 device_power_up+0x5/0xa swsusp_suspend+0x6b/0x85 pm_suspend_disk+0x44/0xd1 enter_state+0x50/0x160 state_store+0x88/0x95 state_store+0x0/0x95 subsys_attr_store+0x1e/0x22 sysfs_write_file+0x9b/0xc1 sysfs_write_file+0x0/0xc1 vfs_write+0xa1/0x146 sys_write+0x3c/0x63 syscall_call+0x7/0xb ACPI-0201: *** Warning: Device is not power manageable PCI: setting latency timer of device 0000:00:01.0 to 64 ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LINKD] -> GSI 5 (level, low) -> IRQ 5 pnp: Device 00:09 activated. pnp: Device 00:0a activated. pnp: Device 00:0b does not supported activation. pnp: Failed to activate device 00:0c. BUG: soft lockup detected on CPU#0! Pid: 0, comm: swapper EIP is at rtl_8169_interrupt+0x36/0x311 [r8169] handle_IRQ_event __do_IRQ do_IRQ common_interrupt unix_shutdown __do_softirq do_softirq do_IRQ common_interrupt acpi_processor_idle cpu_idle start_kernel
Can you try without rtl8169? What happens if you simply disable the soft lockup watchdog?
r8169 is probably still totally broken wrt. suspend, which means: maybe even try it without r8169 ever loaded. IIRC it even borked the system after unloading it.
i will also install 10.1 on my r8169 toughbook and try it there.
Perhaps we can just blacklist r8169...
i blacklisted it, but iirc (there is an old bug with kkeil and pavel in cc) the r8169 is seriously broken, even unloading does not help. it is broken after resume. But i have to retry this, unfortunately i don't have the machine right now. Holger, please merge the commit into Code10 packages. Thanks.
i tried it on the toughbook: - suspend to disk looked okay, even multiple suspends - suspend to RAM broke the r8169, so the second suspend failed with "NetworkManager not stopped" => all "ifconfig", "ip", whatever that wanted to access network hung in state D. Unloading r8169 fixed suspend to ram => we unload it before suspend to ram and disk, just to be sure. Somebody should fix r8169. I stared at the code for quite some time, but it did not get better ;-)
Can you try if unloading r8169 fixes it for you?
Unloading r8169 before resume is done automatically with latest powersave. The second swsuspend in a row now causes slab corruption. Messages follow: Slab corruption: start=ccce65bc, len=32 Redzone: 0x0/0x0. Last user: [<00000000>] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=ccce6576, len=32 Redzone: 0x0/0x0. Last user: [<00000000>] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 slab error in cache_alloc_debugcheck_after(): cache 'size-32': double free, or memory outside object was overwritten BUG: spinlock recursion on CPU#0, myecho/14363 lock: c02d7d40, .magic: dead4ead, .owner:myecho/14363, .owner_cpu: 0 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40
If I delete r8169.ko from /lib/modules/, I can progress further. Now the second resume hangs. Messages: Stopping tasks: ===========================================================| Shrinking memory... done (0 pages freed) Loading image metadata... done (50 pages loaded) Loading image data pages (50449 pages) ... done pnp: Device 00:0b disabled pnp: Device 00:0a disabled And after that it hangs.
Bad bad bad. I get reproducible memory corruption during the second suspend cycle. Sometimes it triggers an oops/bug/slabcorruption during suspend, sometimes during resume. The first cycle is always fine. Current kernel is 2.6.16-rc4-4-default. This is a regression from the kernels in SUSE Linux 10.0.
Pavel: Any ideas what I can do to find the cause for this memory corruption bug?
I'd try to suspend from single-user mode (no modules), then try to find out if it is module causing the corruption. I think it is because I believe I can suspend/resume as many times as I want to.
Hm. I now upgraded to 2.6.16-rc5-git2 and it still locks up even with less modules than before. Will try single-user mode next.
My findings so far: Suspend with graphical frontend from inside kde, r8169 loaded: lockup during second suspend. Suspend with graphical frontend from inside kde, r8169 never loaded: lockup during second resume with spinlock lockup (myecho/...). Suspend with "powersave -U" from runlevel 3, r8169 auto-unloaded: lockup during second resume with message "spinlock lockup detected on CPU#0, myecho/4881". Suspend with "powersave -U" from runlevel 3, r8169 never loaded: slab corruption and spinlock lockup during second resume with message "spinlock lockup detected on CPU#0, myecho/3729". The "myecho" appearing in most backtraces seems to be part of powersave.rpm. Pavel: Did you try suspending with SUSE or vanilla kernels and with "powersave -U" or with "echo disk>/sys/power/state"? Is your basesystem a 10,1beta? Oh, and the memory corruption is present fpor all configurations tested so far. I just have to wait a bit for the "spinlock lockup" messages to repeat and suddenly printk will print garbage on screen, NULL pointers are not NULL anymore etc, but after some time, the printk messages will look nomally again.
Created attachment 71276 [details] screen shot of memory corruption
I'm stuck. Suspend and resume from single user mode works just fine even with all modules loaded. What should I do now? It seems that the hang only happens in runlevel 3/5.
OK, this is getting silly. The bug is independent of runlevel, it just depends on the powersave utility. If I use "powersave -U", I get kernel hangs due to memory corruption. Latest corruption sign was "swap_free: Unused swap offset entry..." in an endless loop. If I use "echo disk >/sys/power/state", everything works just fine.
seife: please un-blacklist r8169 for suspend-to-disk. It works fine and was just a victim of "powersave -U". What the hell is "powersave -U" doing to cause memory corruption?
Powersave is doing nothing special. It uses "myecho" instead of echo, but this is the most trivial echo replacement you can think of, see http://forge.novell.com/modules/xfmod/svn/svnbrowse.php?uri=filedetails.php%3Frepname%3Dpowersave%26path%3D%252Ftrunk%252Fpowersave%252Fhelpertools%252Fmyecho.c%26rev%3D0%26sc%3D0 And if userspace can cause memory corruption, it is a kernel bug ;-) Does the r8169 actually work after suspend/resume or does it just not crash?
Yes, r8169 works after resume. Will retry the experiment with myecho to see if it is to blame.
r8169 works fine after resume, I downloaded a few 700 MB .iso files with it and had no corruption. There were a few suspend/resume bugs in r8169, but IIRC these were fixed upstream in 2.6.16-rc1 or so. myecho works OK. So what else is different when running "powersave -U"?
we are stopping the following services: DEFAULT_S2D_RESTART="slmodemd irda" and unloading the following modules: DEFAULT_S2D_UNLOAD="usb_storage sbp2 ohci_hcd uhci_hcd stir4200 ohci1394 ipw2200 rt2500 prism54 ath_pci r8169 lt_modem Intel536 Intel537" So maybe the unloading of one of these modules is to blame? Other than that, there is not much special in powersaved wrt. suspend. I _think_ (but am not sure) that we set cpufreq to maximum before suspend. We prepare GRUB to not show a menu but directly boot the system without delay, but this should not cause this. You can prevent unloading of _any_ module before suspend in powersaved by setting UNLOAD_MODULES_BEFORE_SUSPEND2DISK="NONE" in /etc/powersave/sleep, look in /var/log/suspend2disk.log for anything else that is done during suspend.
Carl, could you try Stefan's suggestions?
Even if no modules are unloaded, no services stopped, no filesystems unmounted and still it crashes reliably when suspending twice with powersaved and works perfectly when using "echo disk >/sys/power/state" The kernel in beta9 reacts differently to the corruption: It will reliably reboot during the second suspend with powersaved.
SOLVED! Setting the shutdown mode to "platform" causes memory corruption. Setting the shutdown mode to "shutdown" works perfectly dozens of times. How can you explain that?
probably your BIOS is just broken. You might want to bring this up on the apci-devel list.
...or probably report to kernel.org bugzilla. Sorry, I don't think I can debug S4 (==platform mode) problems.
Pavel, if you can't debug S4 issues, why is it then the default?
Well, it is a trade-off. Some machines break with platform - most of them being desktops with broken BIOSen. Others break with "shutdown", in subtle ways, e.g. battery status not updating after resume. Most of them being notebooks that assume that the OS does the correct thing. And platform is the correct thing to do. Of course there could be bugs in the "platform" implementation, this is where you could get help from ACPI guys at bugzilla.kernel.org. BTW: the kernel has shutdown as default, but powersave sets platform - because it usually is the right thing to do. I got as much bugreports from machines not working correctly with shutdown as i get now from machines not working correctly with platform.