|
Bugzilla – Full Text Bug Listing |
| Summary: | second swsuspend in a row freezes with memory corruption | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE Linux 10.1 | Reporter: | Carl-Daniel Hailfinger <kernel01> |
| Component: | Kernel | Assignee: | Pavel Machek <pavel> |
| Status: | RESOLVED WONTFIX | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | ||
| Version: | Beta 6 | ||
| Target Milestone: | --- | ||
| Hardware: | i686 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Development | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | screen shot of memory corruption | ||
|
Description
Carl-Daniel Hailfinger
2006-02-06 14:59:46 UTC
Can you try without rtl8169? What happens if you simply disable the soft lockup watchdog? r8169 is probably still totally broken wrt. suspend, which means: maybe even try it without r8169 ever loaded. IIRC it even borked the system after unloading it. i will also install 10.1 on my r8169 toughbook and try it there. Perhaps we can just blacklist r8169... i blacklisted it, but iirc (there is an old bug with kkeil and pavel in cc) the r8169 is seriously broken, even unloading does not help. it is broken after resume. But i have to retry this, unfortunately i don't have the machine right now. Holger, please merge the commit into Code10 packages. Thanks. i tried it on the toughbook: - suspend to disk looked okay, even multiple suspends - suspend to RAM broke the r8169, so the second suspend failed with "NetworkManager not stopped" => all "ifconfig", "ip", whatever that wanted to access network hung in state D. Unloading r8169 fixed suspend to ram => we unload it before suspend to ram and disk, just to be sure. Somebody should fix r8169. I stared at the code for quite some time, but it did not get better ;-) Can you try if unloading r8169 fixes it for you? Unloading r8169 before resume is done automatically with latest powersave. The second swsuspend in a row now causes slab corruption. Messages follow: Slab corruption: start=ccce65bc, len=32 Redzone: 0x0/0x0. Last user: [<00000000>] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Prev obj: start=ccce6576, len=32 Redzone: 0x0/0x0. Last user: [<00000000>] 000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 slab error in cache_alloc_debugcheck_after(): cache 'size-32': double free, or memory outside object was overwritten BUG: spinlock recursion on CPU#0, myecho/14363 lock: c02d7d40, .magic: dead4ead, .owner:myecho/14363, .owner_cpu: 0 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40 BUG: spinlock lockup on CPU#0, myecho/14363, c02d7d40 If I delete r8169.ko from /lib/modules/, I can progress further. Now the second resume hangs. Messages: Stopping tasks: ===========================================================| Shrinking memory... done (0 pages freed) Loading image metadata... done (50 pages loaded) Loading image data pages (50449 pages) ... done pnp: Device 00:0b disabled pnp: Device 00:0a disabled And after that it hangs. Bad bad bad. I get reproducible memory corruption during the second suspend cycle. Sometimes it triggers an oops/bug/slabcorruption during suspend, sometimes during resume. The first cycle is always fine. Current kernel is 2.6.16-rc4-4-default. This is a regression from the kernels in SUSE Linux 10.0. Pavel: Any ideas what I can do to find the cause for this memory corruption bug? I'd try to suspend from single-user mode (no modules), then try to find out if it is module causing the corruption. I think it is because I believe I can suspend/resume as many times as I want to. Hm. I now upgraded to 2.6.16-rc5-git2 and it still locks up even with less modules than before. Will try single-user mode next. My findings so far: Suspend with graphical frontend from inside kde, r8169 loaded: lockup during second suspend. Suspend with graphical frontend from inside kde, r8169 never loaded: lockup during second resume with spinlock lockup (myecho/...). Suspend with "powersave -U" from runlevel 3, r8169 auto-unloaded: lockup during second resume with message "spinlock lockup detected on CPU#0, myecho/4881". Suspend with "powersave -U" from runlevel 3, r8169 never loaded: slab corruption and spinlock lockup during second resume with message "spinlock lockup detected on CPU#0, myecho/3729". The "myecho" appearing in most backtraces seems to be part of powersave.rpm. Pavel: Did you try suspending with SUSE or vanilla kernels and with "powersave -U" or with "echo disk>/sys/power/state"? Is your basesystem a 10,1beta? Oh, and the memory corruption is present fpor all configurations tested so far. I just have to wait a bit for the "spinlock lockup" messages to repeat and suddenly printk will print garbage on screen, NULL pointers are not NULL anymore etc, but after some time, the printk messages will look nomally again. Created attachment 71276 [details]
screen shot of memory corruption
I'm stuck. Suspend and resume from single user mode works just fine even with all modules loaded. What should I do now? It seems that the hang only happens in runlevel 3/5. OK, this is getting silly. The bug is independent of runlevel, it just depends on the powersave utility. If I use "powersave -U", I get kernel hangs due to memory corruption. Latest corruption sign was "swap_free: Unused swap offset entry..." in an endless loop. If I use "echo disk >/sys/power/state", everything works just fine. seife: please un-blacklist r8169 for suspend-to-disk. It works fine and was just a victim of "powersave -U". What the hell is "powersave -U" doing to cause memory corruption? Powersave is doing nothing special. It uses "myecho" instead of echo, but this is the most trivial echo replacement you can think of, see http://forge.novell.com/modules/xfmod/svn/svnbrowse.php?uri=filedetails.php%3Frepname%3Dpowersave%26path%3D%252Ftrunk%252Fpowersave%252Fhelpertools%252Fmyecho.c%26rev%3D0%26sc%3D0 And if userspace can cause memory corruption, it is a kernel bug ;-) Does the r8169 actually work after suspend/resume or does it just not crash? Yes, r8169 works after resume. Will retry the experiment with myecho to see if it is to blame. r8169 works fine after resume, I downloaded a few 700 MB .iso files with it and had no corruption. There were a few suspend/resume bugs in r8169, but IIRC these were fixed upstream in 2.6.16-rc1 or so. myecho works OK. So what else is different when running "powersave -U"? we are stopping the following services: DEFAULT_S2D_RESTART="slmodemd irda" and unloading the following modules: DEFAULT_S2D_UNLOAD="usb_storage sbp2 ohci_hcd uhci_hcd stir4200 ohci1394 ipw2200 rt2500 prism54 ath_pci r8169 lt_modem Intel536 Intel537" So maybe the unloading of one of these modules is to blame? Other than that, there is not much special in powersaved wrt. suspend. I _think_ (but am not sure) that we set cpufreq to maximum before suspend. We prepare GRUB to not show a menu but directly boot the system without delay, but this should not cause this. You can prevent unloading of _any_ module before suspend in powersaved by setting UNLOAD_MODULES_BEFORE_SUSPEND2DISK="NONE" in /etc/powersave/sleep, look in /var/log/suspend2disk.log for anything else that is done during suspend. Carl, could you try Stefan's suggestions? Even if no modules are unloaded, no services stopped, no filesystems unmounted and still it crashes reliably when suspending twice with powersaved and works perfectly when using "echo disk >/sys/power/state" The kernel in beta9 reacts differently to the corruption: It will reliably reboot during the second suspend with powersaved. SOLVED! Setting the shutdown mode to "platform" causes memory corruption. Setting the shutdown mode to "shutdown" works perfectly dozens of times. How can you explain that? probably your BIOS is just broken. You might want to bring this up on the apci-devel list. ...or probably report to kernel.org bugzilla. Sorry, I don't think I can debug S4 (==platform mode) problems. Pavel, if you can't debug S4 issues, why is it then the default? Well, it is a trade-off. Some machines break with platform - most of them being desktops with broken BIOSen. Others break with "shutdown", in subtle ways, e.g. battery status not updating after resume. Most of them being notebooks that assume that the OS does the correct thing. And platform is the correct thing to do. Of course there could be bugs in the "platform" implementation, this is where you could get help from ACPI guys at bugzilla.kernel.org. BTW: the kernel has shutdown as default, but powersave sets platform - because it usually is the right thing to do. I got as much bugreports from machines not working correctly with shutdown as i get now from machines not working correctly with platform. |