Bug 104647

Summary: Suspend to Disk broken for SATA hard drives
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Alexander Schaefer <aschaefer>
Component: Mobile DevicesAssignee: Jens Axboe <axboe>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Major    
Priority: P5 - None CC: aj, behlert, jmayer, kernel01, trenn
Version: Beta 1   
Target Milestone: ---   
Hardware: Other   
OS: All   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Acer dmesg
Acer lsmod
Acer lspic -vv
dmesg FSC
lsmod FSC
lspci -vv FSC
My libata suspend patch
dmesg output

Description Alexander Schaefer 2005-08-15 09:15:58 UTC
When trying to do a suspend to disk from a KDE session, the system freezes
before reaching suspend state, displaying the following messages:

Writing data to swap (40876 pages)...0%
ata1: error occurred, prot reset
ata1: status=0x01 { Error }
ata1: called with no error (01)!

sda: Current: sense key: Medium Error
    Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 90811040

Doing a suspend to disk from a minimum environment (boot option init=/bin/bash)
causes the system to suspend as expected, but after waking up, no HDD access is
possible. Error messages:

ata1: error occurred, port reset
ata1: status=0x01 { Error }
ata1: called with no error (01)! <these three lines occur several times.>

SCSI error: <0 0 0 0> return code = 0x8000002
sda: Current: sense key: Medium Error
    Additional sense: Unrecovered read error - auto reallocate failed
end_request: I/O error, dev sda, sector 116893847
ReiserFS: sda4: warning: vs-13070: reiserfs_read_locked_inode: i/o failure
occurred trying to find stat data of [5435 24741 0x0 SD]
 
Suspend to RAM also fails on this system, (suspending works, but system does not
wake up any more.) I am sure that this does not (at least not only) depend on a
display problem, so this has possibly the same cause?

Test system: FSC AMILO Pro 8010, Intel Centrino i915GM chipset, Pentium M 1.73
GHz, 60 GB SATA HDD
Comment 1 Alexander Schaefer 2005-08-26 09:51:57 UTC
Possibly this behavior is caused by the same circumstances as the suspend to 
RAM problem described in Bug #105800 (same test system). 
Comment 2 Alexander Schaefer 2005-08-30 09:22:28 UTC
Still not working in beta3, raised severity (please adjust back if not appropriate)
Comment 3 Stefan Behlert 2005-08-30 09:55:15 UTC
This is very bad indead. 
 
AJ: More and more new Laptops have SATA, and the last reviews I saw that 
tested Distros had at least 40% of the test machines with SATA :( 
I find it strange that we have one Acer where STD works with SATA and the FSC 
where it doesn't. 
Comment 4 Andreas Jaeger 2005-08-30 11:46:27 UTC
Stefan and Jens, please work together on this.
Comment 5 Jens Axboe 2005-08-30 11:48:07 UTC
Andreas, is it using ata_piix or ahci as the disk driver? Please post full dmesg
from a booted system, thanks.
Comment 6 Jens Axboe 2005-08-30 11:48:48 UTC
Stefan, what are the hardware details of the acer STD and FSC?
Comment 7 Stefan Behlert 2005-08-30 11:56:06 UTC
The FSC is the one Andreas mentioned. I'll add hwinfo, lspci and lsmod from 
the Acer in a few minutes (the machine is currently resuming) 
Comment 8 Stefan Behlert 2005-08-30 12:01:16 UTC
Created attachment 48149 [details]
Acer dmesg
Comment 9 Stefan Behlert 2005-08-30 12:01:43 UTC
Created attachment 48150 [details]
Acer lsmod
Comment 10 Stefan Behlert 2005-08-30 12:02:09 UTC
Created attachment 48151 [details]
Acer lspic -vv
Comment 11 Stefan Behlert 2005-08-30 12:03:53 UTC
If you need anything else for the Acer please let me know. Suspend is working 
there except the 'pause' described in bug 113335 before writting to disk. 
Comment 12 Alexander Schaefer 2005-08-30 12:20:40 UTC
Created attachment 48153 [details]
dmesg FSC
Comment 13 Alexander Schaefer 2005-08-30 12:21:57 UTC
Created attachment 48154 [details]
lsmod FSC
Comment 14 Alexander Schaefer 2005-08-30 12:23:19 UTC
Created attachment 48155 [details]
lspci -vv FSC
Comment 15 Alexander Schaefer 2005-08-30 12:24:54 UTC
Hope this is enough info about the FSC for the moment... If not, please tell 
me, I'll add everything necessary. 
Comment 16 Jens Axboe 2005-08-30 12:28:51 UTC
Interesting, so the FSC is using ahci while the Acer is using ata_piix. Is it
possible to configured the FSC to not use ahci in the bios, for testing
purposes? Check for any ide options in there.
Comment 17 Alexander Schaefer 2005-08-30 12:40:33 UTC
"AHCI configuration" can be disabled in the BIOS setup, but this does not 
change the behavior of the machine, it still displays the same errors when 
trying to suspend. 
Comment 18 Jens Axboe 2005-08-30 12:41:42 UTC
Have you verified that it actually loads ata_piix instead when you disable ahci?
Comment 19 Alexander Schaefer 2005-08-30 12:56:39 UTC
Regardless whether AHCI is disabled, it loads both modules (possibly because 
of the DVD drive??), but I don't know how to determine which one is actually 
used for the hard drive.  
Comment 20 Jens Axboe 2005-08-30 13:03:14 UTC
The dmesg from comment #12 has:

ahci(0000:00:1f.2) AHCI 0001.0000 32 slots 4 ports 1.5 Gbps 0x5 impl SATA mode
ahci(0000:00:1f.2) flags: 64bit ncq pm led slum part 

which means it actually detected and used ahci. But just grep for the scsi0: xx
line, if xx is ata_piix it is using ata_piix, vice versa for ahci.
Comment 21 Alexander Schaefer 2005-08-30 13:17:36 UTC
Grepping for scsi shows that it still uses ahci. Can I do something about 
that? 
Comment 22 Jens Axboe 2005-08-30 13:27:41 UTC
Hmm, and you disabled everything that looks like ahci in the bios? You can try
and make sure that ata_piix is loaded first (or remove ahci and ata_piix, then
load ata_piix again), I forget what the order is by default.
Comment 23 Alexander Schaefer 2005-08-30 13:52:41 UTC
In bios, "AHCI configurtion" (the one I changed) was the only option that 
seems to have something to do with AHCI. I removed ahci 
from /etc/sysconfig/kernel now, and a reboot with the new initrd showed the 
following: 
 
-booting up worked without any obvious problems 
-the above described grep in dmesg shows that now ata_piix is definitely used. 
-suspend to disk still does not work, but it generates different error 
messages: 
 
<snip> 
 
Writing data to swap (26976 pages)...   0%<3>ata1: command 0x35 timeout, stat 
0x50 host_stat 0x24 
ata1: command 0x35 timeout, stat 0x50 host_stat 0x24 <repeating continuously 
every few seconds> 
 
</snip> 
Comment 24 Jens Axboe 2005-08-30 15:07:25 UTC
Ok, so it breaks with both ata_piix and ahci on that hardware. It could be that
we are not reenabling the hardware properly, we really do nothing special except
the generic way. Ehm looking at the kernel source branch, the libata suspend
patch isn't even there anymore. Very strange, I will check up on that, but it
definitely could explain your issues!
Comment 25 Jens Axboe 2005-08-30 15:16:50 UTC
Created attachment 48189 [details]
My libata suspend patch

I'm committing this to CVS now for easier testing, I'll let you know when a
KOTD kernel with it is available.
Comment 26 Jens Axboe 2005-08-30 17:52:50 UTC
kernel-default-2.6.13-20050830151713.i586.rpm from KOTD has it, please install
that kernel ASAP and retest!
Comment 27 Jens Axboe 2005-08-31 08:18:46 UTC
Alexander, have you had a chance to test that kernel yet? It's important that we
move a little fast on this bug, so if you could test I would much appreciate it.
Comment 28 Alexander Schaefer 2005-08-31 10:26:09 UTC
I tried the KOTD now in all variations. The results are as follows:

- When I turn off AHCI in bios and do not load the ahci module, suspending and
resuming both works fine now.

- Situation with AHCI has significantly improved: The system suspends perfectly,
but HDD access is not possible after resuming. This means that the system still
cannot resume at all when running an X server (KDE) because the missing hard
disk leads to immediate crash.

Do you need some further information? (The error messages of the KDE problem
passed too fast to read and I was not able to stop them with <break> or
something similar, but I believe they were similar to the ones I mentioned in
the initial bug creation.)

Sorry for the delay, I'm still quite new here at SuSE and I had to ask for
advice several times... Hope this is no big deal...
Comment 29 Jens Axboe 2005-08-31 10:37:06 UTC
No worries, feel free to ask here as well if you have questions/problems.

So at least ata_piix works now. It actually cannot work reliably on the Acer
before, it's only by luck that sata suspend/resume works without the
syncronization stuff I added to libata.

Does ahci work with suspend-to-disk (not sure if you are testing suspend-to-ram
or to disk)?

I'm not sure we can do much about ahci resume yet, I think we need more hardware
details. So it's possible that we can only support suspend on ata_piix for 10.0,
which is not a problem in my opinion since all the notebooks out there should be
able to run in piix mode instead of ahci.
Comment 30 Jens Axboe 2005-08-31 10:40:11 UTC
Wait, I think I see that ahci is missing a hunk. I'll update the patch and ask
you to retest a KOTD soon!
Comment 31 Jens Axboe 2005-08-31 10:41:45 UTC
Ok, committed, I'll watch for a KOTD for you to test. I'll add suspend/resume
support to the other libata drivers as well.
Comment 32 Jens Axboe 2005-08-31 12:30:10 UTC
It's ready, please test kernel-default-2.6.13-20050831104926.i586.rpm right away!
Comment 33 Alexander Schaefer 2005-08-31 13:23:08 UTC
hmmm, I believe there is something broken in that kernel, I can't boot the
system any more regardless whether I have AHCI activated in BIOS or whether AHCI
module is loaded. :-( I only see a black screen just after GRUB. Is there any
information about this that I could provide to you? Otherwise I would remove my
installation and set up beta4 instead of recovering my old system.
Comment 34 Jens Axboe 2005-08-31 13:26:47 UTC
How annoying, perhaps some other patch screwed it up in the meantime. Please try
adding 'apic' as a boot parameter.
Comment 35 Alexander Schaefer 2005-08-31 14:33:18 UTC
Thanks, apic made my system boot again :-)

But the ahci behavior has not improved: suspending (to disk) works fine, but
resuming still doesn't, the system hangs after printing out the following (just
as with the last kernel):

<snip>

...

...

PCI: Setting latency timer of device 0000:00:1f.2 to 64
ACPI: PCI Interrupt 0000:06:09.0[A] -> GSI 16 (level, low) -> IRQ 169
ACPI: PCI Interrupt 0000:06:09.2[C] -> GSI 18 (level, low) -> IRQ 185
ata1: dev 0 configured for UDMA/100
Restarting tasks... done

</snip>

ata_piix works as expected, though. My first suspect was framebuffer, but it is
not even loaded. I guess there is still no nard disk access.

Is there anything else left I can do?

By the way: My suspend to RAM problem is desribed in bug #105800, but I didn't
do anything more about it yet since seife told me that this seems to be a much
larger problem.
Comment 36 Alexander Schaefer 2005-08-31 15:13:32 UTC
If there is no more information I can provide at the moment, I would like to
install beta4 to verify some other bugs meanwhile. Would that be ok for you?
Comment 37 Stefan Behlert 2005-08-31 15:48:52 UTC
tested beta4 on the mentioned Acer, suspend2disk is still working  
suspend2ram tested with init=/bin/bash and acpi_sleep=s3_bios,s3_mode is not  
working (as in beta3). The disk seems to  be the problem in that case. (This  
just for completeness)  
  
Comment 38 Jens Axboe 2005-08-31 15:55:46 UTC
Andi, please see comment #33-35, I'm guessing your patches.arch/i386-apic-up
broke the boot of Alexander's notebook.
Comment 39 Jens Axboe 2005-08-31 15:58:16 UTC
Stefan, please try the KOTD I mentioned in comment #32 - or a newer one and see
if that makes suspend-to-ram work any better.

Alexander, when you use ata_piix, what comes after the "Restarting tasks...
done" line? Do you get any disk errors with ahci, or is that the least message
you ever see? Is the machine pingable at that point and responding to key presses?
Comment 40 Alexander Schaefer 2005-08-31 16:05:24 UTC
When I use ata_piix, Restarting tasks... done is the last thing I see (for about
2 secs) before X server gests started. After that, KDE is back and I can work on
as normal. But when using ahci, the system hangs at that point and does not
react to key presses. Just a moment, I'll reproduce the situation to find out
whether ping works...
Comment 41 Jens Axboe 2005-08-31 16:06:46 UTC
That (funny as it may sound) does not sound like a disk problem since at this
point you already paged in lots of data from swap. Can you try ahci with a
minimal boot and see if the disk works after resume?
Comment 42 Alexander Schaefer 2005-08-31 16:19:45 UTC
done, suspending in minimal boot works well - but I have the same error messages
as mentioned in description when trying to access the hard disk after resuming
(e.g. with 'ls' command).
Comment 43 Stefan Behlert 2005-09-01 08:15:46 UTC
Installed kernel-default-2.6.13-20050831104926.i586.rpm. 
suspend2disk: coredump 
suspend2ram: not tested after the results with s2d :( 
 
Jens, I will open an extr bug for the core with that kernel, I don't think we 
should put all in that bug here. 
Comment 44 Stefan Behlert 2005-09-01 08:31:35 UTC
I've added the information for the dump mentioned in comment 43 to bug 114648 
since I think that one should be easier to fix. 
Comment 45 Jens Axboe 2005-09-02 12:37:12 UTC
Stefan, any changes with the newer KOTD kernels?
Comment 46 Stefan Behlert 2005-09-02 12:48:43 UTC
2.6.13-20050901172817-default was the last one I tried on the Acer, there 
everything was fine (see bug 114648). I'll trigger Alexander to test it on the 
FSC. I have not seen any newer KotD. 
Comment 47 Jens Axboe 2005-09-02 12:50:47 UTC
Alexander, please test! I would love to close this bug.
Comment 48 Alexander Schaefer 2005-09-02 14:03:35 UTC
I tried 2.6.13-20050901172817-default, and still not working, I'm afraid :-(

Resume from suspend to disk from KDE hangs after printing "Restarting tasks...
done" and before starting X-server.

Somehow also resume from an "init=/bin/bash" environment does not work any more
- instead of resuming (without hard drive access, but at least with a prompt and
responding as it was before), the system now freezes with the following output :

ACPI: PCI Interrupt 0000:00:1f.1[A] -> GSI 18 (level, low) -> IRQ 185
ACPI-0212: *** Warning: Device is not power manageable
ACPI: PCI Interrupt 0000:00:1f.2[B] -> GSI 19 (level, low) -> IRQ 193
PCI: Setting latency timer of device 0000:00:1f.2 to 64
ACPI: PCI Interrupt 0000:06:09.0[A] -> GSI 16 (level, low) -> IRQ 169

I believe this came in with the last changes, but I'm not completely sure.

Is there another KOTD that could possibly improve my situation?

I'm extremely sorry for messing up your weekend :-(
Comment 49 Jens Axboe 2005-09-02 14:47:11 UTC
That's too bad, I'm still not convinced this is a sata problem, but it's hard to
diagnose this further without actually having the laptop at hand.

Don't worry, you are not messing up my weekend, I wont even be home :-). So
followups will have to wait until monday.
Comment 50 Thomas Renninger 2005-09-03 18:17:16 UTC
CC'ing Pavel. He's the expert in suspend kernel debugging and hopefully could
help here...
Comment 51 Jörg Mayer 2005-09-05 11:11:52 UTC
I'm testing on a FSC Amilo M1437G and I am seeing the same problem when doing 
suspend to disk: 
Writing data to swap (17715 pages)... 0% 
ata1: command 0x35 timeout, stat 0x50 host-stat 0x4 
[repeat until poweroff] 
I'll try to kotd when I get home. The file currently on ftp is 
kernel-default-2.6.13-20050904145455.i586.rpm. Is it ok to test that version? 
Comment 52 Jens Axboe 2005-09-05 12:42:21 UTC
Yes, please test that version when you get home!
Comment 53 Jörg Mayer 2005-09-05 18:02:57 UTC
Suspend completed this time.  
Resume too.  
As far as I am concerned, kernel-default-2.6.13-20050904145455.i586.rpm solves 
the problem. 
Comment 54 Jens Axboe 2005-09-06 08:23:22 UTC
I'm downgrading this issue to major, since I don't consider it a blocker
anymore. SATA suspend should work well on most machines now, it's likely that
the remaining problems with some machines isn't SATA related.
Comment 55 Alexander Schaefer 2005-09-06 16:45:08 UTC
Tried suspend to disk with kernel-default-2.6.13-20050904145455 using AHCI, 
still the following result: 
 
Suspend works, but during resume I get lots of these Eroor messages already 
described in the initial bug description - then the system freezes. 
 
Using ata_piix (just removed ahci module from initrd, I don't even need to 
change the ahci bios option), everything works perfect, even with a running X 
server. 
Comment 56 Jörg Mayer 2005-09-08 22:01:56 UTC
Just an update (unfortunately):  
I'm currently running beta4 with kernel kernel-default-2.6.13-20050906125922. 
Doing suspend (powersave -U) and resume works fine - but only the first time. 
If I do a powersave -U again on a system that was resumed, several things went 
wrong in several tests. In most cases, the system hung before the line 
mentioning that pages are being written to disk. In one case the system 
managed to succeed on the second suspend only to hang during resume. 
Comment 57 Jörg Mayer 2005-09-08 22:07:50 UTC
Created attachment 49281 [details]
dmesg output

There are several messages like:
ACPI-0212: *** Warning: Device is not power manageable
Comment 58 Thomas Renninger 2005-09-09 05:45:14 UTC
Joerg: Is this a x86_64 machine?
I have the feeling that this problem is unrelated to SATA?
Just an idea, you could change the suspend mode from platform to shutdown in
/etc/sysconfig/powersave/sleep
SUSPEND2DISK_SHUTDOWN_MODE="platform" -> SUSPEND2DISK_SHUTDOWN_MODE="shutdown"
maybe it helps? Better open a new bug for that and assign it to pavel@suse.de.
Comment 59 Jörg Mayer 2005-09-09 06:55:42 UTC
No, it isn't - it's a pentium-m (just have a look at the dmesg output that I 
attached in #57). 
Setting that value doesn't change anything and why should it. The box normally 
fails after initiating the suspend process and closely before actually writing 
out the data to disk. 
This time, I actually got an oops (unfortunately it scrolled out very 
quickly), followed by an endless stream of the following message: 
atkbd.c: Spurious ACK on isa0060/serio0. Some program, like XFree86, might be 
trying access hardware directly. 
Note: The system is in runlevel 3, no X running. 
 
Does it really make sense to open a new bug for that? The problem and the 
hardware are similar to that mentioned above (see #51) 
Comment 60 Carl-Daniel Hailfinger 2005-09-12 12:30:26 UTC
Jörg: Is your problem maybe related to bug 115095 ?
A SysRq-T backtrace captured via serial console when the system hangs during
suspend could help find out what's going on.
Comment 61 Jörg Mayer 2005-09-21 08:56:07 UTC
OK, turned on alt-sysrq and verified that it works. Then started wondering 
about how to get a serial console (read the file in Documentation) but the 
laptop doesn't have a serial console, only usb. An ideas? 
Comment 62 Pavel Machek 2005-09-21 20:34:50 UTC
Just use small font and digital camera. usb is not usefull for serial console.
Comment 63 Jens Axboe 2006-01-26 12:39:11 UTC
Please open a new bug for 10.1-beta2 if the problem still exists there, it should work on ata_piix as well as ahci (10.0 missed an important ahci fix).