Bug 151517 - FSC Lifebook does not resume
Summary: FSC Lifebook does not resume
Status: RESOLVED FIXED
Alias: None
Product: SUSE LINUX 10.0
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Final
Hardware: i686 Linux
: P5 - None : Critical
Target Milestone: ---
Assignee: Hannes Reinecke
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2006-02-16 15:36 UTC by Hannes Reinecke
Modified: 2006-03-17 07:56 UTC (History)
2 users (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
sata-acpi-fsc-debug.patch (2.08 KB, patch)
2006-02-16 15:53 UTC, Hannes Reinecke
Details | Diff
DSDT.dsl (219.17 KB, text/plain)
2006-02-21 15:53 UTC, Hannes Reinecke
Details
enable ACPI via AML debug statement (1.25 KB, patch)
2006-02-21 21:20 UTC, Thomas Renninger
Details | Diff
sata-acpi-suspend.patch (62.42 KB, patch)
2006-02-28 12:01 UTC, Hannes Reinecke
Details | Diff
ahci-update.patch (10.65 KB, patch)
2006-03-17 07:53 UTC, Hannes Reinecke
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Hannes Reinecke 2006-02-16 15:36:40 UTC
FSC Lifebooks C-series (I have an C1320 here) doesn't resume properly.
It definitely needs the SATA ACPI support (cf bugzilla #142375), but even with that it doesn't resume properly.

Well, actually it does resume, but the disk stalls; the _GTF commands are send to the device but the device never responds. So appearently there is some initialisation missing...
Comment 1 Hannes Reinecke 2006-02-16 15:43:00 UTC
This is what I get during bootup:

ata1: SATA link up 1.5 Gbps (SStatus 113)
ata1: dev 0 cfg 00:045a 49:2f00 82:346b 83:7f09 84:6063 85:3469 86:3f09 87:6063
88:203f 93:0000
ata1: dev 0 ATA-7, max UDMA/100, 78140160 sectors: LBA48
ata_acpi_push_id: ap->id: 1, ix = 0, port#: 0, hard_port#: 0
sata_get_dev_handle: SATA dev addr=0x1f0002, handle=0xdf672e28
:\_SB_.PCI0.SATA: matches pcidevfn (0x1f0002)
GOT ONE: (\_SB_.PCI0.SATA.PRT0) root_port = 0x0, port_num = 0xffff
GOT ONE: (\_SB_.PCI0.SATA.PRT0) root_port = 0x0, port_num = 0xffff
THIS ^^^^^ is the requested SATA drive (handle = 0xdf6672a4)
GOT ONE: (\_SB_.PCI0.SATA.PRT2) root_port = 0x2, port_num = 0xffff
ata1: dev 0 configured for UDMA/100
ata_acpi_exec_tfs: ENTER:
ata_acpi_exec_tfs: call get_GTF, ix=0
do_drive_get_GTF: ENTER: ap->id: 1, port#: 0, hard_port#: 0
sata_get_dev_handle: SATA dev addr=0x1f0002, handle=0xdf672e28
do_drive_get_GTF: returning gtf_length=14, gtf_address=0xded46834, obj_loc=0xded
46824
ata_acpi_exec_tfs: call set_taskfiles, ix=0
do_drive_set_taskfiles: ENTER: ap->id: 1, port#: 0, hard_port#: 0
do_drive_set_taskfiles: total GTF bytes=14 (0xe), gtf_count=2, addr=0xded46834
taskfile_load_raw: (0x1f1-1f7): hex: 10 03 00 00 00 a0 ef
call ata_exec_internal:
taskfile_load_raw: (0x1f1-1f7): hex: 00 00 00 00 00 a0 f5
call ata_exec_internal:
ata_acpi_exec_tfs: call get_GTF, ix=1
do_drive_get_GTF: ENTER: ap->id: 1, port#: 0, hard_port#: 0
do_drive_get_GTF: ERR: ata_dev_present: 0, PORT_DISABLED: 0
ata_acpi_exec_tfs: get_GTF error (-19)
ata_acpi_exec_tfs: ret=-19
scsi0 : ahci
Comment 2 Hannes Reinecke 2006-02-16 15:52:08 UTC
It's sending first a 'SET FEATURES' command, with subcommand 'Set CFA Power 1', followed by some vendor specific command '0xf5'. 
Comment 3 Hannes Reinecke 2006-02-16 15:53:41 UTC
Created attachment 68857 [details]
sata-acpi-fsc-debug.patch

Patch to call ata_set_xfermode _after_ the _GTF methods.
Comment 4 Hannes Reinecke 2006-02-16 15:55:48 UTC
When using the above patch, I'm actually able to resume the notebook under ideal conditions (ie syslog & hal off, rmmod ipw2200, runlevel 3).

I'm using these commands:

echo platform > /sys/power/disk
echo disk > /sys/power/state

Jens, given that the _GTF method might be unlocking the device, shouldn't we call set_xfermode after _GTF?
Comment 5 Jens Axboe 2006-02-18 11:40:52 UTC
Hannes, yep makes sense. We should always execute the machine specific parts first, then do what we need after that.
Comment 6 Hannes Reinecke 2006-02-21 15:53:10 UTC
Created attachment 69602 [details]
DSDT.dsl

Disassembled DSDT from the FSC laptop.
Comment 7 Thomas Renninger 2006-02-21 21:20:50 UTC
Created attachment 69656 [details]
enable ACPI via AML debug statement

With this patch you can debug the GTF function without tons of ACPI debug garbage:
Put a:
Store("acpi_dbg_level=0xFFFFF", debug) (or whatever debug value)
at the beginning of the GTF function
and a
Store("acpi_dbg_level=0xF", debug)
at the end.
ials -sa dsdt.dsl
cp DSDT.aml /etc/DSDT.aml
vi /etc/sysconfig/kernel -> ACPI_DSDT="/etc/DSDT.aml"
mkinitrd
Good luck.
Comment 8 Thomas Renninger 2006-02-21 21:22:34 UTC
enable higher ACPI debug tracing via AML debug statement describes it better ...
You must have ACPI_DEBUG=y comipled in!
Comment 9 Hannes Reinecke 2006-02-23 15:57:28 UTC
Hmm. Doesn't appear to be the fault of the ACPI patches.
The machine _does_ resume properly, but upon resume the drive locks up.

Either I get a message like:

ata1: resume device
ata_acpi_exec_tfs: ENTER:
ata_acpi_exec_tfs: call get_GTF, ix=0
do_drive_get_GTF: ENTER: ap->id: 1, port#: 0, hard_port#: 0
sata_get_dev_handle: SATA dev addr=0x1f0002, handle=0xdf66e1a8
do_drive_get_GTF: returning gtf_length=14, gtf_address=0xdea6c1d0, obj_loc=0xdea
6c1c0
ata_acpi_exec_tfs: call set_taskfiles, ix=0
do_drive_set_taskfiles: ENTER: ap->id: 1, port#: 0, hard_port#: 0
do_drive_set_taskfiles: total GTF bytes=14 (0xe), gtf_count=2, addr=0xdea6c1d0
taskfile_load_raw: (0x1f1-1f7): hex: 10 03 00 00 00 a0 ef
call ata_qc_issue:
Uhhuh. NMI received. Dazed and confused, but trying to continue
You probably have a hardware problem with your RAM chips
ata1: port reset, p_is 24000000 is 1 pis 0 cmd 4017 tf d0 ss 113 se 0
taskfile_load_raw: (0x1f1-1f7): hex: 00 00 00 00 00 a0 f5
call ata_qc_issue:
ata1: port reset, p_is 24000000 is 1 pis 0 cmd 4017 tf 1d0 ss 113 se 0

and the entire device is gone after that, or it's more like:

ata1: resume device
ata_acpi_exec_tfs: ENTER:
ata_acpi_exec_tfs: call get_GTF, ix=0
do_drive_get_GTF: ENTER: ap->id: 1, port#: 0, hard_port#: 0
pata_get_dev_handle: enter: dev->bus_id='0000:00:1f.2'
pata_get_dev_handle: dev_handle: 0xdf66e1a8, parent_handle: 0xdf67c628
pata_get_dev_handle: for dev=0x1f.2, addr=0x1f0002, parent=0xdf640800, *handle=
0xdf66e1a8
do_drive_get_GTF: returning gtf_length=14, gtf_address=0xdc2223f0, obj_loc=0xdc
2223e0
ata_acpi_exec_tfs: call set_taskfiles, ix=0
do_drive_set_taskfiles: ENTER: ap->id: 1, port#: 0, hard_port#: 0
do_drive_set_taskfiles: total GTF bytes=14 (0xe), gtf_count=2, addr=0xdc2223f0
taskfile_load_raw: (0x1f1-1f7): hex: 10 03 00 00 00 a0 ef
call ata_qc_issue:
taskfile_load_raw: (0x1f1-1f7): hex: 00 00 00 00 00 a0 f5
call ata_qc_issue:
ata_acpi_exec_tfs: call get_GTF, ix=1
do_drive_get_GTF: ENTER: ap->id: 1, port#: 0, hard_port#: 0
do_drive_get_GTF: ERR: ata_dev_present: 0, PORT_DISABLED: 0
ata_acpi_exec_tfs: get_GTF error (-19)
ata_acpi_exec_tfs: ret=-19
ata1: dev 0 configured for UDMA/100
Restarting tasks... done
ReiserFS: warning: is_tree_node: node level 8240 does not match to the expected
 one 4

and the machine continues operation; only reiserfs is quite unhappy and switches to read-only.

I'm a bit at a loss how to debug this further.
Vojtech, any ideas?
Comment 10 Hannes Reinecke 2006-02-24 09:29:00 UTC
As the patch should be ready by the end of the month, this is getting somewhat urgent.
Comment 11 Vojtech Pavlik 2006-02-24 10:03:52 UTC
Well, I'm no ACPI expert nor SATA expert, so I'm not sure why you've chosen
me to help. ;) Anyway, from the above output it seems that the ACPI BIOS is
asking sata_acpi to send two commands to the drive, of which the second fails
badly, either confusing the BIOS (1st case) or only the drive (2nd case).

I guess Jens Axboe could help you by looking at what the commands are and why
they confuse the drive.

Jens, any idea?
Comment 12 Hannes Reinecke 2006-02-24 14:00:16 UTC
This seems to be the problem:
upon resume, the status is:

ata1: enter resume, p_is 0 is 0 pis 0 cmd 4017 tf 50 ss 113 se 0

then I issue a reset and the status is:

ata1: exit resume, p_is 0 is 0 pis 0 cmd 4017 tf 150 ss 113 se 0
Comment 13 Randy Dunlap 2006-02-24 16:30:30 UTC
Hannes:  That enter/exit resume info didn't tell me anything.
What does it tell you?
Comment 14 Hannes Reinecke 2006-02-27 09:39:31 UTC
It tells me that the disk's gone south. tf bit 8 means disk error.
Comment 15 Hannes Reinecke 2006-02-28 09:04:42 UTC
Ok. Finally nailed him.

It turns out that AHCI is not correctly initialised after all.

There are four registers (PxCLB, PxCLBU, PxFB, PxFBU) which contain the addresses for the DMA memory space. And of course these addresses are not necessarily identical across reboots.
IOW they might be different before and after a resume. So we _need_ to update them after a resume.

What I've actually done was to stop DMA entirely upon suspend (via ahci_port_stop()) and re-enable it upon resume (via ahci_port_start()).
Those two function ensure that the said registers are always up-to-date.

Comment 16 Jens Axboe 2006-02-28 09:23:12 UTC
Encouraging news Hannes, the stop/start of the port sounds like a correct fix too.
Comment 17 Jens Axboe 2006-02-28 09:25:19 UTC
I think it would be best to split out the read of the address and initializing from the port_stop/start routines and call these on suspend and resume. There's no reason to free and reallocate memory on suspend/resume, that might just get us into other problems (and it can't be GFP_KERNEL) currently either from that path.
Comment 18 Hannes Reinecke 2006-02-28 09:39:11 UTC
Ah. Like this?

static void ahci_port_suspend(struct ata_port *ap)
{
	struct device *dev = ap->host_set->dev;
	struct ahci_port_priv *pp = ap->private_data;
	void *mmio = ap->host_set->mmio_base;
	void *port_mmio = ahci_port_base(mmio, ap->port_no);
	u32 tmp;

	tmp = readl(port_mmio + PORT_CMD);
	tmp &= ~(PORT_CMD_START | PORT_CMD_FIS_RX);
	writel(tmp, port_mmio + PORT_CMD);
	readl(port_mmio + PORT_CMD); /* flush */

	/* spec says 500 msecs for each PORT_CMD_{START,FIS_RX} bit, so
	 * this is slightly incorrect.
	 */
	msleep(500);
}

static void ahci_port_stop(struct ata_port *ap)
{
	ahci_port_suspend(ap);

	ap->private_data = NULL;
	dma_free_coherent(dev, AHCI_PORT_PRIV_DMA_SZ,
			  pp->cmd_slot, pp->cmd_slot_dma);
	kfree(pp);
}

Comment 19 Jens Axboe 2006-02-28 09:44:24 UTC
Yep that looks good, and ditto for ahci_port_start() of course, I guess you'll add ahci_port_resume() for that?

The msleep() might be troublesome too, which is a little unfortunate since it's a quite long sleep. Doesn't it complain about non-atomic sleep with that called from  the suspend path?
Comment 20 Hannes Reinecke 2006-02-28 09:56:21 UTC
Not really (this is 10.0. That doesn't complain).

How should we replace the msleep?
Comment 21 Hannes Reinecke 2006-02-28 10:00:33 UTC
mdelay looks like a candidate.
Comment 22 Jens Axboe 2006-02-28 10:01:41 UTC
It might need to become mdelay() - however that is a little nasty on module load, as it will busy block for half a second... It's a little ugly, but we could pass in a 'int is_resume/suspend' parameter and use mdelay() then and msleep() for the module load path.
Comment 24 Hannes Reinecke 2006-02-28 11:59:57 UTC
Even better; the spec actually says what we should wait for.
So we can use udelay(10) within the wait loop and everything should be ok.

I'll attach the patch.
Comment 25 Hannes Reinecke 2006-02-28 12:01:53 UTC
Created attachment 70613 [details]
sata-acpi-suspend.patch

Patch to enable suspend to AHCI w/ ACPI integration.

Not sure whether the ACPI integration is actually needed, but they certainly don't seem to cause any harm.

This is mostly a backport of Randy's original patch with the ahci suspend / resume stuff done by me.
Comment 26 Jens Axboe 2006-02-28 12:02:16 UTC
So the comment is incorrect? 500ms is a huge amount of time.
Comment 27 Hannes Reinecke 2006-02-28 12:04:27 UTC
Yes. The spec says:

Software is allowed to manipulate PxCMD.FRE so that it may move the FIS receive area to a new location. When this bit is cleared to ‘0’, software must first wait for PxCMD.FR to clear to ‘0’, indicating that the DMA engine for FIS reception is in an idle condition. When PxCMD.FR and PxCMD.FRE are both cleared to ‘0’, software may update the values of PxFB and PxFBU. Prior to setting PxCMD.FRE to ‘1’, software shall ensure that PxFB and PxFBU are set to valid values. Software shall not write PxFB and PxFBU while PxCMD.FRE is set to ‘1’.

Might be that there are some timing things somewhere else, but this looks far more reliable.
Comment 28 Jens Axboe 2006-02-28 12:05:44 UTC
Overall that patch looks like a good thing. Can you please split it up, though? Lets keep Randys patch separate, and do the suspend/resume fixes separately from the HPA stuff.

The udelay(10) splitup and check look good.
Comment 29 Hannes Reinecke 2006-02-28 12:21:13 UTC
I'll rediff them with libata-dev and post them to linux-ide.
But that's something I wanted to do anyway, as the ahci reset stuff is too important to _not_ have it.
Comment 30 Hannes Reinecke 2006-02-28 12:25:25 UTC
Adjusting the product.
Comment 31 Randy Dunlap 2006-02-28 16:23:27 UTC
Don't the ahci pci_driver .suspend and .resume method pointers need to be
added also?  At least they are not yet in mainline.
Comment 32 Jens Axboe 2006-03-01 07:31:00 UTC
Yes they do, I think that's already in our tree. Hannes, any reason it's not part of this rollup?
Comment 33 Hannes Reinecke 2006-03-01 07:51:52 UTC
Yes, it is. It's been your previous patch (ahci-suspend) which included those hooks already. The mainline patch will have them.
Comment 34 Hannes Reinecke 2006-03-01 12:32:13 UTC
So I think we can close this as it works as designed now.
Comment 35 Hannes Reinecke 2006-03-17 07:53:07 UTC
Created attachment 73634 [details]
ahci-update.patch

AHCI update backported from mainline.
Comment 36 Hannes Reinecke 2006-03-17 07:56:27 UTC
There is actually a small glitch in the patch from comment #25; the spec mandates that the DMA should only be turned on after port probing. Hence we might get those spurious 'ataX; dma not running' during booting.
The above patch fixes this. I've also backported the mainline version as it's far clearer by design and does have more reliable error messages.
It helps to track down errors (like a missing BIOS reset :-) far better than the original one.