Bug 228344

Summary: Acer Ferrari 1000 resets on resume with CONFIG_MTRR
Product: [openSUSE] openSUSE 10.2 Reporter: Thomas Renninger <trenn>
Component: KernelAssignee: E-mail List <kernel-maintainers>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: Chris.Schlaeger, joachim.deguara, jplack, mark.langsdorf, stefan.fent
Version: Final   
Target Milestone: Future/Later   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Dmesg with PCI Debug enabled
minium kernel configuration used for testing suspend on Ferrari 1000 with 2.6.20-rc4-git2
kernel config for 2.6.20-rc4-mm1/i386 which gives working suspend from X
"dmesg|grep -B5 MTRR" from 2.6.20-rc5-20070122124535-default (kernel CVS from this monday) x86-64 kernel, booted with mtrr.show=1
roof of concept for fixing save/restore/sync of extended fixed-range MTRRs on AMD64
updated fix with detection of Hammer, includes rewrite of set_fixed_ranges, tested on Ferrari 1000 in 64-bit mode - for review
polished patch, reduces binary size by one for loop for all calls to set_fixed_range
updated fix which works on arch/i386 as well, should be ready for submission, just one minor beautification missing
updated patch for review, possibly mtrr_save_state() could be moved to an x86-generic location, but that's the only thing left for now and this can be discussed on linux-kernel on submission
Program for reading the fixed-range MTRR values man

Description Thomas Renninger 2006-12-13 18:24:32 UTC
When trying to suspend the machine it stops quite early, last message on console:
suspend: Snapshotting the system

dmesg shows some PCI problems and I think they could be related (machine totally freezes, so I expect this happening when suspend tries to free the problematic PCI ports?).
Comment 1 Thomas Renninger 2006-12-13 18:30:05 UTC
Created attachment 109606 [details]
Dmesg with PCI Debug enabled

Debug output did not show much more...
There are two IMO potentially dangerous messages:

PCI: Bus #09 (-#0c) is hidden behind transparent bridge #08 (-#09) (try 'pci=assign-busses')
Please report the result to linux-kernel to fix this permanently

Harmless?
This one seems to be more problematic, should be the PCIe ports?:

PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq".  If it helps, post a report
PCI: Cannot allocate resource region 7 of bridge 0000:00:04.0
PCI: Cannot allocate resource region 8 of bridge 0000:00:04.0
PCI: Cannot allocate resource region 7 of bridge 0000:00:05.0
PCI: Cannot allocate resource region 8 of bridge 0000:00:05.0
PCI: Cannot allocate resource region 9 of bridge 0000:00:05.0
PCI: Cannot allocate resource region 7 of bridge 0000:00:06.0
PCI: Cannot allocate resource region 8 of bridge 0000:00:06.0
PCI-DMA: Disabling IOMMU.


These three get disabled then:
PCI: Bridge: 0000:00:04.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:05.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:06.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
Comment 2 Thomas Renninger 2006-12-13 18:32:34 UTC
Rest of the machine works fine, could it be that the machine freezes when suspend code tries to touch the broken devices?
Comment 3 Pavel Machek 2006-12-13 20:38:10 UTC
Try usual debugging: boot with init=/bin/bash, try suspend from there. If it helps, binary search modules to find the broken one.

We are talking about suspend-to-disk, right?

(What about suspend-to-ram, btw? Has it ever worked before?)
Comment 4 Stefan Fent 2006-12-14 09:07:12 UTC
Yes, suspend-to-disk
Suspend-to-ram doesn't work either (AFAIR, I tried that, too)
Comment 5 Thomas Renninger 2006-12-14 09:54:34 UTC
Did that. Could it be that fb must be off? With vga= removed I now see oopses with init=/bin/bash. First I got an oops and calltrace to thermal resume (I don't see what could go wrong there, but I couldn't scroll up to the beginning...). Then I tried again with thermal module also removed before and it oopsed again. Now I am quite sure it has to do with above PCI warnings at boot time:

Call Trace (written by hand):

pci_restore_msi_state
pci_restore_state
pci_set_master
pcie_portdrv_resume
resume_device
dpm_resume
device_resume
pm_suspend_disk
...

RIP: kfree+0x4d/0x1bb

Again I couldn't scroll up to the very beginning (but call trace is complete), I can try to setup a firewire debug console if really needed...
Comment 6 Pavel Machek 2006-12-14 19:30:57 UTC
It would be nice to find out which device does this... Can you try passing nomsi parameter to see if it helps?
Comment 7 Bernhard Kaindl 2007-01-10 16:59:11 UTC
pci=nomsi changes the oops from the kfree() at the end of pci_restore_msi_state()
to a oops occuring at some random locations, I have seen about 3 locations, so
for me, not only the oops in the kfree in pci_restore_msi_state, but also the
other random oopses look like to be only possible because of corrupted memory.

The first place where something went wrong during boot were these messages, from Comment #1 of Thomas:

PCI: Cannot allocate resource region 7 of bridge 0000:00:04.0

... and so on also for 00.05 and 00.06. They are followed by:

PCI: Bridge: 0000:00:04.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:05.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.
PCI: Bridge: 0000:00:06.0
  IO window: disabled.
  MEM window: disabled.
  PREFETCH window: disabled.

and by:

PCI: Setting latency timer of device 0000:00:04.0 to 64
pcie_portdrv_probe->Dev[5a36:1002] has invalid IRQ. Check vendor BIOS
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:04.0:pcie00]
Allocate Port Service[0000:00:04.0:pcie01]
Allocate Port Service[0000:00:04.0:pcie03]
PCI: Setting latency timer of device 0000:00:05.0 to 64
pcie_portdrv_probe->Dev[5a37:1002] has invalid IRQ. Check vendor BIOS
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:05.0:pcie00]
Allocate Port Service[0000:00:05.0:pcie01]
Allocate Port Service[0000:00:05.0:pcie03]
PCI: Setting latency timer of device 0000:00:06.0 to 64
pcie_portdrv_probe->Dev[5a38:1002] has invalid IRQ. Check vendor BIOS
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:06.0:pcie00]
Allocate Port Service[0000:00:06.0:pcie01]
Allocate Port Service[0000:00:06.0:pcie03]
PCI: Setting latency timer of device 0000:00:07.0 to 64
pcie_portdrv_probe->Dev[5a39:1002] has invalid IRQ. Check vendor BIOS
assign_interrupt_mode Found MSI capability
Allocate Port Service[0000:00:07.0:pcie00]
Allocate Port Service[0000:00:07.0:pcie01]
Allocate Port Service[0000:00:07.0:pcie02]
Allocate Port Service[0000:00:07.0:pcie03]

The bridges in question are shown in lspci as:

00:00.0 Host bridge: ATI Technologies Inc RS480 Host Bridge (rev 10)
00:01.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge
00:04.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge
00:05.0 PCI bridge: ATI Technologies Inc Unknown device 5a37
00:06.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge
00:07.0 PCI bridge: ATI Technologies Inc RS480 PCI Bridge

And lspci -t shows

-[0000:00]-+-00.0
           +-01.0-[0000:01]----05.0
           +-04.0-[0000:02]--
           +-05.0-[0000:03-05]--
           +-06.0-[0000:06]--

I have looked at pcie_portdrv_probe(), which spills the most noticeable warning messages:

static int __devinit pcie_portdrv_probe (struct pci_dev *dev,
                                const struct pci_device_id *id )
{
        int                     status;

        status = pcie_port_device_probe(dev);
        if (status)
                return status;

        if (pci_enable_device(dev) < 0)
                return -ENODEV;

        pci_set_master(dev);
        if (!dev->irq) {
                printk(KERN_WARNING 
                "%s->Dev[%04x:%04x] has invalid IRQ. Check vendor BIOS\n",
                __FUNCTION__, dev->device, dev->vendor);
        }
        if (pcie_port_device_register(dev)) {
                pci_disable_device(dev);
                return -ENOMEM;
        }

        return 0;
}

I have built 2.6.20-rc4-git2 now to test if it helps to abort in this function when the warning is triggered, but with init=/bin/bash, it but it panics suspend2disk after printing

Suspending console(s)

without further messages, even with vanilla code, Caps-Lock blinking and does the automatic reboot on panic which I have set up through sysctl.

pci=noacpi does not even help, only acpi=off makes suspend2disk succeed.

The boot warnings which I quoted also appear on other PCIE machines which Thomas has got, and we want the messages fixed there too.

Pavel, if you have ideas, they would be welcome.
Comment 8 Bernhard Kaindl 2007-01-10 18:10:07 UTC
I have changed pcie_portdrv_probe() to do nothing (verified in the dmesg) on the 3 strange PCI bridges, but it didn't help. 

I have also compared the dmesg from Thomas with the one from 2.6.20-rc4-git2 but didn't get new ideas from that either as 2.6.20-rc4-git2 (with the same config settings) didn't help anyways.

I'm going to rebuild 2.6.20-rc4-git2 with debugging options on and consoles not disabled during suspend and see if I get more info then.
Comment 10 Bernhard Kaindl 2007-01-11 15:30:36 UTC
I've got suspend on 2.6.20-rc4-git2 to succeed by aborting the data copy loop in kernel/power/snapshot.c:copy_data_pages() before it hits the page (src_pfn= 0x4000) which triggered the oops.

With this hack which consists of

if (pfn==0x4000)
    break

the kernel suspends, but on resume it hangs after reading the pages from swap. Once, resume displayed vertical white bars on the display.

Looks like the memory corruption also bites on 2.6.20-rc4-git2 (one could assume), or the kernel data pages used by suspend do no look like the suspend code expects them to be.
Comment 11 Bernhard Kaindl 2007-01-11 16:48:09 UTC
Created attachment 112500 [details]
minium kernel configuration used for testing suspend on Ferrari 1000 with 2.6.20-rc4-git2

It looks like that we can pretty much exclude the PCIE code from the list of suspects because in my current test kernel config, I disabled everything I could,  including PCI Express support, MSI, MMCONFIG and SMP.

All code which accesses the three PCI bridges which caused the warnings can also be removed from the list of subjects because I patched drivers/pci/probe.c to skip them during the probing and are treated as if they do not exist, so that they do not become known to these functions and the kernel (and are also not shown in lspci).
Comment 12 Bernhard Kaindl 2007-01-11 17:03:16 UTC
But what surprises me is that there must be (at least also) an error in the data which suspend uses to verify if it can actually save the page it wants to copy. 

To exclude pages which should not be copied, copy_data_pages calls saveable_page(pfn) before on every page before it starts copying, and double-check, I even added anther check directly into the copy loop, but not even then the check triggers and the kernel still panics in do_copy_page with this check added:

        memory_bm_position_reset(orig_bm);
        memory_bm_position_reset(copy_bm);
        do {
                pfn = memory_bm_next_pfn(orig_bm);
+               if (!saveable_page(pfn)) {
+                       printk("copy_data_pages: not saveable: %p\n", pfn);
+                       break;
+               }
                if (likely(pfn != BM_END_OF_MAP))
                        copy_data_page(memory_bm_next_pfn(copy_bm), pfn);
        } while (pfn != BM_END_OF_MAP);

Very strange, Pavel, any idea?
Comment 13 Bernhard Kaindl 2007-01-12 12:51:43 UTC
I installed 10.2-i386 into a new partition and it was also hanging.
Compiled 2.6.20-rc4-mm1 on it and suspend2disk and -resume works flawlessy there.

Fix for SLED10 SP1 wanted...
Comment 14 Bernhard Kaindl 2007-01-12 14:31:16 UTC
Hm, 10.2-i386 does not hang on supend2disk if done from init=/bin/bash, and suspend2disk succeeds, but resume turn off power after reading the swap image.

Trying to suspend 10.2-i386 from runlevel 1 (sulogin) showed quite me some errors and timeouts USB (ehci-hcd) and messages from pnp for some devices not supporting activation before the code triggers a reboot.

The reboot also happens if I rmmod ehci-hcd and ohci1394 before the echo disk >/sys/power/state. Suspend from X shows no messages at all (black screen for a while) before it triggers the reboot.

Pavel, can't we show the kernel messages also when we suspend from higher runlevels than 1?
Comment 15 Bernhard Kaindl 2007-01-12 17:16:52 UTC
Created attachment 112752 [details]
kernel config for 2.6.20-rc4-mm1/i386 which gives working suspend from X

This is the only kernel config with which I was able to suspend/resume the machine at all so far. It's i386, uniprocessor, but ACPI is on.

Turn e.g. SMP on based on this config, and suspending from X will stop at saving ~23% and suspending from init 3 will stop at ~84% saved.

Even with this kind of config, 2.6.20-rc4-mm1 still panics on the data page copy on x86_64 as ?I described in comment #10, so there are 2 different symtoms with this machine, maybe with the same cause?:

- panic on kernel data page copying on x86_64
- problems with all but a extremely minimal config on i386
Comment 16 Bernhard Kaindl 2007-01-12 17:41:10 UTC
The i386 SMP suspend hang was limited to the -mm1 kernel, plain i386 2.4.20-rc4 also suspends with SMP correctly now.
Comment 17 Pavel Machek 2007-01-12 23:46:49 UTC
If you want to display messages from init 5... it should be enough to set console loglevel correctly, no? (killall klogd?)

Yes, it looks like there's driver problem _plus_ some core problem, too... but 2.6.20-rc4 works in both respects, right?

Is CONFIG_PAE set it any of those kernels? That's known broken and pretty much unfixable.
Comment 18 Bernhard Kaindl 2007-01-17 18:14:50 UTC
Regarding messages: Sure, I can turn up the loglevel, but in general I noticed that during some phases during suspend which can last for seconds, there is no visual feedback (except for the HD LED) that things are still being worked on and the system is indeed working on suspending or resuming and not hung. But that's a candidate for a feature request or for a separate bugzilla entry.

2.6.20-rc4 (I switched to -rc5 meanwhile) is apparently better than any earlyer kernels which I tested, but still has two issues on this laptop:

-When the kernel is compiled with CONFIG_MTRR, at the end of resume, things break: Depending on the other config options/loaded modules/test case, it either triggers a reboot of the system after it prints the message for Booting the 2nd CPU (similar place for non SMP kernel) or it just hangs some seconds after reading the image back from disk.

- When the tg3 module is loaded on suspend, it hangs on suspend, after writing the image.

I have got some MTRR setup info from booting with mtrr.show=1 and will post it tomorrow.

Comment 19 Bernhard Kaindl 2007-01-24 18:13:02 UTC
Created attachment 114778 [details]
"dmesg|grep -B5 MTRR" from 2.6.20-rc5-20070122124535-default (kernel CVS from this monday) x86-64 kernel, booted with mtrr.show=1

Regarding comment #17: This bug is for x86_64, so I mostly debugged using x86_64 and CONFIG_PAE doesn' exist there. I didn't make this clear in the bug's text and since the hardware information of this bug to x86_64, I'll set it.

I'm attaching the debug info info from mtrr.show=1 I'll cc Andi, Bodo told me that he did something with MTRR's and should be able to help with MTRR's.
Comment 20 Bernhard Kaindl 2007-01-25 12:37:18 UTC
Hi Andi,

as Bodo mentioned that you have experiance with MTRR's, could you
have a look into the attachment 114778 [details] of the previous comment?

It's an AMD task statement machine (Ferrari 1000) where resume from
disk with CONFIG_MTRR causes a reboot with power-cylce at the point
where the second CPU is booted (that's the last message which I can
see before the display is reset), but it also happens on a Uniprocessor
when CONFIG_MTRR is on. Processor is (from kernel logs):

AMD Turion(tm) 64 X2 Mobile Technology TL-56 stepping 02

Comment 21 Bodo Bauer 2007-01-25 15:39:37 UTC
I'm leaving Novell. If TPM assistance is needed, please ask Joachim Plack (AMD related issues) or Oliver Ries (general x86_64/i386) for assitance.
Comment 22 Bernhard Kaindl 2007-01-30 16:45:13 UTC
Status summary (for Andi or later Andrea or AMD techs):

On resume, if the kernel is compiled with CONFIG_MTRR (x86_64 and i386),
the last message which I can read before the display is reset, is:

Booting processor 1/2 APIC 0x1

A kernel without CONFIG_MTRR gives these messages afterwards:

Initializing CPU#1
Calibrating delay using timer
specific routine.. 1600.16 BogoMIPS (lpj=3200321)
CPU: L1 I Cache: 64K (64 bytes/line),
D cache 64K (64 bytes/line)
CPU: L2 Cache: 512K (64 bytes/line)
CPU 1/1 -> Node 0
CPU: Physical Processor ID: 0
CPU: Processor Core ID: 1
AMD Turion(tm) 64 X2 Mobile
Technology TL-56 stepping 02
CPU 1: Syncing TSC to CPU 0.
CPU 1: synchronized TSC with CPU 0 (last diff 0 cycles, maxerr 452 cycles)

This also happens on Uniprocessor kernels and other kernels with quite
small configuration, with init=/bin/sh, and with vga=0 on plan vanilla
kernels, so it does not appear to me like an interaction with some
non-required driver.

Andi, any ideas?
Comment 23 Bernhard Kaindl 2007-02-07 16:36:56 UTC
Sorry that this comment is a bit long, but I really wanted to give
a complete update:

Andi suggested debugging the MTRR resume function, but while
trying that on the CVS kernel (which otherwise has working
suspend/resume when CONTIG_MTRR is off because it has no bad
interaction between tg3 and sata_sil), I found that MTRR breaks
suspending with the sata_sil driver, but with the CVS kernel only
so far.

While sata_sil should in theory be robust to suspend independently
of MTRR support in the kernel, this itself is a bug, but this
actually helps me a but to hopefully find out what is wrong with
MTRRs on the Ferrari 1000. At least, they should normally not
have such impact, I guess...

While adopting Andi's development patches to switch from MTRR
to PAT would be too bleeding-edge for general adoption just now,
starting to try out his patches for PAT and disabling MTRR on the
brand new Ferrari 1000 (controlled by DMI) would be an option which
I'd be interested in.

During the last release candidates of 2.6.19, a big fix for the
MTRR code was adopted, so I could also try to revert it an see:
http://lkml.org/lkml/2006/11/15/144

While suspending to disk (but not resuming from disk) worked with
CONFIG_MTRR on on plain vanilla 2.6.20-rc5, I do not think that
this is a bug in sata_sil, because sata_sil otherwise suspends
and resumes fine with CONFIG_MTRR is off and it even works with
CONFIG_MTRR enabled on a similar machine (HP nx6325) with the
same CPU stepping (TL-56 stepping 02) and same ATI RS480 chipset.

The only issue which I still saw with sata_sil in the CVS kernel build
was that if using vesafb and with console log level was 8 and PCI_- and
PM_DEBUG on, it still lost it's disk, probably still because the
intense logging on vesafb takes too long to make the driver recover.

So the sata_sil driver in CVS is apparently improved, but still not useful
for suspend debugging with many messages logged over a slow output
like vesafb. Pure VGA Text mode console logging was fine.

Since the machine does not reset itself but hang, that new symptom is
better because at that point, I may still be able to read the Ferrari's
RAM chips over DMA using the thankfully on-board OHCI1394 controller
over Firewire using Andi's x86_64 port of firescope. That would be
harder to do after a complete machine reset and so I could get debugging
information which the CPU does not even have to output somewhere, it
just has to be written to the RAM chips, from where I can read it over
firewire. Thanks to Andi also for some tips on how to set that up.

If reverting the last MTRR fix does not help and Andi's PAT code
is not yet well enough (a small fix would be also preferred if possible),
I'd have to look at the MTRR initialisation routines, reduce MTRR
initialisation to a minimum, see if that works and start setting
more MTRR registers.
Comment 24 Andreas Kleen 2007-02-07 17:52:15 UTC
I don't think PAT will help because SIL shouldn't write combining and for
non cached mapping the in kernel code already DTRT.


Have you checked if it's suspend or resume yet as I suggested? 
Then disabling the individual mtrrs and finding which one causes it 
should give some clues.

This can be all done without a working console
Comment 25 Bernhard Kaindl 2007-02-09 16:58:08 UTC
Just a quick update:

I have found that the execution of set_fixed_ranges() which sets up the fixed-range MTRRs cause the sata_sil driver to fail by misreading the
size of sda in the resume phase while suspending to disk by returning
from it without setting up the fixed-range MTRRs.

The reset while resuming from disk while initializing the CPU (I think
2nd core, have to check to logs closer) which happened with the vanilla
2.6.20-rc5 is also gone when set_fixed_ranges() is changed to not set the fixed-range MTRRs.

As Andi suggested, I'll try to nail down which of the MTRRs causes the problem.

Another idea which comes to my mind: As the MTRR are AFAICS are initially
set-up by the BIOS, I guess a look for an BIOS upgrade could also be useful,
but in case the BIOS does not allow downgrading, I will not be able to further debug the issue then, so I am a bit hestitant to that. However, it would be good to know if the BIOS changelog has something, I guess.

I do not yet really understand what the kernel does with the fixed-range
MTRRs: I have not seen any possibility in the kernel to change the setup
of the fixed-range MTRRs, so it looks to me that set_fixed_ranges() should
normally do nothing to the MTRRs, as nobody ever asks to change something,
so if everything is correct, it should not try to write to them, it should
normally only verify and restore them in case the initial setup from the
BIOS is lost sometime. I hope that we can sched some light on this....

For a start, I have added some debug printks to the function to see more
what it's doing and where it's causing the reset and the sata_sil trouble.
Comment 26 Pavel Machek 2007-02-13 10:53:58 UTC
I'd do BIOS upgrade if I were you. If it is BIOS problem, and it is fixed in newer BIOS, you do not want to waste your time trying to debug it.
Comment 27 Bernhard Kaindl 2007-02-14 16:40:18 UTC
There is indeed a newer BIOS available from Acer, six-teen versions to be exact,
but only the last version is available, without changelog.

At least the Windows-only Flasher seems to backup the old BIOS image to a backup
file, so I guess I could restore the current BIOS in case.

There is no support for upgrades. If something goes wrong -> not Acer's fault.
I could at least understand if somebody would not like to mess with the BIOS.

Back to teh technical analysis:

In the way I understand the x86-64 manual, the initial setup of Extended
Fixed-Range MTRRs is wrong. On early boot, after identifiying the boot CPU,
I dumped MTRR and SYSCFG MSR contents and found the following
Extended Fixed-Range MTRR setup in the boot CPU:

1) Fixed-Range MTRRs are enabled (MTRR_deftype bits 10 and 11 are set)

2) Extended Fixed-Range MTRRs are enabled and RdMem and Wrmem bits are
   configured to be rw-accessible (The MtrrFixDramEn and MtrrFixModEn bits
   in the SYSCFG MSR are set)

2) All WrMem and RdMem bits are 0, but the MTRRs for 00000-9FFFF (0-640k) are
   set to write-back (WB). To repeat, the entire register's value is 6,
   and Fixed-Range MTRRs are enabled as well.

This combination, according to the page 235 of AMD publication 24593 from
http://www.amd.com/gb-uk/Processors/TechnicalResources/0,,30_182_739_7044,00.html
is not an "allowable combination" and results in "undefined" and "unpredictable"
behaviour.

But when I retrieve the MTRRs after boot, I get this register dump:

CPU 0 MTRR 00000-7FFFF RdMem+Wrmem+write-back
CPU 0 MTRR 80000-9FFFF RdMem+Wrmem+write-back
CPU 0 MTRR C0000-C7FFF RdMem+write-protect
CPU 0 MTRR C8000-CFFFF RdMem+write-protect
CPU 0 MTRR E4000-E7FFF RdMem+write-protect
CPU 0 MTRR E8000-EFFFF RdMem+write-protect
CPU 0 MTRR F0000-F7FFF RdMem+write-protect
CPU 0 MTRR F8000-FFFFF RdMem+write-protect

CPU 1 MTRR 00000-7FFFF write-back
CPU 1 MTRR 80000-9FFFF write-back
CPU 1 MTRR C0000-C7FFF write-protect
CPU 1 MTRR C8000-CFFFF write-protect
CPU 1 MTRR E4000-E7FFF write-protect
CPU 1 MTRR E8000-EFFFF write-protect
CPU 1 MTRR F0000-F7FFF write-protect
CPU 1 MTRR F8000-FFFFF write-protect

While CPU Core 1 still had the same setup as in early boot, the access types
of the Boot CPU Core have drastically changed. A number of memory ranges which
are often used for ROM are changed to a shadowed ROM configuration, as if
something shadowed ROM respective memory-mapped I/O to system memory for
shadowed ROM while the 1st 640k of memory are now properly configured for
write-back.

Now, when the inital MTRR setup of the two CPU core is restored during suspend
after saving+restoring the CPU context in {save,restore}_processor_state, the
bug manifests itself in the form of sata_sil from the kernel CVS failing to
access sda when it's time to write the suspend image or a system reset later
on resume with older kernels.

I have not found Linux kernel code which would set the RdMem and Wrmem bits,
ACPI seems to be not able to do wrmsr but an SMI (system management interrupt)
may do it: http://en.wikipedia.org/wiki/System_Management_Mode

A good fix may however be to save the MTRR state in save_processor_state,
including the extended fixed-range MTRRs (currently nothing regarding MTRRs
is done there) and restore it in restore_processor_state, and ensuring that
all running CPUs have the save MTRR setup.
Comment 28 Bernhard Kaindl 2007-02-19 13:46:50 UTC
I tested saving the Fixed-range MTRR state in save_processor_state, but on restore_processor_state for the second core, an exception is thrown by WRMSR when set_fixed_ranges tries to set the RdMem and/or WrMem attributes in the Extended MTRR Type-Field Format for Fixed-Range MTRRs which are described in section 7.8.1 of AMD publication 24593, Rev. 3.12 ( http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24593.pdf ) which are already set on the boot core.

That kind exception is handled by the MTRR code already, it logs that the MSR for which WRMSR failed and continues and the system resumes otherwise fine.

The code to handle an exception from WRMSR is in mainline because it was a work-around for an BIOS bug regarding MTRRs, found and fixed by SuSE/Andi:
http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.12-rc1/2.6.12-rc1-mm2/broken-out/x86_64-work-around-tyan-bios-mtrr-initialization-bug.patch

I looked up the reasons why WRMSR may throw an exception on page 359 of http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24594.pdf
and AFACS, none of them is present, so at least from that documentation, the exception shouldn't be thrown. I also verified that the MtrrFixDramModEn and MtrrFixDramEn bits are set in the SYSCFG MSR of the second core immediately before trying to set the bits, but both are set, so nothing should prevent to set these bits, but the system throws the exception nonetheless.

I tried to search for AMD erratas and Turion 64 X2 docs which would document this, but didn't find documentation on that.

Mark, may I be right with the following:

May it be that these bits may not be set on the second TL-56 core despite having two a separate L2 cache since this kind of memory configuration (shadowed ROM and write-back memory below 1M) and is more usual for the system BIOS (which will only use the boot core) than by modern operating systems which use both cores? 

As a workaround, the fixed-range MTRRs could be saved&restored using separate backup values per logical CPU. The resulting suspend/resume setup would not change, except that the exceptions would not be thrown and logged and the MTRRs
would be restored as it was before resume.
Comment 29 Andreas Kleen 2007-02-19 14:17:17 UTC
The MTRR save/restore procedure doesn't 100% follow the intel recommendations.
Maybe it's related to that. It was on my todo list to fix that up.

>
>A good fix may however be to save the MTRR state in save_processor_state,
>including the extended fixed-range MTRRs (currently nothing regarding MTRRs
>is done there) and restore it in restore_processor_state, and ensuring that
>all running CPUs have the save MTRR setup.

This sounds like a good idea. Can you try that?
Comment 30 Bernhard Kaindl 2007-02-19 18:10:15 UTC
Yes, looks good.

A quick proof-of-concept, tailored to the dual-core Ferrari 1000 works on it.

Just for additional information, but not of importance, do you have (or has anybody here) a pointer to the recommendations for MTRR save/restore?

In the first step which I described in comment #28, I initialized the MTRRs of the AP with the values which I got from the BP in disable_nonboot_cpus() (because I saw that it runs during suspend) - That was an easy quick test.

Because that caused exceptions on the second core of this Ferrari 1000
(as described), I used a separate MTRR backup for the AP (the AP does not
accept the same MTRR setup on this machine) - and that works fine and
really restores what was there before suspend (which should be safe).

I'll extract the changes and post them for review here.
Comment 31 Bernhard Kaindl 2007-02-20 19:49:41 UTC
I had a simple off-by-one error in checking the MtrrFixDramModEn bit and while trying to set RdMem and WrMem, the bit was not set, so this was the reason why the exception which occurred in wrmsr so everything works as documented and I can set the RdMem and WrMem and bits successfully on the second CPU core.

I now worked on extending my changes which save and restore the extended fixed-range MTRRs which are a special feature of the AMD64 architecture during suspend/resume to also syncronize them between CPUs as described as a "must" in section 7.6.5 "MTRRs in Multi-Processing Environments" of the AMD x86-64 programmer's manual, vol. 2, despite the BIOS not conforming to that, so that part is actually a BIOS workaround, but the MTRR code already has a tradition in doing so, e.g. in comment #28, I mentioned a MTRR BIOS bug workaround by Andi and there are more, e.g for more easyer cases than the Ferrari 1000, the Linux MTRR code already did the right thing, but with the extended fixed-range MTRRs of the AMD64 architecture, and the Ferrari 1000 setting up shadow RAM when enabling ACPI mode made it bit more tricky.

But on the other hand, I think fixing this nearly worst-case scenario was a good test for this code to save/restore and snyc the AMD64 extended fixed-range MTRRs should be working once and for all.

I'll dig out the combined changes and post them for review.
Comment 32 Bernhard Kaindl 2007-02-21 18:09:25 UTC
Created attachment 120361 [details]
roof of concept for fixing save/restore/sync of extended fixed-range MTRRs on AMD64

This is the extracted patch, but it is only a proof of concept.

It works on my the Ferrari 1000 and should to "the right thing"(tm), IMHO.

It's ready for review, but needs more changes, e.g. it would not be accepted into mainline in this state because it accesses the possibly K8-specific SYSCFG MSR directly, but I don't know if this MSR is even there on the Intel 64 CPUs. Maybe AMD knows wether it is, but I have not found it described in the Intel 64 and IA-32 programmer's manual.

In any case these accesses would need to be encapsulated to only be done on x86_64 or even only on AMD64 (K8) CPUs. Otherwise it could be used now and works fine. I also verifified the MTRR state to be sure that not only suspend2disk works but that the MTRRs are indeed in SYNC.

Preferably set_fixed_ranges() should be rewritten to use a new function called update_fixed_range which handles the reoccurring code of reading an MSR, checking if it needs update and updating it if needed, with status returned. That would reduce the code size while adding the support for AMD64 extended fixed range MTRRs.

I could do so as well, but I first wanted to show what I did in essence, and when we know how to draft it finally, I can do it of course.
Comment 33 Bernhard Kaindl 2007-02-22 17:16:25 UTC
Created attachment 120608 [details]
updated fix with detection of Hammer, includes rewrite of set_fixed_ranges, tested on Ferrari 1000 in 64-bit mode - for review

I could not access the K8 SYSCFG MSR on Intel64 (tried it in legacy mode only)
and http://en.wikipedia.org/wiki/EM64T#Differences_between_AMD64_and_Intel_64
also says that it does not exist.

So this updated patch accesses the K8 SYSCFG MSR only on Hammer CPUs, also in legacy mode (I assume that there is nothing different there, but I will test it), but not on Intel 64, Athlon or other CPUs.

It contains a rewrite of set_fixed_ranges() which was a spaghetti-implementation of what could have been written better (smaller in code and source and easy to read) using an inlined function, but without the code for handling WrMem and RdMem, my rewrite has the exact same object size, so there was no big deal in rewriting it, other than readability.

However, by adding support for WrMem/WrMem, the code is smaller when not inlined (speed is not a concern in this code path) so a rewrite to use a function was even better for code size. The patch could be optimised even more for object size, but that is not the main topic now.

Anyway, with this version, result is that the object code with the rewrite is more 100 bytes smaller than that of the previous patch, even with the K8 check which isn't even included in the previous patch, but also the source code is more readable.

This patch works on the Ferrari 1000 in long mode (I verified the MTRRs right after boot and after many resumes to be in sync and not messed up), and will also test what happens in legacy mode.

What remains is a test in 32-bit mode and the patch could be further polished to be ready for submission. Comments welcome.
Comment 34 Bernhard Kaindl 2007-02-23 15:28:16 UTC
Created attachment 120787 [details]
polished patch, reduces binary size by one for loop for all calls to set_fixed_range

Attached updated patch, Description is: "reduces binary size by one for loop for all calls to set_fixed_range."

Contains only small changes. I found that my test program to verify the MTTR contents had a bug so I didn't notice that the patch doesn't syncronize the MTRRs after the SMI for the transition to ACPI mode.

Initially I thought that I could call mtrr_save_fixed_ranges() in the window between acpi_enable() and the init of the APs, but it would be better if do a 

    smp_call_function_single(0, mtrr_save_fixed_ranges, NULL, 1, 1);

in mtrr_ap_init, which works, but it triggers a WARN_ON smp_call_function_single() is called with interrupts disabled because it could deadlock then, so I will probably better lookup the places where the APs are signaled to boot and call mtrr_save_fixed_ranges() there.
Comment 35 Joachim Deguara 2007-02-25 21:58:12 UTC
the same problem, premature shutdown when suspending to disk, happens on the Ferrari 5000 also, but the patch from comment #34 fixes it.  Tested on openSUSE 10.2 with 2.6.21-rc1-git1 kernel.
Comment 36 Bernhard Kaindl 2007-02-26 17:48:11 UTC
Created attachment 121120 [details]
updated fix which works on arch/i386 as well, should be ready for submission, just one minor beautification missing

Attached updated fix which works on arch/i386 as well. It is ready for submission except for one detail which is only beauty(last change to be synced to x86_64) in detail and needs to be tested to work in uniprocessor kernel builds as well and does not break builds of the kernel without MTRR support.

With this patch, I verified that the same issue is also there in legacy mode, and verified that it is fixed with the same change.
Comment 37 Bernhard Kaindl 2007-02-27 17:12:22 UTC
Created attachment 121351 [details]
updated patch for review, possibly mtrr_save_state() could be moved to an x86-generic location, but that's the only thing left for now and this can be discussed on linux-kernel on submission

I tested this patch now on x86_64, works fine and fixes also the PCIE issue which we initially saw on this bug.
Comment 38 Bernhard Kaindl 2007-02-28 17:41:22 UTC
Created attachment 121632 [details]
Program for reading the fixed-range MTRR values man 

Test program for reading current fixed-range MTRR values, enable CONFIG_X86_MSR and load msr.ko.

Assumes that msr.ko device files are created as /dev/cpu/<#CPU>/msr by udev (default)

Example output from broken system (each 64-bit column is from one CPU):

MSR 0x250: 1e1e1e1e1e1e1e1e | 0606060606060606
MSR 0x258: 1e1e1e1e1e1e1e1e | 0606060606060606
MSR 0x259: 0000000000000000 | 0000000000000000
MSR 0x268: 1515151515151515 | 0505050505050505
MSR 0x269: 1515151515151515 | 0505050505050505
MSR 0x26a: 0000000000000000 | 0000000000000000
MSR 0x26b: 0000000000000000 | 0000000000000000
MSR 0x26c: 1515151500000000 | 0505050500000000
MSR 0x26d: 1515151515151515 | 0505050505050505
MSR 0x26e: 1515151515151515 | 0505050505050505
MSR 0x26f: 1515151515151515 | 0505050505050505

In fixed setup, the register dump blocks from all CPUs are identical and do not change e.g. after resume from disk.
Comment 40 Thomas Renninger 2008-08-29 09:39:03 UTC
AFAIK Bernd got this resolved mainline and did forget to close the bug.