Bug 146367

Summary: Kernel Panic on AMD Opteron 4P/8P system with 2GB+ used by PCI devices (in some memory configurations)
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Brian Richardson <brianr>
Component: KernelAssignee: Andreas Kleen <ak>
Status: RESOLVED INVALID QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: david.keck, jacob.shin, mark.langsdorf
Version: Final   
Target Milestone: ---   
Hardware: x86-64   
OS: SuSE Linux 10.1   
Whiteboard:
Found By: Third Party Developer/Partner Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Brian Richardson 2006-01-27 21:57:05 UTC
Symptom: Kernel panic very early in boot process
> PANIC: early exception rip ffffffff8023d53f error 0 cr2 f048ba
> PANIC: early exception rip ffffffff8011ba8a error 0 cr2 ffffffffff5fd023 

OS: SuSe 10 x86-64
Kernel: Linux version 2.6.13-15-smp (default from installation)
Boot Options: default kernel parameters + APIC ON
Processor Models Tested: Opteron 848 (Rev E4), Opteron 865HE (Rev E6)

System: PANTA "Tabasco", a reconfigurable server that can run with 4 or 8 AMD Opteron Processors.
BIOS: AMIBIOS8, version 0ABHQ012

This system uses a large amount of resources for PCI/PCIe devices. In the 4P or 8P configuration, each system has one ATI graphics adapter and two Mellanox Infiniband Host Controller Adapters (HCA). The memory space claimed by PCI in this configuration is 2048MB.

These configurations will FAIL when "hardware memory hole" is enabled in the BIOS, an option that attempts to reclaim memory lost to PCI/PCIe devices by remapping any overlapped memory to the top of physical memory. This option is controlled by the AMD CPU & memory init code.

In a 4P configuration with only one Infiniband HCA, the PCI memory is 1536MB. In this configuration the failure cannot be reproduced.

Without the PANTA hardware, it may be possible to create a similar configuration using any 4P Opteron motherboard with enough expansion cards installed to use more than 2GB of memory for PCI/PCIe devices. Note that the hardware memory hole is only used on "Rev E" Opterons.

Memory Configurations That PASS with memory hole enabled:
* Any where 4GB memory is populated on the first CPU (CPU0) and at least 1 DIMM is installed on the other system CPUs
Memory Configurations That FAIL with memory hole enabled:
* Any where less than 4GB memory is populated on CPU0
(more detailed memory test grid available upon request)

Test conditions used to generate failure with memory hole enabled:
* 4P system, 2GB per CPU (8GB total)
* 8P system, 2GB per CPU (16GB total)
Both configurations can PASS by installing an additional 2GB RAM on CPU0 

===
Problem reported to mark.langsdorf@amd.com, who recommended reporting to SuSe/Novell for investigation. AMI & PANTA would like to verify if this is a LINUX kernel issue, AMD CPU limitation or AMD memory initialization bug.

This is a complicated problem to describe, due to the unusual hardware configuration. Additional data can be provided upon request.
Comment 1 Chris L Mason 2006-01-27 23:37:12 UTC
Please try our 10.1 beta kernels.  If this still fails there, we can move forward with debugging it.

ftp://ftp.suse.com/pub/projects/kernel/kotd/x86_64/HEAD/kernel-smp.x86_64.rpm

Comment 2 Bodo Bauer 2006-02-08 10:54:06 UTC
Adding David and Jacob as Mark is on sabbatical
Comment 3 Brian Richardson 2006-02-08 18:34:19 UTC
Identicial failure on SLES10 Beta (2.6.15-git12-6-smp). Waiting for results on SLES10 Beta3. Similar failure on Win2K3 Server.
Comment 4 Olaf Kirch 2006-03-06 12:07:17 UTC
Andi, Greg, any ideas? Whom of you should I assign this to?
Comment 5 Andreas Kleen 2006-03-06 12:36:07 UTC
To me.

I fixed various issues with this recently 
so I would recommend to test the lastest kernel (2.6.15-git12 is really old)

But if you get a similar failure on Windows maybe it's not the OS that is
to blame? Does something simple like memtest86 work? If not then likely
it's a hardware/BIOS problem. We had this in the past when MTRRs were not
correctly set up or e820 RAM entries pointed to non-RAM etc. Basically we 
require the all RAM entries in e820 point to real usable RAM and if that's
not the case there's nothing we can do from the OS side.



Comment 6 Olaf Kirch 2006-03-06 12:44:37 UTC
Thanks Andi. Handing over to you.
Comment 7 Andreas Kleen 2006-03-06 12:49:20 UTC
Still need information.
Comment 8 Brian Richardson 2006-03-06 14:57:06 UTC
AMD has verified this problem on another manufacturer's hardware. A future version of their memory/CPU reference code (AGESA 1.32.01) will resolve this issue.

A patch provided by AMD was tested on the PANTA hardware using SuSe 10.0 x64. This resolves the issue described in this report.
Comment 9 Andreas Kleen 2006-03-06 15:06:33 UTC
Can you attach the patch?
Comment 10 Brian Richardson 2006-03-06 15:22:03 UTC
It's a source-level change to AMD confidential code, and our BIOS structure doesn't use a patching system ... so I can't provide anything here. The version of BIOS that resolves the issue is 0ABHQ013.

Since the issue is resolved without changing any SuSe product code, what is the proper resolution to use when closing this bug?
Comment 11 Andreas Kleen 2006-03-07 22:25:50 UTC
Ah thanks - i thought you were refering to a Linux kernel patch.
If it's a BIOS bug we normally close it as INVALID since it's 
not our bug.

But could you please give me a quick summary what is wrong and why it crashes
if you know? That's just so that I can more easily recognize the symptoms
when in future a customer runs into it and blames us.
Thanks.