Bugzilla – Bug 146367
Kernel Panic on AMD Opteron 4P/8P system with 2GB+ used by PCI devices (in some memory configurations)
Last modified: 2006-03-07 22:25:50 UTC
Symptom: Kernel panic very early in boot process > PANIC: early exception rip ffffffff8023d53f error 0 cr2 f048ba > PANIC: early exception rip ffffffff8011ba8a error 0 cr2 ffffffffff5fd023 OS: SuSe 10 x86-64 Kernel: Linux version 2.6.13-15-smp (default from installation) Boot Options: default kernel parameters + APIC ON Processor Models Tested: Opteron 848 (Rev E4), Opteron 865HE (Rev E6) System: PANTA "Tabasco", a reconfigurable server that can run with 4 or 8 AMD Opteron Processors. BIOS: AMIBIOS8, version 0ABHQ012 This system uses a large amount of resources for PCI/PCIe devices. In the 4P or 8P configuration, each system has one ATI graphics adapter and two Mellanox Infiniband Host Controller Adapters (HCA). The memory space claimed by PCI in this configuration is 2048MB. These configurations will FAIL when "hardware memory hole" is enabled in the BIOS, an option that attempts to reclaim memory lost to PCI/PCIe devices by remapping any overlapped memory to the top of physical memory. This option is controlled by the AMD CPU & memory init code. In a 4P configuration with only one Infiniband HCA, the PCI memory is 1536MB. In this configuration the failure cannot be reproduced. Without the PANTA hardware, it may be possible to create a similar configuration using any 4P Opteron motherboard with enough expansion cards installed to use more than 2GB of memory for PCI/PCIe devices. Note that the hardware memory hole is only used on "Rev E" Opterons. Memory Configurations That PASS with memory hole enabled: * Any where 4GB memory is populated on the first CPU (CPU0) and at least 1 DIMM is installed on the other system CPUs Memory Configurations That FAIL with memory hole enabled: * Any where less than 4GB memory is populated on CPU0 (more detailed memory test grid available upon request) Test conditions used to generate failure with memory hole enabled: * 4P system, 2GB per CPU (8GB total) * 8P system, 2GB per CPU (16GB total) Both configurations can PASS by installing an additional 2GB RAM on CPU0 === Problem reported to mark.langsdorf@amd.com, who recommended reporting to SuSe/Novell for investigation. AMI & PANTA would like to verify if this is a LINUX kernel issue, AMD CPU limitation or AMD memory initialization bug. This is a complicated problem to describe, due to the unusual hardware configuration. Additional data can be provided upon request.
Please try our 10.1 beta kernels. If this still fails there, we can move forward with debugging it. ftp://ftp.suse.com/pub/projects/kernel/kotd/x86_64/HEAD/kernel-smp.x86_64.rpm
Adding David and Jacob as Mark is on sabbatical
Identicial failure on SLES10 Beta (2.6.15-git12-6-smp). Waiting for results on SLES10 Beta3. Similar failure on Win2K3 Server.
Andi, Greg, any ideas? Whom of you should I assign this to?
To me. I fixed various issues with this recently so I would recommend to test the lastest kernel (2.6.15-git12 is really old) But if you get a similar failure on Windows maybe it's not the OS that is to blame? Does something simple like memtest86 work? If not then likely it's a hardware/BIOS problem. We had this in the past when MTRRs were not correctly set up or e820 RAM entries pointed to non-RAM etc. Basically we require the all RAM entries in e820 point to real usable RAM and if that's not the case there's nothing we can do from the OS side.
Thanks Andi. Handing over to you.
Still need information.
AMD has verified this problem on another manufacturer's hardware. A future version of their memory/CPU reference code (AGESA 1.32.01) will resolve this issue. A patch provided by AMD was tested on the PANTA hardware using SuSe 10.0 x64. This resolves the issue described in this report.
Can you attach the patch?
It's a source-level change to AMD confidential code, and our BIOS structure doesn't use a patching system ... so I can't provide anything here. The version of BIOS that resolves the issue is 0ABHQ013. Since the issue is resolved without changing any SuSe product code, what is the proper resolution to use when closing this bug?
Ah thanks - i thought you were refering to a Linux kernel patch. If it's a BIOS bug we normally close it as INVALID since it's not our bug. But could you please give me a quick summary what is wrong and why it crashes if you know? That's just so that I can more easily recognize the symptoms when in future a customer runs into it and blames us. Thanks.