Bug 136990

Summary: [PATCH] Fix ECC error counting for AMD76x chipset, char/ecc.c driver
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Bryce Nesbitt <bryce2>
Component: KernelAssignee: Andreas Kleen <ak>
Status: RESOLVED WONTFIX QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None    
Version: Final   
Target Milestone: ---   
Hardware: i586   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Patch AMD 76x ECC counting, verified on AMD 761 Chipset, ASUS A7M266 Mainboard

Description Bryce Nesbitt 2005-12-05 19:02:50 UTC
Summary:
* Patch is relevant to driver "char/ecc.c", for the AMD76x Athlon chipset.  This appears to be a SUSE extension, not part of the kernel mainline.

* Write 1 bits, not 0 bits, to clear the ECC error flag status.
* Without this patch, Linux will detect ONLY the first single and multibit
  error.  All subsequent errors are ignored by the hardware until the
  register is properly cleared.
* The patch also fast-paths the common polled operation, and simplifies
code.
* Includes reference to AMD 761 spec sheet documenting the ECC register
values.

* Note: this module is often not installed by default: do "modprobe ecc"
then "cat /proc/ram" to check your ECC memory for detected soft errors.


----------------------------------------------------------------------------------------
linux:/usr/src # diff -u -b orig/drivers/char/ecc.c
linux-2.6.13-15/drivers/char/ecc.c  > /tmp/ecc.diff
linux:/usr/src # cat /tmp/ecc.diff
--- orig/drivers/char/ecc.c     2005-09-13 08:52:29.000000000 -0700
+++ linux-2.6.13-15/drivers/char/ecc.c  2005-12-05 09:36:26.000000000 -0800
@@ -10,6 +10,7 @@
  */
 #define DEBUG  0

+
 #include <linux/config.h>
 #include <linux/version.h>
 #include <linux/module.h>
@@ -22,7 +23,7 @@
 #include <asm/io.h>
 #include <linux/proc_fs.h>

-#define        ECC_VER "0.14 (Oct 10 2001)"
+#define        ECC_VER "0.15 (Dec 1 2005)"
 #define KERN_ECC KERN_ALERT

 static struct timer_list ecctimer;
@@ -1102,15 +1103,20 @@
        }
 }

+
+// Spec source: AMD 761 System Controller/BIOS Guide, 24081D-February 2002
+//
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24081.pdf
 void check_amd76x(void)
 {
-       unsigned long eccstat = pci_dword(0x48);
+       u32 eccstat;
+       pci_read_config_dword(bridge, 0x48, &eccstat);
+       if(eccstat & 0x30)
+       {
        if(eccstat & 0x10)
        {
                /* bits 7-4 of eccstat indicate the row the MBE occurred. */
                int row = (eccstat >> 4) & 0xf;
                printk("<1>ECC: MBE Detected in DRAM row %d\n", row);
-               scrub_needed |= 2;
                bank[row].mbecount++;
        }
        if(eccstat & 0x20)
@@ -1118,22 +1124,9 @@
                /* bits 3-0 of eccstat indicate the row the SBE occurred. */
                int row = eccstat & 0xf;
                printk("<1>ECC: SBE Detected in DRAM row %d\n", row);
-               scrub_needed |= 1;
                bank[row].sbecount++;
        }
-       if (scrub_needed)
-       {
-               /*
-                * clear error flag bits that were set by writing 0 to them
-                * we hope the error was a fluke or something  :) 
-                */
-               unsigned long value = eccstat;
-               if (scrub_needed & 1)
-                       value &= 0xFFFFFDFF;
-               if (scrub_needed & 2)
-                       value &= 0xFFFFFEFF;
-               pci_write_config_dword(bridge, 0x48, value);
-               scrub_needed = 0;
+               pci_write_config_dword(bridge, 0x48, eccstat);  // clear
by writing a 1
        }
 }
Comment 1 Bryce Nesbitt 2005-12-05 19:04:47 UTC
Created attachment 59862 [details]
Patch AMD 76x ECC counting, verified on AMD 761 Chipset, ASUS A7M266 Mainboard
Comment 2 Hannes Reinecke 2005-12-07 11:28:43 UTC
Andi, can you look over the patch?
Comment 3 Andreas Kleen 2005-12-07 11:44:25 UTC
AFAIK we already dropped the ECC drivers and I don't know of any plans to
add them back. So it's probably a WONTFIX.

For the original poster I would suggest to fix that problem in mainline of the package, so if that driver is ever used again the fix will be in there.