|
Bugzilla – Full Text Bug Listing |
| Summary: | [PATCH] Fix ECC error counting for AMD76x chipset, char/ecc.c driver | ||
|---|---|---|---|
| Product: | [openSUSE] SUSE LINUX 10.0 | Reporter: | Bryce Nesbitt <bryce2> |
| Component: | Kernel | Assignee: | Andreas Kleen <ak> |
| Status: | RESOLVED WONTFIX | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Normal | ||
| Priority: | P5 - None | ||
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | i586 | ||
| OS: | SuSE Linux 10.0 | ||
| Whiteboard: | |||
| Found By: | Other | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: | Patch AMD 76x ECC counting, verified on AMD 761 Chipset, ASUS A7M266 Mainboard | ||
Created attachment 59862 [details]
Patch AMD 76x ECC counting, verified on AMD 761 Chipset, ASUS A7M266 Mainboard
Andi, can you look over the patch? AFAIK we already dropped the ECC drivers and I don't know of any plans to add them back. So it's probably a WONTFIX. For the original poster I would suggest to fix that problem in mainline of the package, so if that driver is ever used again the fix will be in there. |
Summary: * Patch is relevant to driver "char/ecc.c", for the AMD76x Athlon chipset. This appears to be a SUSE extension, not part of the kernel mainline. * Write 1 bits, not 0 bits, to clear the ECC error flag status. * Without this patch, Linux will detect ONLY the first single and multibit error. All subsequent errors are ignored by the hardware until the register is properly cleared. * The patch also fast-paths the common polled operation, and simplifies code. * Includes reference to AMD 761 spec sheet documenting the ECC register values. * Note: this module is often not installed by default: do "modprobe ecc" then "cat /proc/ram" to check your ECC memory for detected soft errors. ---------------------------------------------------------------------------------------- linux:/usr/src # diff -u -b orig/drivers/char/ecc.c linux-2.6.13-15/drivers/char/ecc.c > /tmp/ecc.diff linux:/usr/src # cat /tmp/ecc.diff --- orig/drivers/char/ecc.c 2005-09-13 08:52:29.000000000 -0700 +++ linux-2.6.13-15/drivers/char/ecc.c 2005-12-05 09:36:26.000000000 -0800 @@ -10,6 +10,7 @@ */ #define DEBUG 0 + #include <linux/config.h> #include <linux/version.h> #include <linux/module.h> @@ -22,7 +23,7 @@ #include <asm/io.h> #include <linux/proc_fs.h> -#define ECC_VER "0.14 (Oct 10 2001)" +#define ECC_VER "0.15 (Dec 1 2005)" #define KERN_ECC KERN_ALERT static struct timer_list ecctimer; @@ -1102,15 +1103,20 @@ } } + +// Spec source: AMD 761 System Controller/BIOS Guide, 24081D-February 2002 +// http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/24081.pdf void check_amd76x(void) { - unsigned long eccstat = pci_dword(0x48); + u32 eccstat; + pci_read_config_dword(bridge, 0x48, &eccstat); + if(eccstat & 0x30) + { if(eccstat & 0x10) { /* bits 7-4 of eccstat indicate the row the MBE occurred. */ int row = (eccstat >> 4) & 0xf; printk("<1>ECC: MBE Detected in DRAM row %d\n", row); - scrub_needed |= 2; bank[row].mbecount++; } if(eccstat & 0x20) @@ -1118,22 +1124,9 @@ /* bits 3-0 of eccstat indicate the row the SBE occurred. */ int row = eccstat & 0xf; printk("<1>ECC: SBE Detected in DRAM row %d\n", row); - scrub_needed |= 1; bank[row].sbecount++; } - if (scrub_needed) - { - /* - * clear error flag bits that were set by writing 0 to them - * we hope the error was a fluke or something :) - */ - unsigned long value = eccstat; - if (scrub_needed & 1) - value &= 0xFFFFFDFF; - if (scrub_needed & 2) - value &= 0xFFFFFEFF; - pci_write_config_dword(bridge, 0x48, value); - scrub_needed = 0; + pci_write_config_dword(bridge, 0x48, eccstat); // clear by writing a 1 } }