Bug 1221093 - Reading a WT mapped PCIe BAR generates MCE on icelake and newer
Summary: Reading a WT mapped PCIe BAR generates MCE on icelake and newer
Status: NEW
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP5
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: Other Other
: P5 - None : Normal
Target Milestone: ---
Assignee: Kernel Bugs
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-03-06 20:15 UTC by Anthony Tortola
Modified: 2024-03-18 18:02 UTC (History)
4 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Anthony Tortola 2024-03-06 20:15:11 UTC
I have an in-house FPGA with a prefetchable BAR:
apxs____9072:~ # lspci -vvs 61:0
61:00.0 Memory controller: Xilinx Corporation Device 7030
        Region 0: Memory at 25ffffe00000 (64-bit, prefetchable) [size=128K]

In the driver, I map the last 32k of the BAR WT using ioremap_wt(). Here is the PAT config:
apxs____9072:~ # cat /sys/kernel/debug/x86/pat_memtype_list |grep 0025ff
PAT: [mem 0x000025ffffe00000-0x000025ffffe10000] uncached-minus
PAT: [mem 0x000025ffffe10000-0x000025ffffe18000] write-combining
PAT: [mem 0x000025ffffe18000-0x000025ffffe20000] write-through

I believe I also need to config MTRR:
apxs____9072:~ # cat /proc/mtrr
reg00: base=0x000000000 (    0MB), size=524288MB, count=1: write-back
reg01: base=0x080000000 ( 2048MB), size= 2048MB, count=1: uncachable
reg02: base=0x25ffffe18000 (39845886MB), size=   32KB, count=1: write-through

At that point, I mmap() those 32kB via /sys/devices/pci0000:60/0000:60:01.0/0000:61:00.0/resource0 and I read.

On Intel up to cascade lake and AMD genoa, that works fine.

On icelake and newer, the system immediately crashes and doesn't even trigger a crashkernel. Bios tells me Machine Check Error.

What has changed on the Intel CPU?
Comment 1 Takashi Iwai 2024-03-07 16:15:09 UTC
Something specific to x86, I suppose.  Adding relevant people to Cc.
Comment 2 Jiri Slaby 2024-03-08 07:03:14 UTC
Can you dump the MC registers when that happens so that it is known what the error is?
Comment 3 Anthony Tortola 2024-03-08 21:45:23 UTC
From the developer:

CPU 0: Machine Check Exception: 5 Bank 9: be20000000061136
RIP !INEXACT! 33:<000055f8f7b265a0>
TSC 280beabb3ed ADDR 25ffffe18000 MISC 1004080408300886
PROCESSOR 0:806f8 TIME 1709914912 SOCKET 0 APIC 0 microcode 2b000571
Run the above through 'mcelog --ascii'
Machine check: Processor context corrupt
 
cat mce|mcelog --ascii --intel-cpu 6,143
Hardware event. This is not a software error.
CPU 0 BANK 9 TSC 280beabb3ed
RIP !INEXACT! 33:55f8f7b265a0
MISC 1004080408300886 ADDR 25ffffe18000
TIME 1709914912 Fri Mar  8 11:21:52 2024
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
MCi_ADDR register valid
Processor context corrupt
MCA: corrected filtering (some unreported errors in same region)
Data CACHE Level-3 Data-Read Error
STATUS be20000000061136 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 143 Step 8
SOCKET 0 APIC 0 microcode 2b000571
Run the above through 'mcelog --ascii'
Machine check: Processor context corrupt
 
This was produced on a opensuse 15.5/SLES15 SP5 and a system with the latest kernel 6.7.9
Comment 4 Jiri Slaby 2024-03-13 10:55:40 UTC
To me, it looks like WT for PCI devices does not work well (never did). In this case, it might have caused cache L3 inconsistency (but how?). And it seems only new processors can detect it somehow? How many new CPUs have you tried? Aren't they a defective batch?

BTW why not using write-back and proper snooping in your device?
Comment 5 Anthony Tortola 2024-03-18 18:02:44 UTC
From the developer:

They have tried many different CPUs with the same result.  Can you elaborate on your comment: "why not using write-back and proper snooping in your device?"?

thanks