Bug 1213533

Summary: Backport irqchip/gicv3: Workaround for NVIDIA erratum T241-FABRIC-4
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP5 Reporter: Matt Ochs <mochs>
Component: KernelAssignee: Ivan Ivanov <ivan.ivanov>
Status: VERIFIED FIXED QA Contact:
Severity: Normal    
Priority: P5 - None CC: afaerber, ddavis, ivan.ivanov, mbenes, mochs, stanimir.varbanov
Version: unspecified   
Target Milestone: ---   
Hardware: aarch64   
OS: SLES 15   
See Also: https://bugzilla.suse.com/show_bug.cgi?id=1224448
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Matt Ochs 2023-07-20 21:22:12 UTC
This upstreamed patch provides a hardware errata workaround and is required to support compute on 3 and 4-node NVIDIA Grace systems.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=35727af2b15d98a2dd2811d631d3a3886111312e

This fix needs to be back ported to SLES 15 SP5.
Comment 1 Ivan Ivanov 2023-07-24 07:36:08 UTC
Patches merged in SLE15-SP5 kernel sources. Thank you!
Comment 10 Maintenance Automation 2023-08-03 09:40:45 UTC
SUSE-SU-2023:3172-1: An update that solves seven vulnerabilities, contains two features and has 25 fixes can now be installed.

Category: security (important)
Bug References: 1150305, 1193629, 1194869, 1207894, 1208788, 1211243, 1211867, 1212256, 1212301, 1212525, 1212846, 1212905, 1213059, 1213061, 1213205, 1213206, 1213226, 1213233, 1213245, 1213247, 1213252, 1213258, 1213259, 1213263, 1213264, 1213286, 1213493, 1213523, 1213524, 1213533, 1213543, 1213705
CVE References: CVE-2023-20593, CVE-2023-2985, CVE-2023-3117, CVE-2023-31248, CVE-2023-3390, CVE-2023-35001, CVE-2023-3812
Jira References: PED-4718, PED-4758
Sources used:
openSUSE Leap 15.5 (src): kernel-obs-qa-5.14.21-150500.55.12.1, kernel-source-5.14.21-150500.55.12.1, kernel-obs-build-5.14.21-150500.55.12.1, kernel-livepatch-SLE15-SP5_Update_2-1-150500.11.3.2, kernel-default-base-5.14.21-150500.55.12.1.150500.6.4.2, kernel-syms-5.14.21-150500.55.12.1
Basesystem Module 15-SP5 (src): kernel-source-5.14.21-150500.55.12.1, kernel-default-base-5.14.21-150500.55.12.1.150500.6.4.2
Development Tools Module 15-SP5 (src): kernel-obs-build-5.14.21-150500.55.12.1, kernel-source-5.14.21-150500.55.12.1, kernel-syms-5.14.21-150500.55.12.1
SUSE Linux Enterprise Live Patching 15-SP5 (src): kernel-livepatch-SLE15-SP5_Update_2-1-150500.11.3.2

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 11 Maintenance Automation 2023-08-03 20:30:46 UTC
SUSE-SU-2023:3180-1: An update that solves seven vulnerabilities, contains two features and has 26 fixes can now be installed.

Category: security (important)
Bug References: 1150305, 1193629, 1194869, 1207894, 1208788, 1211243, 1211867, 1212256, 1212301, 1212525, 1212846, 1212905, 1213059, 1213061, 1213205, 1213206, 1213226, 1213233, 1213245, 1213247, 1213252, 1213258, 1213259, 1213263, 1213264, 1213286, 1213311, 1213493, 1213523, 1213524, 1213533, 1213543, 1213705
CVE References: CVE-2023-20593, CVE-2023-2985, CVE-2023-3117, CVE-2023-31248, CVE-2023-3390, CVE-2023-35001, CVE-2023-3812
Jira References: PED-4718, PED-4758
Sources used:
openSUSE Leap 15.5 (src): kernel-source-azure-5.14.21-150500.33.11.1, kernel-syms-azure-5.14.21-150500.33.11.1
Public Cloud Module 15-SP5 (src): kernel-source-azure-5.14.21-150500.33.11.1, kernel-syms-azure-5.14.21-150500.33.11.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 16 Maintenance Automation 2023-08-14 08:30:30 UTC
SUSE-SU-2023:3302-1: An update that solves 28 vulnerabilities, contains two features and has 115 fixes can now be installed.

Category: security (important)
Bug References: 1150305, 1187829, 1193629, 1194869, 1206418, 1207129, 1207894, 1207948, 1208788, 1210335, 1210565, 1210584, 1210627, 1210780, 1210825, 1210853, 1211014, 1211131, 1211243, 1211738, 1211811, 1211867, 1212051, 1212256, 1212265, 1212301, 1212445, 1212456, 1212502, 1212525, 1212603, 1212604, 1212685, 1212766, 1212835, 1212838, 1212842, 1212846, 1212848, 1212861, 1212869, 1212892, 1212901, 1212905, 1212961, 1213010, 1213011, 1213012, 1213013, 1213014, 1213015, 1213016, 1213017, 1213018, 1213019, 1213020, 1213021, 1213024, 1213025, 1213032, 1213034, 1213035, 1213036, 1213037, 1213038, 1213039, 1213040, 1213041, 1213059, 1213061, 1213087, 1213088, 1213089, 1213090, 1213092, 1213093, 1213094, 1213095, 1213096, 1213098, 1213099, 1213100, 1213102, 1213103, 1213104, 1213105, 1213106, 1213107, 1213108, 1213109, 1213110, 1213111, 1213112, 1213113, 1213114, 1213116, 1213134, 1213167, 1213205, 1213206, 1213226, 1213233, 1213245, 1213247, 1213252, 1213258, 1213259, 1213263, 1213264, 1213272, 1213286, 1213287, 1213304, 1213417, 1213493, 1213523, 1213524, 1213533, 1213543, 1213578, 1213585, 1213586, 1213588, 1213601, 1213620, 1213632, 1213653, 1213705, 1213713, 1213715, 1213747, 1213756, 1213759, 1213777, 1213810, 1213812, 1213856, 1213857, 1213863, 1213867, 1213870, 1213871, 1213872
CVE References: CVE-2022-40982, CVE-2023-0459, CVE-2023-1829, CVE-2023-20569, CVE-2023-20593, CVE-2023-21400, CVE-2023-2156, CVE-2023-2166, CVE-2023-2430, CVE-2023-2985, CVE-2023-3090, CVE-2023-31083, CVE-2023-3111, CVE-2023-3117, CVE-2023-31248, CVE-2023-3212, CVE-2023-3268, CVE-2023-3389, CVE-2023-3390, CVE-2023-35001, CVE-2023-3567, CVE-2023-3609, CVE-2023-3611, CVE-2023-3776, CVE-2023-3812, CVE-2023-38409, CVE-2023-3863, CVE-2023-4004
Jira References: PED-4718, PED-4758
Sources used:
openSUSE Leap 15.5 (src): kernel-livepatch-SLE15-SP5-RT_Update_3-1-150500.11.5.1, kernel-syms-rt-5.14.21-150500.13.11.1, kernel-source-rt-5.14.21-150500.13.11.1
SUSE Linux Enterprise Live Patching 15-SP5 (src): kernel-livepatch-SLE15-SP5-RT_Update_3-1-150500.11.5.1
SUSE Real Time Module 15-SP5 (src): kernel-syms-rt-5.14.21-150500.13.11.1, kernel-source-rt-5.14.21-150500.13.11.1

NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
Comment 17 Matt Ochs 2023-08-16 03:59:42 UTC
Verified:

host-10-176-223-220:~ # dmesg | grep -i gicv3 
[    0.000000] GICv3: GIC: Using split EOI/Deactivate mode
[    0.000000] GIC: enabling workaround for GICv3: NVIDIA erratum T241-FABRIC-4

Ran stress workload (memtester + stress-ng + iozone + irqbalance) and encountered lockups within 15 minutes of starting tests. With fix applied, same workload has been running for 6+ hours without issue.