Bug 1226114

Summary: MI300A: rasdaemon: Error logs are not captured in rasdaemon upon error injection
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP5 Reporter: Muralidhara MK <muralidhara.mk>
Component: OtherAssignee: E-mail List <maint-coord>
Status: IN_PROGRESS --- QA Contact:
Severity: Major    
Priority: P3 - Medium CC: aschnell, ddavis, kim.naru
Version: unspecified   
Target Milestone: ---   
Hardware: x86-64   
OS: Linux   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Muralidhara MK 2024-06-08 16:19:48 UTC
This Bug is based in the JIRA https://jira.suse.com/browse/AMD-133

 

From the analysis, it seems that the SLES15 kernel has incorporated the below two commits:

[PATCH] tracing/ring-buffer: Have polling block on watermark (kernel.org) 

[RFC PATCH 1/1] tracing: Fix poll() and select() do not work on per_cpu trace_pipe and trace_pipe_raw (kernel.org) 


While the packaged rasdaemon, as hinted by yghannam, lacks the below commit:

rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely · mchehab/rasdaemon@6986d81 · GitHub 

As a result, the buffer_percent file in tracefs (/sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent) retains its default value of 50.

Consequently, the poll() undertaken on per_cpu/cpuX/trace_pipe_raw in tracefs blocks indefinitely, and the rasdaemon does not output decoded error information.

 

Work around:

rasdaemon can be used on SLES15-SP5 with the following workaround

$ echo 0 > /sys/kernel/debug/tracing/instances/rasdaemon/buffer_percent

$ systemctl restart rasdaemon.service

 .. rasdameon captures logs . attached in AMD-133 ...



With this workaround, rasdaemon should log the decoded error information in the journal

journalctl -f -u rasdaemon.service


Please note that this issue is only prevalent in the packaged version of rasdaemon i.e. 0.6.7
This issue should not be prevalent on the latest version of the rasdaemon i.e. 0.8.0

 

Based on above, SUSE has to backport the below patch:

rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely · mchehab/rasdaemon@6986d81 · GitHub
Comment 1 Muralidhara MK 2024-06-25 07:21:15 UTC
Hi,

The below patch is accepted upstreamed in  https://github.com/mchehab/rasdaemon/

ced615c rasdaemon: Add error decoding for MCA_CTL_SMU extended bits

 

Please backport the pending patch mentioned in AMD-133.
Comment 4 Muralidhara MK 2024-07-16 06:33:43 UTC
Hi,

There is a minor enhancement patch for already upstreamed patch in rasdaemon "ced615c rasdaemon: Add error decoding for MCA_CTL_SMU extended bits".

and the enhancement patch is 

73d8177  rasdaemon: mce-amd-smca: Optimizing decoding of MCA_CTL_SMU bits




Could you please merge the below patches  ?

For polling and capture logs:
6986d81 rasdaemon: Fix poll() on per_cpu trace_pipe_raw blocks indefinitely

support New GFX bank error decoding:
ced615c rasdaemon: Add error decoding for MCA_CTL_SMU extended bits
73d8177  rasdaemon: mce-amd-smca: Optimizing decoding of MCA_CTL_SMU bits
Comment 5 Arvin Schnell 2024-07-19 10:36:36 UTC
Added the three patches in https://build.suse.de/request/show/339253.