Bug 1215470 - amdgpu no-retry page fault
Summary: amdgpu no-retry page fault
Status: RESOLVED FIXED
: 1215695 (view as bug list)
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: x86-64 Other
: P3 - Medium : Major with 5 votes (vote)
Target Milestone: ---
Assignee: Thomas Zimmermann
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-19 06:55 UTC by Marco Varlese
Modified: 2023-10-25 13:53 UTC (History)
6 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
amdgpu journalctl logs (20.80 KB, text/plain)
2023-09-19 06:55 UTC, Marco Varlese
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Marco Varlese 2023-09-19 06:55:25 UTC
Created attachment 869588 [details]
amdgpu journalctl logs

With latest available kernel (6.5.2-1-default), in a totally random mode, during the use of Gnome and any applications (ie. Chrome or Thunderbird), the rendering breaks and from the journalctl logs I can see the errors which I share via the attachment file.

The issue is *not* present in kernel 6.4.12-1-default which I am now using as a backup plan.
Comment 1 Thomas Zimmermann 2023-09-21 09:19:52 UTC
Hi, do you have the means to test certain patches?

I found 

  e77673d14f2c ("drm/amdgpu: Update invalid PTE flag setting")

in v6.6-rc1 and

  edcfe22985d0 ("drm/amdkfd: Insert missing TLB flush on GFX10 and later")

in v6.6-rc2.

The former looks promising as a fix.
Comment 2 Marco Varlese 2023-09-21 12:58:51 UTC
(In reply to Thomas Zimmermann from comment #1)
> Hi, do you have the means to test certain patches?
> 
> I found 
> 
>   e77673d14f2c ("drm/amdgpu: Update invalid PTE flag setting")
> 
> in v6.6-rc1 and
> 
>   edcfe22985d0 ("drm/amdkfd: Insert missing TLB flush on GFX10 and later")
> 
> in v6.6-rc2.
> 
> The former looks promising as a fix.

The problem with testing is that this issue is happening on my work-laptop which I need up-and-running (for obvious reasons). :(

An option I see is to wait for kernel 6.6 to land in factory and eventually TW, get it via "zypper dup" and see if that helps with the bug. Or hopefully the fix lands on some bug-fix release of 6.5?

Would you be happy enough with that?
Comment 3 Thomas Zimmermann 2023-09-21 15:23:59 UTC
(In reply to Marco Varlese from comment #2)
> (In reply to Thomas Zimmermann from comment #1)
> > Hi, do you have the means to test certain patches?
> > 
> > I found 
> > 
> >   e77673d14f2c ("drm/amdgpu: Update invalid PTE flag setting")
> > 
> > in v6.6-rc1 and
> > 
> >   edcfe22985d0 ("drm/amdkfd: Insert missing TLB flush on GFX10 and later")
> > 
> > in v6.6-rc2.
> > 
> > The former looks promising as a fix.
> 
> The problem with testing is that this issue is happening on my work-laptop
> which I need up-and-running (for obvious reasons). :(
> 
> An option I see is to wait for kernel 6.6 to land in factory and eventually
> TW, get it via "zypper dup" and see if that helps with the bug. Or hopefully
> the fix lands on some bug-fix release of 6.5?
> 
> Would you be happy enough with that?

I haven't seen these patches in linux-stable (yet). I can attempt to backport them into TW. I'll also try to reproduce this locally.
Comment 4 Thomas Zimmermann 2023-09-25 11:31:08 UTC
> 
>   e77673d14f2c ("drm/amdgpu: Update invalid PTE flag setting")
> 

FYI I have sent this patch for inclusion in the stable branch.
Comment 5 Eyad Issa 2023-09-29 17:00:02 UTC
*** Bug 1215695 has been marked as a duplicate of this bug. ***
Comment 6 Eyad Issa 2023-09-29 19:43:31 UTC
Apparently kernel 6.5.5 (from build.o.o devel project) fixed it for me.


~> LANG=C sudo zypper info kernel-default
...
Information for package kernel-default:
---------------------------------------
Repository     : Kernel builds for branch stable (standard)
Name           : kernel-default
Version        : 6.5.5-2.1.g6cf5261
Arch           : x86_64
Vendor         : obs://build.opensuse.org/Kernel
Installed Size : 248.2 MiB
Installed      : Yes
Status         : up-to-date
Source package : kernel-default-6.5.5-2.1.g6cf5261.nosrc
Upstream URL   : https://www.kernel.org/
Summary        : The Standard Kernel
Description    : 
    The standard kernel for both uniprocessor and multiprocessor systems.


    Source Timestamp: 2023-09-25 10:19:02 +0000
    GIT Revision: 6cf5261da0ebc2ca4f200ee6fe0fde9d6c3eff3e
    GIT Branch: stable
Comment 7 Marco Varlese 2023-10-25 13:53:45 UTC
I confirm - having run kernel 6.5.8-1-default for sometime now - that the bug is no longer there. 

I think we can close this bug as resolved.

Thank you for looking into this and fixing it so promptly!