Bug 1219444 - amdgpu critical error
Summary: amdgpu critical error
Status: NEW
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Leap 15.5
Hardware: x86-64 openSUSE Leap 15.5
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-01 12:11 UTC by Teuniz XXX
Modified: 2024-02-12 14:16 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Output of /var/log/messages (103.84 KB, text/plain)
2024-02-01 12:11 UTC, Teuniz XXX
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Teuniz XXX 2024-02-01 12:11:18 UTC
Created attachment 872372 [details]
Output of /var/log/messages

The kernel crashes approx every 5 minutes.
I reverted back to kernel 5.14.21-150500.55.19-default because with that one it crashes approx once a day.


Operating System: openSUSE Leap 15.5
KDE Plasma Version: 5.27.9
KDE Frameworks Version: 5.103.0
Qt Version: 5.15.8
Kernel Version: 5.14.21-150500.55.44-default (64-bit)
Graphics Platform: X11
Processors: 32 × 13th Gen Intel Core i9-13900K
Memory: 31.0 GiB of RAM
Graphics Processor: AMD Radeon Pro W6600
Manufacturer: HP
Product Name: HP Z2 Tower G9 Workstation Desktop PC


dmesg | grep amdgpu

[    1.540640] [drm] amdgpu kernel modesetting enabled.
[    1.540703] amdgpu: CRAT table not found
[    1.540705] amdgpu: Virtual CRAT table created for CPU
[    1.540712] amdgpu: Topology: Add CPU node
[    1.542670] amdgpu 0000:03:00.0: amdgpu: Fetched VBIOS from VFCT
[    1.542671] amdgpu: ATOM BIOS: 113-D5330400-100
[    1.542770] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[    1.542771] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    1.542799] amdgpu 0000:03:00.0: amdgpu: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    1.542800] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    1.542801] amdgpu 0000:03:00.0: amdgpu: AGP: 267894784M 0x0000008400000000 - 0x0000FFFFFFFFFFFF
[    1.542845] [drm] amdgpu: 8176M of VRAM memory ready
[    1.542845] [drm] amdgpu: 15892M of GTT memory ready.
[    1.548699] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[    1.548704] amdgpu 0000:03:00.0: amdgpu: PSP runtime database doesn't exist
[    2.854516] amdgpu 0000:03:00.0: amdgpu: STB initialized to 2048 entries
[    2.895100] amdgpu 0000:03:00.0: amdgpu: Will use PSP to load VCN firmware
[    3.094413] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    3.115717] amdgpu 0000:03:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[    3.115740] amdgpu 0000:03:00.0: amdgpu: smu driver if version = 0x0000000f, smu fw if version = 0x00000013, smu fw program = 0, version = 0x003b2b00 (59.43.0)
[    3.115745] amdgpu 0000:03:00.0: amdgpu: SMU driver if version not matched
[    3.115777] amdgpu 0000:03:00.0: amdgpu: use vbios provided pptable
[    3.165133] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!
[    3.268063] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    3.268478] amdgpu: sdma_bitmap: ffff
[    3.302091] amdgpu: HMM registered 8176MB device memory
[    3.302135] amdgpu: SRAT table not found
[    3.302136] amdgpu: Virtual CRAT table created for GPU
[    3.302599] amdgpu: Topology: Add dGPU node [0x73e3:0x1002]
[    3.302601] kfd kfd: amdgpu: added device 1002:73e3
[    3.302617] amdgpu 0000:03:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 8, active_cu_number 28
[    3.302658] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    3.302659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    3.302659] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    3.302660] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    3.302660] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    3.302661] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    3.302661] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    3.302662] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    3.302662] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    3.302663] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    3.302663] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    3.302664] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[    3.302665] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec_0 uses VM inv eng 0 on hub 1
[    3.302665] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.0 uses VM inv eng 1 on hub 1
[    3.302666] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc_0.1 uses VM inv eng 4 on hub 1
[    3.302666] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[    3.303573] [drm] Initialized amdgpu 3.49.0 20150101 for 0000:03:00.0 on minor 0
[    3.308709] fbcon: amdgpudrmfb (fb0) is primary device
[    3.505728] amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[    3.505731] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000006004000 from client 0x12 (VMC)
[    3.505733] amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x0000073A
[    3.505733] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: DCEDMC (0x3)
[    3.505734] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x0
[    3.505735] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x5
[    3.505735] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[    3.505735] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x1
[    3.505736] amdgpu 0000:03:00.0: amdgpu:      RW: 0x0
[    3.524299] amdgpu 0000:03:00.0: [drm] fb0: amdgpudrmfb frame buffer device
[    4.537456] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[    5.416287] amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[    5.416312] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000006004000 from client 0x12 (VMC)
[    5.416319] amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x00000000
[    5.416324] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: unknown (0x0)
[    5.416329] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x0
[    5.416333] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x0
[    5.416336] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x0
[    5.416340] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x0
[    5.416343] amdgpu 0000:03:00.0: amdgpu:      RW: 0x0
[   73.156519] amdgpu 0000:03:00.0: amdgpu: [mmhub] page fault (src_id:0 ring:157 vmid:0 pasid:0, for process  pid 0 thread  pid 0)
[   73.156538] amdgpu 0000:03:00.0: amdgpu:   in page starting at address 0x0000000006004000 from client 0x12 (VMC)
[   73.156546] amdgpu 0000:03:00.0: amdgpu: MMVM_L2_PROTECTION_FAULT_STATUS:0x0000073A
[   73.156551] amdgpu 0000:03:00.0: amdgpu:      Faulty UTCL2 client ID: DCEDMC (0x3)
[   73.156562] amdgpu 0000:03:00.0: amdgpu:      MORE_FAULTS: 0x0
[   73.156566] amdgpu 0000:03:00.0: amdgpu:      WALKER_ERROR: 0x5
[   73.156570] amdgpu 0000:03:00.0: amdgpu:      PERMISSION_FAULTS: 0x3
[   73.156578] amdgpu 0000:03:00.0: amdgpu:      MAPPING_ERROR: 0x1
[   73.156582] amdgpu 0000:03:00.0: amdgpu:      RW: 0x0

uname -a

5.14.21-150500.55.44-default #1 SMP PREEMPT_DYNAMIC Mon Jan 15 10:03:40 UTC 2024 (cc7d8b6) x86_64 x86_64 x86_64 GNU/Linux

lspci

VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Navi 23 WKS-XL [Radeon PRO W6600]
Comment 1 Takashi Iwai 2024-02-02 10:20:15 UTC
Please check with the recent upstream kernel, e.g. in OBS Kernel:stable:Backport repo:
  http://download.opensuse.org/repositories/Kernel:/stable:/Backport/standard/

If the problem persists, we'd need to report it to the upstream devs.
Comment 2 Teuniz XXX 2024-02-02 12:45:17 UTC
Thanks, I just installed kernel

6.7.3-lp155.2.g0fa3c9e-default #1 SMP PREEMPT_DYNAMIC Thu Feb  1 05:38:11 UTC 2024 (0fa3c9e) x86_64 x86_64 x86_64 GNU/Linux

from that repo you mentioned and it booted without any error messages.
I'll continue to use this kernel and I'll let you know next week how it goes.

Have a nice weekend.
Comment 3 Teuniz XXX 2024-02-09 08:46:30 UTC
After one week of testing, it seems that kernel 
6.7.3-lp155.2.g0fa3c9e-default solves the problem.
I haven't noticed any error messages or instabilities.
Thank you for pointing me to that repo!

Only downside of that kernel is that I can't run virtualbox. I need to run sudo /usr/sbin/vboxconfig which in turn tries to compile a kernel interface but exits with an error because the newer kernel is compiled with GCC 13 (instead of 7.5).

Output of /var/log/virtualbox.log:

=== Building 'vboxdrv' module ===
make[1]: Entering directory '/usr/src/kernel-modules/virtualbox/src/vboxdrv'
make V= CONFIG_MODULE_SIG= CONFIG_MODULE_SIG_ALL= -C /lib/modules/6.7.3-lp155.2.g0fa3c9e-default/build M=/usr/src/kernel-modules/virtualbox/src/vboxdrv SRCROOT=/usr/src/kernel-modules/virtualbox/src/vboxdrv -j32 modules
make[2]: Entering directory '/usr/src/linux-6.7.3-lp155.2.g0fa3c9e-obj/x86_64/default'
warning: the compiler differs from the one used to build the kernel
  The kernel was built by: gcc (SUSE Linux) 13.2.1 20230912 [revision b96e66fd4ef3e36983969fb8cdd1956f551a074b]
  You are using:           gcc (SUSE Linux) 7.5.0
  CC [M]  /usr/src/kernel-modules/virtualbox/src/vboxdrv/linux/SUPDrv-linux.o
  CC [M]  /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrv.o
  CC [M]  /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvGip.o
  CC [M]  /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvSem.o
  CC [M]  /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvTracer.o
gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’?
  CC [M]  /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPLibAll.o
make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/linux/SUPDrv-linux.o] Error 1
make[4]: *** Waiting for unfinished jobs....
gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’?
gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’?
make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvGip.o] Error 1
make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrv.o] Error 1
  CC [M]  /usr/src/kernel-modules/virtualbox/src/vboxdrv/common/string/strformatrt.o
gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’?
make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvSem.o] Error 1
gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’?
make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPDrvTracer.o] Error 1
gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’?
make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/SUPLibAll.o] Error 1
gcc: error: unrecognized command line option ‘-mharden-sls=all’; did you mean ‘-mhard-float’?
make[4]: *** [/usr/src/linux-6.7.3-lp155.2.g0fa3c9e/scripts/Makefile.build:244: /usr/src/kernel-modules/virtualbox/src/vboxdrv/common/string/strformatrt.o] Error 1
Comment 4 Takashi Iwai 2024-02-12 14:16:42 UTC
Yes, the lack of KMP is a known issue with the TW kernel build for Leap, unfortunately.

Honestly speaking, fixing this kind of bug for amdgpu on SLE15-SP5 kernel is really tough.  It seems hitting on only certain models / hardware configs.

You may try Leap 15.6 kernel instead of TW backport kernel, too; which should be new enough and receive most of fixes from the latest code, too.  vbox driver should be available for Leap 15.6, too.  But maybe some later point after the kABI freeze.