Bug 1215910

Summary: GPU hang after kernel update to 6.5.4-1-default
Product: [openSUSE] openSUSE Tumbleweed Reporter: Gabriel Krisman Bertazi <gabriel.bertazi>
Component: KernelAssignee: openSUSE Kernel Bugs <kernel-bugs>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: gabriel.bertazi
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: kernel log

Description Gabriel Krisman Bertazi 2023-10-03 13:44:03 UTC
After updating the TW kernel from 6.4.12-1-default to 6.5.4-1-default, I get a GPU hang, followed by a failed recovery (completely lose graphics) a few seconds after loading GNOME, *only* when connected to an external monitor (HDMI).  This is a T14s Gen 2 laptop (AMD version). 

I'm running with the latest firmware provided by Lenovo.

After reverting to 6.4.12-1-default, the issue seems to go away.

This is the graphics card:

06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev d1) (prog-if 00 [VGA controller])
	Subsystem: Lenovo Device 5095
	Flags: bus master, fast devsel, latency 0, IRQ 82, IOMMU group 16
	Memory at 860000000 (64-bit, prefetchable) [size=256M]
	Memory at 870000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 1000 [size=256]
	Memory at fd300000 (32-bit, non-prefetchable) [size=512K]
	Capabilities: <access denied>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu


The relevant part of the log is as follows. Full dmesg attached

kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:70:crtc-1] hw_done or flip_done timed out
kernel: amdgpu 0000:06:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
kernel: amdgpu 0000:06:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6
kernel: amdgpu 0000:06:00.0: amdgpu: MODE2 reset
kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume
kernel: [drm] PCIE GART of 1024M enabled.
kernel: [drm] PTB located at 0x000000F43FC00000
kernel: [drm] PSP is resuming...
kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR
kernel: amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available
kernel: amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available
kernel: amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
kernel: amdgpu 0000:06:00.0: amdgpu: SMU is resuming...
kernel: amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully!
kernel: [drm] DMUB hardware initialized: version=0x01010027
kernel: ------------[ cut here ]------------
kernel: WARNING: CPU: 9 PID: 7334 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn21/dcn21_link_encoder.c:215 dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu]
kernel: Modules linked in: hid_apple apple_mfi_fastcharge snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi r8153_ecm cdc_ether usbnet r8152 mii usbhid uinput rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device tun ccm af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject mhi_wwan_mbim mhi_wwan_ctrl nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter qrtr binfmt_misc iwlmvm nls_iso8859_1 nls_cp437 vfat fat mac80211 snd_acp3x_rn snd_acp3x_pdm_dma snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp libarc4 snd_sof snd_ctl_led snd_hda_codec_realtek snd_sof_utils snd_hda_codec_generic snd_hda_codec_hdmi snd_soc_core uvcvideo videobuf2_vmalloc
Oct 03 09:15:22  kernel:  snd_hda_intel uvc intel_rapl_msr snd_compress videobuf2_memops snd_intel_dspcfg snd_intel_sdw_acpi videobuf2_v4l2 intel_rapl_common snd_pcm_dmaengine edac_mce_amd snd_hda_codec snd_pci_ps videodev snd_rpl_pci_acp6x snd_acp_pci snd_hda_core kvm_amd snd_pci_acp6x iwlwifi r8169 snd_hwdep videobuf2_common snd_pci_acp5x realtek kvm snd_pcm snd_rn_pci_acp3x think_lmi snd_acp_config mdio_devres mc mhi_pci_generic cfg80211 irqbypass pcspkr thinkpad_acpi firmware_attributes_class wmi_bmof efi_pstore tiny_power_button snd_soc_acpi ledtrig_audio snd_pci_acp3x k10temp mhi snd_timer libphy platform_profile i2c_piix4 snd thermal ac soundcore amd_pmc joydev button fuse configfs dmi_sysfs ip_tables x_tables cmac algif_hash algif_skcipher af_alg dm_crypt essiv authenc trusted asn1_encoder tee bnep btusb btrtl btbcm btintel btmtk bluetooth ecdh_generic rfkill amdgpu crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul i2c_algo_bit drm_ttm_helper ghash_clmulni_intel sha512_ssse3 ttm drm_suballoc_helper
Oct 03 09:15:22  kernel:  amdxcp iommu_v2 xhci_pci drm_buddy xhci_pci_renesas gpu_sched nvme hid_multitouch drm_display_helper xhci_hcd nvme_core aesni_intel ucsi_acpi cec typec_ucsi video hid_generic crypto_simd cryptd usbcore roles ccp rc_core sp5100_tco typec t10_pi battery wmi i2c_hid_acpi i2c_hid serio_raw btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg br_netfilter bridge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod scsi_common msr efivarfs
kernel: CPU: 9 PID: 7334 Comm: kworker/u32:18 Tainted: G        W          6.5.4-1-default #1 openSUSE Tumbleweed 33e9043b6169f387e828626939b31ae921c14ccd
kernel: Hardware name: LENOVO 20XGS0P30C/20XGS0P30C, BIOS R1NET57W (1.27) 05/15/2023
kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched]
kernel: RIP: 0010:dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu]
kernel: Code: b6 89 8b 00 00 00 e8 d8 fa 09 00 b8 01 00 00 00 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 47 48 83 c4 10 5b e9 79 ea 6d f8 <0f> 0b 31 c0 eb e0 0f 0b 48 8b 53 60 48 8b 43 68 41 b9 01 00 00 00
kernel: RSP: 0018:ffffa380d7b936e0 EFLAGS: 00010246
kernel: RAX: 0000000000163333 RBX: ffff9103fca81200 RCX: 0000000000000011
kernel: RDX: 0000000000000000 RSI: 0000000000001638 RDI: ffff90ffa3280000
kernel: RBP: ffffa380d7b93818 R08: ffffa380d7b936e4 R09: ffffa380d7b93708
kernel: R10: 0000001400000002 R11: 0000000000000010 R12: 0000000000000008
kernel: R13: 0000000000000008 R14: 0000000000000000 R15: ffff9104fff43900
kernel: FS:  0000000000000000(0000) GS:ffff91064ee80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007fb1780432d8 CR3: 0000000215de2000 CR4: 0000000000750ee0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  ? __warn+0x81/0x130
kernel:  ? dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  ? report_bug+0x171/0x1a0
kernel:  ? handle_bug+0x3c/0x80
kernel:  ? exc_invalid_op+0x17/0x70
kernel:  ? asm_exc_invalid_op+0x1a/0x20
kernel:  ? dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  dcn21_link_encoder_enable_dp_mst_output+0x1b/0x40 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  enable_dio_dp_link_output+0x41/0x80 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  dce110_enable_dp_link_output+0x257/0x270 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  dp_enable_link_phy+0x4d/0x90 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  perform_link_training_with_retries+0x1e1/0x560 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  ? core_link_write_dpcd+0x8f/0x100 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  enable_link_dp+0x13a/0x2c0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  ? core_link_write_dpcd+0x8f/0x100 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  link_set_dpms_on+0xb54/0xca0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  dce110_apply_ctx_to_hw+0x4fb/0x6c0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  dc_commit_state_no_check+0x3cd/0xed0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  dc_commit_streams+0x29b/0x400 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  dm_resume+0x44e/0x7c0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  ? srso_alias_return_thunk+0x5/0x7f
kernel:  ? _dev_info+0x70/0x90
kernel:  amdgpu_device_ip_resume_phase2+0x52/0xc0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
PackageKit[7629]: get-updates transaction /54_ecedccda from uid 1000 finished with success after 4476ms
kernel:  amdgpu_do_asic_reset+0x4ca/0x730 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  amdgpu_device_gpu_recover+0x50f/0xce0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  ? __drm_err+0x20/0xa0
kernel:  amdgpu_job_timedout+0x151/0x240 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a]
kernel:  ? finish_task_switch.isra.0+0x94/0x2f0
kernel:  drm_sched_job_timedout+0x6a/0x100 [gpu_sched 27f0dfbe2a70efedf8e6c5f739524c2be6a0dc79]
kernel:  process_one_work+0x21d/0x430
kernel:  worker_thread+0x4e/0x3b0
kernel:  ? __pfx_worker_thread+0x10/0x10
kernel:  kthread+0xe8/0x120
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork+0x34/0x50
kernel:  ? __pfx_kthread+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>
kernel: ---[ end trace 0000000000000000 ]---
kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready
kernel: [drm] enabling link 2 failed: 15
kernel: [drm] kiq ring mec 2 pipe 1 q 0
kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110)
kernel: [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed
kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110
kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset(2) failed
kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -110
kernel: [drm] Skip scheduling IBs!
kernel: [drm] Skip scheduling IBs!
kernel: [drm] Skip scheduling IBs!
kernel: [drm] Skip scheduling IBs!
kernel: [drm] Skip scheduling IBs!
kernel: [drm] Skip scheduling IBs!
kernel: [drm] Skip scheduling IBs!
kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110
kernel: [drm] Skip scheduling IBs!
kernel: [drm] Skip scheduling IBs!
Comment 1 Gabriel Krisman Bertazi 2023-10-03 13:48:59 UTC
Ah, after checking the logs a bit more, I see a bunch of these, a few seconds before the hang.  Maybe it is what caused the gpu reset?

kernel: amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:5 pasid:32772, for process firefox pid 3410 thread firefox:cs0 pid 3489)
kernel: amdgpu 0000:06:00.0: amdgpu:   in page starting at address 0x00003ffb78559000 from IH client 0x1b (UTCL2)
kernel: amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00500431
kernel: amdgpu 0000:06:00.0: amdgpu:          Faulty UTCL2 client ID: IA (0x2)
kernel: amdgpu 0000:06:00.0: amdgpu:          MORE_FAULTS: 0x1
kernel: amdgpu 0000:06:00.0: amdgpu:          WALKER_ERROR: 0x0
kernel: amdgpu 0000:06:00.0: amdgpu:          PERMISSION_FAULTS: 0x3
kernel: amdgpu 0000:06:00.0: amdgpu:          MAPPING_ERROR: 0x0
kernel: amdgpu 0000:06:00.0: amdgpu:          RW: 0x0


This comes from the firefox process, but I've seen it crash even when firefox wasn't involved.
Comment 2 Gabriel Krisman Bertazi 2023-10-03 13:52:21 UTC
Created attachment 869880 [details]
kernel log
Comment 3 Gabriel Krisman Bertazi 2023-10-03 13:55:34 UTC
For completeness, you might see a few of 

kernel: WARNING: CPU: 11 PID: 3081 at drivers/acpi/platform_profile.c:74 platform_profile_show+0xa6/0xd0 [platform_profile]

during boot in the log in Comment 2. Those have been there since installing TW and are most likely unrelated to this issue (should be another kernel or a fw bug report). But it seemed harmless and I forgot to investigate/report.
Comment 4 Frank Krüger 2023-10-03 19:57:41 UTC
Just as a wild guess: Could you boot with amdgpu.mcbp=0?
Comment 5 Frank Krüger 2023-10-10 19:24:40 UTC
No reply from the reporter so far. If amdgpu.mcbp=0 solves the issue, then kernel 6.5.6 has the fix:

commit 2c4cc4d787a5f332f2c61f12cdb31e01da386439
Author: Jiadong Zhu <Jiadong.Zhu@amd.com>
Date:   Wed Jul 26 15:21:48 2023 +0800

drm/amdgpu: set completion status as preempted for the resubmission
Comment 6 Gabriel Krisman Bertazi 2023-10-10 19:35:03 UTC
(In reply to Frank Krüger from comment #5)
> No reply from the reporter so far. If amdgpu.mcbp=0 solves the issue, then
> kernel 6.5.6 has the fix:
> 
> commit 2c4cc4d787a5f332f2c61f12cdb31e01da386439
> Author: Jiadong Zhu <Jiadong.Zhu@amd.com>
> Date:   Wed Jul 26 15:21:48 2023 +0800
> 
> drm/amdgpu: set completion status as preempted for the resubmission

Apologies.  This is a workstation and I haven't been able to try it yet. Will try to do after hours today.
Comment 7 Gabriel Krisman Bertazi 2023-10-11 16:27:27 UTC
Just tried 6.5.6-1-default and it fixed the issue. thanks. closing.