Bugzilla – Bug 1215910
GPU hang after kernel update to 6.5.4-1-default
Last modified: 2023-10-11 16:27:27 UTC
After updating the TW kernel from 6.4.12-1-default to 6.5.4-1-default, I get a GPU hang, followed by a failed recovery (completely lose graphics) a few seconds after loading GNOME, *only* when connected to an external monitor (HDMI). This is a T14s Gen 2 laptop (AMD version). I'm running with the latest firmware provided by Lenovo. After reverting to 6.4.12-1-default, the issue seems to go away. This is the graphics card: 06:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Vega Mobile Series] (rev d1) (prog-if 00 [VGA controller]) Subsystem: Lenovo Device 5095 Flags: bus master, fast devsel, latency 0, IRQ 82, IOMMU group 16 Memory at 860000000 (64-bit, prefetchable) [size=256M] Memory at 870000000 (64-bit, prefetchable) [size=2M] I/O ports at 1000 [size=256] Memory at fd300000 (32-bit, non-prefetchable) [size=512K] Capabilities: <access denied> Kernel driver in use: amdgpu Kernel modules: amdgpu The relevant part of the log is as follows. Full dmesg attached kernel: [drm:amdgpu_dm_atomic_check [amdgpu]] *ERROR* [CRTC:70:crtc-1] hw_done or flip_done timed out kernel: amdgpu 0000:06:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6 kernel: amdgpu 0000:06:00.0: amdgpu: failed to write reg 28b4 wait reg 28c6 kernel: amdgpu 0000:06:00.0: amdgpu: MODE2 reset kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset succeeded, trying to resume kernel: [drm] PCIE GART of 1024M enabled. kernel: [drm] PTB located at 0x000000F43FC00000 kernel: [drm] PSP is resuming... kernel: [drm] reserve 0x400000 from 0xf43f800000 for PSP TMR kernel: amdgpu 0000:06:00.0: amdgpu: RAS: optional ras ta ucode is not available kernel: amdgpu 0000:06:00.0: amdgpu: RAP: optional rap ta ucode is not available kernel: amdgpu 0000:06:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available kernel: amdgpu 0000:06:00.0: amdgpu: SMU is resuming... kernel: amdgpu 0000:06:00.0: amdgpu: SMU is resumed successfully! kernel: [drm] DMUB hardware initialized: version=0x01010027 kernel: ------------[ cut here ]------------ kernel: WARNING: CPU: 9 PID: 7334 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn21/dcn21_link_encoder.c:215 dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu] kernel: Modules linked in: hid_apple apple_mfi_fastcharge snd_usb_audio snd_usbmidi_lib snd_ump snd_rawmidi r8153_ecm cdc_ether usbnet r8152 mii usbhid uinput rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device tun ccm af_packet nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject mhi_wwan_mbim mhi_wwan_ctrl nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter qrtr binfmt_misc iwlmvm nls_iso8859_1 nls_cp437 vfat fat mac80211 snd_acp3x_rn snd_acp3x_pdm_dma snd_soc_dmic snd_sof_amd_rembrandt snd_sof_amd_renoir snd_sof_amd_acp snd_sof_pci snd_sof_xtensa_dsp libarc4 snd_sof snd_ctl_led snd_hda_codec_realtek snd_sof_utils snd_hda_codec_generic snd_hda_codec_hdmi snd_soc_core uvcvideo videobuf2_vmalloc Oct 03 09:15:22 kernel: snd_hda_intel uvc intel_rapl_msr snd_compress videobuf2_memops snd_intel_dspcfg snd_intel_sdw_acpi videobuf2_v4l2 intel_rapl_common snd_pcm_dmaengine edac_mce_amd snd_hda_codec snd_pci_ps videodev snd_rpl_pci_acp6x snd_acp_pci snd_hda_core kvm_amd snd_pci_acp6x iwlwifi r8169 snd_hwdep videobuf2_common snd_pci_acp5x realtek kvm snd_pcm snd_rn_pci_acp3x think_lmi snd_acp_config mdio_devres mc mhi_pci_generic cfg80211 irqbypass pcspkr thinkpad_acpi firmware_attributes_class wmi_bmof efi_pstore tiny_power_button snd_soc_acpi ledtrig_audio snd_pci_acp3x k10temp mhi snd_timer libphy platform_profile i2c_piix4 snd thermal ac soundcore amd_pmc joydev button fuse configfs dmi_sysfs ip_tables x_tables cmac algif_hash algif_skcipher af_alg dm_crypt essiv authenc trusted asn1_encoder tee bnep btusb btrtl btbcm btintel btmtk bluetooth ecdh_generic rfkill amdgpu crct10dif_pclmul crc32_pclmul polyval_clmulni polyval_generic gf128mul i2c_algo_bit drm_ttm_helper ghash_clmulni_intel sha512_ssse3 ttm drm_suballoc_helper Oct 03 09:15:22 kernel: amdxcp iommu_v2 xhci_pci drm_buddy xhci_pci_renesas gpu_sched nvme hid_multitouch drm_display_helper xhci_hcd nvme_core aesni_intel ucsi_acpi cec typec_ucsi video hid_generic crypto_simd cryptd usbcore roles ccp rc_core sp5100_tco typec t10_pi battery wmi i2c_hid_acpi i2c_hid serio_raw btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg br_netfilter bridge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod scsi_common msr efivarfs kernel: CPU: 9 PID: 7334 Comm: kworker/u32:18 Tainted: G W 6.5.4-1-default #1 openSUSE Tumbleweed 33e9043b6169f387e828626939b31ae921c14ccd kernel: Hardware name: LENOVO 20XGS0P30C/20XGS0P30C, BIOS R1NET57W (1.27) 05/15/2023 kernel: Workqueue: amdgpu-reset-dev drm_sched_job_timedout [gpu_sched] kernel: RIP: 0010:dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu] kernel: Code: b6 89 8b 00 00 00 e8 d8 fa 09 00 b8 01 00 00 00 48 8b 54 24 08 65 48 2b 14 25 28 00 00 00 75 47 48 83 c4 10 5b e9 79 ea 6d f8 <0f> 0b 31 c0 eb e0 0f 0b 48 8b 53 60 48 8b 43 68 41 b9 01 00 00 00 kernel: RSP: 0018:ffffa380d7b936e0 EFLAGS: 00010246 kernel: RAX: 0000000000163333 RBX: ffff9103fca81200 RCX: 0000000000000011 kernel: RDX: 0000000000000000 RSI: 0000000000001638 RDI: ffff90ffa3280000 kernel: RBP: ffffa380d7b93818 R08: ffffa380d7b936e4 R09: ffffa380d7b93708 kernel: R10: 0000001400000002 R11: 0000000000000010 R12: 0000000000000008 kernel: R13: 0000000000000008 R14: 0000000000000000 R15: ffff9104fff43900 kernel: FS: 0000000000000000(0000) GS:ffff91064ee80000(0000) knlGS:0000000000000000 kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 kernel: CR2: 00007fb1780432d8 CR3: 0000000215de2000 CR4: 0000000000750ee0 kernel: PKRU: 55555554 kernel: Call Trace: kernel: <TASK> kernel: ? dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: ? __warn+0x81/0x130 kernel: ? dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: ? report_bug+0x171/0x1a0 kernel: ? handle_bug+0x3c/0x80 kernel: ? exc_invalid_op+0x17/0x70 kernel: ? asm_exc_invalid_op+0x1a/0x20 kernel: ? dcn21_link_encoder_acquire_phy+0x117/0x160 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: dcn21_link_encoder_enable_dp_mst_output+0x1b/0x40 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: enable_dio_dp_link_output+0x41/0x80 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: dce110_enable_dp_link_output+0x257/0x270 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: dp_enable_link_phy+0x4d/0x90 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: perform_link_training_with_retries+0x1e1/0x560 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: ? core_link_write_dpcd+0x8f/0x100 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: enable_link_dp+0x13a/0x2c0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: ? core_link_write_dpcd+0x8f/0x100 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: link_set_dpms_on+0xb54/0xca0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: dce110_apply_ctx_to_hw+0x4fb/0x6c0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: dc_commit_state_no_check+0x3cd/0xed0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: dc_commit_streams+0x29b/0x400 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: dm_resume+0x44e/0x7c0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: ? srso_alias_return_thunk+0x5/0x7f kernel: ? _dev_info+0x70/0x90 kernel: amdgpu_device_ip_resume_phase2+0x52/0xc0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] PackageKit[7629]: get-updates transaction /54_ecedccda from uid 1000 finished with success after 4476ms kernel: amdgpu_do_asic_reset+0x4ca/0x730 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: amdgpu_device_gpu_recover+0x50f/0xce0 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: ? __drm_err+0x20/0xa0 kernel: amdgpu_job_timedout+0x151/0x240 [amdgpu bd864c145785aad46d2adae1aced5b7775508b8a] kernel: ? finish_task_switch.isra.0+0x94/0x2f0 kernel: drm_sched_job_timedout+0x6a/0x100 [gpu_sched 27f0dfbe2a70efedf8e6c5f739524c2be6a0dc79] kernel: process_one_work+0x21d/0x430 kernel: worker_thread+0x4e/0x3b0 kernel: ? __pfx_worker_thread+0x10/0x10 kernel: kthread+0xe8/0x120 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork+0x34/0x50 kernel: ? __pfx_kthread+0x10/0x10 kernel: ret_from_fork_asm+0x1b/0x30 kernel: </TASK> kernel: ---[ end trace 0000000000000000 ]--- kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready kernel: [drm:dp_set_fec_ready [amdgpu]] *ERROR* dpcd write failed to set fec_ready kernel: [drm] enabling link 2 failed: 15 kernel: [drm] kiq ring mec 2 pipe 1 q 0 kernel: amdgpu 0000:06:00.0: [drm:amdgpu_ring_test_helper [amdgpu]] *ERROR* ring kiq_0.2.1.0 test failed (-110) kernel: [drm:amdgpu_gfx_enable_kcq [amdgpu]] *ERROR* KCQ enable failed kernel: [drm:amdgpu_device_ip_resume_phase2 [amdgpu]] *ERROR* resume of IP block <gfx_v9_0> failed -110 kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset(2) failed kernel: amdgpu 0000:06:00.0: amdgpu: GPU reset end with ret = -110 kernel: [drm] Skip scheduling IBs! kernel: [drm] Skip scheduling IBs! kernel: [drm] Skip scheduling IBs! kernel: [drm] Skip scheduling IBs! kernel: [drm] Skip scheduling IBs! kernel: [drm] Skip scheduling IBs! kernel: [drm] Skip scheduling IBs! kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* GPU Recovery Failed: -110 kernel: [drm] Skip scheduling IBs! kernel: [drm] Skip scheduling IBs!
Ah, after checking the logs a bit more, I see a bunch of these, a few seconds before the hang. Maybe it is what caused the gpu reset? kernel: amdgpu 0000:06:00.0: amdgpu: [gfxhub0] no-retry page fault (src_id:0 ring:24 vmid:5 pasid:32772, for process firefox pid 3410 thread firefox:cs0 pid 3489) kernel: amdgpu 0000:06:00.0: amdgpu: in page starting at address 0x00003ffb78559000 from IH client 0x1b (UTCL2) kernel: amdgpu 0000:06:00.0: amdgpu: VM_L2_PROTECTION_FAULT_STATUS:0x00500431 kernel: amdgpu 0000:06:00.0: amdgpu: Faulty UTCL2 client ID: IA (0x2) kernel: amdgpu 0000:06:00.0: amdgpu: MORE_FAULTS: 0x1 kernel: amdgpu 0000:06:00.0: amdgpu: WALKER_ERROR: 0x0 kernel: amdgpu 0000:06:00.0: amdgpu: PERMISSION_FAULTS: 0x3 kernel: amdgpu 0000:06:00.0: amdgpu: MAPPING_ERROR: 0x0 kernel: amdgpu 0000:06:00.0: amdgpu: RW: 0x0 This comes from the firefox process, but I've seen it crash even when firefox wasn't involved.
Created attachment 869880 [details] kernel log
For completeness, you might see a few of kernel: WARNING: CPU: 11 PID: 3081 at drivers/acpi/platform_profile.c:74 platform_profile_show+0xa6/0xd0 [platform_profile] during boot in the log in Comment 2. Those have been there since installing TW and are most likely unrelated to this issue (should be another kernel or a fw bug report). But it seemed harmless and I forgot to investigate/report.
Just as a wild guess: Could you boot with amdgpu.mcbp=0?
No reply from the reporter so far. If amdgpu.mcbp=0 solves the issue, then kernel 6.5.6 has the fix: commit 2c4cc4d787a5f332f2c61f12cdb31e01da386439 Author: Jiadong Zhu <Jiadong.Zhu@amd.com> Date: Wed Jul 26 15:21:48 2023 +0800 drm/amdgpu: set completion status as preempted for the resubmission
(In reply to Frank Krüger from comment #5) > No reply from the reporter so far. If amdgpu.mcbp=0 solves the issue, then > kernel 6.5.6 has the fix: > > commit 2c4cc4d787a5f332f2c61f12cdb31e01da386439 > Author: Jiadong Zhu <Jiadong.Zhu@amd.com> > Date: Wed Jul 26 15:21:48 2023 +0800 > > drm/amdgpu: set completion status as preempted for the resubmission Apologies. This is a workstation and I haven't been able to try it yet. Will try to do after hours today.
Just tried 6.5.6-1-default and it fixed the issue. thanks. closing.