Bug 1219517 - Kernel 6.7.2 AMD GPU random system freeze or output no video
Summary: Kernel 6.7.2 AMD GPU random system freeze or output no video
Status: RESOLVED INVALID
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-02-03 09:31 UTC by Yunhe Guo
Modified: 2024-02-28 15:53 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Yunhe Guo 2024-02-03 09:31:55 UTC
Recently, I experienced two problems:

1. System random freeze. Can only force shutdown by press power button.
2. Cannot boot. Output no video signal. No UEFI logo. No Grub menu.

I am using AMD GPU. Found some Arch Linux users have similar issues:

https://bbs.archlinux.org/viewtopic.php?id=292442

Operating System: openSUSE Tumbleweed 20240131
KDE Plasma Version: 5.27.10
KDE Frameworks Version: 5.114.0
Qt Version: 5.15.12
Kernel Version: 6.7.2-1-default (64-bit)
Graphics Platform: Wayland
Processors: 12 × AMD Ryzen 5 5600X 6-Core Processor
Memory: 31.3 GiB of RAM
Graphics Processor: AMD Radeon Graphics RX 6700
Manufacturer: Micro-Star International Co., Ltd.
Product Name: MS-7C94
System Version: 1.0
Comment 1 Yunhe Guo 2024-02-03 09:37:45 UTC
Related post from Reddit:

https://www.reddit.com/r/openSUSE/comments/1ahi5fn/kernel_672_fan_woes/

It is more and more clear that the issue is related to kernel 6.7.2.
Comment 2 Takashi Iwai 2024-02-04 09:15:19 UTC
Please check the behavior with 6.7.3 or later kernel in OBS Kernel:stable repo
  http://download.opensuse.org/repositories/Kernel:/stable/standard/

If the problem persists, verify the latest 6.8-rc kernel in OBS Kernel:HEAD
  http://download.opensuse.org/repositories/Kernel:/HEAD/standard/

If it's still seen in 6.8-rc, report to the upstream devs at gitlab.freedesktop.org issues.
Comment 3 Yunhe Guo 2024-02-20 11:10:36 UTC
(In reply to Takashi Iwai from comment #2)
> Please check the behavior with 6.7.3 or later kernel in OBS Kernel:stable
> repo
>   http://download.opensuse.org/repositories/Kernel:/stable/standard/
> 
> If the problem persists, verify the latest 6.8-rc kernel in OBS Kernel:HEAD
>   http://download.opensuse.org/repositories/Kernel:/HEAD/standard/
> 
> If it's still seen in 6.8-rc, report to the upstream devs at
> gitlab.freedesktop.org issues.

I tried all these kernel version but still get random system freeze. Today I finally captured the logs when system freeze (I was playing YouTube with Firefox, nothing else is running.):

2月 20 19:00:12 localhost kernel: BTRFS warning (device nvme0n1p2): checksum verify failed on logical 1758335664128 mirror 1 wanted 0x2e8ebbf4 found 0x2ed755e2 level 0
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 1758335664128 (dev /dev/nvme0n1p2 sector 883506304)
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 1758335668224 (dev /dev/nvme0n1p2 sector 883506312)
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 1758335672320 (dev /dev/nvme0n1p2 sector 883506320)
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 1758335676416 (dev /dev/nvme0n1p2 sector 883506328)
2月 20 19:00:12 localhost kernel: BTRFS warning (device nvme0n1p2): checksum verify failed on logical 491194335232 mirror 1 wanted 0x9cfbbb3e found 0xb0a41980 level 0
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 491194335232 (dev /dev/nvme0n1p2 sector 42204000)
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 491194339328 (dev /dev/nvme0n1p2 sector 42204008)
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 491194343424 (dev /dev/nvme0n1p2 sector 42204016)
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 491194347520 (dev /dev/nvme0n1p2 sector 42204024)
2月 20 19:00:12 localhost kernel: BTRFS warning (device nvme0n1p2): checksum verify failed on logical 781902266368 mirror 1 wanted 0x6ebf7f50 found 0xafea776b level 0
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 781902266368 (dev /dev/nvme0n1p2 sector 491765984)
2月 20 19:00:12 localhost kernel: BTRFS info (device nvme0n1p2): read error corrected: ino 0 off 781902270464 (dev /dev/nvme0n1p2 sector 491765992)
2月 20 19:00:12 localhost kernel: BTRFS warning (device nvme0n1p2): checksum verify failed on logical 1756486762496 mirror 1 wanted 0x84dd47fe found 0xea1c76ef level 0
2月 20 19:00:12 localhost kernel: BTRFS warning (device nvme0n1p2): checksum verify failed on logical 1756486762496 mirror 2 wanted 0x84dd47fe found 0x908fc266 level 0
2月 20 19:00:12 localhost kernel: BTRFS error (device nvme0n1p2): qgroup scan failed with -5

I guess either my btrfs is broken or my SSD is broken.
Comment 4 Takashi Iwai 2024-02-20 11:21:28 UTC
Yes, it smells more like a filesystem problem.  Please try to repair the filesystem at first.

Do you see any else traces about the amdgpu crash or such?
Comment 5 Yunhe Guo 2024-02-20 11:25:41 UTC
(In reply to Takashi Iwai from comment #4)
> Yes, it smells more like a filesystem problem.  Please try to repair the
> filesystem at first.
> 
> Do you see any else traces about the amdgpu crash or such?

No. I think it is not related to amdgpu.
Comment 6 Jiri Slaby 2024-02-26 06:57:57 UTC
This might be related:
https://gitlab.freedesktop.org/drm/amd/-/issues/3132

I backported:
commit 3a9626c816db901def438dc2513622e281186d39
Author: Mario Limonciello <mario.limonciello@amd.com>
Date:   Wed Feb 7 23:52:55 2024 -0600

    drm/amd: Stop evicting resources on APUs in suspend

to stable.
Comment 7 Yunhe Guo 2024-02-28 15:53:15 UTC
I can confirm now, the problem is caused by hardware. The CPU or DRAM is too lose cause system failure. Re-install CPU and DRAM solved the issue.