Bug 1217433

Summary: Leap 15.6 system with AM5 and NVIDIA RTX3070 gets stuck few times a day after a recent update
Product: [openSUSE] openSUSE Distribution Reporter: Lubos Kocman <lubos.kocman>
Component: X11 3rd Party DriverAssignee: Stefan Dirsch <sndirsch>
Status: RESOLVED INVALID QA Contact: Stefan Dirsch <sndirsch>
Severity: Normal    
Priority: P3 - Medium CC: ddadap, lubos.kocman, tiwai
Version: Leap 15.6   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: dmesg output (not showing anything obvious)
/var/log/messages
updated packages in last two weeks
updated packages in last two weeks
output of nvidia-bug-report.sh

Description Lubos Kocman 2023-11-23 09:17:31 UTC
Created attachment 870924 [details]
dmesg output (not showing anything obvious)

The entire system always got frozen when google-chrome was active

lkocman@localhost:~> rpm -qa | grep -i chrome
chrome-gnome-shell-10.1-1.56.x86_64
google-chrome-stable-119.0.6045.159-1.x86_64


I did update nvidia drivers today, the machine was showing the same symptomps yesterday prior to the update too.

lkocman@localhost:~> rpm -qa | grep nvidia
kernel-firmware-nvidia-20231006-150600.1.1.noarch
nvidia-video-G06-32bit-545.29.02-lp156.18.1.x86_64
nvidia-video-G06-545.29.02-lp156.18.1.x86_64
nvidia-compute-G06-545.29.02-lp156.18.1.x86_64
nvidia-driver-G06-kmp-default-545.29.02_k6.4.0_150600.1-lp156.18.1.x86_64
nvidia-compute-G06-32bit-545.29.02-lp156.18.1.x86_64
nvidia-gl-G06-32bit-545.29.02-lp156.18.1.x86_64
nvidia-gl-G06-545.29.02-lp156.18.1.x86_64

lkocman@localhost:~> uname -r
6.4.0-150600.1-default


lkocman@localhost:~> rpm -qa | grep kernel-default
kernel-default-optional-5.14.21-150500.55.28.1.x86_64
kernel-default-6.4.0-150600.1.2.x86_64
kernel-default-extra-6.4.0-150600.1.2.x86_64
kernel-default-devel-5.14.21-150500.55.28.1.x86_64
kernel-default-5.14.21-150500.55.28.1.x86_64
kernel-default-optional-6.4.0-150600.1.2.x86_64
kernel-default-extra-5.14.21-150500.55.28.1.x86_64
kernel-default-devel-6.4.0-150600.1.2.x86_64
Comment 1 Lubos Kocman 2023-11-23 09:19:02 UTC
Created attachment 870925 [details]
/var/log/messages

The system got stuck right before connecting to 10am call today on Nov 23. /var/log/messages show some more interesting things there
Comment 2 Lubos Kocman 2023-11-23 09:23:23 UTC
I did update the system as well as flatpaks in the the morning. Problematic part was let's say 5 minutes +/- around 10:00.
Comment 3 Lubos Kocman 2023-11-23 09:25:30 UTC
I'd say it must be somethign here (see the GPU stall messages)

https://paste.opensuse.org/pastes/d03de0ffd4e0


Nov 23 09:52:30 localhost google-chrome.desktop[5122]: [1123/095230.734736:ERROR:file_io_posix.cc(152)] open /home/lkocman/.config/google-chrome/Crash Reports/pending/55130046-2a95-4ed2-a1fa-6254c78d9e97.lock: File exists (17)
Nov 23 09:52:30 localhost systemd[2516]: Started Application launched by gnome-shell.
Nov 23 09:52:30 localhost gnome-keyring-daemon[2540]: asked to register item /org/freedesktop/secrets/collection/login/1, but it's already registered
Nov 23 09:52:30 localhost google-chrome.desktop[5122]: [5116:5141:1123/095230.986524:ERROR:nss_util.cc(357)] After loading Root Certs, loaded==false: NSS error code: -8018
Nov 23 09:52:32 localhost chrome[5116]: [5116:5116:1123/095232.674531:WARNING:remote_commands_service.cc(225)] Client is not registered.
Nov 23 09:52:32 localhost google-chrome.desktop[5122]: [5116:5116:1123/095232.841042:ERROR:interface_endpoint_client.cc(702)] Message 3 rejected by interface blink.mojom.Widget
Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.134736:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels
Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.137989:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels
Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.154721:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels
Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.171233:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels (this message will no longer repeat)
Nov 23 09:52:41 localhost systemd[2516]: Created slice User Background Tasks Slice.
Comment 4 Lubos Kocman 2023-11-23 09:27:29 UTC
The issue is not easily reproducible it has happened maybe 3-4 times since yesterday.
Comment 5 Stefan Dirsch 2023-11-23 09:32:00 UTC
@Daniel Does this ring a bell for you?
Comment 6 Lubos Kocman 2023-11-23 09:34:58 UTC
No update of kernel for two weeks or similar, therefore I think it must be NVIDIA related


# last log rotation was two weeks ago and I have not seen any issues back then.
localhost:/home/lkocman # ls /var/log/zypper.log*
/var/log/zypper.log  /var/log/zypper.log-20230915.xz  /var/log/zypper.log-20231010.xz  /var/log/zypper.log-20231108.xz



localhost:/home/lkocman # cat /var/log/zypper.log | egrep -i "kernel|nvidia" | egrep "<install>|<uninstall>"
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(13)nvidia-compute-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_(14)nvidia-compute-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_rs(18)nvidia-driver-G06-kmp-default-545.29.02_k6.4.0_150600.1-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(22)nvidia-gl-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(23)nvidia-gl-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(27)nvidia-video-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(28)nvidia-video-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112840)nvidia-compute-G06-535.129.03-lp156.15.1.x86_64(@System)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112841)nvidia-compute-G06-32bit-535.129.03-lp156.15.1.x86_64(@System)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112842)nvidia-driver-G06-kmp-default-535.129.03_k6.4.0_150600.1-lp156.15.1.x86_64(@System)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112843)nvidia-gl-G06-535.129.03-lp156.15.1.x86_64(@System)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112844)nvidia-gl-G06-32bit-535.129.03-lp156.15.1.x86_64(@System)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112845)nvidia-video-G06-535.129.03-lp156.15.1.x86_64(@System)
2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112846)nvidia-video-G06-32bit-535.129.03-lp156.15.1.x86_64(@System)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(13)nvidia-compute-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_(14)nvidia-compute-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_rs(18)nvidia-driver-G06-kmp-default-545.29.02_k6.4.0_150600.1-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(22)nvidia-gl-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(23)nvidia-gl-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(27)nvidia-video-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install>   U_Ts_r(28)nvidia-video-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112862)nvidia-compute-G06-535.129.03-lp156.15.1.x86_64(@System)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112863)nvidia-compute-G06-32bit-535.129.03-lp156.15.1.x86_64(@System)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112864)nvidia-driver-G06-kmp-default-535.129.03_k6.4.0_150600.1-lp156.15.1.x86_64(@System)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112865)nvidia-gl-G06-535.129.03-lp156.15.1.x86_64(@System)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112866)nvidia-gl-G06-32bit-535.129.03-lp156.15.1.x86_64(@System)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112867)nvidia-video-G06-535.129.03-lp156.15.1.x86_64(@System)
2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112868)nvidia-video-G06-32bit-535.129.03-lp156.15.1.x86_64(@System)
Comment 7 Lubos Kocman 2023-11-23 09:37:25 UTC
Created attachment 870926 [details]
updated packages in last two weeks
Comment 8 Lubos Kocman 2023-11-23 09:39:42 UTC
Created attachment 870927 [details]
updated packages in last two weeks

Let's use this file it shows previous versions in more readable form
Comment 9 Lubos Kocman 2023-11-23 10:23:06 UTC
Agreement with Stefan to narrow down the issue is to try using chrome without webgl and see if it still happens and then confirm issue again with webgl enabled

this seems to do the trick
lkocman@localhost:~> google-chrome --args  -disable-webgl

Aside from that no easy way to debug without a simple reproducer.
Comment 10 Stefan Dirsch 2023-11-26 19:32:59 UTC
I wonder if disabling WegGL helped ...
Comment 11 Lubos Kocman 2023-11-27 12:50:37 UTC
[   59.275604] NVRM: GPU at PCI:0000:01:00: GPU-21498f58-0fa4-a24c-b7c7-ee4bfbd25302
[   59.275607] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus.
[   59.275610] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus.
[   59.275617] NVRM: A GPU crash dump has been created. If possible, please run
               NVRM: nvidia-bug-report.sh as root to collect this data before
               NVRM: the NVIDIA kernel module is unloaded.


so system freeze happened right after fresh boot, on gnome-startup. So no chromium involved (some are electron apps)
Comment 12 Lubos Kocman 2023-11-27 12:54:51 UTC
Created attachment 871000 [details]
output of nvidia-bug-report.sh

Attaching output of nvidia-bug-report.sh
Comment 13 Lubos Kocman 2023-12-04 18:39:43 UTC
The machine gets stuck time to time, at random times.

I do have a dualboot and have not experienced that on Windows, although I spent over 90% of time in  Leap 15.6 :-)
Comment 14 Lubos Kocman 2023-12-05 10:51:50 UTC
I really think it was a HW/connectivity issue, there is pci-e 4 riser cable and gpu supports invooved. I did experience similar freeze on windows, disassambled machine cleaned up, reassembled. I expect things to be back operational.

Seeing similar freeze after a cold start on another platform makes me quite confident it's not on linux driver side.

Thank you
Comment 15 Lubos Kocman 2023-12-13 08:47:17 UTC
Interestingly the issue still happens, after disassembly etc. see the error related to nvidia_drm

localhost:/home/lkocman # dmesg | grep nvidia
[    6.728037] nvidia: module license 'NVIDIA' taints kernel.
[    6.728042] nvidia: module license taints kernel.
[    6.872582] nvidia: externally supported module, setting X kernel taint flag.
[    6.874419] nvidia-nvlink: Nvlink Core is being initialized, major device number 235
[    6.876731] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none
[    6.981171] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[    7.040119] nvidia_uvm: externally supported module, setting X kernel taint flag.
[    7.041292] nvidia-uvm: Loaded the UVM driver, major device number 511.
[    7.115788] nvidia_modeset: externally supported module, setting X kernel taint flag.
[    7.115856] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms  545.29.06  Thu Nov 16 01:47:29 UTC 2023
[    7.121349] nvidia_drm: externally supported module, setting X kernel taint flag.
[    7.121509] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver
[    8.021796] audit: type=1400 audit(1702456391.987:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1113 comm="apparmor_parser"
[    8.021797] audit: type=1400 audit(1702456391.987:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1113 comm="apparmor_parser"
[    9.359733] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1
[    9.359921] nvidia 0000:01:00.0: vgaarb: deactivate vga console
[    9.478019] fbcon: nvidia-drmdrmfb (fb0) is primary device
[    9.573186] nvidia 0000:01:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device
[   12.792312] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
[   15.864386] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
[   18.936439] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
[   22.264512] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
[   25.336582] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
[   28.664656] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
[   31.736666] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
Comment 16 Lubos Kocman 2024-01-12 14:48:31 UTC
Confirming that issue does not happen if the GPU is connected directly to motherboard. It must be a faulty pcie4 riser cable.