Bugzilla – Bug 1217433
Leap 15.6 system with AM5 and NVIDIA RTX3070 gets stuck few times a day after a recent update
Last modified: 2024-01-12 14:48:31 UTC
Created attachment 870924 [details] dmesg output (not showing anything obvious) The entire system always got frozen when google-chrome was active lkocman@localhost:~> rpm -qa | grep -i chrome chrome-gnome-shell-10.1-1.56.x86_64 google-chrome-stable-119.0.6045.159-1.x86_64 I did update nvidia drivers today, the machine was showing the same symptomps yesterday prior to the update too. lkocman@localhost:~> rpm -qa | grep nvidia kernel-firmware-nvidia-20231006-150600.1.1.noarch nvidia-video-G06-32bit-545.29.02-lp156.18.1.x86_64 nvidia-video-G06-545.29.02-lp156.18.1.x86_64 nvidia-compute-G06-545.29.02-lp156.18.1.x86_64 nvidia-driver-G06-kmp-default-545.29.02_k6.4.0_150600.1-lp156.18.1.x86_64 nvidia-compute-G06-32bit-545.29.02-lp156.18.1.x86_64 nvidia-gl-G06-32bit-545.29.02-lp156.18.1.x86_64 nvidia-gl-G06-545.29.02-lp156.18.1.x86_64 lkocman@localhost:~> uname -r 6.4.0-150600.1-default lkocman@localhost:~> rpm -qa | grep kernel-default kernel-default-optional-5.14.21-150500.55.28.1.x86_64 kernel-default-6.4.0-150600.1.2.x86_64 kernel-default-extra-6.4.0-150600.1.2.x86_64 kernel-default-devel-5.14.21-150500.55.28.1.x86_64 kernel-default-5.14.21-150500.55.28.1.x86_64 kernel-default-optional-6.4.0-150600.1.2.x86_64 kernel-default-extra-5.14.21-150500.55.28.1.x86_64 kernel-default-devel-6.4.0-150600.1.2.x86_64
Created attachment 870925 [details] /var/log/messages The system got stuck right before connecting to 10am call today on Nov 23. /var/log/messages show some more interesting things there
I did update the system as well as flatpaks in the the morning. Problematic part was let's say 5 minutes +/- around 10:00.
I'd say it must be somethign here (see the GPU stall messages) https://paste.opensuse.org/pastes/d03de0ffd4e0 Nov 23 09:52:30 localhost google-chrome.desktop[5122]: [1123/095230.734736:ERROR:file_io_posix.cc(152)] open /home/lkocman/.config/google-chrome/Crash Reports/pending/55130046-2a95-4ed2-a1fa-6254c78d9e97.lock: File exists (17) Nov 23 09:52:30 localhost systemd[2516]: Started Application launched by gnome-shell. Nov 23 09:52:30 localhost gnome-keyring-daemon[2540]: asked to register item /org/freedesktop/secrets/collection/login/1, but it's already registered Nov 23 09:52:30 localhost google-chrome.desktop[5122]: [5116:5141:1123/095230.986524:ERROR:nss_util.cc(357)] After loading Root Certs, loaded==false: NSS error code: -8018 Nov 23 09:52:32 localhost chrome[5116]: [5116:5116:1123/095232.674531:WARNING:remote_commands_service.cc(225)] Client is not registered. Nov 23 09:52:32 localhost google-chrome.desktop[5122]: [5116:5116:1123/095232.841042:ERROR:interface_endpoint_client.cc(702)] Message 3 rejected by interface blink.mojom.Widget Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.134736:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.137989:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.154721:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels Nov 23 09:52:33 localhost google-chrome.desktop[5122]: [5161:5161:1123/095233.171233:ERROR:gl_utils.cc(402)] [.RendererMainThread-0x9b0002d7f00]GL Driver Message (OpenGL, Performance, GL_CLOSE_PATH_NV, High): GPU stall due to ReadPixels (this message will no longer repeat) Nov 23 09:52:41 localhost systemd[2516]: Created slice User Background Tasks Slice.
The issue is not easily reproducible it has happened maybe 3-4 times since yesterday.
@Daniel Does this ring a bell for you?
No update of kernel for two weeks or similar, therefore I think it must be NVIDIA related # last log rotation was two weeks ago and I have not seen any issues back then. localhost:/home/lkocman # ls /var/log/zypper.log* /var/log/zypper.log /var/log/zypper.log-20230915.xz /var/log/zypper.log-20231010.xz /var/log/zypper.log-20231108.xz localhost:/home/lkocman # cat /var/log/zypper.log | egrep -i "kernel|nvidia" | egrep "<install>|<uninstall>" 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(13)nvidia-compute-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install> U_Ts_(14)nvidia-compute-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install> U_Ts_rs(18)nvidia-driver-G06-kmp-default-545.29.02_k6.4.0_150600.1-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(22)nvidia-gl-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(23)nvidia-gl-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(27)nvidia-video-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(28)nvidia-video-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112840)nvidia-compute-G06-535.129.03-lp156.15.1.x86_64(@System) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112841)nvidia-compute-G06-32bit-535.129.03-lp156.15.1.x86_64(@System) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112842)nvidia-driver-G06-kmp-default-535.129.03_k6.4.0_150600.1-lp156.15.1.x86_64(@System) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112843)nvidia-gl-G06-535.129.03-lp156.15.1.x86_64(@System) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112844)nvidia-gl-G06-32bit-535.129.03-lp156.15.1.x86_64(@System) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112845)nvidia-video-G06-535.129.03-lp156.15.1.x86_64(@System) 2023-11-21 13:17:50 <1> localhost.localdomain(20410) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112846)nvidia-video-G06-32bit-535.129.03-lp156.15.1.x86_64(@System) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(13)nvidia-compute-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install> U_Ts_(14)nvidia-compute-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install> U_Ts_rs(18)nvidia-driver-G06-kmp-default-545.29.02_k6.4.0_150600.1-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(22)nvidia-gl-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(23)nvidia-gl-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(27)nvidia-video-G06-32bit-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <install> U_Ts_r(28)nvidia-video-G06-545.29.02-lp156.18.1.x86_64(NVIDIA:repo-non-free) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112862)nvidia-compute-G06-535.129.03-lp156.15.1.x86_64(@System) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112863)nvidia-compute-G06-32bit-535.129.03-lp156.15.1.x86_64(@System) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112864)nvidia-driver-G06-kmp-default-535.129.03_k6.4.0_150600.1-lp156.15.1.x86_64(@System) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112865)nvidia-gl-G06-535.129.03-lp156.15.1.x86_64(@System) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112866)nvidia-gl-G06-32bit-535.129.03-lp156.15.1.x86_64(@System) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112867)nvidia-video-G06-535.129.03-lp156.15.1.x86_64(@System) 2023-11-23 09:07:03 <1> localhost.localdomain(5310) [zypper++] Summary.cc(readPool):171 <uninstall> I_TsU(112868)nvidia-video-G06-32bit-535.129.03-lp156.15.1.x86_64(@System)
Created attachment 870926 [details] updated packages in last two weeks
Created attachment 870927 [details] updated packages in last two weeks Let's use this file it shows previous versions in more readable form
Agreement with Stefan to narrow down the issue is to try using chrome without webgl and see if it still happens and then confirm issue again with webgl enabled this seems to do the trick lkocman@localhost:~> google-chrome --args -disable-webgl Aside from that no easy way to debug without a simple reproducer.
I wonder if disabling WegGL helped ...
[ 59.275604] NVRM: GPU at PCI:0000:01:00: GPU-21498f58-0fa4-a24c-b7c7-ee4bfbd25302 [ 59.275607] NVRM: Xid (PCI:0000:01:00): 79, pid='<unknown>', name=<unknown>, GPU has fallen off the bus. [ 59.275610] NVRM: GPU 0000:01:00.0: GPU has fallen off the bus. [ 59.275617] NVRM: A GPU crash dump has been created. If possible, please run NVRM: nvidia-bug-report.sh as root to collect this data before NVRM: the NVIDIA kernel module is unloaded. so system freeze happened right after fresh boot, on gnome-startup. So no chromium involved (some are electron apps)
Created attachment 871000 [details] output of nvidia-bug-report.sh Attaching output of nvidia-bug-report.sh
The machine gets stuck time to time, at random times. I do have a dualboot and have not experienced that on Windows, although I spent over 90% of time in Leap 15.6 :-)
I really think it was a HW/connectivity issue, there is pci-e 4 riser cable and gpu supports invooved. I did experience similar freeze on windows, disassambled machine cleaned up, reassembled. I expect things to be back operational. Seeing similar freeze after a cold start on another platform makes me quite confident it's not on linux driver side. Thank you
Interestingly the issue still happens, after disassembly etc. see the error related to nvidia_drm localhost:/home/lkocman # dmesg | grep nvidia [ 6.728037] nvidia: module license 'NVIDIA' taints kernel. [ 6.728042] nvidia: module license taints kernel. [ 6.872582] nvidia: externally supported module, setting X kernel taint flag. [ 6.874419] nvidia-nvlink: Nvlink Core is being initialized, major device number 235 [ 6.876731] nvidia 0000:01:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=none [ 6.981171] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint. [ 7.040119] nvidia_uvm: externally supported module, setting X kernel taint flag. [ 7.041292] nvidia-uvm: Loaded the UVM driver, major device number 511. [ 7.115788] nvidia_modeset: externally supported module, setting X kernel taint flag. [ 7.115856] nvidia-modeset: Loading NVIDIA Kernel Mode Setting Driver for UNIX platforms 545.29.06 Thu Nov 16 01:47:29 UTC 2023 [ 7.121349] nvidia_drm: externally supported module, setting X kernel taint flag. [ 7.121509] [drm] [nvidia-drm] [GPU ID 0x00000100] Loading driver [ 8.021796] audit: type=1400 audit(1702456391.987:4): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe" pid=1113 comm="apparmor_parser" [ 8.021797] audit: type=1400 audit(1702456391.987:5): apparmor="STATUS" operation="profile_load" profile="unconfined" name="nvidia_modprobe//kmod" pid=1113 comm="apparmor_parser" [ 9.359733] [drm] Initialized nvidia-drm 0.0.0 20160202 for 0000:01:00.0 on minor 1 [ 9.359921] nvidia 0000:01:00.0: vgaarb: deactivate vga console [ 9.478019] fbcon: nvidia-drmdrmfb (fb0) is primary device [ 9.573186] nvidia 0000:01:00.0: [drm] fb0: nvidia-drmdrmfb frame buffer device [ 12.792312] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0 [ 15.864386] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0 [ 18.936439] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0 [ 22.264512] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0 [ 25.336582] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0 [ 28.664656] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0 [ 31.736666] [drm:nv_drm_atomic_commit [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Flip event timeout on head 0
Confirming that issue does not happen if the GPU is connected directly to motherboard. It must be a faulty pcie4 riser cable.