Bugzilla – Bug 1226116
nvidia-open-driver-G06 ... 550.90.07: Fails suspend/recover from powersaving with NVreg_PreserveVideoMemoryAllocations (needed for GNOME Wayland)
Last modified: 2024-06-13 21:15:45 UTC
Created attachment 875386 [details] nvidia-bug-report.log.gz from nvidia-bug-report.sh --safe-mode [That's on a Lenovo P1 Gen 6 with "NVIDIA RTX A1000 6GB Laptop GPU".] When going to the suspend mode, it looks quite normal (screen gets dark, powerlight blinks.) However, when one presses a button: (1) it shows the terminal Window - not the Wayland desktop - with two unrelated warnings: i2c_hid_acpi i2c-ELAN0686:00: i2c_hid_get_input: incomplete report (31/65280) iwlwifi 0000:00:14.3: WRT: Invalid buffer destination (2) When going to Alt+F2, it shows a login screen but then nothing works (well, SysRq-{S,U,B} does work), but nothing on that screen nor Alt-Shift-F... When skipping (2): (3) When going to a terminal (e.g. Alt-F1), login is possible. Getting some diagnostic output: * nvidia-bug-report.sh stops early as an access to /proc/driver/nvidia/gpus/0000:01:00.0/information (I think it was that file) got stuck (lsof -p<pid> showed that file) – likewise, nvidia-smi did not output anything (both interruptible by 'ctrl-C'. * A reboot failed with some message by systemd related to nvidia-*.service; I thought it was nvidia-powerd.service, but looking at the logs, it could be also nvidia-suspend.service. In any case, it stated something that disabling (or enabling?) wasn't possible while the service was currently enabled (or disabled?). * "nvidia-bug-report.sh --safe-mode" this did work → see attached file. * * * dmesg showed many lines of the form: kernel: NVRM: kbusVerifyBar2_GM107: MMUTest BAR0 window offset 0x70e000 returned garbage 0x0 The attached .gz file from "nvidia-bug-report.sh --safe-mode" contains both dmesg and "journalctl -b -0" and has the line above 1,750,353 times. The "journalctl -b -0" output it contains (→ attachment) has: Jun 07 20:59:33 tux.net-b.de /usr/bin/nvidia-powerd[1477]: Dbus Connection is established Jun 07 21:29:48 tux.net-b.de suspend[3509]: nvidia-suspend.service Jun 07 21:29:48 tux.net-b.de logger[3509]: <13>Jun 7 21:29:48 suspend: nvidia-suspend.service Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvCheckFailedNoLog: Check failed: pMemDesc->_pInternalMapping != NULL @ mem_desc.c:2260 Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mem_utils.c:574 Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Ran out of a critical resource, other than memory [NV_ERR_INSUFFICIENT_RESOURCES] (0x0000001A) returned from memmgrMemCopy(pMemoryManager, &sysSurface, &vidSurface, copySize, TRANSFER_FLAGS_PREFER_CE) @ fbsr_gm107.c:1156 * * * I think the issue occurred when doing the suspend and not when waking up the system, but I might be mistaken. - I thought wall time showed it, but I am not completely sure as I woke it up quite quickly; however, the quoted assertion fails directly after nvidia-suspend.service, which implies that it happens during the suspend. BTW: With the older 550.78 driver, leaving the laptop a while alone (→ power save mode) ended up with a reboot or shortly showing the terminal (similar output as above) before rebooting. Thus, the 550.78 issue was definitely a suspend/power-save issue. [The triggered reboot would be harder to diagnose than the issue I have now.] * * * Installed nvidia packages (rpm -qa '*nvidia*') - all are now 550.90.07-23.1: nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64 kernel-firmware-nvidia-20240519-1.1.noarch nvidia-compute-G06-32bit-550.90.07-23.1.x86_64 nvidia-gl-G06-550.90.07-23.1.x86_64 nvidia-video-G06-32bit-550.90.07-23.1.x86_64 nvidia-compute-utils-G06-550.90.07-23.1.x86_64 nvidia-video-G06-550.90.07-23.1.x86_64 nvidia-utils-G06-550.90.07-23.1.x86_64 libnvidia-egl-wayland1-1.1.13-1.3.x86_64 nvidia-compute-G06-550.90.07-23.1.x86_64 nvidia-gl-G06-32bit-550.90.07-23.1.x86_64 kernel-firmware-nvidia-gspx-G06-550.90.07-1.1.x86_64 * * * Side remarks: (a) Contrary to the classic drivers, the open kernel driver offers the pageableMemoryAccess property, which permits via Linux kernel HMM support to migrate memory pages to/from the device when a the page is accessed. That's used, e.g., by GCC 15 (mainline) with OpenMP offload support when Unified-Shared Memory (USM) has been requested. See https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html / https://gcc.gnu.org/gcc-15/changes.html (b) The open kernels drivers permit showing the screen to both an external monitor and to the laptop screen, which didn't work with the default/classic driver. (c) The more recent classic/non-'open kernels' driver also tended to crash occasionally (either reboot [typically when doing 'zypper dup'; possibly due to some systemd interaction] - or a freeze with a kernel fail (blinking shift lock; not even SysRq worked), which is a known but unsolved issue for the 550 driver according to the Nvidia Linux forum. Thus, except for the issue reported in this bug, the open-kernels driver is better. :-) And the future (said to be the default with Nvidia's 555 driver).
Looking at https://forums.developer.nvidia.com/c/gpu-graphics/linux/148 Today, there was a reply to an issue reported by someone else, pointing to https://github.com/NVIDIA/open-gpu-kernel-modules/issues/472 That issue has plenty of comments and was opened Mar 11, 2023 for 525.85.05. Glancing through that issue: * I didn't see my assert * but 'MMUTest BAR0 window offset 0x70e000 returned garbage 0x0' showed up in one comment; the variant with 'f' instead of 'e' in the hex address showed up in another more recent comment. Plus: * May 21, 2024 a comment was: > We missed calling it out in the changelog explicitly (oops), but this > should be fixed with 555.42.02. Please test. I'll leave this > bug open while 555.xx is still in beta. A bit later, some user reported: > I am also getting a crash on suspend once in a blue moon. > Here's the logs whenever the crash happens: [...] With the reply (on June 4, 2024): > Acknowledged the crash on suspend issue, we have filed a bug 4683310 > internally for tracking purpose.
I've enabled this NVreg_PreserveVideoMemoryAllocations kernel option and the these suspend/hibernate services some time ago. ------------------------------------------------------------------- Mon Nov 21 12:30:46 UTC 2022 - Stefan Dirsch <sndirsch@suse.com> - NVreg_PreserveVideoMemoryAllocations kernel option and enabled services nvidia-suspend, nvidia-resume and nvidia-hibernate now needed for GNOME Wayland (gdm) since commit 51181871e9db716546e9593216220389de0d8b03 Author: Ray Strode <rstrode@redhat.com> Date: Fri Mar 4 14:11:03 2022 -0500 data: Disable wayland on nvidia if suspend is broken If you don't·use GNOME Wayland you may get rid of the kernel option and disable the services again.
For this you need to edit /usr/lib/modprobe.d/50-nvidia-default.conf and run dracut --force systemctl disable nvidia-suspend.service systemctl disable nvidia-hibernate.service systemctl disable nvidia-resume.service and reboot your system. Does that help?
> For this you need to edit > /usr/lib/modprobe.d/50-nvidia-default.conf I guess you mean: /usr/lib/modprobe.d/59-nvidia-default.conf of nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64 I can confirm that with /proc/driver/nvidia/params PreserveVideoMemoryAllocations: 0 suspending + unsuspending/waking up works and I also do not see any glitches, but I have not tried much. nvidia-smi also works. * * * Initially, I forgot to run: dracut --force systemctl disable nvidia-suspend.service systemctl disable nvidia-hibernate.service systemctl disable nvidia-resume.service and at least 'status' for suspend/resume showed that they were enabled, but also with them enabled it did work as described above.
(In reply to Tobias Burnus from comment #4) > > For this you need to edit > > /usr/lib/modprobe.d/50-nvidia-default.conf > > I guess you mean: /usr/lib/modprobe.d/59-nvidia-default.conf > of nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64 Yes, you're right. I apologize. > I can confirm that with /proc/driver/nvidia/params > PreserveVideoMemoryAllocations: 0 > suspending + unsuspending/waking up works and I also do not see any > glitches, but I have not tried much. > > nvidia-smi also works. > > * * * > > Initially, I forgot to run: > dracut --force > systemctl disable nvidia-suspend.service > systemctl disable nvidia-hibernate.service > systemctl disable nvidia-resume.service > > and at least 'status' for suspend/resume showed that they were enabled, but > also with them enabled it did work as described above. That's interesting, but I'm afraid it's needed when using GNOME Wayland. May I ask which desktop you're using?
> May I ask which desktop you're using? KDE Plasma 6 (Wayland)
I'm adding my contact at nvidia here. Could be interesting for them I believe.
21F. KNOWN ISSUES AND WORKAROUNDS o On some systems, where the default suspend mode is '"s2idle"', the system may not resume properly due to a known timing issue in the kernel. The suspend mode can be verified by reading the contents of the file '/sys/power/mem_sleep'. The following upstream kernel changes have been proposed to fix the issue: https://lore.kernel.org/linux-pci/20190927090202.1468-1-drake@endlessm.com/ https://lore.kernel.org/linux-pci/20190821124519.71594-1-mika.westerberg@linux.intel.com/ In the interim, the default suspend mode on the affected systems should be set to '"deep"' using the kernel command line parameter '"mem_sleep_default"' - 'mem_sleep_default=deep' I guess it's worth a try. Instead of removing PreserveVideoMemoryAllocations and therefore switching back from /proc/driver/nvidia/suspend to Kernel driver callback. So could you give this kernel option a try, please and readd PreserveVideoMemoryAllocations to nvidia options in modprobe.d snippet? Also I double checked that PreserveVideoMemoryAllocations is needed when you want to make use of nvidia-suspend.service nvidia-hibernate.service nvidia-resume.service This is checked by gdm. If these services are not enabled when using nvidia driver Wayland session isn't offered. Only Wayland.
> This is checked by gdm. If these services are not enabled when using nvidia driver > Wayland session isn't offered. Only Wayland. Only X11 of course!
Currently: $ cat /sys/power/mem_sleep [s2idle] I will check the mem_sleep_default=deep kernel parameter later. * * * Let's link the GNOME gdm commit, mentioned in this bug, for easier access: https://gitlab.gnome.org/GNOME/gdm/-/commit/51181871e9db716546e9593216220389de0d8b03
"deep" does not seem to be supported on this system (which has an i7-13800H, Raptor Lake-P 6p+8e cores Host Bridge/DRAM Controller, ...). dmesg confirms: Kernel command line: ... mem_sleep_default=deep But: tux:~ # cat /sys/power/mem_sleep s2idle https://www.kernel.org/doc/Documentation/power/states.txt states: "These modes are "s2idle" (Suspend-To-Idle), "shallow" (Power-On Suspend) and "deep" (Suspend-To-RAM). The "s2idle" mode is always available, while the other ones are only available if supported by the platform (if not supported, the strings representing them are not present in /sys/power/mem_sleep). The string representing the suspend mode to be used subsequently is enclosed in square brackets." Which shows implies that I only have one. Albeit currently it is "s2idle" while before it was "[s2idle]". Especially given that it is a laptop, I wonder why only Suspend-To-Idle / S0 / "s2idle" ("freeze") and not Standby / Power-On Suspend / S1 / "shallow" ("standby") Suspend-to-RAM / S3 / "deep" seem to be available.
Thanks a lot for testing. Yeah, I think on newer systems like yours meanwhile only "s2idle" is left.