Bug 1226116 - nvidia-open-driver-G06 ... 550.90.07: Fails suspend/recover from powersaving with NVreg_PreserveVideoMemoryAllocations (needed for GNOME Wayland)
Summary: nvidia-open-driver-G06 ... 550.90.07: Fails suspend/recover from powersaving ...
Status: IN_PROGRESS
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X11 3rd Party Driver (show other bugs)
Version: Current
Hardware: Other Other
: P3 - Medium : Normal (vote)
Target Milestone: ---
Assignee: Stefan Dirsch
QA Contact: Stefan Dirsch
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-08 18:56 UTC by Tobias Burnus
Modified: 2024-06-13 21:15 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
nvidia-bug-report.log.gz from nvidia-bug-report.sh --safe-mode (851.02 KB, application/gzip)
2024-06-08 18:56 UTC, Tobias Burnus
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Tobias Burnus 2024-06-08 18:56:46 UTC
Created attachment 875386 [details]
nvidia-bug-report.log.gz from nvidia-bug-report.sh --safe-mode

[That's on a Lenovo P1 Gen 6 with "NVIDIA RTX A1000 6GB Laptop GPU".]

When going to the suspend mode, it looks quite normal (screen gets dark, powerlight blinks.)

However, when one presses a button:

(1) it shows the terminal Window - not the Wayland desktop -
    with two unrelated warnings:
i2c_hid_acpi i2c-ELAN0686:00: i2c_hid_get_input: incomplete report (31/65280)
iwlwifi 0000:00:14.3: WRT: Invalid buffer destination

(2) When going to Alt+F2, it shows a login screen but
    then nothing works (well, SysRq-{S,U,B} does work), but nothing on that
    screen nor Alt-Shift-F...

When skipping (2):
(3) When going to a terminal (e.g. Alt-F1), login is possible.
Getting some diagnostic output:

* nvidia-bug-report.sh stops early as an access to /proc/driver/nvidia/gpus/0000:01:00.0/information (I think it was that file) got stuck (lsof -p<pid> showed that file) – likewise, nvidia-smi did not output anything (both interruptible by 'ctrl-C'.

* A reboot failed with some message by systemd related to nvidia-*.service;
  I thought it was nvidia-powerd.service, but looking at the logs, it
  could be also nvidia-suspend.service. In any case, it stated something that
  disabling (or enabling?) wasn't possible while the service was currently
  enabled (or disabled?).

* "nvidia-bug-report.sh --safe-mode"
  this did work → see attached file.

* * *

dmesg showed many lines of the form:


kernel: NVRM: kbusVerifyBar2_GM107: MMUTest BAR0 window offset 0x70e000 returned garbage 0x0


The attached .gz file from "nvidia-bug-report.sh --safe-mode" contains both dmesg and "journalctl -b -0" and has the line above 1,750,353 times.

The "journalctl -b -0" output it contains (→ attachment) has:


Jun 07 20:59:33 tux.net-b.de /usr/bin/nvidia-powerd[1477]: Dbus Connection is established
Jun 07 21:29:48 tux.net-b.de suspend[3509]: nvidia-suspend.service
Jun 07 21:29:48 tux.net-b.de logger[3509]: <13>Jun  7 21:29:48 suspend: nvidia-suspend.service

Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvCheckFailedNoLog: Check failed: pMemDesc->_pInternalMapping != NULL @ mem_desc.c:2260
Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertFailedNoLog: Assertion failed: 0 @ mem_utils.c:574
Jun 07 21:29:48 tux.net-b.de kernel: NVRM: nvAssertOkFailedNoLog: Assertion failed: Ran out of a critical resource, other than memory [NV_ERR_INSUFFICIENT_RESOURCES] (0x0000001A) returned from memmgrMemCopy(pMemoryManager, &sysSurface, &vidSurface, copySize, TRANSFER_FLAGS_PREFER_CE) @ fbsr_gm107.c:1156

* * *

I think the issue occurred when doing the suspend and not when waking up the system, but I might be mistaken. - I thought wall time showed it, but I am not completely sure as I woke it up quite quickly; however, the quoted assertion fails directly after nvidia-suspend.service, which implies that it happens during the suspend.

BTW: With the older 550.78 driver, leaving the laptop a while alone (→ power save mode) ended up with a reboot or shortly showing the terminal (similar output as above) before rebooting. Thus, the 550.78 issue was definitely a suspend/power-save issue. [The triggered reboot would be harder to diagnose than the issue I have now.]

* * *

Installed nvidia packages (rpm -qa '*nvidia*') - all are now 550.90.07-23.1:

nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64
kernel-firmware-nvidia-20240519-1.1.noarch
nvidia-compute-G06-32bit-550.90.07-23.1.x86_64
nvidia-gl-G06-550.90.07-23.1.x86_64
nvidia-video-G06-32bit-550.90.07-23.1.x86_64
nvidia-compute-utils-G06-550.90.07-23.1.x86_64
nvidia-video-G06-550.90.07-23.1.x86_64
nvidia-utils-G06-550.90.07-23.1.x86_64
libnvidia-egl-wayland1-1.1.13-1.3.x86_64
nvidia-compute-G06-550.90.07-23.1.x86_64
nvidia-gl-G06-32bit-550.90.07-23.1.x86_64
kernel-firmware-nvidia-gspx-G06-550.90.07-1.1.x86_64

* * *

Side remarks:

(a) Contrary to the classic drivers, the open kernel driver offers the pageableMemoryAccess property, which permits via Linux kernel HMM support to migrate memory pages to/from the device when a the page is accessed. That's used, e.g., by GCC 15 (mainline) with OpenMP offload support when Unified-Shared Memory (USM) has been requested. See https://gcc.gnu.org/onlinedocs/libgomp/nvptx.html / https://gcc.gnu.org/gcc-15/changes.html

(b) The open kernels drivers permit showing the screen to both an external monitor and to the laptop screen, which didn't work with the default/classic driver.

(c) The more recent classic/non-'open kernels' driver also tended to crash occasionally (either reboot [typically when doing 'zypper dup'; possibly due to some systemd interaction] - or a freeze with a kernel fail (blinking shift lock; not even SysRq worked), which is a known but unsolved issue for the 550 driver according to the Nvidia Linux forum.

Thus, except for the issue reported in this bug, the open-kernels driver is better. :-)
And the future (said to be the default with Nvidia's 555 driver).
Comment 1 Tobias Burnus 2024-06-08 21:05:59 UTC
Looking at https://forums.developer.nvidia.com/c/gpu-graphics/linux/148
Today, there was a reply to an issue reported by someone else, pointing to
https://github.com/NVIDIA/open-gpu-kernel-modules/issues/472

That issue has plenty of comments and was opened Mar 11, 2023 for 525.85.05.

Glancing through that issue:
* I didn't see my assert
* but 'MMUTest BAR0 window offset 0x70e000 returned garbage 0x0' showed up in one comment; the variant with 'f' instead of 'e' in the hex address showed up in another more recent comment.

Plus:

* May 21, 2024 a comment was:
> We missed calling it out in the changelog explicitly (oops), but this
> should be fixed with 555.42.02. Please test. I'll leave this
> bug open while 555.xx is still in beta.

A bit later, some user reported:
> I am also getting a crash on suspend once in a blue moon.
> Here's the logs  whenever the crash happens: [...]

With the reply (on June 4, 2024):
> Acknowledged the crash on suspend issue, we have filed a bug 4683310
> internally for tracking purpose.
Comment 2 Stefan Dirsch 2024-06-09 08:00:34 UTC
I've enabled this NVreg_PreserveVideoMemoryAllocations kernel option and the these suspend/hibernate services some time ago.

-------------------------------------------------------------------
Mon Nov 21 12:30:46 UTC 2022 - Stefan Dirsch <sndirsch@suse.com>

- NVreg_PreserveVideoMemoryAllocations kernel option and enabled
  services nvidia-suspend, nvidia-resume and nvidia-hibernate now
  needed for GNOME Wayland (gdm) since
    commit 51181871e9db716546e9593216220389de0d8b03
    Author: Ray Strode <rstrode@redhat.com>
    Date:   Fri Mar 4 14:11:03 2022 -0500

      data: Disable wayland on nvidia if suspend is broken

If you don't·use GNOME Wayland you may get rid of the kernel option and disable the services again.
Comment 3 Stefan Dirsch 2024-06-12 10:41:33 UTC
For this you need to edit 

  /usr/lib/modprobe.d/50-nvidia-default.conf

and run

  dracut --force
  systemctl disable nvidia-suspend.service
  systemctl disable nvidia-hibernate.service
  systemctl disable nvidia-resume.service

and reboot your system. Does that help?
Comment 4 Tobias Burnus 2024-06-12 16:23:50 UTC
> For this you need to edit 
> /usr/lib/modprobe.d/50-nvidia-default.conf

I guess you mean: /usr/lib/modprobe.d/59-nvidia-default.conf
of nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64

I can confirm that with /proc/driver/nvidia/params
  PreserveVideoMemoryAllocations: 0
suspending + unsuspending/waking up works and I also do not see any glitches, but I have not tried much.

nvidia-smi also works.

* * *

Initially, I forgot to run:
  dracut --force
  systemctl disable nvidia-suspend.service
  systemctl disable nvidia-hibernate.service
  systemctl disable nvidia-resume.service

and at least 'status' for suspend/resume showed that they were enabled, but also with them enabled it did work as described above.
Comment 5 Stefan Dirsch 2024-06-12 19:21:28 UTC
(In reply to Tobias Burnus from comment #4)
> > For this you need to edit 
> > /usr/lib/modprobe.d/50-nvidia-default.conf
> 
> I guess you mean: /usr/lib/modprobe.d/59-nvidia-default.conf
> of nvidia-open-driver-G06-signed-kmp-default-550.90.07_k6.9.3_1-1.1.x86_64

Yes, you're right. I apologize.

> I can confirm that with /proc/driver/nvidia/params
>   PreserveVideoMemoryAllocations: 0
> suspending + unsuspending/waking up works and I also do not see any
> glitches, but I have not tried much.
> 
> nvidia-smi also works.
> 
> * * *
> 
> Initially, I forgot to run:
>   dracut --force
>   systemctl disable nvidia-suspend.service
>   systemctl disable nvidia-hibernate.service
>   systemctl disable nvidia-resume.service
> 
> and at least 'status' for suspend/resume showed that they were enabled, but
> also with them enabled it did work as described above.

That's interesting, but I'm afraid it's needed when using GNOME Wayland. May I ask which desktop you're using?
Comment 6 Tobias Burnus 2024-06-13 06:09:34 UTC
> May I ask which desktop you're using?

KDE Plasma 6 (Wayland)
Comment 7 Stefan Dirsch 2024-06-13 08:09:20 UTC
I'm adding my contact at nvidia here. Could be interesting for them I believe.
Comment 8 Stefan Dirsch 2024-06-13 11:19:36 UTC
21F. KNOWN ISSUES AND WORKAROUNDS

   o On some systems, where the default suspend mode is '"s2idle"', the system
     may not resume properly due to a known timing issue in the kernel. The
     suspend mode can be verified by reading the contents of the file
     '/sys/power/mem_sleep'. The following upstream kernel changes have been
     proposed to fix the issue:

           https://lore.kernel.org/linux-pci/20190927090202.1468-1-drake@endlessm.com/     

           https://lore.kernel.org/linux-pci/20190821124519.71594-1-mika.westerberg@linux.intel.com/     

     In the interim, the default suspend mode on the affected systems should
     be set to '"deep"' using the kernel command line parameter
     '"mem_sleep_default"' -

      'mem_sleep_default=deep'

I guess it's worth a try. Instead of removing PreserveVideoMemoryAllocations and therefore switching back from /proc/driver/nvidia/suspend to Kernel driver callback. 

So could you give this kernel option a try, please and readd PreserveVideoMemoryAllocations to nvidia options in modprobe.d snippet?

Also I double checked that PreserveVideoMemoryAllocations is needed when you want to make use of 

nvidia-suspend.service
nvidia-hibernate.service
nvidia-resume.service

This is checked by gdm. If these services are not enabled when using nvidia driver Wayland session isn't offered. Only Wayland.
Comment 9 Stefan Dirsch 2024-06-13 11:22:27 UTC
> This is checked by gdm. If these services are not enabled when using nvidia driver 
> Wayland session isn't offered. Only Wayland.

Only X11 of course!
Comment 10 Tobias Burnus 2024-06-13 12:06:33 UTC
Currently:
$ cat /sys/power/mem_sleep 
[s2idle]

I will check the mem_sleep_default=deep kernel parameter later.

* * *

Let's link the GNOME gdm commit, mentioned in this bug, for easier access:
https://gitlab.gnome.org/GNOME/gdm/-/commit/51181871e9db716546e9593216220389de0d8b03
Comment 11 Tobias Burnus 2024-06-13 19:30:01 UTC
"deep" does not seem to be supported on this system (which has an i7-13800H, Raptor Lake-P 6p+8e cores Host Bridge/DRAM Controller, ...).

dmesg confirms:

Kernel command line: ... mem_sleep_default=deep

But:
tux:~ # cat /sys/power/mem_sleep 
s2idle

https://www.kernel.org/doc/Documentation/power/states.txt states:

"These modes are "s2idle" (Suspend-To-Idle), "shallow" (Power-On Suspend) and "deep" (Suspend-To-RAM). The "s2idle" mode is always available, while the other ones are only available if supported by the platform (if not supported, the strings representing them are not present in /sys/power/mem_sleep).  The string representing the suspend mode to be used subsequently is enclosed in square brackets."

Which shows implies that I only have one. Albeit currently it is "s2idle" while before it was "[s2idle]".

Especially given that it is a laptop, I wonder why only
  Suspend-To-Idle / S0 / "s2idle" ("freeze")
and not
  Standby / Power-On Suspend / S1 / "shallow" ("standby")
  Suspend-to-RAM / S3 / "deep"
seem to be available.
Comment 12 Stefan Dirsch 2024-06-13 21:15:45 UTC
Thanks a lot for testing. Yeah, I think on newer systems like yours meanwhile only "s2idle" is left.