Bug 1226055 - NVIDIA driver 550.90 broken, plus no boot option for kernel 6.9.3
Summary: NVIDIA driver 550.90 broken, plus no boot option for kernel 6.9.3
Status: RESOLVED INVALID
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X11 3rd Party Driver (show other bugs)
Version: Current
Hardware: x86-64 openSUSE Tumbleweed
: P3 - Medium : Normal (vote)
Target Milestone: ---
Assignee: Stefan Dirsch
QA Contact: Stefan Dirsch
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-06 16:59 UTC by Gerald Chen
Modified: 2024-06-08 09:14 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Gerald Chen 2024-06-06 16:59:54 UTC
Hi. So NVIDIA driver 550.90 came out in Tumbleweed repo so I upgraded (system in snapshot 20240531) and things broke.

Rebooted and NVIDIA drivers could not be loaded. `nvidia-smi` said `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.`, though NVIDIA drivers were indeed installed. No `nvidia_drm` in `lsmod`. Tried `dracut --force --regenerate-all` and no improvement. Tried re-installing the drivers and re-enrolling the public key but no help. Also tried disabling secure boot to sadly see no difference.

Went back to the pre snapshot before the update and found that even the read-only snapshot selection menu in grub reported the kernel version to be 6.9.3, `uname -r` after booting into the snapshot still said 6.9.1. In fact that's the case in several snapshots. And 6.9.3 could not be found in `YaST Boot Loader` > `Bootloader Options` > `Default Boot Section`. Tried `update-bootloader --reinit` and `update-bootloader --refresh` then reboot and no improvement.

I’ve updated to 20240605 snapshot and it's still broken.
Comment 1 Stefan Dirsch 2024-06-06 18:28:09 UTC
Hmm, which nvidia packages do you have installed currently? Check this and let me know:

  rpm -qa | grep nvidia
Comment 2 Gerald Chen 2024-06-06 22:39:55 UTC
Currently I have these nvidia packages installed:

nvidia-compute-G06-550.90.07-23.1.x86_64
nvidia-gl-G06-550.90.07-23.1.x86_64
libva-nvidia-driver-0.0.12-1.3.x86_64
nvidia-video-G06-550.90.07-23.1.x86_64
nvidia-driver-G06-kmp-default-550.90.07_k6.9.3_1-23.1.x86_64
nvidia-gl-G06-32bit-550.90.07-23.1.x86_64
kernel-firmware-nvidia-20240519-1.1.noarch
openSUSE-repos-Tumbleweed-NVIDIA-20240516.5431918-2.1.x86_64
nvidia-video-G06-32bit-550.90.07-23.1.x86_64
nvidia-compute-utils-G06-550.90.07-23.1.x86_64
nvidia-compute-G06-32bit-550.90.07-23.1.x86_64
libnvidia-egl-wayland1-1.1.13-1.3.x86_64

libva-nvidia-driver are installed from the X11:XOrg repo.

Thanks in advance.
Comment 3 Stefan Dirsch 2024-06-07 08:09:06 UTC
package-wise it looks good. I suggest to reinstall nvidia-driver-G06-kmp-default and see what happens, like if driver build fails or so.

 sudo rpm -e nvidia-driver-G06-kmp-default --nodeps
 sudo zypper in nvidia-driver-G06-kmp-default

Then reboot and accept the certificate for secureboot. If still nvidia module cannot be loaded try this.

 sudo dmesg -c > /dev/null
 sudo modprobe nvidia
 sudo dmesg

Attach the last dmesg output

Also please attach the output of

 sudo inxi -aG
Comment 4 Scott Bradnick 2024-06-07 15:21:37 UTC
I'm not sure this is helpful, but I'll add it in case it is.

I use vfio on a Dell with a T1000 and pass that discrete card to either a TW Qemu VM or Win11 Qemu VM depending on which one I want to use. Didn't have too much trouble w/ nvidia <= 550.67 and kernel <= 6.9.1. But the combo of 550.90 & 6.9.3 was a much more painful experience (for this machine). I won't claim to have any idea why, but 9 times out of 10 the Dell wouldn't boot to X and w/in 3 minutes would lockup w/ some type of vfio "cold" lockup and I'd have to hard-reset it.

No manner of trying to blacklist vfio would stop it from showing up in lsmod output; neither would commenting out vfio-related items in /etc/modprobe.d and /etc/modules-load.d - it always showed back up. I removed nvidia-driver-G06-kmp-default and tried to reinstall it - locked up again w/ vfio before it seemed like the install completed, but I was prompted w/ a MOK enroll after the hard-reset.

Only success I had was after "# modprobe --remove <each vfio module>" was run, then using `rpm -evh` to remove G06 and reinstalling G06 completed successfully and it seems the system is ACTUALLY not loading vfio as I'd expect considering they're still commented out. System hasn't locked up and seems happier, I'll check another day if I can re-enable vfio and GPU passthrough works again.

Oddly, I have another up-to-date TW system w/ and AMD CPU and a 3070ti that didn't have any of these problems. This Dell is nothing but trouble.
Comment 5 Gerald Chen 2024-06-07 15:32:27 UTC
(In reply to Stefan Dirsch from comment #3)
> package-wise it looks good. I suggest to reinstall
> nvidia-driver-G06-kmp-default and see what happens, like if driver build
> fails or so.
> 
>  sudo rpm -e nvidia-driver-G06-kmp-default --nodeps
>  sudo zypper in nvidia-driver-G06-kmp-default
> 
> Then reboot and accept the certificate for secureboot. If still nvidia
> module cannot be loaded try this.
> 
>  sudo dmesg -c > /dev/null
>  sudo modprobe nvidia
>  sudo dmesg
> 
> Attach the last dmesg output
> 
> Also please attach the output of
> 
>  sudo inxi -aG

Thank you Mr. Dirsch. I reinstalled `nvidia-driver-G06-kmp-default` and re-enrolled MOK key as suggested. However after reboot, `modprobe nvidia` said that:

modprobe: ERROR: could not find module by name='nvidia'
modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)

which seemed bizarre to me. After updated to snapshot 20240606 and rebooted I tried the procedure again, but got the same result.

Also, `inxi -aG` reported:

Graphics:
  Device-1: NVIDIA GA106M [GeForce RTX 3060 Mobile / Max-Q] vendor: Dell driver: N/A
    alternate: nouveau non-free: 550.xx+ status: current (as of 2024-04; EOL~2026-12-xx)
    arch: Ampere code: GAxxx process: TSMC n7 (7nm) built: 2020-2023 pcie: gen: 3 speed: 8 GT/s
    lanes: 8 link-max: gen: 4 speed: 16 GT/s lanes: 16 bus-ID: 01:00.0 chip-ID: 10de:2560
    class-ID: 0300
  Device-2: AMD Cezanne [Radeon Vega Series / Radeon Mobile Series] vendor: Dell driver: amdgpu
    v: kernel arch: GCN-5 code: Vega process: GF 14nm built: 2017-20 pcie: gen: 3 speed: 8 GT/s
    lanes: 16 link-max: gen: 4 speed: 16 GT/s ports: active: eDP-1 empty: none bus-ID: 05:00.0
    chip-ID: 1002:1638 class-ID: 0300 temp: 52.0 C
  Device-3: Microdia Integrated_Webcam_HD driver: uvcvideo type: USB rev: 2.0 speed: 480 Mb/s
    lanes: 1 mode: 2.0 bus-ID: 1-4:6 chip-ID: 0c45:6a09 class-ID: 0e02
  Display: server: X.org v: 1.21.1.12 with: Xwayland v: 24.1.0 compositor: kwin_wayland driver:
    X: loaded: modesetting unloaded: fbdev,vesa dri: radeonsi gpu: amdgpu tty: 213x44
  Monitor-1: eDP-1 model: LG Display 0x067e built: 2020 res: 1920x1080 dpi: 142 gamma: 1.2
    size: 344x194mm (13.54x7.64") diag: 395mm (15.5") ratio: 16:9 modes: max: 1920x1080 min: 640x480
  API: EGL v: 1.5 hw: drv: amd radeonsi platforms: device: 0 drv: radeonsi device: 1 drv: swrast
    surfaceless: drv: radeonsi inactive: gbm,wayland,x11
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: mesa v: 24.0.8 note: console (EGL sourced)
    renderer: AMD Radeon Graphics (radeonsi renoir LLVM 18.1.6 DRM 3.57 6.9.1-1-default), llvmpipe
    (LLVM 18.1.6 256 bits)
  API: Vulkan v: 1.3.283 layers: 2 device: 0 type: integrated-gpu name: AMD Radeon Graphics
    (RADV RENOIR) driver: N/A device-ID: 1002:1638 surfaces: N/A
Comment 6 Stefan Dirsch 2024-06-07 17:48:09 UTC
Ok. Your GPU is supported by the driver. Please try this.

sudo dmesg -c > /dev/null
sudo modprobe nvidia
sudo dmesg

Thanks!
Comment 7 Gerald Chen 2024-06-07 18:09:27 UTC
(In reply to Stefan Dirsch from comment #6)
> Ok. Your GPU is supported by the driver. Please try this.
> 
> sudo dmesg -c > /dev/null
> sudo modprobe nvidia
> sudo dmesg
> 
> Thanks!

It said:

modprobe: ERROR: could not find module by name='nvidia'
modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg)

after `modprobe nvidia`.

And the second `dmesg` had no output, as expected.
Comment 8 Scott Bradnick 2024-06-07 20:30:46 UTC
Just a little update from me, not that anyone asked :P (I'm not looking to take this bug over, but it's currently the only 6.9.3 bug out there and if anyone else is having problems, hopefully they'd see this and decide if they need to open their own, I appologize Gerald if this is worthless chatter in your report).

I don't think there's a issue w/ 550.90, I think there's some oddness w/ 6.9.3, but it appears to be more of an issue w/ prime laptops than desktops w/ discrete cards.

I have a desktop passing a 3070ti to a TW Qemu VM, it was on 6.9.1 w/ 550.78 and all was fine pre-`zypper dup`. After the dup, the worst I could get to happen was that 6.9.1 tried to use 550.78 and reported:

Failed to initialize NVML: Driver/library version mismatch
NVML library version: 550.90

But it's fine w/ 6.9.3 and I'd assume I could have 6.9.1 rebuild w/ 550.90 if for some reason that was a desired setup, which it isn't presently.

The Dell, it's a different story (of course). It's still hit-or-miss on if 6.9.1 or 6.9.3 boot into X and/or throw the "cold" issue w/ the T1000. Right now, I've got it booted in 6.9.3 running a TW VM (on 6.9.3 + 550.90.07, reporting the T1000 via `nvidia-smi -L`) and all seems fine, been up ~20 minutes.

Other than the Dell's inability to consistenly boot w/o problem(s), and even though it was working fine even with a VM using the T1000, running `inxi --graphics` either takes >= 30s to run or hangs and basically causes the hard-lock.

I'm about to the point where I just don't turn the thing on ...
Comment 9 Gerald Chen 2024-06-08 06:15:45 UTC
Hey guys! I worked this out! And the real cause behind these are truly bizarre.

As the title mentioned, there was no boot option for kernel 6.9.3, albeit it had been installed, and I had done `dracut -f`. And the NVIDIA driver 550.90 was made for kernel 6.9.3, as the full name of the package mentioned in #3 by Mr. Dirsch was "nvidia-driver-G06-kmp-default-550.90.07_k6.9.3_1-23.1.x86_64". I was always guided to boot in kernel 6.9.1, so the NVIDIA driver could not be properly loaded, I guess.

So while investigating the output of `update-bootloader` as mentioned in the original post, I happened to notice that grub menu could not be updated because of an error of /etc/grub.d/00_tuned, which requires /etc/tuned, which was deleted by me because I don't use tuned, which prevented boot option for kernel 6.9.3 to be generated. So I reinstalled tuned to recreate /etc/tuned and did `update-bootloader` and `dracut` things and rebooted. Voilà! My system went back to normal and the NVIDIA driver got properly loaded.

Thanks go to Mr. Dirsch and Mr. Bradnick for helping these out. I'm changing the status to FIXED.
Comment 10 Stefan Dirsch 2024-06-08 09:14:54 UTC
Ok. Thanks for feedback!