Bugzilla – Bug 1226055
NVIDIA driver 550.90 broken, plus no boot option for kernel 6.9.3
Last modified: 2024-06-08 09:14:54 UTC
Hi. So NVIDIA driver 550.90 came out in Tumbleweed repo so I upgraded (system in snapshot 20240531) and things broke. Rebooted and NVIDIA drivers could not be loaded. `nvidia-smi` said `NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.`, though NVIDIA drivers were indeed installed. No `nvidia_drm` in `lsmod`. Tried `dracut --force --regenerate-all` and no improvement. Tried re-installing the drivers and re-enrolling the public key but no help. Also tried disabling secure boot to sadly see no difference. Went back to the pre snapshot before the update and found that even the read-only snapshot selection menu in grub reported the kernel version to be 6.9.3, `uname -r` after booting into the snapshot still said 6.9.1. In fact that's the case in several snapshots. And 6.9.3 could not be found in `YaST Boot Loader` > `Bootloader Options` > `Default Boot Section`. Tried `update-bootloader --reinit` and `update-bootloader --refresh` then reboot and no improvement. I’ve updated to 20240605 snapshot and it's still broken.
Hmm, which nvidia packages do you have installed currently? Check this and let me know: rpm -qa | grep nvidia
Currently I have these nvidia packages installed: nvidia-compute-G06-550.90.07-23.1.x86_64 nvidia-gl-G06-550.90.07-23.1.x86_64 libva-nvidia-driver-0.0.12-1.3.x86_64 nvidia-video-G06-550.90.07-23.1.x86_64 nvidia-driver-G06-kmp-default-550.90.07_k6.9.3_1-23.1.x86_64 nvidia-gl-G06-32bit-550.90.07-23.1.x86_64 kernel-firmware-nvidia-20240519-1.1.noarch openSUSE-repos-Tumbleweed-NVIDIA-20240516.5431918-2.1.x86_64 nvidia-video-G06-32bit-550.90.07-23.1.x86_64 nvidia-compute-utils-G06-550.90.07-23.1.x86_64 nvidia-compute-G06-32bit-550.90.07-23.1.x86_64 libnvidia-egl-wayland1-1.1.13-1.3.x86_64 libva-nvidia-driver are installed from the X11:XOrg repo. Thanks in advance.
package-wise it looks good. I suggest to reinstall nvidia-driver-G06-kmp-default and see what happens, like if driver build fails or so. sudo rpm -e nvidia-driver-G06-kmp-default --nodeps sudo zypper in nvidia-driver-G06-kmp-default Then reboot and accept the certificate for secureboot. If still nvidia module cannot be loaded try this. sudo dmesg -c > /dev/null sudo modprobe nvidia sudo dmesg Attach the last dmesg output Also please attach the output of sudo inxi -aG
I'm not sure this is helpful, but I'll add it in case it is. I use vfio on a Dell with a T1000 and pass that discrete card to either a TW Qemu VM or Win11 Qemu VM depending on which one I want to use. Didn't have too much trouble w/ nvidia <= 550.67 and kernel <= 6.9.1. But the combo of 550.90 & 6.9.3 was a much more painful experience (for this machine). I won't claim to have any idea why, but 9 times out of 10 the Dell wouldn't boot to X and w/in 3 minutes would lockup w/ some type of vfio "cold" lockup and I'd have to hard-reset it. No manner of trying to blacklist vfio would stop it from showing up in lsmod output; neither would commenting out vfio-related items in /etc/modprobe.d and /etc/modules-load.d - it always showed back up. I removed nvidia-driver-G06-kmp-default and tried to reinstall it - locked up again w/ vfio before it seemed like the install completed, but I was prompted w/ a MOK enroll after the hard-reset. Only success I had was after "# modprobe --remove <each vfio module>" was run, then using `rpm -evh` to remove G06 and reinstalling G06 completed successfully and it seems the system is ACTUALLY not loading vfio as I'd expect considering they're still commented out. System hasn't locked up and seems happier, I'll check another day if I can re-enable vfio and GPU passthrough works again. Oddly, I have another up-to-date TW system w/ and AMD CPU and a 3070ti that didn't have any of these problems. This Dell is nothing but trouble.
(In reply to Stefan Dirsch from comment #3) > package-wise it looks good. I suggest to reinstall > nvidia-driver-G06-kmp-default and see what happens, like if driver build > fails or so. > > sudo rpm -e nvidia-driver-G06-kmp-default --nodeps > sudo zypper in nvidia-driver-G06-kmp-default > > Then reboot and accept the certificate for secureboot. If still nvidia > module cannot be loaded try this. > > sudo dmesg -c > /dev/null > sudo modprobe nvidia > sudo dmesg > > Attach the last dmesg output > > Also please attach the output of > > sudo inxi -aG Thank you Mr. Dirsch. I reinstalled `nvidia-driver-G06-kmp-default` and re-enrolled MOK key as suggested. However after reboot, `modprobe nvidia` said that: modprobe: ERROR: could not find module by name='nvidia' modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg) which seemed bizarre to me. After updated to snapshot 20240606 and rebooted I tried the procedure again, but got the same result. Also, `inxi -aG` reported: Graphics: Device-1: NVIDIA GA106M [GeForce RTX 3060 Mobile / Max-Q] vendor: Dell driver: N/A alternate: nouveau non-free: 550.xx+ status: current (as of 2024-04; EOL~2026-12-xx) arch: Ampere code: GAxxx process: TSMC n7 (7nm) built: 2020-2023 pcie: gen: 3 speed: 8 GT/s lanes: 8 link-max: gen: 4 speed: 16 GT/s lanes: 16 bus-ID: 01:00.0 chip-ID: 10de:2560 class-ID: 0300 Device-2: AMD Cezanne [Radeon Vega Series / Radeon Mobile Series] vendor: Dell driver: amdgpu v: kernel arch: GCN-5 code: Vega process: GF 14nm built: 2017-20 pcie: gen: 3 speed: 8 GT/s lanes: 16 link-max: gen: 4 speed: 16 GT/s ports: active: eDP-1 empty: none bus-ID: 05:00.0 chip-ID: 1002:1638 class-ID: 0300 temp: 52.0 C Device-3: Microdia Integrated_Webcam_HD driver: uvcvideo type: USB rev: 2.0 speed: 480 Mb/s lanes: 1 mode: 2.0 bus-ID: 1-4:6 chip-ID: 0c45:6a09 class-ID: 0e02 Display: server: X.org v: 1.21.1.12 with: Xwayland v: 24.1.0 compositor: kwin_wayland driver: X: loaded: modesetting unloaded: fbdev,vesa dri: radeonsi gpu: amdgpu tty: 213x44 Monitor-1: eDP-1 model: LG Display 0x067e built: 2020 res: 1920x1080 dpi: 142 gamma: 1.2 size: 344x194mm (13.54x7.64") diag: 395mm (15.5") ratio: 16:9 modes: max: 1920x1080 min: 640x480 API: EGL v: 1.5 hw: drv: amd radeonsi platforms: device: 0 drv: radeonsi device: 1 drv: swrast surfaceless: drv: radeonsi inactive: gbm,wayland,x11 API: OpenGL v: 4.6 compat-v: 4.5 vendor: mesa v: 24.0.8 note: console (EGL sourced) renderer: AMD Radeon Graphics (radeonsi renoir LLVM 18.1.6 DRM 3.57 6.9.1-1-default), llvmpipe (LLVM 18.1.6 256 bits) API: Vulkan v: 1.3.283 layers: 2 device: 0 type: integrated-gpu name: AMD Radeon Graphics (RADV RENOIR) driver: N/A device-ID: 1002:1638 surfaces: N/A
Ok. Your GPU is supported by the driver. Please try this. sudo dmesg -c > /dev/null sudo modprobe nvidia sudo dmesg Thanks!
(In reply to Stefan Dirsch from comment #6) > Ok. Your GPU is supported by the driver. Please try this. > > sudo dmesg -c > /dev/null > sudo modprobe nvidia > sudo dmesg > > Thanks! It said: modprobe: ERROR: could not find module by name='nvidia' modprobe: ERROR: could not insert 'nvidia': Unknown symbol in module, or unknown parameter (see dmesg) after `modprobe nvidia`. And the second `dmesg` had no output, as expected.
Just a little update from me, not that anyone asked :P (I'm not looking to take this bug over, but it's currently the only 6.9.3 bug out there and if anyone else is having problems, hopefully they'd see this and decide if they need to open their own, I appologize Gerald if this is worthless chatter in your report). I don't think there's a issue w/ 550.90, I think there's some oddness w/ 6.9.3, but it appears to be more of an issue w/ prime laptops than desktops w/ discrete cards. I have a desktop passing a 3070ti to a TW Qemu VM, it was on 6.9.1 w/ 550.78 and all was fine pre-`zypper dup`. After the dup, the worst I could get to happen was that 6.9.1 tried to use 550.78 and reported: Failed to initialize NVML: Driver/library version mismatch NVML library version: 550.90 But it's fine w/ 6.9.3 and I'd assume I could have 6.9.1 rebuild w/ 550.90 if for some reason that was a desired setup, which it isn't presently. The Dell, it's a different story (of course). It's still hit-or-miss on if 6.9.1 or 6.9.3 boot into X and/or throw the "cold" issue w/ the T1000. Right now, I've got it booted in 6.9.3 running a TW VM (on 6.9.3 + 550.90.07, reporting the T1000 via `nvidia-smi -L`) and all seems fine, been up ~20 minutes. Other than the Dell's inability to consistenly boot w/o problem(s), and even though it was working fine even with a VM using the T1000, running `inxi --graphics` either takes >= 30s to run or hangs and basically causes the hard-lock. I'm about to the point where I just don't turn the thing on ...
Hey guys! I worked this out! And the real cause behind these are truly bizarre. As the title mentioned, there was no boot option for kernel 6.9.3, albeit it had been installed, and I had done `dracut -f`. And the NVIDIA driver 550.90 was made for kernel 6.9.3, as the full name of the package mentioned in #3 by Mr. Dirsch was "nvidia-driver-G06-kmp-default-550.90.07_k6.9.3_1-23.1.x86_64". I was always guided to boot in kernel 6.9.1, so the NVIDIA driver could not be properly loaded, I guess. So while investigating the output of `update-bootloader` as mentioned in the original post, I happened to notice that grub menu could not be updated because of an error of /etc/grub.d/00_tuned, which requires /etc/tuned, which was deleted by me because I don't use tuned, which prevented boot option for kernel 6.9.3 to be generated. So I reinstalled tuned to recreate /etc/tuned and did `update-bootloader` and `dracut` things and rebooted. Voilà! My system went back to normal and the NVIDIA driver got properly loaded. Thanks go to Mr. Dirsch and Mr. Bradnick for helping these out. I'm changing the status to FIXED.
Ok. Thanks for feedback!