Bugzilla – Bug 1215981
Black Screen during boot on both internal and external screen in kernel 6.5.4-1 on Thinkpad P16 (Discrete Graphics mode)
Last modified: 2024-04-04 10:48:03 UTC
I have similar problem to #1213693, but on newer kernel 6.5.4-1, which should contain the fix. #1213693 was broken by commit ca62297b2085 ("drm/edid: Fix csync detailed mode parsing") in v6.4-rc1, which was fixed by revert it in 50b6f2c82977 ("Revert "drm/edid: Fix csync detailed mode parsing"") in v6.5-rc7. In my case I'm not able to see anything after kernel being loaded. I have Tumbleweed kernel 6.5.4-1 and 6.5.2-1. Problem is on Thinkpad P16 with 2 GPU: 00:02.0 VGA compatible controller: Intel Corporation Alder Lake-HX GT1 [UHD Graphics 770] (rev 0c) 01:00.0 VGA compatible controller: NVIDIA Corporation GA107GLM [RTX A1000 Laptop GPU] (rev a1) The problem is on "Discrete Graphics" (Nvidia only) mode. "Hybrid Graphics" (Intel + Nvidia) works, but I need for external screen to use "Discrete Graphics" as it's the only way to get external screens working (because external output is wired only to nvidia): i.e. on Discrete Graphics there is only Intel card being used $ drm_info |grep -i node: -A1 Node: /dev/dri/card0 Driver: i915 (Intel Graphics) version 1.6.0 (20201103) I tested with internal screen only and with internal screen + 2 external GPU. I tested to disable plymouth with rd.plymouth=0 plymouth.enable=0 plymouth=0 cmdline args, also tried fbcon=map:1 also boot to runlevel 1 and 3 instead the default. None helped. $ rpm -qa |grep -i -e nouveau -e intel -e ^kernel kernel-firmware-nvidia-gsp-G06-525.116.04-2.1.x86_64 kernel-firmware-nvidia-gspx-G06-535.113.01-1.1.x86_64 kernel-firmware-serial-20230829-1.1.noarch libdrm_nouveau2-2.4.116-2.1.x86_64 intel-vaapi-driver-2.4.1-5.11.x86_64 kernel-firmware-mwifiex-20230829-1.1.noarch xf86-video-intel-2.99.917.916_g31486f40-3.6.x86_64 kernel-firmware-platform-20230829-1.1.noarch kernel-firmware-intel-20230829-1.1.noarch kernel-firmware-iwlwifi-20230829-1.1.noarch kernel-firmware-all-20230829-1.1.noarch intel-media-driver-23.3.3-1.1.x86_64 ucode-intel-20230808-1.1.x86_64 kernel-firmware-nvidia-gsp-G06-535.54.03-1.1.x86_64 kernel-firmware-amdgpu-20230829-1.1.noarch kernel-firmware-usb-network-20230829-1.1.noarch kernel-firmware-i915-20230829-1.1.noarch kernel-macros-6.5.4-1.1.noarch kernel-firmware-qcom-20230829-1.1.noarch libvulkan_intel-23.2.0-1699.360.pm.1.x86_64 intel-gpu-tools-1.27.1-2.3.x86_64 kernel-firmware-sound-20230829-1.1.noarch kernel-firmware-ath10k-20230829-1.1.noarch libvdpau_nouveau-23.2.0-1699.360.pm.1.x86_64 kernel-firmware-bnx2-20230829-1.1.noarch Mesa-dri-nouveau-23.2.0-1699.360.pm.1.x86_64 kernel-firmware-dpaa2-20230829-1.1.noarch kernel-firmware-atheros-20230829-1.1.noarch kernel-firmware-radeon-20230829-1.1.noarch kernel-firmware-ueagle-20230829-1.1.noarch kernel-firmware-brcm-20230829-1.1.noarch kernel-firmware-chelsio-20230829-1.1.noarch kernel-firmware-nvidia-20230829-1.1.noarch kernel-firmware-ti-20230829-1.1.noarch kernel-firmware-media-20230829-1.1.noarch kernel-firmware-realtek-20230829-1.1.noarch kernel-firmware-mellanox-20230829-1.1.noarch libdrm_intel1-2.4.116-2.1.x86_64 kernel-firmware-network-20230829-1.1.noarch kernel-firmware-ath11k-20230829-1.1.noarch kernel-firmware-mediatek-20230829-1.1.noarch kernel-firmware-bluetooth-20230829-1.1.noarch kernel-firmware-prestera-20230829-1.1.noarch kernel-firmware-liquidio-20230829-1.1.noarch kernel-firmware-marvell-20230829-1.1.noarch kernel-default-6.5.2-1.1.x86_64 kernel-firmware-nfp-20230829-1.1.noarch kernel-default-devel-6.5.2-1.1.x86_64 kernel-devel-6.5.4-1.1.noarch kernel-firmware-qlogic-20230829-1.1.noarch kernel-default-devel-6.5.4-1.1.x86_64 kernel-default-6.5.4-1.1.x86_64 kernel-devel-6.5.2-1.1.noarch $ lsmod |grep -i -e i915 -e nvidia -e nouveau nvidia_drm 94208 0 nvidia_modeset 1794048 1 nvidia_drm nvidia_uvm 3608576 0 i915 4087808 5 drm_buddy 20480 1 i915 i2c_algo_bit 20480 1 i915 drm_display_helper 237568 1 i915 ttm 102400 1 i915 cec 90112 2 drm_display_helper,i915 nvidia 8843264 2 nvidia_uvm,nvidia_modeset video 77824 3 thinkpad_acpi,i915,nvidia_modeset $ modinfo nvidia |grep -i version version: 535.113.01 srcversion: 81566B70A70B0B19F40FD1A vermagic: 6.5.4-1-default SMP preempt mod_unload modversions $ cat /proc/cmdline # but I tested with others, see above BOOT_IMAGE=/boot/vmlinuz-6.5.4-1-default root=/dev/mapper/system-root splash=silent resume=/dev/system/swap mitigations=auto quiet security=apparmor modprobe.blacklist=i915 nosimplefb=1 I use these non-factory repos: https://download.opensuse.org/repositories/X11:/Drivers:/Video:/Redesign/openSUSE_Tumbleweed/ https://download.opensuse.org/repositories/X11:/XOrg/openSUSE_Tumbleweed/ https://download.nvidia.com/opensuse/tumbleweed
Can you access the system remotely? If so, please provide dmesg and hwinfo output.
(In reply to Patrik Jakobsson from comment #1) > Can you access the system remotely? If so, please provide dmesg and hwinfo > output. Unfortunately the system does not reply to ping. I'm able to get to working system if I switch in BIOS to "Discrete Graphics". I'm not sure if the system crashes, or network requires mn-applet to start. I'll try setup network over lan cable and setup SSH so that I can get some logs.
Created attachment 869967 [details] dmesg of the affected system
Created attachment 869968 [details] hwinfo of the affected system
Created attachment 869969 [details] dmesg of the affected system (cmdline cleanup) I removed modprobe.blacklist=i915 nosimplefb=1 from cmdline. Obviously it did not solve problem, just to use the default cmdline. There are some errors, not sure [ 1.464073] BERT: [Hardware Error]: Skipped 1 error records ... [ 2.052280] pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.0 [ 2.052299] pci 0000:01:00.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 2.052345] pci 0000:01:00.0: device [10de:25b9] error status/mask=00100000/00000000 ... [ 9.027482] sof-audio-pci-intel-tgl 0000:00:1f.3: init of i915 and HDMI codec failed ... [ 12.628660] [drm:nv_drm_load [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to allocate NvKmsKapiDevice [ 12.629139] [drm:nv_drm_probe_devices [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to register device Nvidia card is visible: $ lspci |grep -i nvidia 01:00.0 VGA compatible controller: NVIDIA Corporation GA107GLM [RTX A1000 Laptop GPU] (rev a1) 01:00.1 Audio device: NVIDIA Corporation Device 2291 (rev a1)
Created attachment 869970 [details] hwinfo of the affected system (cmdline cleanup) The main difference is that modprobe.blacklist=i915 nosimplefb=1 (previous log file) forced efi-framebuffer instead of the default simple-framebuffer and had "Generic Monitor". But output is the same - none.
Created attachment 869971 [details] dmesg on Hybrid Graphics mode (where GUI works, just for a reference)
Created attachment 869972 [details] hwinfo on Hybrid Graphics mode (where GUI works, just for a reference)
[ 12.368440] NVRM: Open nvidia.ko is only ready for use on Data Center GPUs. [ 12.368442] NVRM: To force use of Open nvidia.ko on other GPUs, see the [ 12.368442] NVRM: 'OpenRmEnableUnsupportedGpus' kernel module parameter described [ 12.368443] NVRM: in the README. So have you set this in modprobe.de/50-nvidia-default.conf ?
(In reply to Stefan Dirsch from comment #9) > [ 12.368440] NVRM: Open nvidia.ko is only ready for use on Data Center > GPUs. > [ 12.368442] NVRM: To force use of Open nvidia.ko on other GPUs, see the > [ 12.368442] NVRM: 'OpenRmEnableUnsupportedGpus' kernel module parameter > described > [ 12.368443] NVRM: in the README. > > So have you set this in modprobe.d/50-nvidia-default.conf ? Yes, I remember setting OpenRmEnableUnsupportedGpus=1 in /usr/lib/modprobe.d/50-nvidia-default.conf before (it was in the SUSE internal docs for the laptop), but now I see it's not set. I suspect it was overwrite by rpm update. So I reenabled it again. And setting it is really required: * Both Discrete Graphics and Hybrid Graphics modes are not able to use external screens when OpenRmEnableUnsupportedGpus=1 is not set. * Discrete Graphics mode now starts normally, I can use X11 based window managers and also Wayland based compositors (tested on sway, which is picky on nvidia proprietary drivers). I guess we can close this bug. Maybe we should consider to document using OpenRmEnableUnsupportedGpus=1 also somewhere in openSUSE wiki. Or ask Nvidia, which IMHO maintains /usr/lib/modprobe.d/50-nvidia-default.conf, to somehow document which GPU need this option.
(In reply to Petr Vorel from comment #10) > I guess we can close this bug. Actually loosing whole output without OpenRmEnableUnsupportedGpus=1 is a new *feature*, maybe Nvidia driver is broken on 6.5 kernel (it should be usable, although only internal screen). > > Maybe we should consider to document using OpenRmEnableUnsupportedGpus=1 > also somewhere in openSUSE wiki. Or ask Nvidia, which IMHO maintains > /usr/lib/modprobe.d/50-nvidia-default.conf, to somehow document which GPU > need this option. To correct myself: nvidia-open-driver-G06-signed-kmp-default-535.113.01_k6.5.4_1-43.4.x86_64 which contains /usr/lib/modprobe.d/50-nvidia-default.conf is from obs://build.opensuse.org/X11:Drivers:Video. Shouldn't be the config file in /etc? Or am I suppose to put it into /etc?
Hmm. In theory during an update a file marked as %config in RPM and edited by yourself before should not be overwritten. https://www.cl.cam.ac.uk/~jw35/docs/rpm_config.html I don't think it has changed in the package itself. But you could check if there is a .rpmsave with a timestamp of the update. /usr/lib/modprobe.d is the new location for packaged config files. But you can overwrite things permanently on your system in /etc/modprobe.d using the same filename (IIRC). Usage of the opengpu driver is documented: --> https://en.opensuse.org/SDB:NVIDIA_drivers Open GPU kernel modules versus Proprietary drivers The following article is about installing NVIDIA's Proprietary drivers. For more information about the Open GPU kernel modules, that NVIDIA released in May 2022, read this [openSUSE Blog article][https://sndirsch.github.io/nvidia/2022/06/07/nvidia-opengpu.html]. [...] I doubt nvidia opengpu driver ever worked without that option. It does only on computing cards without graphical output.
> I don't think it has changed in the package itself. But you could check if there is a .rpmsave with a timestamp of the update. Therefore keeping NEEDINFO open ...
(In reply to Stefan Dirsch from comment #12) > Hmm. In theory during an update a file marked as %config in RPM and edited > by yourself before should not be overwritten. > > https://www.cl.cam.ac.uk/~jw35/docs/rpm_config.html > > I don't think it has changed in the package itself. But you could check if > there is a .rpmsave with a timestamp of the update. Yes, there is 50-nvidia-default.conf.rpmsave with date 29th September, which is *without* "options nvidia NVreg_OpenRmEnableUnsupportedGpus=1" line (not even commented out). That also brought my suspicion that it was overwritten. Also in the file before I edited it was this line commented out (it was also after the installation before I modified it to get GPU working). > /usr/lib/modprobe.d is the new location for packaged config files. But you > can overwrite things permanently on your system in /etc/modprobe.d using the > same filename (IIRC). It's ok if I'm supposed to make this copy (I'll do). I just wanted to point out whole problem in case of any problem/bug in the package itself. > > Usage of the opengpu driver is documented: > > --> https://en.opensuse.org/SDB:NVIDIA_drivers > > Open GPU kernel modules versus Proprietary drivers > The following article is about installing NVIDIA's Proprietary drivers. For > more information about the Open GPU kernel modules, that NVIDIA released in > May 2022, read this [openSUSE Blog > article][https://sndirsch.github.io/nvidia/2022/06/07/nvidia-opengpu.html]. > [...] Yes, I've noticed both of them before. The blog document using this variable and I found it via the official docs. But none of them suggests to move content of /usr/lib/modprobe.d to /etc/modprobe.d (probably general approach which I should have known, but in this case it leads to a broken system). Blog also mentions pci_ids-unsupported [1] in our packaging. I wonder if there could be automation which would on package configure checked this list and enable or disable the variable. > I doubt nvidia opengpu driver ever worked without that option. It does only > on computing cards without graphical output. Interesting. This could be mentioned in the blog post. [1] https://build.opensuse.org/package/view_file/X11:Drivers:Video:Redesign/nvidia-open-driver-G06-signed/pci_ids-unsupported
Hmm. So that would mean the packaged file has changed (not sure why though; I'm not aware of any changes I did) and the .rpmsave is the edited one. So apparently you would have removed the line before yourself!?! But you needed to have set it. Hmm ... I'm not happy with the situation with this option. I had the idea to make a subpackage just out of this option, i.e. just one file. Install or uninstall this package to enable the driver or not. I'm afraid I can't enable this option by default as long as nVidia call it alpha quality for cards with display engine. In my blog post I mention, which GPUs are supported by default and which need this option. Pretty obvious I believe.
I just checked that 50-nvidia-default.conf of 535.104.05 and 535.113.01 is identical. So this does not explain, which such a .rpmsave file has been created.
(In reply to Stefan Dirsch from comment #16) > I just checked that 50-nvidia-default.conf of 535.104.05 and 535.113.01 is > identical. So this does not explain, which such a .rpmsave file has been > created. Thanks for all info. I remember only adding this option. But maybe I really removed this option, but it would have to be some time ago, not recently. But let's expect it was my fault, I'll watch next update of the driver. Also although I thought that I at least once before boot with Nvidia driver without NVreg_OpenRmEnableUnsupportedGpus=1, I'm not sure. Now I think it's unlike there is a regression in the driver or kernel. Maybe we should close this bug for now, it can be reopen if problem gets back.
Yes, I would definitely appreciate if you could watch what happens with the next update! And of course this ticket can be reopened if you run into the same situation again with the next update!
After zypper dup the problem is back. Notebook is running, but no output. I tried to run without dock station and any external screen. dmesg output is visible, last message is: nvidia 0000:01:00:0: [drm] fb0: nvidia-drmdrmfb frame buffer device and it asks for root to fix the problem. And indeed /usr/lib/modprobe.d/50-nvidia-default.conf is different (- for original + for new): -options nvidia-drm modeset=1 -options nvidia NVreg_OpenRMEnableSupporteGpus=1 +options nvidia-drm modeset=1 fbdev=1 1) Do I need to copy my config somewhere in /etc not to be overwritten? 2) Auto detection for NVreg_OpenRMEnableSupporteGpus would really help.
Hmm ... NVreg_OpenRMEnableSupporteGpus option is no longer needed. The support for Workstation cards is now considered beta and officially supported. fbdev option is new and eventually enables a Linux console with the nvidia driver (and no longer breaks simpledrm on newer 6.x.y kernels). Do things work again when you remove the fbdev option? I think you need to regenerate the initrd by running 'dracut' to make the changes effective.
(In reply to Stefan Dirsch from comment #20) > Hmm ... > > NVreg_OpenRMEnableSupporteGpus option is no longer needed. The support for > Workstation cards is now considered beta and officially supported. Does this apply to Open GPU kernel modules or to NVIDIA's Proprietary drivers? Your comment #12 suggests it's needed for Open GPU kernel modules which I'm trying to use. Although I need to double check if I installed only Open GPU kernel modules (the open ones) and not NVIDIA's Proprietary drivers. > > fbdev option is new and eventually enables a Linux console with the nvidia > driver (and no longer breaks simpledrm on newer 6.x.y kernels). > > Do things work again when you remove the fbdev option? OK, I'll test "options nvidia-drm modeset=1" (with removed "fbdev=1" from that line and removed "options nvidia NVreg_OpenRMEnableSupporteGpus=1"). But I remember last time "options nvidia-drm modeset=1" only didn't work (NVreg_OpenRMEnableSupporteGpus=1 was required on kernel 6.5 and kernel-firmware-nvidia-gspx-G06-535.113.01). > I think you need to regenerate the initrd by running 'dracut' to make the changes effective. OK, I'll try tomorrow something like: dracut --kver $(uname -r) -f
(In reply to Petr Vorel from comment #21) > (In reply to Stefan Dirsch from comment #20) > > Hmm ... > > > > NVreg_OpenRMEnableSupporteGpus option is no longer needed. The support for > > Workstation cards is now considered beta and officially supported. > > Does this apply to Open GPU kernel modules or to NVIDIA's Proprietary > drivers? Your comment #12 suggests it's needed for Open GPU kernel modules > which I'm trying to use. Although I need to double check if I installed only > Open GPU kernel modules (the open ones) and not NVIDIA's Proprietary drivers. This applies to Open GPU kernel modules. Setting this option is no longer needed for Desktop GPUs since version 545.29.02. > > fbdev option is new and eventually enables a Linux console with the nvidia > > driver (and no longer breaks simpledrm on newer 6.x.y kernels). > > > > Do things work again when you remove the fbdev option? > > OK, I'll test "options nvidia-drm modeset=1" (with removed "fbdev=1" from > that line and removed "options nvidia NVreg_OpenRMEnableSupporteGpus=1"). > But I remember last time "options nvidia-drm modeset=1" only didn't work > (NVreg_OpenRMEnableSupporteGpus=1 was required on kernel 6.5 and > kernel-firmware-nvidia-gspx-G06-535.113.01). See above. > > I think you need to regenerate the initrd by running 'dracut' to make the changes effective. > > OK, I'll try tomorrow something like: > dracut --kver $(uname -r) -f yes. I think this should do the job.
TL;DR: Probably problem in my setup, we can probably close this. The rest is a description if you find something which I do obviously wrong or if there is something what can be improved. I wonder how can happen that 2 driver versions can coexist together? (kernel-firmware-nvidia-gsp-G06-525.116 vs. kernel-firmware-nvidia-gspx-G06-535 and nvidia-open-driver-G06-signed-kmp-default-535 and nvidia-open-driver-G06-signed-kmp-default-545): $ rpm -qa |grep -i nvidia | sort kernel-firmware-nvidia-20231107-1.1.noarch kernel-firmware-nvidia-gsp-G06-525.116.04-2.1.x86_64 kernel-firmware-nvidia-gsp-G06-535.54.03-1.1.x86_64 kernel-firmware-nvidia-gspx-G06-535.113.01-1.1.x86_64 kernel-firmware-nvidia-gspx-G06-535.129.03-1.1.x86_64 kernel-firmware-nvidia-gspx-G06-535.129.03-11.1.x86_64 kernel-firmware-nvidia-gspx-G06-535.129.03-12.1.x86_64 kernel-firmware-nvidia-gspx-G06-545.29.02-13.1.x86_64 libnvidia-egl-wayland1-1.1.12-1.2.x86_64 libva-nvidia-driver-0.0.10-1.1.x86_64 nvidia-compute-G06-32bit-535.129.03-15.1.x86_64 nvidia-compute-G06-535.129.03-15.1.x86_64 nvidia-gl-G06-32bit-535.129.03-15.1.x86_64 nvidia-gl-G06-535.129.03-15.1.x86_64 nvidia-open-driver-G06-signed-kmp-default-535.129.03_k6.6.1_1-1.2.x86_64 nvidia-open-driver-G06-signed-kmp-default-545.29.02_k6.5.9_1-57.1.x86_64 nvidia-video-G06-32bit-535.129.03-15.1.x86_64 nvidia-video-G06-535.129.03-15.1.x86_64 $ rpm -qi kernel-firmware-nvidia-gspx-G06-545.29.02-13.1.x86_64 Name : kernel-firmware-nvidia-gspx-G06 Version : 545.29.02 Release : 13.1 Architecture: x86_64 Install Date: Út 14. listopadu 2023, 09:27:44 Group : System/Kernel Size : 64294720 License : GPL-2.0-only AND SUSE-Firmware AND GPL-2.0-or-later AND MIT Signature : RSA/SHA256, Po 13. listopadu 2023, 16:53:44, Key ID 590401a1e38fb563 Source RPM : kernel-firmware-nvidia-gspx-G06-545.29.02-13.1.nosrc.rpm Build Date : Po 13. listopadu 2023, 16:53:25 Build Host : i04-ch2a Vendor : obs://build.opensuse.org/X11:Drivers:Video URL : https://www.nvidia.com/en-us/drivers/unix/ Summary : Kernel firmware file for open NVIDIA kernel module driver G06 Description : This package contains the versioned kernel firmware file "gsp.bin" for the OpenSource NVIDIA kernel module driver G06. Distribution: X11:Drivers:Video:Redesign / openSUSE_Tumbleweed $ rpm -qi kernel-firmware-nvidia-gspx-G06-535.129.03-1.1.x86_64 Name : kernel-firmware-nvidia-gspx-G06 Version : 535.129.03 Release : 1.1 Architecture: x86_64 Install Date: Pá 10. listopadu 2023, 07:23:53 Group : System/Kernel Size : 61824832 License : GPL-2.0-only AND SUSE-Firmware AND GPL-2.0-or-later AND MIT Signature : RSA/SHA512, Čt 2. listopadu 2023, 20:48:50, Key ID 35a2f86e29b700a4 Source RPM : kernel-firmware-nvidia-gspx-G06-535.129.03-1.1.nosrc.rpm Build Date : Čt 2. listopadu 2023, 20:48:26 Build Host : i04-ch1b Packager : https://bugs.opensuse.org Vendor : openSUSE URL : https://www.nvidia.com/en-us/drivers/unix/ Summary : Kernel firmware file for open NVIDIA kernel module driver G06 Description : This package contains the versioned kernel firmware file "gsp.bin" for the OpenSource NVIDIA kernel module driver G06. Distribution: openSUSE Tumbleweed I suppose this is due multiversion = provides:multiversion(kernel), right? Because I see that both nvidia-open-driver devel [1] and factory [2] have the same newer version, the same applies to kernel-firmware-nvidia-gspx-G06 [3] [4] I removed obs://build.opensuse.org/X11:Drivers:Video and removed packages and install only the latest version. After this, the default value ("options nvidia-drm modeset=1 fbdev=1" and *not* set NVreg_OpenRMEnableSupporteGpus=1) was working for xorg. After installation the still was not working even I run dracut, I needed to ssh to the system, rerun dracut and reboot to get it working. Let's assume I did something wrong, that's why I needed to rerun dracut via ssh. But sway did not work. Removing "fbdev=1" made no difference (working xorg, broken sway). Adding NVreg_OpenRMEnableSupporteGpus=1 is the option which breaks booting. For sway are also needed nvidia-video-G06 (otherwise sway startup freezes) and nvidia-gl-G06 (sway startup fails) from the proprietary NVIDIA repository. i.e. both kernel open driver nvidia-open-driver-G06-signed-kmp-default-545.29.02_k6.6.1_1-1.1.x86_64 and GPU and proprietary NVIDIA OpenGL libraries are needed for sway (while this might be obvious from the block post [5] it was new for me, because sway claims "don't use nvidia proprietary"). [1] https://build.opensuse.org/package/view_file/X11:Drivers:Video:Redesign/nvidia-open-driver-G06-signed/nvidia-open-driver-G06-signed.changes?expand=1 [2] https://build.opensuse.org/package/view_file/openSUSE:Factory/nvidia-open-driver-G06-signed/nvidia-open-driver-G06-signed.changes?expand=1 [3] https://build.opensuse.org/package/view_file/X11:Drivers:Video:Redesign/kernel-firmware-nvidia-gspx-G06/kernel-firmware-nvidia-gspx-G06.changes?expand=1 [4] https://build.opensuse.org/package/view_file/openSUSE:Factory/kernel-firmware-nvidia-gspx-G06/kernel-firmware-nvidia-gspx-G06.changes?expand=1 [5] https://sndirsch.github.io/nvidia/2022/06/07/nvidia-opengpu.html
(In reply to Petr Vorel from comment #23) > TL;DR: Probably problem in my setup, we can probably close this. The rest is > a description if you find something which I do obviously wrong or if there > is something what can be improved. Thanks for the detailed report. Very much appreciated! > I wonder how can happen that 2 driver versions can coexist together? > (kernel-firmware-nvidia-gsp-G06-525.116 vs. > kernel-firmware-nvidia-gspx-G06-535 and > nvidia-open-driver-G06-signed-kmp-default-535 and > nvidia-open-driver-G06-signed-kmp-default-545): > > $ rpm -qa |grep -i nvidia | sort > kernel-firmware-nvidia-20231107-1.1.noarch > kernel-firmware-nvidia-gsp-G06-525.116.04-2.1.x86_64 > kernel-firmware-nvidia-gsp-G06-535.54.03-1.1.x86_64 > kernel-firmware-nvidia-gspx-G06-535.113.01-1.1.x86_64 > kernel-firmware-nvidia-gspx-G06-535.129.03-1.1.x86_64 > kernel-firmware-nvidia-gspx-G06-535.129.03-11.1.x86_64 > kernel-firmware-nvidia-gspx-G06-535.129.03-12.1.x86_64 > kernel-firmware-nvidia-gspx-G06-545.29.02-13.1.x86_64 > libnvidia-egl-wayland1-1.1.12-1.2.x86_64 > libva-nvidia-driver-0.0.10-1.1.x86_64 > nvidia-compute-G06-32bit-535.129.03-15.1.x86_64 > nvidia-compute-G06-535.129.03-15.1.x86_64 > nvidia-gl-G06-32bit-535.129.03-15.1.x86_64 > nvidia-gl-G06-535.129.03-15.1.x86_64 > nvidia-open-driver-G06-signed-kmp-default-535.129.03_k6.6.1_1-1.2.x86_64 > nvidia-open-driver-G06-signed-kmp-default-545.29.02_k6.5.9_1-57.1.x86_64 > nvidia-video-G06-32bit-535.129.03-15.1.x86_64 > nvidia-video-G06-535.129.03-15.1.x86_64 > > > $ rpm -qi kernel-firmware-nvidia-gspx-G06-545.29.02-13.1.x86_64 > Name : kernel-firmware-nvidia-gspx-G06 > Version : 545.29.02 > Release : 13.1 > Architecture: x86_64 > Install Date: Út 14. listopadu 2023, 09:27:44 > Group : System/Kernel > Size : 64294720 > License : GPL-2.0-only AND SUSE-Firmware AND GPL-2.0-or-later AND MIT > Signature : RSA/SHA256, Po 13. listopadu 2023, 16:53:44, Key ID > 590401a1e38fb563 > Source RPM : kernel-firmware-nvidia-gspx-G06-545.29.02-13.1.nosrc.rpm > Build Date : Po 13. listopadu 2023, 16:53:25 > Build Host : i04-ch2a > Vendor : obs://build.opensuse.org/X11:Drivers:Video > URL : https://www.nvidia.com/en-us/drivers/unix/ > Summary : Kernel firmware file for open NVIDIA kernel module driver G06 > Description : > This package contains the versioned kernel firmware file "gsp.bin" for > the OpenSource NVIDIA kernel module driver G06. > Distribution: X11:Drivers:Video:Redesign / openSUSE_Tumbleweed > > $ rpm -qi kernel-firmware-nvidia-gspx-G06-535.129.03-1.1.x86_64 > Name : kernel-firmware-nvidia-gspx-G06 > Version : 535.129.03 > Release : 1.1 > Architecture: x86_64 > Install Date: Pá 10. listopadu 2023, 07:23:53 > Group : System/Kernel > Size : 61824832 > License : GPL-2.0-only AND SUSE-Firmware AND GPL-2.0-or-later AND MIT > Signature : RSA/SHA512, Čt 2. listopadu 2023, 20:48:50, Key ID > 35a2f86e29b700a4 > Source RPM : kernel-firmware-nvidia-gspx-G06-535.129.03-1.1.nosrc.rpm > Build Date : Čt 2. listopadu 2023, 20:48:26 > Build Host : i04-ch1b > Packager : https://bugs.opensuse.org > Vendor : openSUSE > URL : https://www.nvidia.com/en-us/drivers/unix/ > Summary : Kernel firmware file for open NVIDIA kernel module driver G06 > Description : > This package contains the versioned kernel firmware file "gsp.bin" for > the OpenSource NVIDIA kernel module driver G06. > Distribution: openSUSE Tumbleweed > > I suppose this is due multiversion = provides:multiversion(kernel), right? Yes, this is exactly the reason. > Because I see that both nvidia-open-driver devel [1] and factory [2] have > the same newer version, the same applies to kernel-firmware-nvidia-gspx-G06 > [3] [4] I removed obs://build.opensuse.org/X11:Drivers:Video and removed > packages and install only the latest version. Yes, you no longer need the devel projects, since the driver+firmware is now included in our products. So better remove these. > After this, the default value ("options nvidia-drm modeset=1 fbdev=1" and > *not* set NVreg_OpenRMEnableSupporteGpus=1) was working for xorg. Thanks for confirmation. > After > installation the still was not working even I run dracut, I needed to ssh to > the system, rerun dracut and reboot to get it working. Let's assume I did > something wrong, that's why I needed to rerun dracut via ssh. But sway did > not work. Yeah. You need to reboot now after changing kernel modules config. You no longer can easily unload the driver when option "fbdev=1" is et which eventually added a Linux console with this driver. > Removing "fbdev=1" made no difference (working xorg, broken sway). > > Adding NVreg_OpenRMEnableSupporteGpus=1 is the option which breaks booting. Interesting that having this option still set breaks things. I think it should be removed from the driver. > For sway are also needed nvidia-video-G06 (otherwise sway startup freezes) > and nvidia-gl-G06 (sway startup fails) from the proprietary NVIDIA > repository. > > i.e. both kernel open driver > nvidia-open-driver-G06-signed-kmp-default-545.29.02_k6.6.1_1-1.1.x86_64 and > GPU > and proprietary NVIDIA OpenGL libraries are needed for sway (while this > might be obvious from the blog post [5] it was new for me, because sway > claims "don't use nvidia proprietary"). Ok. Good to know this. Maybe sway just doesn't work with Mesa's software fallback driver, no matter which KMS driver is in use.
So I'm closing this for now. Of course you can report what happens with the next update. ;-)
(In reply to Petr Vorel from comment #23) > Adding NVreg_OpenRMEnableSupporteGpus=1 is the option which breaks booting. I cannot reproduce that issue. Driver 545.29.02 simply ignores this setting. [ 4.993601] nvidia: unknown parameter 'NVreg_OpenRMEnableSupporteGpus' ignored
SUSE-RU-2023:4642-1: An update that has two fixes can now be installed. Category: recommended (moderate) Bug References: 1215981, 1217370 Sources used: openSUSE Leap 15.5 (src): nvidia-open-driver-G06-signed-545.29.02-150500.3.18.1 SUSE Linux Enterprise Micro 5.5 (src): nvidia-open-driver-G06-signed-545.29.02-150500.3.18.1 Basesystem Module 15-SP5 (src): nvidia-open-driver-G06-signed-545.29.02-150500.3.18.1 Public Cloud Module 15-SP5 (src): nvidia-open-driver-G06-signed-545.29.02-150500.3.18.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
SUSE-RU-2023:4641-1: An update that has two fixes can now be installed. Category: recommended (moderate) Bug References: 1215981, 1217370 Sources used: openSUSE Leap 15.4 (src): nvidia-open-driver-G06-signed-545.29.02-150400.9.32.1 SUSE Linux Enterprise Micro for Rancher 5.3 (src): nvidia-open-driver-G06-signed-545.29.02-150400.9.32.1 SUSE Linux Enterprise Micro 5.3 (src): nvidia-open-driver-G06-signed-545.29.02-150400.9.32.1 SUSE Linux Enterprise Micro for Rancher 5.4 (src): nvidia-open-driver-G06-signed-545.29.02-150400.9.32.1 SUSE Linux Enterprise Micro 5.4 (src): nvidia-open-driver-G06-signed-545.29.02-150400.9.32.1 Basesystem Module 15-SP4 (src): nvidia-open-driver-G06-signed-545.29.02-150400.9.32.1 Public Cloud Module 15-SP4 (src): nvidia-open-driver-G06-signed-545.29.02-150400.9.32.1 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
I still experience black screen very often (e.g. ~ 50% of boots or resumes from boot). I guess what I reported as a configuration issue /usr/lib/modprobe.d/50-nvidia-default.conf (there probably was at least one problem with it) or with broken "systemctl suspend" is something else. It happens even I don't do any update or configuration issue. OTOH I did some updates, thus it also happened on different kernels and nvidia driver versions. When there is a black screen there is full log of repeating messages: [ 23.262590] snd_hda_intel 0000:01:00.1: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID) [ 23.262597] snd_hda_intel 0000:01:00.1: device [10de:2291] error status/mask=00100000/00000000 [ 23.262602] snd_hda_intel 0000:01:00.1: [20] UnsupReq (First) [ 23.262606] snd_hda_intel 0000:01:00.1: AER: TLP Header: 60000008 000000ff 00000040 00840000 [ 23.262613] pci 0000:01:00.0: AER: can't recover (no error_detected callback) [ 23.262615] snd_hda_intel 0000:01:00.1: AER: can't recover (no error_detected callback) [ 23.262646] pcieport 0000:00:01.0: AER: device recovery failed [ 23.349965] pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.1 I already reported it in comment #5, but in dmesg #7 it was added only once. Later it become permanent (i.e. dmesg ring buffer contains only these messages). Is that a hardware error? Documenting current state of the config files (IMHO they are correct). $ rpm -qa |grep -i -e kernel-default -e nvidia | sort kernel-default-devel-6.6.2-1.1.x86_64 kernel-default-devel-6.6.3-1.1.x86_64 kernel-default-6.6.2-1.1.x86_64 kernel-default-6.6.3-1.1.x86_64 kernel-firmware-nvidia-gspx-G06-545.29.06-1.1.x86_64 kernel-firmware-nvidia-20231128-1.1.noarch libnvidia-egl-wayland1-1.1.13-1.1.x86_64 libva-nvidia-driver-0.0.11-1.1.x86_64 nvidia-compute-G06-32bit-545.29.06-18.1.x86_64 nvidia-compute-G06-545.29.06-18.1.x86_64 nvidia-driver-G06-kmp-default-545.29.06_k6.6.2_1-18.1.x86_64 nvidia-gl-G06-32bit-545.29.06-18.1.x86_64 nvidia-gl-G06-545.29.06-18.1.x86_64 nvidia-video-G06-32bit-545.29.06-18.1.x86_64 nvidia-video-G06-545.29.06-18.1.x86_64 $ uname -a Linux p16 6.6.3-1-default #1 SMP PREEMPT_DYNAMIC Wed Nov 29 05:06:07 UTC 2023 (d766c57) x86_64 x86_64 x86_64 GNU/Linux $ cat /usr/lib/modprobe.d/50-nvidia-default.conf |grep -v ^# options nvidia NVreg_DeviceFileUID=0 NVreg_DeviceFileGID=485 NVreg_DeviceFileMode=0660 NVreg_PreserveVideoMemoryAllocations=1 options nvidia-drm modeset=1 fbdev=1 install nvidia PATH=$PATH:/bin:/usr/bin; if /sbin/modprobe --ignore-install nvidia; then if /sbin/modprobe nvidia_uvm; then if [ ! -c /dev/nvidia-uvm ]; then mknod -m 660 /dev/nvidia-uvm c $(cat /proc/devices | while read major device; do if [ "$device" = "nvidia-uvm" ]; then echo $major; break; fi ; done) 0; chown :video /dev/nvidia-uvm; fi; if [ ! -c /dev/nvidia-uvm-tools ]; then mknod -m 660 /dev/nvidia-uvm-tools c $(cat /proc/devices | while read major device; do if [ "$device" = "nvidia-uvm" ]; then echo $major; break; fi ; done) 1; chown :video /dev/nvidia-uvm-tools; fi; fi; if [ ! -c /dev/nvidiactl ]; then mknod -m 660 /dev/nvidiactl c 195 255; chown :video /dev/nvidiactl; fi; devid=-1; for dev in $(ls -d /sys/bus/pci/devices/*); do vendorid=$(cat $dev/vendor); if [ "$vendorid" = "0x10de" ]; then class=$(cat $dev/class); classid=${class%%00}; if [ "$classid" = "0x0300" -o "$classid" = "0x0302" ]; then devid=$((devid+1)); if [ ! -c /dev/nvidia${devid} ]; then mknod -m 660 /dev/nvidia${devid} c 195 ${devid}; chown :video /dev/nvidia${devid}; fi; fi; fi; done; /sbin/modprobe nvidia_drm; if [ ! -c /dev/nvidia-modeset ]; then mknod -m 660 /dev/nvidia-modeset c 195 254; chown :video /dev/nvidia-modeset; fi; fi $ cat /usr/lib/tmpfiles.d/nvidia-logind-acl-trick-G06.conf L /run/udev/static_node-tags/uaccess/nvidiactl - - - - /dev/nvidiactl L /run/udev/static_node-tags/uaccess/nvidia-uvm - - - - /dev/nvidia-uvm L /run/udev/static_node-tags/uaccess/nvidia-uvm-tools - - - - /dev/nvidia-uvm-tools L /run/udev/static_node-tags/uaccess/nvidia-modeset - - - - /dev/nvidia-modeset L /run/udev/static_node-tags/uaccess/nvidia0 - - - - /dev/nvidia0 $ cat /usr/lib/modprobe.d/nvidia-default.conf blacklist nouveau $ cat /usr/lib/dracut/dracut.conf.d/60-nvidia-default.conf add_drivers+=" nvidia nvidia-drm nvidia-modeset nvidia-uvm " $ cat /usr/src/kernel-modules/nvidia-545.29.06-default/dkms.conf |grep -v ^# PACKAGE_NAME="nvidia" PACKAGE_VERSION="__VERSION_STRING" AUTOINSTALL="yes" MAKE[0]="'make' -j__JOBS NV_EXCLUDE_BUILD_MODULES='__EXCLUDE_MODULES' KERNEL_UNAME=${kernelver} modules" __DKMS_MODULES
Hmm, snd_hda_intel sounds like the driver for the internal Intel sound chip. > [ 23.262646] pcieport 0000:00:01.0: AER: device recovery failed > [ 23.349965] pcieport 0000:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 0000:01:00.1 No idea. Google tells me https://www.videogames.ai/dmesg-aer-error#:~:text=You%20can%20fix%20the%20problem,and%20disabling%20memory%20mapping%20support.&text=Just%20need%20to%20reboot%20and%20the%20error%20should%20disapear. Maybe it's worth a try.
AER report is usually harmless, but if it happens even with a newer kernel, it's a regression and should be addressed. (And yes, it's worth to test the boot options to see whether it suppresses or not.)
SUSE-RU-2024:0143-1: An update that has one fix can now be installed. Category: recommended (moderate) Bug References: 1215981 Sources used: openSUSE Leap 15.5 (src): nvidia-open-driver-G06-signed-545.29.06-150500.3.21.5 SUSE Linux Enterprise Micro 5.5 (src): nvidia-open-driver-G06-signed-545.29.06-150500.3.21.5 Basesystem Module 15-SP5 (src): nvidia-open-driver-G06-signed-545.29.06-150500.3.21.5 Public Cloud Module 15-SP5 (src): nvidia-open-driver-G06-signed-545.29.06-150500.3.21.5 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
SUSE-RU-2024:0169-1: An update that has one fix can now be installed. Category: recommended (moderate) Bug References: 1215981 Sources used: SUSE Manager Retail Branch Server 4.3 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Manager Server 4.3 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 openSUSE Leap 15.4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise Micro for Rancher 5.3 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise Micro 5.3 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise Micro for Rancher 5.4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise Micro 5.4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 Public Cloud Module 15-SP4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise High Performance Computing ESPOS 15 SP4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise High Performance Computing LTSS 15 SP4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise Desktop 15 SP4 LTSS 15-SP4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise Server 15 SP4 LTSS 15-SP4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Linux Enterprise Server for SAP Applications 15 SP4 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 SUSE Manager Proxy 4.3 (src): nvidia-open-driver-G06-signed-545.29.06-150400.9.35.2 NOTE: This line indicates an update has been released for the listed product(s). At times this might be only a partial fix. If you have questions please reach out to maintenance coordination.
(In reply to Takashi Iwai from comment #37) > AER report is usually harmless, but if it happens even with a newer kernel, > it's a regression and should be addressed. > (And yes, it's worth to test the boot options to see whether it suppresses > or not.) So have you tried this meanwhile? Instructions in the link you posted in comment #36.
(In reply to Stefan Dirsch from comment #44) > (In reply to Takashi Iwai from comment #37) > > AER report is usually harmless, but if it happens even with a newer kernel, > > it's a regression and should be addressed. > > (And yes, it's worth to test the boot options to see whether it suppresses > > or not.) > > So have you tried this meanwhile? Instructions in the link you posted in > comment #36. Any news on this one?
@Petr ping ...
I'm sorry, meanwhile I reinstalled to nouveau, but I'll reinstall back and check it.
(In reply to Stefan Dirsch from comment #46) > (In reply to Stefan Dirsch from comment #44) > > (In reply to Takashi Iwai from comment #37) > > > AER report is usually harmless, but if it happens even with a newer kernel, > > > it's a regression and should be addressed. > > > (And yes, it's worth to test the boot options to see whether it suppresses > > > or not.) > > > > So have you tried this meanwhile? Instructions in the link you posted in > > comment #36. > > Any news on this one? Yes, pci=nommconf kernel command parameter suppresses AER error message in dmesg.
Just for the record, nouveau kernel driver does not have the problem (going to retest nvidia kernel drivers).
(In reply to Petr Vorel from comment #49) > (In reply to Stefan Dirsch from comment #46) > > (In reply to Stefan Dirsch from comment #44) > > > (In reply to Takashi Iwai from comment #37) > > > > AER report is usually harmless, but if it happens even with a newer kernel, > > > > it's a regression and should be addressed. > > > > (And yes, it's worth to test the boot options to see whether it suppresses > > > > or not.) > > > > > > So have you tried this meanwhile? Instructions in the link you posted in > > > comment #36. > > > > Any news on this one? > > Yes, pci=nommconf kernel command parameter suppresses AER error message in > dmesg. Thanks for verifying that!
(In reply to Petr Vorel from comment #50) > Just for the record, nouveau kernel driver does not have the problem (going > to retest nvidia kernel drivers). I think with that we should close this bug. I understand that it's a hassle testing again and again a driver when you already found another solution. And since nobody else seems to be affected ...