Bug 1213765 - nvidia driver removal does not run depmod making subsequent dracut calls fail
Summary: nvidia driver removal does not run depmod making subsequent dracut calls fail
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X11 3rd Party Driver (show other bugs)
Version: Current
Hardware: Other Other
: P3 - Medium : Normal (vote)
Target Milestone: ---
Assignee: Stefan Dirsch
QA Contact: Stefan Dirsch
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-07-28 16:46 UTC by Andrei Borzenkov
Modified: 2023-08-02 12:43 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Andrei Borzenkov 2023-07-28 16:46:12 UTC
2023-07-28 18:55:37|command|root@uefi|'zypper' 'rm' '--clean-deps' 'nvidia-driver-G06-kmp-default'|
2023-07-28 18:55:43|remove |nvidia-compute-G06-32bit|535.86.05-10.1|x86_64||
2023-07-28 18:55:45|remove |nvidia-gl-G06-32bit|535.86.05-10.1|x86_64||
2023-07-28 18:55:47|remove |nvidia-video-G06-32bit|535.86.05-10.1|x86_64||
2023-07-28 18:56:07|remove |nvidia-gl-G06|535.86.05-10.1|x86_64||
2023-07-28 18:56:09|remove |nvidia-video-G06|535.86.05-10.1|x86_64||
2023-07-28 18:56:13|remove |libnvidia-egl-wayland1|1.1.12-1.1|x86_64||
2023-07-28 18:56:18|remove |nvidia-compute-G06|535.86.05-10.1|x86_64||
# 2023-07-28 18:56:33 nvidia-driver-G06-kmp-default-535.86.05_k6.4.3_1-10.1.x86_64 removed ok
# warning: /usr/lib/modprobe.d/50-nvidia-default.conf saved as /usr/lib/modprobe.d/50-nvidia-default.conf.rpmsave
# warning: /usr/lib/dracut/dracut.conf.d/60-nvidia-default.conf saved as /usr/lib/dracut/dracut.conf.d/60-nvidia-default.conf.rpmsave
# update-alternatives: warning: alternative /usr/lib/nvidia/alternate-install-present-default (part of link group alternate-install-present) doesn't exist; removing from list of alternatives
2023-07-28 18:56:33|remove |nvidia-driver-G06-kmp-default|535.86.05_k6.4.3_1-10.1|x86_64|root@uefi|


So far so good, but subsequent dracut fails with

dracut[I]: *** Including module: kernel-modules-extra ***
realpath: updates/nvidia-modeset.ko: No such file or directory
realpath: updates/nvidia-peermem.ko: No such file or directory
realpath: updates/nvidia-drm.ko: No such file or directory
realpath: updates/nvidia-uvm.ko: No such file or directory
realpath: updates/nvidia.ko: No such file or directory
dracut[F]: installkernel failed in module kernel-modules-extra

Because

user@uefi:~> ll /usr/lib/modules/6.4.4-1-default/updates/
total 0
user@uefi:~> grep nvidia /usr/lib/modules/6.4.4-1-default/modules.dep
kernel/drivers/net/ethernet/nvidia/forcedeth.ko.zst:
kernel/drivers/usb/typec/altmodes/typec_nvidia.ko.zst: kernel/drivers/usb/typec/altmodes/typec_displayport.ko.zst kernel/drivers/usb/typec/typec.ko.zst
kernel/drivers/i2c/busses/i2c-nvidia-gpu.ko.zst: kernel/drivers/i2c/busses/i2c-ccgx-ucsi.ko.zst
kernel/drivers/platform/x86/nvidia-wmi-ec-backlight.ko.zst: kernel/drivers/acpi/video.ko.zst kernel/drivers/platform/x86/wmi.ko.zst
updates/nvidia-modeset.ko: updates/nvidia.ko kernel/drivers/acpi/video.ko.zst kernel/drivers/platform/x86/wmi.ko.zst
updates/nvidia-peermem.ko:
updates/nvidia-drm.ko: updates/nvidia-modeset.ko updates/nvidia.ko kernel/drivers/acpi/video.ko.zst kernel/drivers/platform/x86/wmi.ko.zst
updates/nvidia-uvm.ko: updates/nvidia.ko
updates/nvidia.ko:
user@uefi:~> ll /usr/lib/modules/6.4.4-1-default/modules.dep
-rw-r--r-- 1 root root 655367 Jul 25 21:47 /usr/lib/modules/6.4.4-1-default/modules.dep
user@uefi:~> 

So modules.dep was not updated when drivers were removed.
Comment 1 Stefan Dirsch 2023-07-28 18:57:44 UTC
Hmm. First time I hear this being an issue. Seems indeed it's not being executed in %postun/%preun. May I ask how you ran dracut exactly?
Comment 2 Andrei Borzenkov 2023-07-29 06:12:28 UTC
(In reply to Stefan Dirsch from comment #1)
> May I ask how you ran dracut exactly?

dracut -f --regenerate-all
Comment 3 Andrei Borzenkov 2023-07-29 06:39:17 UTC
Related - on installation of NVIDIA KMP new initrd is generated which includes NVDIA modules. On removal of KMP initrd is not regenerated so we are left with initrd including NVIDIA drivers. Which can be quite surprising for users and it actually did cause issues. So probably package should simply call 

/usr/lib/module-init-tools/kernel-scriptlets/kmp-postun

which will also take care of depmod. I bet there are some magic RPM macros to take care of it.
Comment 4 Stefan Dirsch 2023-07-29 08:46:01 UTC
(In reply to Andrei Borzenkov from comment #3)
> Related - on installation of NVIDIA KMP new initrd is generated which
> includes NVDIA modules. On removal of KMP initrd is not regenerated so we
> are left with initrd including NVIDIA drivers. Which can be quite surprising
> for users and it actually did cause issues. So probably package should
> simply call 
> 
> /usr/lib/module-init-tools/kernel-scriptlets/kmp-postun
> 
> which will also take care of depmod. I bet there are some magic RPM macros
> to take care of it.

Actually this is being done.

postuninstall scriptlet (using /bin/sh):
[...]
run_if_exists /usr/lib/module-init-tools/kernel-scriptlets/kmp-postun --name "nvidia-driver-G06-kmp-default" \
  --version "535.86.05_k6.4.3_1" --release "10.1" --kernelrelease "6.4.3-1" \
  --flavor "default" --usrmerged "01" "$@"
Comment 6 Andrei Borzenkov 2023-07-29 10:49:34 UTC
(In reply to Stefan Dirsch from comment #4)
>   --version "535.86.05_k6.4.3_1" --release "10.1" --kernelrelease "6.4.3-1" 

Kernel release is hardcoded in RPM and it is skipped for any other release. There is some code in %triggerpostun, but I honestly fail to understand what it does. Oh, and for G06 it is using the wrong package name - it says

triggerpostun scriptlet (using /bin/bash) -- nvidia-gfxG06-kmp-default

while package is named nvidia-driver-G06-kmp-default. May be this trigger is explicitly for migrating from old package naming convention, do not know.
Comment 7 Andrei Borzenkov 2023-07-29 12:36:14 UTC
Thinking more about it. On Tumbleweed NVIDIA driver is compiled and installed into kernel tree every time *kernel* is updated. What should happen when KMP is removed? Should all compiled binaries installed for every kernel version be removed? Should only the binary from the latest release be removed? I rather expect all compiled binaries should be removed. This is what e.g. Ubuntu nvidia-dkms package does.
Comment 8 Stefan Dirsch 2023-07-29 13:30:33 UTC
(In reply to Andrei Borzenkov from comment #6)
> (In reply to Stefan Dirsch from comment #4)
> >   --version "535.86.05_k6.4.3_1" --release "10.1" --kernelrelease "6.4.3-1" 
> 
> Kernel release is hardcoded in RPM and it is skipped for any other release.
> There is some code in %triggerpostun, but I honestly fail to understand what
> it does. Oh, and for G06 it is using the wrong package name - it says
> 
> triggerpostun scriptlet (using /bin/bash) -- nvidia-gfxG06-kmp-default
> 
> while package is named nvidia-driver-G06-kmp-default. May be this trigger is
> explicitly for migrating from old package naming convention, do not know.

This was needed only for the migration from the old to the new package name.
Comment 9 Stefan Dirsch 2023-07-29 13:35:45 UTC
(In reply to Andrei Borzenkov from comment #7)
> Thinking more about it. On Tumbleweed NVIDIA driver is compiled and
> installed into kernel tree every time *kernel* is updated. What should
> happen when KMP is removed? Should all compiled binaries installed for every
> kernel version be removed? Should only the binary from the latest release be
> removed? I rather expect all compiled binaries should be removed. This is
> what e.g. Ubuntu nvidia-dkms package does.

I tried to remove nvidia modules for no longer existing kernel trees, but this fails and I needed to revert it.

# rpm --triggers -qp nvidia-driver-G06-kmp-default-535.86.05_k6.4.3_1-10.1.x86_64.rpm
[...]
triggerpostun scriptlet (using /bin/sh) -- kernel-default
#
# Unfortunately doesn't work since kernel updates are not considered "atomar"
# when using YaST/zypper (only safe when using rpm) [boo#1182666]
#
#for dir in $(find /lib/modules  -mindepth 1 -maxdepth 1 -type d); do
#       if [ ! -d $dir/kernel ]; then
#               test -d $dir/updates && rm -f  $dir/updates/nvidia*.ko
#       fi
#done
Comment 10 Stefan Dirsch 2023-08-02 12:42:23 UTC
Ok. Indeed I can reproduce this issue with current Tumbleweed easily.

(In reply to Stefan Dirsch from comment #9)
> (In reply to Andrei Borzenkov from comment #7)
> > Thinking more about it. On Tumbleweed NVIDIA driver is compiled and
> > installed into kernel tree every time *kernel* is updated. What should
> > happen when KMP is removed? Should all compiled binaries installed for every
> > kernel version be removed? Should only the binary from the latest release be
> > removed? I rather expect all compiled binaries should be removed. This is
> > what e.g. Ubuntu nvidia-dkms package does.
> 
> I tried to remove nvidia modules for no longer existing kernel trees, but
> this fails and I needed to revert it.
> 
> # rpm --triggers -qp
> nvidia-driver-G06-kmp-default-535.86.05_k6.4.3_1-10.1.x86_64.rpm
> [...]
> triggerpostun scriptlet (using /bin/sh) -- kernel-default
> #
> # Unfortunately doesn't work since kernel updates are not considered "atomar"
> # when using YaST/zypper (only safe when using rpm) [boo#1182666]
> #
> #for dir in $(find /lib/modules  -mindepth 1 -maxdepth 1 -type d); do
> #       if [ ! -d $dir/kernel ]; then
> #               test -d $dir/updates && rm -f  $dir/updates/nvidia*.ko
> #       fi
> #done

Actually I'm doing this when removing the package completely (%postun). For that case I added now a run of depmod for the affected kernel module trees.
Comment 11 Stefan Dirsch 2023-08-02 12:43:32 UTC
This will get fixed with the following RPM changelog entry

-------------------------------------------------------------------
Wed Aug  2 12:23:27 UTC 2023 - Stefan Dirsch <sndirsch@suse.com>

- %postun: regenerate modules.dep, etc. to avoid dracut failures
  later (boo#1213765)

Closing ...