Bug 1173248 - Kernel 5.7.2 - nvidia driver hangs with high system load on Optimus system (Intel/NVIDIA combo) in NVIDIA mode
Kernel 5.7.2 - nvidia driver hangs with high system load on Optimus system (I...
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: X11 3rd Party Driver
Current
Other Other
: P3 - Medium : Major (vote)
: ---
Assigned To: Stefan Dirsch
Stefan Dirsch
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-06-23 06:17 UTC by Axel Braun
Modified: 2020-07-02 12:59 UTC (History)
3 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Installation log (507.64 KB, text/plain)
2020-06-23 06:17 UTC, Axel Braun
Details
Installation log TW 20200628 (69.32 KB, text/plain)
2020-07-01 08:04 UTC, Axel Braun
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Axel Braun 2020-06-23 06:17:32 UTC
Created attachment 839025 [details]
Installation log

Installation of current TW image causes issue if Nvidia-driver is installed: At installation of nvidia-glG05-440.82-30.1.x86_64 , system hangs under high system load. Previous compilation of driver fails, see attachment
Comment 1 Axel Braun 2020-06-23 11:11:37 UTC
Looks like there is a patch in between:
https://gitlab.com/snippets/1965550
Comment 2 Stefan Dirsch 2020-06-24 10:25:17 UTC
(In reply to Axel Braun from comment #1)
> Looks like there is a patch in between:
> https://gitlab.com/snippets/1965550

An identical patch is applied to our packages in the current repo.
Comment 3 Stefan Dirsch 2020-06-24 10:26:57 UTC
(In reply to Axel Braun from comment #0)
> Created attachment 839025 [details]
> Installation log
> 
> Installation of current TW image causes issue if Nvidia-driver is installed:
> At installation of nvidia-glG05-440.82-30.1.x86_64 , system hangs under high
> system load. Previous compilation of driver fails, see attachment

Let's see whether the new driver version is fixing this issue. I'll work on this later.
Comment 4 Stefan Dirsch 2020-06-25 21:08:19 UTC
Updated packages are on-the-way now.
Comment 5 Axel Braun 2020-07-01 08:03:36 UTC
Same issue with update to Snapshot 20200628 (Kernel 5.7.5) - find he log of yesterdays installation attached.
Comment 6 Axel Braun 2020-07-01 08:04:58 UTC
Created attachment 839251 [details]
Installation log TW 20200628
Comment 7 Stefan Dirsch 2020-07-01 12:29:21 UTC
That's still NVIDIA 440.82. Please try with NVIDIA 440.100 (repos have been updated yesterday).
Comment 8 Axel Braun 2020-07-01 13:30:13 UTC
(In reply to Stefan Dirsch from comment #7)
> That's still NVIDIA 440.82. Please try with NVIDIA 440.100 (repos have been
> updated yesterday).

Please scroll down in the attachment, nvidia 440.100 is loaded, but too late:
kernel--default-devel-5.7.5-1.2.x86_64 is installed in step 82, and compiles against the old nvidia driver

176/182) Installieren: nvidia-gfxG05-kmp-default-440.100_k5.7.2_1-26.1.x86_64 
comes much later around, delivering the new version, and compiles the new modules.

The issue that it hangs at step 178 
(178/182) Installieren: nvidia-glG05-440.100-26.1.x86_64
is probably the fact that it cant unload the nvidia modules completely.

What brings me to this conclusion?
I did the update again today, and before that I switched to the intel graphics:
X1E:/home/docb # prime-select intel
X1E:/home/docb # glxinfo | grep 'OpenGL renderer string'
OpenGL renderer string: Mesa DRI Intel(R) UHD Graphics 630 (CFL GT2)

When doing so, the message
Cant unload nvidia.drm (or similar) scrolled through the terminal 
(After switching graphics you need to log off and on again to get into Intel)
My guess is that this causes the system to hang.
Comment 9 Stefan Dirsch 2020-07-01 13:45:31 UTC
> (178/182) Installieren: nvidia-glG05-440.100-26.1.x86_64

Can't find this in the attached logfile. Indeed seems installation of 440.100 worked fine. Maybe you should check if none of the NVIDIA packages is installed twice in different versions.

tumbleweed/x86_64/x11-video-nvidiaG05-440.100-26.1.x86_64.rpm
tumbleweed/x86_64/nvidia-glG05-440.100-26.1.x86_64.rpm
tumbleweed/x86_64/nvidia-computeG05-440.100-26.1.x86_64.rpm
tumbleweed/x86_64/nvidia-gfxG05-kmp-default-440.100_k5.7.2_1-26.1.x86_64.rpm

These should be installed. Mabye you need to uninstall a mess of nvidia packages and reinstall them proper again.
Comment 10 Axel Braun 2020-07-01 15:19:01 UTC
(In reply to Stefan Dirsch from comment #9)
> > (178/182) Installieren: nvidia-glG05-440.100-26.1.x86_64
> 
> Can't find this in the attached logfile. Indeed seems installation of
> 440.100 worked fine. Maybe you should check if none of the NVIDIA packages
> is installed twice in different versions.
> 
> tumbleweed/x86_64/x11-video-nvidiaG05-440.100-26.1.x86_64.rpm
> tumbleweed/x86_64/nvidia-glG05-440.100-26.1.x86_64.rpm
> tumbleweed/x86_64/nvidia-computeG05-440.100-26.1.x86_64.rpm
> tumbleweed/x86_64/nvidia-gfxG05-kmp-default-440.100_k5.7.2_1-26.1.x86_64.rpm

X1E:/home/docb # rpm -qa | grep nvidia
nvidia-glG05-440.100-26.1.x86_64
nvidia-gfxG05-kmp-default-440.100_k5.7.2_1-26.1.x86_64
x11-video-nvidiaG05-440.100-26.1.x86_64
nvidia-computeG05-440.100-26.1.x86_64
kernel-firmware-nvidia-20200610-1.1.noarch

> These should be installed. Mabye you need to uninstall a mess of nvidia
> packages and reinstall them proper again.

Hm, that should not be the idea behind zypper dup ;-)
BTW, the message when switching to intel driver is:

modprobe: FATAL: Module nvidia_drm is in use.

Best guess is that this module causes the issue
Comment 11 Stefan Dirsch 2020-07-01 16:16:07 UTC
> BTW, the message when switching to intel driver is:

What do you mean with switching to intel? Do you have an Optimus system with Intel/NVIDIA combo and are trying to use suse-prime?

> modprobe: FATAL: Module nvidia_drm is in use.
>Best guess is that this module causes the issue
Comment 12 Axel Braun 2020-07-01 20:24:34 UTC
(In reply to Stefan Dirsch from comment #11)
 
> What do you mean with switching to intel? Do you have an Optimus system with
> Intel/NVIDIA combo and are trying to use suse-prime?

Correct. Using suse-prime-bbswitch
Comment 13 Stefan Dirsch 2020-07-01 22:12:25 UTC
Ok. prime-select script tries to unload the nvidia kernel modules for intel mode, which of course cannot work as long as you're sitting on an Xserver, which still needs them. But I think this is more a non-issue, after reboot it should no longer be active and bbswitch active instead (NVIDIA and bswitch don't like each other) . By using suse-prime-bbswitch you apparently are trying disable the NVIDIA GPU completely, which is the feature of using suse-prime-bbswitch.
Comment 14 Stefan Dirsch 2020-07-01 22:17:50 UTC
I'm not sure what you're trying to achieve here. NVIDIA mode or Intel mode or Intel mode with NVIDIA GPU completely off to save more power?
Comment 15 Axel Braun 2020-07-02 07:16:33 UTC
(In reply to Stefan Dirsch from comment #14)
> I'm not sure what you're trying to achieve here. NVIDIA mode or Intel mode
> or Intel mode with NVIDIA GPU completely off to save more power?

The upgrade (zypper dup) should work independent which GPU is activated. I wonder how people deal with the issue that have only a Nvidia card (changing it to ATI/AMD is not the answer here ;-) as they cant deactivate nvidia driver. Or maybe they do it in init 3.

So, not sure if zypper people should look into this , or how we can find out why zypper hangs.

I'm happy to have some more broken upgrades if it helps....
Comment 16 Stefan Dirsch 2020-07-02 12:38:18 UTC
Ah it's really the "zypper dup" which hangs your machine. I got confused. I thought the driver would hang the system after the latest update. So the last packages installed were

2020-06-30 19:00:34|install|nvidia-gfxG05-kmp-default|440.100_k5.7.2_1-26.1|x86_64||NVIDIA|ab38a13092e18b471c083d659ba3777ed9a50e2b8e70503a06cc8842cb90687d|
2020-06-30 19:00:36|install|nvidia-computeG05|440.100-26.1|x86_64||NVIDIA|a3cff98fcc7680444ab99a417e5ca49395ca0bb5b546c59ca7942b09801c782c|

I'm afraid you need to report this to the zypper guys with zypper log, etc. They will tell you. I suggest to open a new bug once you can reproduce the issue with "zypper dup", because you won't find any longer the appropriated zypper logs meanwhile (overwritten by subsequent zypper runs). :-(
Comment 17 Stefan Dirsch 2020-07-02 12:39:20 UTC
So let's close this one.
Comment 18 Axel Braun 2020-07-02 12:59:48 UTC
(In reply to Stefan Dirsch from comment #16)

> I'm afraid you need to report this to the zypper guys with zypper log, etc.
> They will tell you. I suggest to open a new bug once you can reproduce the
> issue with "zypper dup", because you won't find any longer the appropriated
> zypper logs meanwhile (overwritten by subsequent zypper runs). :-(

OK, will do. Thanks for your help!