Bugzilla – Bug 1227194
‘mount -t cifs’ then ‘ls /mnt’ causes “soft lockup - CPU#0 stuck”. Can’t kill -9 nor init s, etc
Last modified: 2024-07-05 19:42:29 UTC
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0 Build Identifier: Every time I do: USS-Liberty:~ # mount -t cifs //192.168.99.78/elements -o username=hattons /mnt/elements Password for hattons@//192.168.99.78/elements: followed by ‘ls /mnt’, I get 2024-06-27T03:48:38.093101-04:00 USS-Liberty kernel: [ 2664.161932][ C0] watchdog: BUG: soft lockup - CPU#0 stuck for 1643s! [ls:5764] etc. I can’t ‘kill -9 5764’, ‘reboot’, ‘halt’, ‘init s’, etc. Other CPUs start to get stuck on other processes. I haven’t been using Linux for several years, but in the good old days, this wasn’t supposed to happen. Reproducible: Always Steps to Reproduce: 1. # mount -t cifs //192.168.99.78/<share> -o username=<user> /mnt/<mount point> 2. ‘ls /mnt’ 3. Actual Results: I get repeated messages of the form 2024-06-27T03:48:38.093101-04:00 USS-Liberty kernel: [ 2664.161932][ C0] watchdog: BUG: soft lockup - CPU#0 stuck for 1643s! [ls:5764] I can’t ‘kill -9 5764’, ‘reboot’, ‘halt’, ‘init s’, etc. Other CPUs start to get stuck on other processes. Expected Results: Expected the remote share to either be mounted an ls to show me the contents of /mnt, or receive an error indicating a failure of the mount and/or ls commands. This is very deterministic. I have nothing to compare it to because it is my first Linux install in years, and I don't have a second machine to test with. There is, however, this: https://forums.opensuse.org/t/mount-t-cifs-then-ls-mnt-causes-soft-lockup-cpu-0-stuck-cant-kill-9-nor-init-s-etc/176343/3 Unfortunately, I did a reinstall since I last crashed the machine, so I don't have logs. If you can't reproduce the result, I will try again.
Created attachment 875778 [details] Kernel output from /var/log/messages Kernel output during cifs mount and ls /mnt
I can reproduce this with kernel 6.4.0-150600.23.7-default and a Windows 11 share. Mount command gives no error. System is freezing after ls /mnt until power cycle see attached log snippet. A cifs share from a FritzBox router works for me with no problems so far.
I confirm. Same here
(In reply to Rolf Wentland from comment #2) > I can reproduce this with kernel 6.4.0-150600.23.7-default and a Windows 11 > share. Mount command gives no error. System is freezing after ls /mnt until > power cycle see attached log snippet. A cifs share from a FritzBox router > works for me with no problems so far. Are you saying that the lockup only happens when you mount the share that's on Windows 11? I don't know FritzBox, but I'm assuming they use samba, or their own version of it. Do you have other servers you can try mounting cifs shares from? @Jonas what kind of server are you mounting from? Internal automated tests are done with Windows Server 2022 and samba shares -- such trivial task would've been catched easily. Hence why I'm asking if a different server yields different results on the cifs.ko client.
Nevermind, I can reproduce it. Fix coming soon. Meanwhile, I used 'nohandlecache' mount option and it dint't hang. Let me know what you get if you try it.
On my system the lookup occurs only with a Windows 11 23H2 machine. A FritzBox is a common SOHO router in Germany that provides a NAS option. Cifs mount does work without problem with the router box. I do not know what software is used in the router. Mounting the Windows 11 share with option 'nohandlecache' does solve the problem.
Thanks for confirming. I checked if I was using 'nohandlecache' on my test systems and I wasn't. So the problem is in my build system, that I'm still puzzled as I can't reproduce the bug with my self-built kernel. I'm suspecting probably some kernel config, but will update the bug once I get the fix.
I don't know the other side - it's in a corporate environment and my google-foo would not let me find how to find that information client side. But I tried adding 'nohandlecache' to the mount options and it didn't lock up since then. If you can point me to some tutorial or some other guidance I am happy to provide further info.
I've pushed the fix to our internal branches, Leap 15.6 should get it on next maintenance update. If anyone wants to build + test the fixed kernel, please let me know so I push it to a public branch. Meanwhile use the 'nohandlecache' mount option as a workaround, which should have little to no side effects (aside from some performance loss).