Bug 1227194 - ‘mount -t cifs’ then ‘ls /mnt’ causes “soft lockup - CPU#0 stuck”. Can’t kill -9 nor init s, etc
Summary: ‘mount -t cifs’ then ‘ls /mnt’ causes “soft lockup - CPU#0 stuck”. Can’t kill...
Status: NEW
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Leap 15.6
Hardware: Other Other
: P5 - None : Critical (vote)
Target Milestone: ---
Assignee: Enzo Matsumiya
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-29 01:15 UTC by Steven Hatton
Modified: 2024-07-05 19:42 UTC (History)
6 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
ematsumiya: needinfo? (R.Wentland)
ematsumiya: needinfo? (gross.jonas)
ematsumiya: needinfo? (hattons)


Attachments
Kernel output from /var/log/messages (20.38 KB, text/plain)
2024-06-29 13:32 UTC, Rolf Wentland
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Steven Hatton 2024-06-29 01:15:40 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/115.0
Build Identifier: 

Every time I do:

USS-Liberty:~ # mount -t cifs //192.168.99.78/elements -o username=hattons /mnt/elements
Password for hattons@//192.168.99.78/elements:

followed by ‘ls /mnt’, I get

2024-06-27T03:48:38.093101-04:00 USS-Liberty kernel: [ 2664.161932][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 1643s! [ls:5764]

etc.

I can’t ‘kill -9 5764’, ‘reboot’, ‘halt’, ‘init s’, etc. Other CPUs start to get stuck on other processes. I haven’t been using Linux for several years, but in the good old days, this wasn’t supposed to happen.

Reproducible: Always

Steps to Reproduce:
1. # mount -t cifs //192.168.99.78/<share> -o username=<user> /mnt/<mount point>
2. ‘ls /mnt’
3. 
Actual Results:  
I get repeated messages of the form

2024-06-27T03:48:38.093101-04:00 USS-Liberty kernel: [ 2664.161932][    C0] watchdog: BUG: soft lockup - CPU#0 stuck for 1643s! [ls:5764]

I can’t ‘kill -9 5764’, ‘reboot’, ‘halt’, ‘init s’, etc. Other CPUs start to get stuck on other processes.

Expected Results:  
Expected the remote share to either be mounted an ls to show me the contents of /mnt, or receive an error indicating a failure of the mount and/or ls commands.

This is very deterministic.  I have nothing to compare it to because it is my first Linux install in years, and I don't have a second machine to test with.  There is, however, this:

https://forums.opensuse.org/t/mount-t-cifs-then-ls-mnt-causes-soft-lockup-cpu-0-stuck-cant-kill-9-nor-init-s-etc/176343/3

Unfortunately, I did a reinstall since I last crashed the machine, so I don't have logs.  If you can't reproduce the result, I will try again.
Comment 1 Rolf Wentland 2024-06-29 13:32:34 UTC
Created attachment 875778 [details]
Kernel output from /var/log/messages

Kernel output during cifs mount and ls /mnt
Comment 2 Rolf Wentland 2024-06-29 13:37:03 UTC
I can reproduce this with kernel 6.4.0-150600.23.7-default and a Windows 11 share. Mount command gives no error. System is freezing after ls /mnt until power cycle see attached log snippet. A cifs share from a FritzBox router works for me with no problems so far.
Comment 4 Jonas Groß 2024-07-05 13:57:59 UTC
I confirm. Same here
Comment 5 Enzo Matsumiya 2024-07-05 15:20:01 UTC
(In reply to Rolf Wentland from comment #2)
> I can reproduce this with kernel 6.4.0-150600.23.7-default and a Windows 11
> share. Mount command gives no error. System is freezing after ls /mnt until
> power cycle see attached log snippet. A cifs share from a FritzBox router
> works for me with no problems so far.

Are you saying that the lockup only happens when you mount the share that's on Windows 11?

I don't know FritzBox, but I'm assuming they use samba, or their own version of it.

Do you have other servers you can try mounting cifs shares from?

@Jonas what kind of server are you mounting from?


Internal automated tests are done with Windows Server 2022 and samba shares -- such trivial task would've been catched easily.  Hence why I'm asking if a different server yields different results on the cifs.ko client.
Comment 6 Enzo Matsumiya 2024-07-05 15:29:47 UTC
Nevermind, I can reproduce it.

Fix coming soon.

Meanwhile, I used 'nohandlecache' mount option and it dint't hang.  Let me know what you get if you try it.
Comment 7 Rolf Wentland 2024-07-05 16:00:26 UTC
On my system the lookup occurs only with a Windows 11 23H2 machine. A FritzBox is a common SOHO router in Germany that provides a NAS option. Cifs mount does work without problem with the router box. I do not know what software is used in the router. Mounting the Windows 11 share with option 'nohandlecache' does solve the problem.
Comment 8 Enzo Matsumiya 2024-07-05 16:04:46 UTC
Thanks for confirming.

I checked if I was using 'nohandlecache' on my test systems and I wasn't.

So the problem is in my build system, that I'm still puzzled as I can't reproduce the bug with my self-built kernel.

I'm suspecting probably some kernel config, but will update the bug once I get the fix.
Comment 9 Jonas Groß 2024-07-05 16:07:12 UTC
I don't know the other side - it's in a corporate environment and my google-foo would not let me find how to find that information client side.
But I tried adding 'nohandlecache' to the mount options and it didn't lock up since then.

If you can point me to some tutorial or some other guidance I am happy to provide further info.
Comment 10 Enzo Matsumiya 2024-07-05 19:42:29 UTC
I've pushed the fix to our internal branches, Leap 15.6 should get it on next maintenance update.

If anyone wants to build + test the fixed kernel, please let me know so I push it to a public branch.

Meanwhile use the 'nohandlecache' mount option as a workaround, which should have little to no side effects (aside from some performance loss).