Bug 1212257 - CIFS mount with DFS crashing the system
Summary: CIFS mount with DFS crashing the system
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-06-12 20:16 UTC by Luiz Angelo Daros de Luca
Modified: 2023-08-07 18:54 UTC (History)
6 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
kernel log (when kde clock was still ticking) (16.49 KB, text/x-log)
2023-06-13 17:07 UTC, Luiz Angelo Daros de Luca
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Luiz Angelo Daros de Luca 2023-06-12 20:16:26 UTC
Hello,

After upgrade from 20230605-2330 to 20230610-2340, the machine hard lock during KDE login seconds after a splashscreen appears. I tried both wayland and X11 with identical results. IceWM is working as expected.

There is no log nor crash, just a hard lock. Sysrq does not respond and I need to cold reboot the machine.

My VGA is:

00:02.0 VGA compatible controller: Intel Corporation AlderLake-S GT1 (rev 0c)
(machine Dell Precision 3660)

Some days ago, a similar issue at the same point happened with an HP Workstation with:

00:02.0 VGA compatible controller: Intel Corporation RocketLake-S GT1 [UHD Graphics P750] (rev 04)
(machine HP Z2 G8 Tower Workstation Desktop PC)

But that issue is now fixed. Maybe that fix broke the Dell setup)
Comment 1 Luiz Angelo Daros de Luca 2023-06-12 20:34:57 UTC
I tried to disable compositor but it wasn't enough. I noticed that the crash happened during session restore (when KDE reopen apps). I disabled that from IceWM and I could once again login to the machine. It looks like some interaction between KDE and some restored apps (chrome, firefox, konsole, kate, blink Voip) crashed the machine.
Comment 2 Luiz Angelo Daros de Luca 2023-06-12 21:16:01 UTC
Some more info:

After I managed to login without freezing the machine, I worked with it for a couple of hours until the problem happened again. While I was typing a text, while pressing space, something blocked and space repeated indefinitely, like it lost the info that I released the key. The keyboard was not responding anymore (although sysrq did work). The kde painel clock was still updating, kernel was answering ping but nothing more worked (ssh for example).
Comment 3 Fabian Vogt 2023-06-13 06:33:07 UTC
> Sysrq does not respond and I need to cold reboot the machine.

-> kernel issue.

Is there something in the journal?
Comment 4 Luiz Angelo Daros de Luca 2023-06-13 17:07:26 UTC
Created attachment 867548 [details]
kernel log (when kde clock was still ticking)

Not for the problem during kde login: userland I/O was dead. Journal has a sharp cut from "business as usual" to "Booting".

The other event that happened after I successfully logged into the machine has some more clues (as even sysrq was working). I'm not a kernel expert but the errors I see are indirectly related to the issue. I got some generic "watchdog: BUG: soft lockup" but nothing out of ordinary before that. Something very low level was broken and the kernel could not even notice it.
Comment 5 Takashi Iwai 2023-06-14 11:18:56 UTC
Looks like some hang up in a CIFS work.
Adding relevant people to Cc.
Comment 6 Luiz Angelo Daros de Luca 2023-06-14 16:49:32 UTC
There was some CIFS DFS fixed that landed recently by Paulo Alcantara

https://bugzilla.suse.com/show_bug.cgi?id=1210470

However, the test kernel 6.3.0-1.g527350e-default didn't show this problem before. Maybe something changed between that and the code that landed in the kernel.
Comment 7 Paulo Alcantara 2023-06-27 15:20:43 UTC
User reported this:

DFS namespace on samba:

dfs/a -> msdsf://
dfs/dir/1 -> msdsf://
dfs/dir/2 -> msdsf://
dfs/dir/3 -> msdsf://
dfs/dir/4 -> msdsf://
dfs/dir/5 -> msdsf://
dfs/dir/6 -> msdsf://
dfs/b -> msdsf://
dfs/c -> msdsf://

/etc/security/pam_mount.conf.xml:

<volume options="sec=krb5i,cruid=%(USERUID),uid=%(USERUID),noserverino,file_mode=0700,dir_mode=0700" server="server" path="rede" mountpoint="~/g"
fstype="cifs"/>

During or after login

[28870.567865] CIFS: Attempting to mount \\server\dfs
[28870.788120] CIFS: Attempting to mount \\server\dfs
[28870.846423] CIFS: Attempting to mount \\server\dfs
[28870.846458] BUG: kernel NULL pointer dereference, address: 0000000000000008
[28870.846461] #PF: supervisor write access in kernel mode
[28870.846463] #PF: error_code(0x0002) - not-present page
[28870.846465] PGD 0 P4D 0 
[28870.846468] Oops: 0002 [#1] PREEMPT SMP NOPTI
[28870.846471] CPU: 5 PID: 12130 Comm: GlobalQueue[05] Kdump: loaded Tainted: G           O       6.3.7-1-default #1 openSUSE Tumbleweed
a577eae57964bb7e83477b5a5645a1781df990f0
[28870.846476] Hardware name: Dell Inc. Precision 3660/01GN0N, BIOS 1.7.1 10/28/2022
[28870.846477] RIP: 0010:__cifs_put_smb_ses+0xd3/0x3f0 [cifs]
[28870.846577] Code: 74 0d e8 60 ff 01 00 48 c7 45 30 00 00 00 00 48 c7 c7 48 87 19 c2 e8 fc 5f 6a f1 48 8b 45 08 48 8b 55 00 48 c7 c7 48 87 19 c2 <48> 89 42 08 48 89 10
48 89 6d 00 48 89 6d 08 e8 99 60 6a f1 4
8 8b
[28870.846580] RSP: 0018:ffffafd3c8297598 EFLAGS: 00010246
[28870.846582] RAX: 0000000000000000 RBX: ffff8be2cefe4838 RCX: 0000000000000000
[28870.846584] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffc2198748
[28870.846586] RBP: ffff8be2cefe4800 R08: 000000000000001e R09: 0000000000000000
[28870.846587] R10: 0000000000000000 R11: ffff8bea4fabdfe8 R12: ffff8be2cefe4800
[28870.846588] R13: 0000000000000000 R14: ffff8be2cefe3020 R15: 0000000000000000
[28870.846590] FS:  00007fcdee7fc6c0(0000) GS:ffff8be7aaa80000(0000) knlGS:0000000000000000
[28870.846592] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[28870.846593] CR2: 0000000000000008 CR3: 000000078cc86000 CR4: 0000000000f50ee0
[28870.846595] PKRU: 55555554
[28870.846596] Call Trace:
[28870.846599]  <TASK>
[28870.846602]  ? __die+0x23/0x70
[28870.846607]  ? page_fault_oops+0x14d/0x490
[28870.846611]  ? request_key_and_link+0xc6/0x810
[28870.846616]  ? exc_page_fault+0x6e/0x150
[28870.846621]  ? asm_exc_page_fault+0x26/0x30
[28870.846626]  ? __cifs_put_smb_ses+0xd3/0x3f0 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.846710]  ? __cifs_put_smb_ses+0xc4/0x3f0 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.846792]  cifs_get_tcon+0x94/0xba0 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.846875]  cifs_mount_get_tcon+0x5f/0x290 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.846958]  dfs_mount_share+0x35a/0x9b0 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.847062]  cifs_mount+0x79/0x300 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.847151]  cifs_smb3_do_mount+0x10b/0x720 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.847238]  smb3_get_tree+0xce/0x280 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.847342]  vfs_get_tree+0x26/0xd0
[28870.847346]  ? smb3_parse_devname+0x120/0x170 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.847446]  fc_mount+0x12/0x40
[28870.847451]  cifs_dfs_do_automount.isra.0+0x250/0x2d0 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.847555]  cifs_dfs_d_automount+0x24/0x150 [cifs 678cdfd8aedeec26422762522f3eb173d76adbbf]
[28870.847649]  __traverse_mounts+0x8c/0x210
[28870.847654]  step_into+0x33a/0x760
[28870.847657]  ? lookup_fast+0x75/0xf0
[28870.847659]  link_path_walk.part.0.constprop.0+0x240/0x380
[28870.847662]  ? path_init+0x28a/0x3c0
[28870.847665]  path_lookupat+0x3e/0x1a0
[28870.847668]  ? try_to_unlazy+0x5a/0xc0
[28870.847670]  filename_lookup+0xd4/0x1d0
[28870.847674]  ? _copy_to_user+0x25/0x30
[28870.847677]  ? cp_statx+0x191/0x1d0
[28870.847682]  vfs_statx+0x8c/0x160
[28870.847686]  do_statx+0x45/0x80
[28870.847689]  ? __check_object_size+0x233/0x2b0
[28870.847692]  ? strncpy_from_user+0x43/0x140
[28870.847697]  ? getname_flags.part.0+0x4b/0x1c0
[28870.847700]  __x64_sys_statx+0x66/0x80
[28870.847702]  do_syscall_64+0x5d/0x90
[28870.847706]  ? __x64_sys_statx+0x70/0x80
[28870.847709]  ? syscall_exit_to_user_mode+0x1b/0x40
[28870.847713]  ? do_syscall_64+0x6c/0x90
[28870.847716]  ? exc_page_fault+0x6e/0x150
[28870.847720]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[28870.847722] RIP: 0033:0x7fcf2d7057de
[28870.847771] Code: a5 fd ff ff e8 e1 3f 02 00 0f 1f 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 90 90 41 89 ca b8 4c 01 00 00 0f 05 <48> 3d 00 f0 ff ff 77
2a 89 c1 85 c0 74 0f 48 8b 05 f5 45 0e 00 64
[28870.847773] RSP: 002b:00007fcdee7fb9d8 EFLAGS: 00000206 ORIG_RAX: 000000000000014c
[28870.847776] RAX: ffffffffffffffda RBX: 00007fcdd4000e70 RCX: 00007fcf2d7057de
[28870.847778] RDX: 0000000000000900 RSI: 00007fcdd4001388 RDI: 00000000ffffff9c
[28870.847779] RBP: 0000000000510000 R08: 00007fcdee7fba30 R09: 00007fcdd40013e5
[28870.847781] R10: 0000000000000fff R11: 0000000000000206 R12: 00007fcdd4000e50
[28870.847782] R13: 0000000000510000 R14: 00007fcdee7fba10 R15: 0000000000510000
[28870.847785]  </TASK>

[ 2917.585595] CIFS: Attempting to mount \\server\dfs
[ 2917.598368] BUG: unable to handle page fault for address: ffffffffffffffff
[ 2917.598383] #PF: supervisor write access in kernel mode
[ 2917.598391] #PF: error_code(0x0002) - not-present page
[ 2917.598398] PGD 40223b067 P4D 40223b067 PUD 40223d067 PMD 0 
[ 2917.598414] Oops: 0002 [#2] PREEMPT SMP PTI
[ 2917.598424] CPU: 3 PID: 14461 Comm: bash Tainted: G      D W  O       6.3.6-1-default #1 openSUSE Tumbleweed d92ec5864371d7852882cd4aa0a220829340020d
[ 2917.598437] Hardware name: HP HP EliteDesk 800 G5 Desktop Mini/8595, BIOS R21 Ver. 02.07.01 10/19/2020
[ 2917.598443] RIP: 0010:_raw_spin_lock+0x17/0x30
[ 2917.598460] Code: 00 66 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 65 ff 05 78 36 d8 61 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 05 c3
cc cc cc cc 89 c6 e8 97 01 00 00 90 c3 cc cc
[ 2917.598469] RSP: 0018:ffffa0a1d05275a8 EFLAGS: 00010246
[ 2917.598478] RAX: 0000000000000000 RBX: ffff909888192038 RCX: 0000000000000000
[ 2917.598485] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffffffffffffff
[ 2917.598492] RBP: ffffffffffffffff R08: 000000000000001e R09: 0000000000000000
[ 2917.598499] R10: 0000000000000001 R11: 0000000000000072 R12: ffffffffa0e07e80
[ 2917.598505] R13: 0000000000000101 R14: ffffa0a1d05275b0 R15: 0000000000000000
[ 2917.598512] FS:  00007f29b5451540(0000) GS:ffff909b67780000(0000) knlGS:0000000000000000
[ 2917.598521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2917.598528] CR2: ffffffffffffffff CR3: 00000003d10a0002 CR4: 00000000003706e0
[ 2917.598535] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2917.598541] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 2917.598547] Call Trace:
[ 2917.598554]  <TASK>
[ 2917.598562]  ? __die+0x23/0x70
[ 2917.598576]  ? page_fault_oops+0x14d/0x490
[ 2917.598589]  ? fixup_exception+0x26/0x370
[ 2917.598607]  ? exc_page_fault+0x14b/0x150
[ 2917.598621]  ? asm_exc_page_fault+0x26/0x30
[ 2917.598636]  ? _raw_spin_lock+0x17/0x30
[ 2917.598648]  ? __kmalloc_node_track_caller+0x4e/0x150
[ 2917.598663]  free_cached_dirs+0x39/0x100 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.599166]  tconInfoFree+0x29/0x120 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.599645]  __cifs_put_smb_ses+0xb0/0x3f0 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.600104]  cifs_get_tcon+0x94/0xba0 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.600563]  cifs_mount_get_tcon+0x5f/0x290 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.601023]  dfs_mount_share+0x35a/0x9b0 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.601522]  cifs_mount+0x79/0x300 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.601986]  cifs_smb3_do_mount+0x10b/0x720 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.602443]  smb3_get_tree+0xce/0x280 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.602933]  vfs_get_tree+0x26/0xd0
[ 2917.602945]  ? smb3_parse_devname+0x120/0x170 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.603398]  fc_mount+0x12/0x40
[ 2917.603411]  cifs_dfs_do_automount.isra.0+0x250/0x2d0 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.603895]  cifs_dfs_d_automount+0x24/0x150 [cifs 85eb8ce5b086c416d3fea95341e62ded2b5d0765]
[ 2917.604345]  __traverse_mounts+0x8c/0x210
[ 2917.604363]  step_into+0x33a/0x760
[ 2917.604376]  path_openat+0x13a/0x1120
[ 2917.604390]  ? do_set_pte+0x188/0x230
[ 2917.604403]  do_filp_open+0xb8/0x160
[ 2917.604420]  ? __check_object_size+0x233/0x2b0
[ 2917.604437]  do_sys_openat2+0x95/0x150
[ 2917.604450]  __x64_sys_openat+0x57/0xa0
[ 2917.604461]  do_syscall_64+0x5d/0x90
[ 2917.604473]  ? handle_mm_fault+0x11e/0x310
[ 2917.604486]  ? do_user_addr_fault+0x1e0/0x720
[ 2917.604498]  ? syscall_exit_to_user_mode+0x1b/0x40
[ 2917.604513]  ? exc_page_fault+0x6e/0x150
[ 2917.604526]  entry_SYSCALL_64_after_hwframe+0x72/0xdc
[ 2917.604536] RIP: 0033:0x7f29b558f3b2
[ 2917.604598] Code: 00 48 89 44 24 18 31 c0 41 83 e2 40 75 3a 89 f0 f7 d0 a9 00 00 41 00 74 2f 89 f2 b8 01 01 00 00 48 89 fe bf 9c ff ff ff 0f 05 <48> 3d 00 f0 ff ff 77
3e 48 8b 54 24 18 64 48 2b 14 25 28 00 00 00
[ 2917.604606] RSP: 002b:00007fff511d9c50 EFLAGS: 00000206 ORIG_RAX: 0000000000000101
[ 2917.604616] RAX: ffffffffffffffda RBX: 000055f82a59a690 RCX: 00007f29b558f3b2
[ 2917.604623] RDX: 0000000000090800 RSI: 000055f82a59a690 RDI: 00000000ffffff9c
[ 2917.604629] RBP: 000055f82a584c50 R08: 000055f82a59fbd0 R09: 00007f29b566ec80
[ 2917.604635] R10: 0000000000000000 R11: 0000000000000206 R12: 000055f828a7e5f4
[ 2917.604641] R13: 00007f29b56a9250 R14: 0000000000000000 R15: 000000000000000a
[ 2917.604655]  </TASK>
[ 2917.604660] Modules linked in: netlink_diag rfcomm snd_seq_dummy snd_hrtimer snd_seq snd_seq_device nls_utf8 cifs cifs_arc4 cifs_md4 dns_resolver fscache netfs ccm cmac
algif_hash algif_skcipher af_alg nft_limit nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib af_packet wireguard nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject
libchacha20poly1305 chacha_x86_64 poly1305_x86_64 curve25519_x86_64 libcurve25519_generic libchacha ip6_udp_tunnel udp_tunnel nf_log_syslog nft_log nft_ct nft_chain_nat
joydev nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4
iptable_mangle iptable_raw iptable_security hid_generic ch341 usbserial usbhid ip_set nfnetlink ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bpfilter
bnep btusb btrtl btbcm btintel btmtk bluetooth sr9700 dm9601 usbnet mii ecdh_generic qrtr snd_sof_pci_intel_cnl snd_sof_intel_hda_common soundwire_intel
soundwire_generic_allocation

With debug enabled, problem doesn't occur.  Will look into the code and try reproducing it.
Comment 8 Luiz Angelo Daros de Luca 2023-06-29 00:49:03 UTC
I renamed the bug to match the real issue. It is a cifs issue with DFS shares.

It is not only during the login. The system can crash at any time when the cifs share with DFS is mounted. It happens more frequently during login because restoring kde session might access multiple files in parallel from that share.

All coworker that use tumbleweed are experiencing similar issues. The system crashes or freezes at unexpected moments. If the share is not mounted, the system is rock solid. We switched to userland cifs access (kio-cifs or smbclient) to workaround the issue.

I got a kdump but crash couldn't open it (it looks like this bug https://www.spinics.net/linux/fedora/redhat-crash-utility/msg10076.html). Anyway, I don't know how useful it would be as the bug seems to damage near memory regions. For example, I once got a xfs crash instead of the typical cifs crash.

All systems affected have plenty of cores (20) and fast M2 storage. This issue might require more parallelism to appear and it might not manifest with slower machines with less core (like VMs), or when debug msgs are enabled. There are just a few tumbleweed users and none of them using slower machines. No leap user reported this issue.
Comment 9 Fabian Vogt 2023-06-29 06:24:10 UTC
(In reply to Luiz Angelo Daros de Luca from comment #8)
> I got a kdump but crash couldn't open it (it looks like this bug
> https://www.spinics.net/linux/fedora/redhat-crash-utility/msg10076.html).

It's bug 1190434. You can build the crash package with crash-debuginfo-compressed.patch removed and it'll work.
Comment 10 Paulo Alcantara 2023-07-05 01:11:50 UTC
Hi Luiz,

Please try kernel from [1] with potential fixes.

If it still doesn't work, then try booting the kernel with
'slub_debug=FPUZ' to see if that helps.  Thanks.

[1] https://build.opensuse.org/package/show/home:pauloac:kernel-bsc1212257/kernel-default
Comment 11 Paulo Alcantara 2023-07-07 15:35:26 UTC
Hi Luiz,

Were you able to test it?
Comment 12 Luiz Angelo Daros de Luca 2023-07-07 16:36:07 UTC
(In reply to Paulo Alcantara from comment #11)
> Hi Luiz,
> 
> Were you able to test it?

Yes, we installed in our tumbleweed workstations. No crash until now.
Comment 13 Paulo Alcantara 2023-08-07 18:21:03 UTC
Hi Luiz,

The fixes have been released in kernel-default-6.4.8-1.  Please upgrade your kernel.

Feel free to re-open it if you're still observing these crashes with latest tumbleweed kernel.
Comment 14 Luiz Angelo Daros de Luca 2023-08-07 18:54:20 UTC
Thanks, Paulo!