Bug 1182929 - kernel fuse crash in heavy use with cvmfs
kernel fuse crash in heavy use with cvmfs
Status: NEW
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel
Current
x86-64 openSUSE Tumbleweed
: P5 - None : Normal (vote)
: ---
Assigned To: Luis Henriques
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2021-03-02 14:22 UTC by Stefan Kluth
Modified: 2021-04-05 09:04 UTC (History)
10 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
debug patch with VM_BUG_ON_PAGE in invalidate_inode_pages2_range() (787 bytes, patch)
2021-03-15 14:27 UTC, Luis Henriques
Details | Diff
mm/truncate: Make invalidate_inode_pages2_range work with THPs (1.95 KB, patch)
2021-03-18 17:31 UTC, Luis Henriques
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Stefan Kluth 2021-03-02 14:22:05 UTC
We use cvmfs to access headers and libraries during a build of our experiment softwar (ATLAS athena). This is a big project and a typical build produces > 100 GB or output. cvmfs is a caching layer and connect via fuse. With the latest tumbleweed kernels we observe after ca. 50% of the build the following:

[97604.721613] CPU: 18 PID: 1584438 Comm: g++ Tainted: P           O 
 5.10.16-1-default #1 openSUSE Tumbleweed
[97604.721618] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a
10/16/2019
[97604.721631] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0
[97604.721637] Code: 0f 48 f0 e9 19 ff ff ff 31 c9 4c 89 e7 ba 01 00 00 00
48 89 ee e8 1a c5 02 00 4c 89 ff e8 02 1b 01 00 84 c0 0f 84 ca fe ff ff <0f>
0b 49 8b 57 18 49 39 d4 0f 85 e2 fe ff ff 49 f7 07 00 60 00 00
[97604.721645] RSP: 0018:ffffa613aa54ba40 EFLAGS: 00010202
[97604.721651] RAX: 0000000000000001 RBX: 000000000000000a RCX:
0000000000000200
[97604.721656] RDX: 0000000000000090 RSI: 00affff800010037 RDI:
ffffd880718e0000
[97604.721660] RBP: 0000000000001400 R08: 0000000000001400 R09:
0000000000001a73
[97604.721664] R10: 0000000000000000 R11: 0000000004a684da R12:
ffff8a28d4549d78
[97604.721669] R13: ffffffffffffffff R14: 0000000000000000 R15:
ffffd880718e0000
[97604.721674] FS:  00007f9cdd7fb740(0000) GS:ffff8a5c7f980000(0000)
knlGS:0000000000000000
[97604.721679] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[97604.721683] CR2: 00007f89d3d78d80 CR3: 0000004d8a14e005 CR4:
00000000007706e0
[97604.721688] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[97604.721692] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
97604.721696] PKRU: 55555554
[97604.721699] Call Trace:
[97604.721719]  ? request_wait_answer+0x11a/0x210 [fuse]
[97604.721729]  ? fuse_dentry_delete+0xb/0x20 [fuse]
[97604.721740]  fuse_finish_open+0x85/0x150 [fuse]
[97604.721750]  fuse_open_common+0x1a8/0x1b0 [fuse]
[97604.721759]  ? fuse_open_common+0x1b0/0x1b0 [fuse]
[97604.721766]  do_dentry_open+0x14e/0x380
[97604.721775]  path_openat+0x600/0x10d0
[97604.721782]  ? handle_mm_fault+0x103c/0x1a00
[97604.721791]  ? follow_page_pte+0x314/0x5f0
[97604.721795]  do_filp_open+0x88/0x130
[97604.721803]  ? security_prepare_creds+0x6d/0x90
[97604.721808]  ? __kmalloc+0x11d/0x2a0
[97604.721814]  do_open_execat+0x6d/0x1a0
[97604.721819]  bprm_execve+0x190/0x6b0
[97604.721825]  do_execveat_common+0x192/0x1c0
[97604.721830]  __x64_sys_execve+0x39/0x50
[97604.721836]  do_syscall_64+0x33/0x80
[97604.721843]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[97604.721848] RIP: 0033:0x7f9cdcfe2c37
[97604.721853] Code: ff ff 76 df 89 c6 f7 de 64 41 89 32 eb d5 89 c6 f7 de
64 41 89 32 eb db 66 2e 0f 1f 84 00 00 00 00 00 90 b8 3b 00 00 00 0f 05 <48>
3d 00 f0 ff ff 77 02 f3 c3 48 8b 15 08 12 30 00 f7 d8 64 89 02
[97604.721862] RSP: 002b:00007ffe444f5758 EFLAGS: 00000202 ORIG_RAX:
000000000000003b
[97604.721867] RAX: ffffffffffffffda RBX: 00007f9cdd7fb6a0 RCX:
00007f9cdcfe2c37
[97604.721872] RDX: 00000000020f5300 RSI: 00000000020f3bf8 RDI:
00000000020f36a0
[97604.721876] RBP: 0000000000000001 R08: 0000000000000000 R09:
0000000000000000
[97604.721880] R10: 00007ffe444f4b60 R11: 0000000000000202 R12:
0000000000000000
[97604.721884] R13: 0000000000000001 R14: 00000000020f36a0 R15:
0000000000000000
[97604.721890] Modules linked in: overlay rpcsec_gss_krb5 nfsv4 dns_resolver
nfsv3 nfs fscache libafs(PO) iscsi_ibft iscsi_boot_sysfs rfkill
vboxnetadp(O) vboxnetflt(O) vboxdrv(O) dmi_sysfs intel_rapl_msr
intel_rapl_common isst_if_common joydev ipmi_ssif i40iw ib_uverbs iTCO_wdt
intel_pmc_bxt ib_core hid_generic iTCO_vendor_support skx_edac nfit
libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel acpi_ipmi
usbhid kvm i40e ipmi_si ioatdma mei_me i2c_i801 irqbypass ipmi_devintf mei
i2c_smbus lpc_ich dca efi_pstore pcspkr ipmi_msghandler tiny_power_button
acpi_pad button nls_iso8859_1 nls_cp437 vfat fat nfsd nfs_acl lockd
auth_rpcgss grace sunrpc fuse configfs nfs_ssc ast i2c_algo_bit
drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops
cec rc_core drm_ttm_helper xhci_pci ttm xhci_pci_renesas xhci_hcd
crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel
drm glue_helper crypto_simd cryptd usbcore wmi sg br_netfilter bridge stp
llc
[97604.721991]  dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua
msr efivarfs
[97604.722031] ---[ end trace edcabaccd35272e2 ]---
[97604.727773] RIP: 0010:invalidate_inode_pages2_range+0x366/0x4e0

After this the fuse layer is stuck and the machine needs a reboot.

In order to confirm the problem we installed from SLE 15

skluth@thserv03:~$ uname -a
Linux thserv03 5.3.18-183.g761c100-default #1 SMP Sun Feb 28 15:36:45 UTC 2021 (761c100) x86_64 x86_64 x86_64 GNU/Linux

and with this kernel everything works fine.
Comment 1 Takashi Iwai 2021-03-03 11:59:58 UTC
Is this a recent regression in 5.10.x, or did you see this in the earlier TW kernels, too?

If the problem is reproducible and it's a regression, it'd be great if you can narrow down which kernel version starts showing the problem.
Comment 2 Stefan Kluth 2021-03-04 07:41:32 UTC
We have reproduced the problem many times when we changed versions of cvmfs, or did it on different servers to exclude hardware problems.

Is there a good and reproduceable way to get the older kernels from opensuse as packages?

I did look on OBS but there are only some private projects with kernels.

Are you perhaps aware of recent (in 5.x kernels) modifications to the fuse kernel module? Their introduction into the the 5.x series might give a hint which kernels to try. Otherwise its at best a binary search between 5.3 and 5.10 and every test takes several hours. That would be the brute force approach.
Comment 3 Takashi Iwai 2021-03-04 07:59:18 UTC
I have a collection of kernel packages in OBS home:tiwai:kernel:* project, e.g. home:tiwai:kernel:5.9 contains the last 5.9.x kernel package, home:tiwai:kernel:5.8 for 5.8.x, etc.  Note that it's signed with an unofficial key, so Secure Boot won't work.

We haven't change anything about FUSE in TW packages, AFAIK, so if any, it's an upstream issue.
Comment 4 Stefan Kluth 2021-03-05 11:15:20 UTC
with one of your kernel packages

skluth@thserv04:~$ uname -a
Linux thserv04 5.4.14-1.gfc4ea7a-default #1 SMP Thu Jan 23 08:54:47 UTC 2020 (fc4ea7a) x86_64 x86_64 x86_64 GNU/Linux

we get a spontaneous reboot (!) after the build was at 99%, so having run for > 3 h. The kernel bug showed up much earlier, but the build continued until the reboot. 

So the problem is already present at 5.4.14. We tried again with the SLE 15 kernel

skluth@thserv03:~$ uname -a
Linux thserv03 5.3.18-183.g761c100-default #1 SMP Sun Feb 28 15:36:45 UTC 2021 (761c100) x86_64 x86_64 x86_64 GNU/Linux

and that works fine. 

Here is the bug report in messages:

2021-03-04T14:51:53.623340+01:00 thserv04 kernel: [17926.981555] BUG: unable to handle page fault for address: 0000138800c9e2cf
2021-03-04T14:51:53.623366+01:00 thserv04 kernel: [17926.981568] #PF: supervisor read access in kernel mode
2021-03-04T14:51:53.623370+01:00 thserv04 kernel: [17926.981571] #PF: error_code(0x0000) - not-present page
2021-03-04T14:51:53.623375+01:00 thserv04 kernel: [17926.981574] PGD 0 P4D 0 
2021-03-04T14:51:53.623402+01:00 thserv04 kernel: [17926.981580] Oops: 0000 [#1] SMP PTI
2021-03-04T14:51:53.623410+01:00 thserv04 kernel: [17926.981586] CPU: 26 PID: 1783256 Comm: cvmfs2 Not tainted 5.4.14-1.gfc4ea7a-def
ault #1 openSUSE Tumbleweed (unreleased)
2021-03-04T14:51:53.623412+01:00 thserv04 kernel: [17926.981590] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a 10/16/201
9
2021-03-04T14:51:53.623414+01:00 thserv04 kernel: [17926.981603] RIP: 0010:walk_component+0x17c/0x340
2021-03-04T14:51:53.623420+01:00 thserv04 kernel: [17926.981607] Code: 00 45 31 e4 83 e3 02 0f 85 44 ff ff ff 8b 45 48 85 c0 0f 84 3
9 ff ff ff 83 e8 01 48 8d 1c 40 89 45 48 48 c1 e3 04 48 03 5d 50 <48> 8b 43 10 48 85 c0 74 09 48 8b 7b 18 e8 52 ab 92 00 f6 45 38 40
2021-03-04T14:51:53.623422+01:00 thserv04 kernel: [17926.981614] RSP: 0018:ffffa06a6a74fc98 EFLAGS: 00010202
2021-03-04T14:51:53.623426+01:00 thserv04 kernel: [17926.981618] RAX: 0000000000043450 RBX: 0000138800c9e2bf RCX: ffff8c007d3b64f8
2021-03-04T14:51:53.623428+01:00 thserv04 kernel: [17926.981622] RDX: 0000000000000000 RSI: ffffa06a62aefc60 RDI: ffff8c0079a6d200
2021-03-04T14:51:53.623429+01:00 thserv04 kernel: [17926.981626] RBP: ffff8c007d3b6480 R08: 0000000000000000 R09: ffffa06a6a74fc38
2021-03-04T14:51:53.623430+01:00 thserv04 kernel: [17926.981629] R10: ffff8c007d3b6480 R11: ffff8c007d3b64b0 R12: ffff8c0079a6d200
2021-03-04T14:51:53.623433+01:00 thserv04 kernel: [17926.981632] R13: ffffffffb4cd6355 R14: ffff8c0079a6d200 R15: ffff8bc08064ec40
2021-03-04T14:51:53.623434+01:00 thserv04 kernel: [17926.981637] FS:  00007fc3637fe640(0000) GS:ffff8bc080d80000(0000) knlGS:0000000
000000000
2021-03-04T14:51:53.623435+01:00 thserv04 kernel: [17926.981641] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-03-04T14:51:53.623436+01:00 thserv04 kernel: [17926.981644] CR2: 0000138800c9e2cf CR3: 0000007f38442004 CR4: 00000000007606e0
2021-03-04T14:51:53.623436+01:00 thserv04 kernel: [17926.981648] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2021-03-04T14:51:53.623439+01:00 thserv04 kernel: [17926.981652] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2021-03-04T14:51:53.623440+01:00 thserv04 kernel: [17926.981655] PKRU: 55555554
2021-03-04T14:51:53.623441+01:00 thserv04 kernel: [17926.981657] Call Trace:
2021-03-04T14:51:53.623442+01:00 thserv04 kernel: [17926.981672]  ? fuse_request_end+0xb0/0x1e0 [fuse]
2021-03-04T14:51:53.623443+01:00 thserv04 kernel: [17926.981679]  ? fuse_dev_do_write+0x234/0x3d0 [fuse]
2021-03-04T14:51:53.623449+01:00 thserv04 kernel: [17926.981686]  ? fuse_dev_write+0x56/0x90 [fuse]
2021-03-04T14:51:53.623450+01:00 thserv04 kernel: [17926.981694]  ? do_iter_readv_writev+0x158/0x1d0
2021-03-04T14:51:53.623451+01:00 thserv04 kernel: [17926.981699]  ? do_iter_write+0x7d/0x190
2021-03-04T14:51:53.623452+01:00 thserv04 kernel: [17926.981704]  ? vfs_writev+0xa6/0xf0
2021-03-04T14:51:53.623453+01:00 thserv04 kernel: [17926.981712]  ? do_futex+0x157/0x1d0
2021-03-04T14:51:53.623454+01:00 thserv04 kernel: [17926.981716]  ? __x64_sys_futex+0x137/0x165
2021-03-04T14:51:53.623455+01:00 thserv04 kernel: [17926.981720]  ? do_writev+0x6b/0x110
2021-03-04T14:51:53.623456+01:00 thserv04 kernel: [17926.981727]  ? do_syscall_64+0x64/0x220
2021-03-04T14:51:53.623457+01:00 thserv04 kernel: [17926.981735]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
2021-03-04T14:51:53.623461+01:00 thserv04 kernel: [17926.981739] Modules linked in: overlay rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs joydev intel_rapl_msr intel_rapl_common hid_generic isst_if_common i40iw ib_uverbs ib_core skx_edac nfit iTCO_wdt ipmi_ssif iTCO_vendor_support libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel usbhid kvm i40e irqbypass pcspkr ipmi_si mei_me ioatdma ipmi_devintf i2c_i801 lpc_ich mei dca ipmi_msghandler acpi_pad button nls_iso8859_1 nls_cp437 vfat fat nfsd nfs_acl lockd auth_rpcgss grace sunrpc fuse configfs crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ast i2c_algo_bit drm_vram_helper ttm drm_kms_helper xhci_pci syscopyarea sysfillrect sysimgblt xhci_hcd fb_sys_fops aesni_intel drm glue_helper crypto_simd cryptd usbcore wmi sg br_netfilter bridge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs
2021-03-04T14:51:53.623462+01:00 thserv04 kernel: [17926.981816] CR2: 0000138800c9e2cf
2021-03-04T14:51:53.623462+01:00 thserv04 kernel: [17926.981820] ---[ end trace a18de532b1569639 ]---
2021-03-04T14:51:53.623463+01:00 thserv04 kernel: [17926.988869] RIP: 0010:walk_component+0x17c/0x340
2021-03-04T14:51:53.623464+01:00 thserv04 kernel: [17926.988875] Code: 00 45 31 e4 83 e3 02 0f 85 44 ff ff ff 8b 45 48 85 c0 0f 84 39 ff ff ff 83 e8 01 48 8d 1c 40 89 45 48 48 c1 e3 04 48 03 5d 50 <48> 8b 43 10 48 85 c0 74 09 48 8b 7b 18 e8 52 ab 92 00 f6 45 38 40
2021-03-04T14:51:53.623471+01:00 thserv04 kernel: [17926.988881] RSP: 0018:ffffa06a6a74fc98 EFLAGS: 00010202
2021-03-04T14:51:53.623472+01:00 thserv04 kernel: [17926.988885] RAX: 0000000000043450 RBX: 0000138800c9e2bf RCX: ffff8c007d3b64f8
2021-03-04T14:51:53.623473+01:00 thserv04 kernel: [17926.988889] RDX: 0000000000000000 RSI: ffffa06a62aefc60 RDI: ffff8c0079a6d200
2021-03-04T14:51:53.623473+01:00 thserv04 kernel: [17926.988892] RBP: ffff8c007d3b6480 R08: 0000000000000000 R09: ffffa06a6a74fc38
2021-03-04T14:51:53.623474+01:00 thserv04 kernel: [17926.988896] R10: ffff8c007d3b6480 R11: ffff8c007d3b64b0 R12: ffff8c0079a6d200
2021-03-04T14:51:53.623475+01:00 thserv04 kernel: [17926.988900] R13: ffffffffb4cd6355 R14: ffff8c0079a6d200 R15: ffff8bc08064ec40
2021-03-04T14:51:53.623476+01:00 thserv04 kernel: [17926.988904] FS:  00007fc3637fe640(0000) GS:ffff8bc080d80000(0000) knlGS:0000000000000000
2021-03-04T14:51:53.623478+01:00 thserv04 kernel: [17926.988908] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2021-03-04T14:51:53.623478+01:00 thserv04 kernel: [17926.988911] CR2: 0000138800c9e2cf CR3: 0000007f38442004 CR4: 00000000007606e0
2021-03-04T14:51:53.623479+01:00 thserv04 kernel: [17926.988915] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
2021-03-04T14:51:53.623480+01:00 thserv04 kernel: [17926.988919] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
2021-03-04T14:51:53.623481+01:00 thserv04 kernel: [17926.988922] PKRU: 55555554
Comment 5 Takashi Iwai 2021-03-05 11:19:02 UTC
Thanks, that's a good information.

Just to be sure, could you test with OBS home:tiwai:kernel:5.3 as well?  Since SLE15 5.3.x kernel is heavily patched from the vanilla 5.3.x., so rather we might have some fix over the vanilla there.
Comment 6 Stefan Kluth 2021-03-08 16:55:43 UTC
we now ran the full build of the ATLAS athena project for > 3 h on

skluth@thserv03:~$ uname -a
Linux thserv03 5.3.12-1.g60a2268-default #1 SMP Mon Nov 25 08:44:42 UTC 2019 (60a2268) x86_64 x86_64 x86_64 GNU/Linux

and this worked fine w/o fuse crash in the kernel logs.
Comment 7 Takashi Iwai 2021-03-10 14:18:25 UTC
Thanks.  Then it must be indeed something between 5.3.x and 5.4.x.
Comment 9 Luis Henriques 2021-03-11 17:56:10 UTC
There were quite a bit of changes introduced in fuse in 5.4, so it's likely that one of those commits are causing this issue.  Or, most likely, it just increased the chances of a race to occur.  A quick look at these changes made me suspicious of commit e4648309b85a ("fuse: truncate pending writes on O_TRUNC") which touches the fuse_open_common() code.

Another somewhat related commit is b24e7598db62 ("fuse: flush dirty data/metadata before non-truncate setattr"), which seems to be an hack for sync'ing writes.

I'll have to look closer at this code (and these changes), and eventually ping upstream.  Simply reverting the changes doesn't look like a viable option, and the issue seems to still be present in mainline.
Comment 10 Luis Henriques 2021-03-11 18:35:56 UTC
By the way, can you confirm that the first log in the bug description was also a "BUG: unable to handle page fault for address"?
Comment 11 Stefan Kluth 2021-03-11 20:30:31 UTC
referring to my original report from 2021-03-02 14:22:05 UTC I forgot to paste these first two lines

[97604.721590] kernel BUG at mm/truncate.c:763!
[97604.721601] invalid opcode: 0000 [#1] SMP PTI 
...
Comment 12 Jakob Blomer 2021-03-12 08:02:39 UTC
As additional information for debugging, CernVM-FS is a read-only file system, all writable fuse callbacks are unset.
Comment 13 Luis Henriques 2021-03-12 12:21:04 UTC
Thank you.  Just to be sure, would it be possible to try to reproduce this issue with a recent kernel, maybe from the KOTD?  You can grab the latest one from OBS Kernel:HEAD.

If it's still reproducible here, could you please provide the kernel splat?  Thanks.

(The fuse maintainer thinks this may actually be an MM issue.  Thread here: https://lore.kernel.org/linux-fsdevel/YEsryBEFq4HuLKBs@suse.de/)
Comment 14 Stefan Kluth 2021-03-12 16:13:55 UTC
skluth@thserv03:~$ uname -a
Linux thserv03 5.12.0-rc2-1.gd9388fc-default #1 SMP Sun Mar 7 09:22:49 UTC 2021 (d9388fc) x86_64 x86_64 x86_64 GNU/Linux

skluth@thserv03:~$ dmesg 
...
[ 5584.543026] ------------[ cut here ]------------
[ 5584.543032] kernel BUG at mm/truncate.c:678!
[ 5584.543040] invalid opcode: 0000 [#1] SMP PTI
[ 5584.543054] CPU: 7 PID: 1681898 Comm: g++ Not tainted 5.12.0-rc2-1.gd9388fc-default #1 openSUSE Tumbleweed (unreleased)
[ 5584.543062] Hardware name: Supermicro X10DRi/X10DRi, BIOS 2.0 12/28/2015
[ 5584.543066] RIP: 0010:invalidate_inode_pages2_range+0x2f6/0x4a0
[ 5584.543080] Code: 0f 48 f0 e9 23 ff ff ff 31 c9 4c 89 e7 ba 01 00 00 00 48 89 ee e8 2a ce 02 00 4c 89 ff e8 22 14 01 00 84 c0 0f 84 d4 fe ff ff <0f> 0b 31 f6 48 8d 7c 24 20 e8 1c b2 35 00 49 83 6c 24 68 01 eb 8c
[ 5584.543088] RSP: 0018:ffffb1e5ead27a40 EFLAGS: 00010202
[ 5584.543095] RAX: 0000000000000001 RBX: 000000000000000c RCX: 0000000000000200
[ 5584.543100] RDX: 0000000000000020 RSI: 00affff8000100b7 RDI: ffffea77e1e28000
[ 5584.543105] RBP: 0000000000000c00 R08: 0000000000000c00 R09: 0000000000001a73
[ 5584.543110] R10: fffffffffffffed9 R11: 000000000016c4fb R12: ffff930d9721a0f8
[ 5584.543115] R13: ffff930d9721a100 R14: 0000000000000000 R15: ffffea77e1e28000
[ 5584.543120] FS:  00007fe695c82740(0000) GS:ffff93077f9c0000(0000) knlGS:0000000000000000
[ 5584.543126] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5584.543131] CR2: 00007ffe011760e0 CR3: 000000413193c006 CR4: 00000000003706e0
[ 5584.543136] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5584.543141] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5584.543146] Call Trace:
[ 5584.543157]  fuse_finish_open+0x82/0x150 [fuse]
[ 5584.543174]  fuse_open_common+0x1a8/0x1b0 [fuse]
[ 5584.543185]  ? fuse_open_common+0x1b0/0x1b0 [fuse]
[ 5584.543195]  do_dentry_open+0x14e/0x380
[ 5584.543204]  path_openat+0xafe/0x10b0
[ 5584.543212]  do_filp_open+0x88/0x130
[ 5584.543220]  ? security_prepare_creds+0x6d/0x90
[ 5584.543229]  ? __kmalloc+0x148/0x2d0
[ 5584.543236]  do_open_execat+0x6d/0x1a0
[ 5584.543242]  bprm_execve+0x128/0x660
[ 5584.543248]  do_execveat_common+0x192/0x1c0
[ 5584.543253]  __x64_sys_execve+0x39/0x50
[ 5584.543259]  do_syscall_64+0x33/0x80
[ 5584.543268]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 5584.543276] RIP: 0033:0x7fe695469c37
[ 5584.543285] Code: Unable to access opcode bytes at RIP 0x7fe695469c0d.
[ 5584.543289] RSP: 002b:00007ffce83793c8 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[ 5584.543295] RAX: ffffffffffffffda RBX: 00007fe695c826a0 RCX: 00007fe695469c37
[ 5584.543300] RDX: 0000000002117e10 RSI: 0000000002124948 RDI: 00000000021244c0
[ 5584.543305] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[ 5584.543310] R10: 00007ffce8378e20 R11: 0000000000000202 R12: 0000000000000000
[ 5584.543314] R13: 0000000000000001 R14: 00000000021244c0 R15: 0000000000000000
[ 5584.543320] Modules linked in: overlay rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs joydev intel_rapl_msr intel_rapl_common iTCO_wdt intel_pmc_bxt hid_generic iTCO_vendor_support sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm mei_me i2c_i801 ipmi_ssif igb ioatdma usbhid irqbypass pcspkr efi_pstore mxm_wmi i2c_smbus mei lpc_ich dca acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_pad tiny_power_button button nls_iso8859_1 nls_cp437 vfat fat nfsd nfs_acl lockd auth_rpcgss grace sunrpc fuse configfs nfs_ssc ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core xhci_pci drm_ttm_helper xhci_pci_renesas ttm xhci_hcd crct10dif_pclmul crc32_pclmul drm crc32c_intel ghash_clmulni_intel ehci_pci ehci_hcd aesni_intel crypto_simd cryptd usbcore wmi sg br_netfilter bridge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs
[ 5584.543436] ---[ end trace 9061ab3412ca7379 ]---
[ 5584.548954] RIP: 0010:invalidate_inode_pages2_range+0x2f6/0x4a0
[ 5584.548974] Code: 0f 48 f0 e9 23 ff ff ff 31 c9 4c 89 e7 ba 01 00 00 00 48 89 ee e8 2a ce 02 00 4c 89 ff e8 22 14 01 00 84 c0 0f 84 d4 fe ff ff <0f> 0b 31 f6 48 8d 7c 24 20 e8 1c b2 35 00 49 83 6c 24 68 01 eb 8c
[ 5584.548983] RSP: 0018:ffffb1e5ead27a40 EFLAGS: 00010202
[ 5584.548989] RAX: 0000000000000001 RBX: 000000000000000c RCX: 0000000000000200
[ 5584.548994] RDX: 0000000000000020 RSI: 00affff8000100b7 RDI: ffffea77e1e28000
[ 5584.548999] RBP: 0000000000000c00 R08: 0000000000000c00 R09: 0000000000001a73
[ 5584.549004] R10: fffffffffffffed9 R11: 000000000016c4fb R12: ffff930d9721a0f8
[ 5584.549009] R13: ffff930d9721a100 R14: 0000000000000000 R15: ffffea77e1e28000
[ 5584.549014] FS:  00007fe695c82740(0000) GS:ffff93077f9c0000(0000) knlGS:0000000000000000
[ 5584.549021] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5584.549026] CR2: 00007fe695469c0d CR3: 000000413193c006 CR4: 00000000003706e0
[ 5584.549031] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 5584.549036] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 5584.560505] BUG: Bad rss-counter state mm:0000000063c87593 type:MM_ANONPAGES val:8
...
Comment 15 Stefan Kluth 2021-03-13 10:34:54 UTC
after reading this thread

https://lore.kernel.org/linux-fsdevel/YEsryBEFq4HuLKBs@suse.de/

I tried the software build on a physical disk and not on /dev/shm as I did before to speed things up. 

I do get the same kernel bug as in my previous trial with linux 5.12.0-rc2-1.gd9388fc-default.
Comment 16 Vlastimil Babka 2021-03-15 10:46:37 UTC
(In reply to Stefan Kluth from comment #14)
> [ 5584.543026] ------------[ cut here ]------------
> [ 5584.543032] kernel BUG at mm/truncate.c:678!
> [ 5584.543040] invalid opcode: 0000 [#1] SMP PTI

That's this one in invalidate_inode_pages2_range():
BUG_ON(page_mapped(page));

Right after
if (page_mapped(page)) unmap_mapping_pages()

So that's indeed weird.

CCing more people in case it rings a bell.

Just a shot in the dark, does disabling THP make any difference? I.e. after a
echo "never" > /sys/kernel/mm/transparent_hugepage/enabled
Comment 17 Luis Henriques 2021-03-15 10:54:31 UTC
(In reply to Vlastimil Babka from comment #16)
> (In reply to Stefan Kluth from comment #14)
> > [ 5584.543026] ------------[ cut here ]------------
> > [ 5584.543032] kernel BUG at mm/truncate.c:678!
> > [ 5584.543040] invalid opcode: 0000 [#1] SMP PTI
> 
> That's this one in invalidate_inode_pages2_range():
> BUG_ON(page_mapped(page));
> 
> Right after
> if (page_mapped(page)) unmap_mapping_pages()
> 
> So that's indeed weird.
> 
> CCing more people in case it rings a bell.
> 
> Just a shot in the dark, does disabling THP make any difference? I.e. after a
> echo "never" > /sys/kernel/mm/transparent_hugepage/enabled

Thanks Vlastimil.

Also, I'm building a test kernel that replaces this BUG_ON() with a VM_BUG_ON_PAGE(), which should provide us with some more info.  I'll post the link once this is build.
Comment 18 Luis Henriques 2021-03-15 14:27:55 UTC
Created attachment 847225 [details]
debug patch with VM_BUG_ON_PAGE in invalidate_inode_pages2_range()

I've a debug kernel ready in OBS home:henrix:bsc1182929.  It's basically a vanilla 5.12-rc3 with the attached patch, which simply replaces the BUG_ON() in function invalidate_inode_pages2_range() with a VM_BUG_ON_PAGE().  This kernel also has the CONFIG_DEBUG_VM config option enabled.

Stefan, could you please try this kernel and share the kernel log?

[ Here's the link to the .repo file if that makes things easier for you:

https://download.opensuse.org/repositories/home:/henrix:/bsc1182929/standard/home:henrix:bsc1182929.repo

or you can simply download the kernel. ]
Comment 19 Stefan Kluth 2021-03-15 20:10:51 UTC
Hi Luis,

skluth@thserv04:~$ uname -a
Linux thserv04 5.12.0-rc3-1.g008d601-default #1 SMP Mon Mar 15 09:39:40 UTC 2021 (008d601) x86_64 x86_64 x86_64 GNU/Linux

skluth@thserv04:~$ dmesg 
...
[16247.536389] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
[16247.536399] ------------[ cut here ]------------
[16247.536400] kernel BUG at mm/truncate.c:678!
[16247.536406] invalid opcode: 0000 [#1] SMP PTI
[16247.536416] CPU: 42 PID: 2063761 Comm: g++ Not tainted 5.12.0-rc3-1.g008d601-default #1 openSUSE Tumbleweed (unreleased)
[16247.536423] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a 10/16/2019
[16247.536427] RIP: 0010:invalidate_inode_pages2_range+0x3b4/0x550
[16247.536436] Code: 00 00 00 4c 89 e6 e8 eb 0f 03 00 4c 89 ff e8 63 40 01 00 84 c0 0f 84 23 fe ff ff 48 c7 c6 d0 1d f4 b1 4c 89 ff e8 ec 82 02 00 <0f> 0b 48 8b 45 78 48 8b 80 80 00 00 00 48 85 c0 0f 84 fb fe ff ff
[16247.536444] RSP: 0000:ffffa18cb0af7a40 EFLAGS: 00010246
[16247.536450] RAX: 0000000000000036 RBX: 000000000000000d RCX: ffff8ef13fc9a748
[16247.536455] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8ef13fc9a740
[16247.536460] RBP: ffff8eb2f9a02ef8 R08: ffff8ef23ffb48a8 R09: 000000000004fffb
[16247.536464] R10: 00000000ffff0000 R11: 3fffffffffffffff R12: 0000000000001400
[16247.536468] R13: ffff8eb2f9a02f00 R14: 0000000000000000 R15: ffffd651b1978000
[16247.536473] FS:  00007f97c1717740(0000) GS:ffff8ef13fc80000(0000) knlGS:0000000000000000
[16247.536478] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16247.536483] CR2: 00007fd48a25a7c0 CR3: 00000040aa3ac006 CR4: 00000000007706e0
[16247.536487] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16247.536491] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16247.536495] PKRU: 55555554
[16247.536498] Call Trace:
[16247.536506]  fuse_finish_open+0x82/0x150 [fuse]
[16247.536520]  fuse_open_common+0x1a8/0x1b0 [fuse]
[16247.536530]  ? fuse_open_common+0x1b0/0x1b0 [fuse]
[16247.536540]  do_dentry_open+0x14e/0x380
[16247.536547]  path_openat+0xaf6/0x10a0
[16247.536555]  do_filp_open+0x88/0x130
[16247.536560]  ? security_prepare_creds+0x6d/0x90
[16247.536566]  ? __kmalloc+0x157/0x2e0
[16247.536575]  do_open_execat+0x6d/0x1a0
[16247.536581]  bprm_execve+0x128/0x660
[16247.536587]  do_execveat_common+0x192/0x1c0
[16247.536593]  __x64_sys_execve+0x39/0x50
[16247.536599]  do_syscall_64+0x33/0x80
[16247.536606]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[16247.536614] RIP: 0033:0x7f97c0efec37
[16247.536621] Code: Unable to access opcode bytes at RIP 0x7f97c0efec0d.
[16247.536625] RSP: 002b:00007ffdc2fdea68 EFLAGS: 00000202 ORIG_RAX: 000000000000003b
[16247.536631] RAX: ffffffffffffffda RBX: 00007f97c17176a0 RCX: 00007f97c0efec37
[16247.536635] RDX: 0000000000ea42c0 RSI: 0000000000ea5848 RDI: 0000000000ea5d00
[16247.536639] RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000000
[16247.536643] R10: 00007ffdc2fdde60 R11: 0000000000000202 R12: 0000000000000000
[16247.536647] R13: 0000000000000001 R14: 0000000000ea5d00 R15: 0000000000000000
[16247.536653] Modules linked in: overlay rpcsec_gss_krb5 nfsv4 dns_resolver nfsv3 nfs fscache iscsi_ibft iscsi_boot_sysfs rfkill dmi_sysfs intel_rapl_msr intel_rapl_common joydev isst_if_common ipmi_ssif i40iw ib_uverbs iTCO_wdt intel_pmc_bxt skx_edac ib_core hid_generic iTCO_vendor_support nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel acpi_ipmi kvm usbhid i2c_i801 mei_me i40e irqbypass efi_pstore pcspkr ipmi_si ioatdma i2c_smbus lpc_ich mei intel_pch_thermal dca ipmi_devintf ipmi_msghandler tiny_power_button acpi_pad button nls_iso8859_1 nls_cp437 vfat fat nfsd nfs_acl auth_rpcgss lockd grace sunrpc fuse configfs nfs_ssc ast i2c_algo_bit drm_vram_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec rc_core drm_ttm_helper ttm xhci_pci xhci_pci_renesas drm xhci_hcd crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel aesni_intel crypto_simd cryptd usbcore wmi sg br_netfilter bridge stp llc dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc
[16247.536758]  scsi_dh_alua msr efivarfs
[16247.536800] ---[ end trace e1493f55bf5b3a34 ]---
[16247.544126] RIP: 0010:invalidate_inode_pages2_range+0x3b4/0x550
[16247.544140] Code: 00 00 00 4c 89 e6 e8 eb 0f 03 00 4c 89 ff e8 63 40 01 00 84 c0 0f 84 23 fe ff ff 48 c7 c6 d0 1d f4 b1 4c 89 ff e8 ec 82 02 00 <0f> 0b 48 8b 45 78 48 8b 80 80 00 00 00 48 85 c0 0f 84 fb fe ff ff
[16247.544148] RSP: 0000:ffffa18cb0af7a40 EFLAGS: 00010246
[16247.544153] RAX: 0000000000000036 RBX: 000000000000000d RCX: ffff8ef13fc9a748
[16247.544158] RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8ef13fc9a740
[16247.544162] RBP: ffff8eb2f9a02ef8 R08: ffff8ef23ffb48a8 R09: 000000000004fffb
[16247.544166] R10: 00000000ffff0000 R11: 3fffffffffffffff R12: 0000000000001400
[16247.544170] R13: ffff8eb2f9a02f00 R14: 0000000000000000 R15: ffffd651b1978000
[16247.544175] FS:  00007f97c1717740(0000) GS:ffff8ef13fc80000(0000) knlGS:0000000000000000
[16247.544180] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[16247.544184] CR2: 00007f97c0efec0d CR3: 00000040aa3ac006 CR4: 00000000007706e0
[16247.544188] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[16247.544191] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[16247.544194] PKRU: 55555554
[16247.546763] BUG: Bad rss-counter state mm:00000000060c94f4 type:MM_ANONPAGES val:8
Comment 20 Luis Henriques 2021-03-15 22:03:24 UTC
(In reply to Stefan Kluth from comment #19)
> Hi Luis,
>
> skluth@thserv04:~$ uname -a
> Linux thserv04 5.12.0-rc3-1.g008d601-default #1 SMP Mon Mar 15 09:39:40 UTC
> 2021 (008d601) x86_64 x86_64 x86_64 GNU/Linux
> 
> skluth@thserv04:~$ dmesg 
> ...

Thanks a lot for doing all this testing.  Now, I was expecting some information before the line below.  Would it be possible to get the logs *before* this line?  We should have extra info there for the MM guys to help.

> [16247.536389] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
> [16247.536399] ------------[ cut here ]------------
> [16247.536400] kernel BUG at mm/truncate.c:678!
> [16247.536406] invalid opcode: 0000 [#1] SMP PTI
> [16247.536416] CPU: 42 PID: 2063761 Comm: g++ Not tainted
> 5.12.0-rc3-1.g008d601-default #1 openSUSE Tumbleweed (unreleased)
> [16247.536423] Hardware name: Supermicro X11DPi-N(T)/X11DPi-N, BIOS 3.1a
> 10/16/2019

...
Comment 21 Stefan Kluth 2021-03-18 09:02:38 UTC

2021-03-15T21:05:43.484788+01:00 thserv04 kernel: [16247.536348] page:00000000dfe36ab1 refcount:673 mapcount:0 mapping:00000000f982a
7f8 index:0x1400 pfn:0x4c65e00
2021-03-15T21:05:43.484810+01:00 thserv04 kernel: [16247.536359] head:00000000dfe36ab1 order:9 compound_mapcount:0 compound_pincount
:0
2021-03-15T21:05:43.484812+01:00 thserv04 kernel: [16247.536361] memcg:ffff8e730012b000
2021-03-15T21:05:43.484814+01:00 thserv04 kernel: [16247.536364] aops:fuse_file_aops [fuse] ino:8b8 dentry name:"cc1plus"
2021-03-15T21:05:43.484815+01:00 thserv04 kernel: [16247.536379] flags: 0xaffff800010037(locked|referenced|uptodate|lru|active|head)
2021-03-15T21:05:43.484820+01:00 thserv04 kernel: [16247.536385] raw: 00affff800010037 ffffd6519ed9c448 ffffd651abea5b08 ffff8eb2f9a
02ef8
2021-03-15T21:05:43.484823+01:00 thserv04 kernel: [16247.536388] raw: 0000000000001400 0000000000000000 000002a1ffffffff ffff8e73001
2b000
2021-03-15T21:05:43.484825+01:00 thserv04 kernel: [16247.536389] page dumped because: VM_BUG_ON_PAGE(page_mapped(page))
Comment 22 Luis Henriques 2021-03-18 17:31:12 UTC
Created attachment 847415 [details]
mm/truncate: Make invalidate_inode_pages2_range work with THPs

I've built another kernel with the attached patch.  It should (hopefully) fix this issue.  This patch is added on top of the previous one, so the VM_BUG_ON_PAGE() is still there.

Stefan, could you please give it a try?  It's in OBS home:henrix:bsc1182929-fix-v01 (https://download.opensuse.org/repositories/home:/henrix:/bsc1182929-fix-v01/standard/home:henrix:bsc1182929-fix-v01.repo)
Comment 23 Luis Henriques 2021-03-18 17:33:14 UTC
(In reply to Luis Henriques from comment #22)
> Created attachment 847415 [details]
> mm/truncate: Make invalidate_inode_pages2_range work with THPs
> 
> I've built another kernel with the attached patch.

I forgot to mention that the patch was provided by Matthew Wilcox here:

https://lore.kernel.org/linux-fsdevel/20210318115543.GM3420@casper.infradead.org/
Comment 24 Stefan Kluth 2021-03-18 22:30:47 UTC
skluth@thserv04:~$ uname -a
Linux thserv04 5.12.0-rc3-1.g8d911cd-default #1 SMP Thu Mar 18 12:08:15 UTC 2021 (8d911cd) x86_64 x86_64 x86_64 GNU/Linux

The full build of the ATLAS software now ran to completion w/o problems.

Looks like a fix for us. Thanks for the efforts of everybody!

Does this fix have a chance of backporting to LTS kernel 5.4.x?

Our application is opensuse tumbleweed, but others would be affected too if they use linux >= 5.4.x
Comment 25 Luis Henriques 2021-03-19 09:03:21 UTC
(In reply to Stefan Kluth from comment #24)
> skluth@thserv04:~$ uname -a
> Linux thserv04 5.12.0-rc3-1.g8d911cd-default #1 SMP Thu Mar 18 12:08:15 UTC
> 2021 (8d911cd) x86_64 x86_64 x86_64 GNU/Linux
> 
> The full build of the ATLAS software now ran to completion w/o problems.

Awesome news!  Thanks for your help.

> Looks like a fix for us. Thanks for the efforts of everybody!
> 
> Does this fix have a chance of backporting to LTS kernel 5.4.x?
> 
> Our application is opensuse tumbleweed, but others would be affected too if
> they use linux >= 5.4.x

Yeah, I'll try to find out when this patch will be merged into mainline and, eventually, Cc'ed into stable kernels.  I'll update this bugzilla as I get more details.
Comment 26 Luis Henriques 2021-04-05 09:04:57 UTC
Quick update: unfortunately, this issue hasn't yet been fixed.  Citing the author:

"This patch is good for that point in the patch series, but I'm not sure it works against upstream in all cases.  I need to spend some time evaluating it."

(See comment #23 for the complete thread.)

So, it will eventually get fixed but it may take some more time.