Bug 1224201

Summary: LTP: openQA test fails in msync04 on ntfs
Product: [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP6 Reporter: WEI GAO <wegao>
Component: KernelAssignee: Kernel Bugs <kernel-bugs>
Status: NEW --- QA Contact:
Severity: Normal    
Priority: P3 - Medium CC: avinesh.kumar, hector.oron, jack, pcervinka, petr.vorel
Version: unspecifiedFlags: wegao: needinfo?
Target Milestone: ---   
Hardware: PowerPC-64   
OS: Other   
URL: https://openqa.suse.de/tests/14296698/modules/msync04/steps/7
Whiteboard:
Found By: openQA Services Priority:
Business Priority: Blocker: Yes
Marketing QA Status: --- IT Deployment: ---
Attachments: failed strace log
pass strace log
failed ftrace on powerVM env

Description WEI GAO 2024-05-14 07:55:30 UTC
openQA test in scenario sle-15-SP6-Online-ppc64le-ltp_syscalls_spvm@ppc64le-spvm fails in
[msync04](https://openqa.suse.de/tests/14296698/modules/msync04/steps/7)

From 89.1, the ntfs start support on PowerVM and trigger mysnc04 failed with following information.

tst_test.c:1690: TINFO: === Testing on ntfs ===
tst_test.c:1107: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
Windows will not be able to boot from this device.
tst_test.c:1121: TINFO: Mounting /dev/loop0 to /tmp/LTP_msyQvoay5/msync04 fstyp=ntfs flags=0
tst_test.c:1121: TINFO: Trying FUSE...
msync04.c:60: TFAIL: Expected dirty bit to be set after writing to mmap()-ed area  <<<<<<<<<<<

NOTE: for x86 case can pass (https://openqa.suse.de/tests/14296768#step/msync04/6)
tst_test.c:1690: TINFO: === Testing on ntfs ===
tst_test.c:1107: TINFO: Formatting /dev/loop0 with ntfs opts='' extra opts=''
The partition start sector was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
The number of sectors per track was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
The number of heads was not specified for /dev/loop0 and it could not be obtained automatically.  It has been set to 0.
To boot from a device, Windows needs the 'partition start sector', the 'sectors per track' and the 'number of heads' to be set.
Windows will not be able to boot from this device.
tst_test.c:1121: TINFO: Mounting /dev/loop0 to /tmp/LTP_msysNuWrm/msync04 fstyp=ntfs flags=0
tst_test.c:1121: TINFO: Trying FUSE...
msync04.c:72: TPASS: msync() working correctly <<<<<<<

NOTE for msync04
 * Test description: Verify msync() after writing into mmap()-ed file works.
 *
 * Write to mapped region and sync the memory back with file. Check the page
 * is no longer dirty after msync() call.
Comment 1 WEI GAO 2024-05-14 09:44:45 UTC
Another failed case which also triggered by ntfs check (which enabled on PowerVM on 89.1)
https://openqa.suse.de/tests/14296699#step/fallocate06/6
Comment 2 Petr Cervinka 2024-05-14 13:17:36 UTC
@wei could you please check similar fail on aarch64 baremetal https://openqa.suse.de/tests/14280426#step/msync04/8  ?
Comment 3 WEI GAO 2024-05-15 02:01:52 UTC
(In reply to Petr Cervinka from comment #2)
> @wei could you please check similar fail on aarch64 baremetal
> https://openqa.suse.de/tests/14280426#step/msync04/8  ?

sure, will check it.

It seems issue failed on different setup, such as arm qemu will pass but baremetal arm will fail, same for power, power qemu env is pass but powerVM env is fail.

arm qemu is pass
https://openqa.suse.de/tests/14288381#step/msync04/6

arm baremetal is failed
https://openqa.suse.de/tests/14280426#step/msync04/7

ppc64le-virtio qemu is pass
https://openqa.suse.de/tests/14305717#step/msync04/7

powerVM is failed
https://openqa.suse.de/tests/14296698#step/msync04/1
Comment 4 WEI GAO 2024-05-15 06:08:58 UTC
Created attachment 874892 [details]
failed strace log
Comment 5 WEI GAO 2024-05-15 06:09:16 UTC
Created attachment 874893 [details]
pass strace log
Comment 6 WEI GAO 2024-05-15 06:11:50 UTC
When i try to manual reproduce in openqa powerVM env, i found this issue not 100% reproduce, i suspect test case encounter some latency when update the dirty bit.
Further investigation needed.
Comment 7 WEI GAO 2024-05-16 03:12:47 UTC
also happen on 390kvm
https://openqa.suse.de/tests/14305377#step/msync04/7
Comment 8 WEI GAO 2024-05-16 03:26:47 UTC
memcontrol02 test also failed on ntfs.
https://openqa.suse.de/tests/14305637#step/memcontrol02/7
Comment 9 Petr Vorel 2024-05-16 08:40:31 UTC
(In reply to WEI GAO from comment #8)
> memcontrol02 test also failed on ntfs.
> https://openqa.suse.de/tests/14305637#step/memcontrol02/7

I wonder if this is related.

Although NTFS on s390x is likely not much used combination.

Also, it fails on Tumbleweed on x86_64 when run with debug_pagealloc=on (but not on SLE 15-SP6).

https://openqa.opensuse.org/tests/4187995#step/msync04/8

Both tests format NTFS on loop device, underlying is Btrfs.

Jan, could you please have a look? BTW you commented 3 years old issue on (but on XFS and on x86_64). dmesg does not mention anything NTFS related, but I can add logs if you want to see it.

https://lore.kernel.org/ltp/20220125121746.wrs4254pfs2mwexb@quack3.lan/
Comment 10 WEI GAO 2024-05-17 13:02:15 UTC
In powerVM env if you add extra sync() before write mmaped area in test case, then you can reproduce this issue 100%.
diff --git a/testcases/kernel/syscalls/msync/msync04.c b/testcases/kernel/syscalls/msync/msync04.c
index 72ddc27a4..26b966505 100644
--- a/testcases/kernel/syscalls/msync/msync04.c
+++ b/testcases/kernel/syscalls/msync/msync04.c
@@ -53,6 +53,7 @@ static void test_msync(void)
        mmaped_area = SAFE_MMAP(NULL, pagesize, PROT_READ | PROT_WRITE,
                        MAP_SHARED, test_fd, 0);
        SAFE_CLOSE(test_fd);
+sync();
        mmaped_area[8] = 'B';
        dirty = get_dirty_bit(mmaped_area);
        if (!dirty) {

Also attach the ftrace for following line, i find balance_dirty_pages(i suppose this is the reason why dirty flag cleared before we start check it) called within handle_mm_fault, i suppose there some potential issue here(normally balance_dirty_pages should not be called here), all dirty page should already cleared by sync() before.

mmaped_area[8] = 'B'; <<<<<
Comment 11 WEI GAO 2024-05-17 13:02:46 UTC
Created attachment 874950 [details]
failed ftrace on powerVM env
Comment 12 Petr Cervinka 2024-05-20 05:36:57 UTC
Also happens on some x86_64 baremetal machines https://openqa.suse.de/tests/14339973#step/msync04/7 isn't it related to higher amount of ram?
Comment 13 Jan Kara 2024-05-20 08:27:58 UTC
Well, as already explained in the email communication referenced from comment 9, msync04 test is inherently racy and I don't see that anything has changed to address that. Nothing guarantees that the page is not written out before get_dirty_page() manages to read the page state. Hence the test should be reworked to verify the page contents is on disk when it finds dirty bit isn't set anymore...
Comment 14 Petr Vorel 2024-05-20 10:22:30 UTC
(In reply to Jan Kara from comment #13)
> Well, as already explained in the email communication referenced from
> comment 9, msync04 test is inherently racy and I don't see that anything has
> changed to address that. Nothing guarantees that the page is not written out
> before get_dirty_page() manages to read the page state. Hence the test
> should be reworked to verify the page contents is on disk when it finds
> dirty bit isn't set anymore...

Thanks for pointing it out, I'm sorry to overlook the actual info you put there already. Although it might be useful to add msync support to framework in fstests, let's first try rewrite LTP msync04 test. I created a ticket for it: https://github.com/linux-test-project/ltp/issues/1157.
Comment 15 WEI GAO 2024-05-22 01:24:58 UTC
(In reply to Petr Vorel from comment #14)
> (In reply to Jan Kara from comment #13)
> > Well, as already explained in the email communication referenced from
> > comment 9, msync04 test is inherently racy and I don't see that anything has
> > changed to address that. Nothing guarantees that the page is not written out
> > before get_dirty_page() manages to read the page state. Hence the test
> > should be reworked to verify the page contents is on disk when it finds
> > dirty bit isn't set anymore...
> 
> Thanks for pointing it out, I'm sorry to overlook the actual info you put
> there already. Although it might be useful to add msync support to framework
> in fstests, let's first try rewrite LTP msync04 test. I created a ticket for
> it: https://github.com/linux-test-project/ltp/issues/1157.

Thanks both Petr and Jan, i will try to fix https://github.com/linux-test-project/ltp/issues/1157.