Bugzilla – Bug 1224434
Page cache invalidation failure during direct I/O write
Last modified: 2024-07-19 13:53:35 UTC
## Observation openQA test in scenario opensuse-Tumbleweed-DVD-ppc64le-ltp_aiodio_part1@ppc64le fails in [AD033](https://openqa.opensuse.org/tests/4202483/modules/AD033/steps/8) ## Test suite description LTP_ENV=TMPDIR=/var/tmp is to test on btrfs instead of tmpfs ## Reproducible Fails since (at least) Build [20240131](https://openqa.opensuse.org/tests/3907534) Last good: Unknown ## Further details Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?arch=ppc64le&distri=opensuse&flavor=DVD&machine=ppc64le&test=ltp_aiodio_part1&version=Tumbleweed) -------- LTP Test - https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/io/ltp-aiodio/aiocp.c we run this test with various combinations of write block size and I/O numbers (https://github.com/linux-test-project/ltp/blob/master/runtest/ltp-aiodio.part1) failing scenarios on ppc64le: AD033 aiocp -b 4k -n 4 -f DIRECT AD035 aiocp -b 4k -n 8 -f DIRECT AD037 aiocp -b 4k -n 16 -f DIRECT (few times doesn't fail) AD039 aiocp -b 4k -n 32 -f DIRECT (few times doesn't fail) AD047 aiocp -b 8k -n 4 -f DIRECT AD049 aiocp -b 8k -n 8 -f DIRECT AD051 aiocp -b 8k -n 16 -f DIRECT (few times doesn't fail) AD061 aiocp -b 16k -n 4 -f DIRECT AD063 aiocp -b 16k -n 8 -f DIRECT AD075 aiocp -b 32k -n 4 -f DIRECT dmesg log during each of these failures: OpenQA::run_ltp.pm: Starting AD032 OpenQA::run_ltp.pm: Starting AD033 [ 48.470444][ T62] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 48.471404][ T62] File: /var/tmp/LTP_aioTBsaEH/dstfile.bin PID: 62 Comm: kworker/1:1 OpenQA::run_ltp.pm: Starting AD034 OpenQA::run_ltp.pm: Starting AD035 This issue is not present on aarch64 and x86_64 tumbleweed builds. And on SLE15-SP6, we are not seeing this on ppc64le also.
Since this is on btrfs, we should involve someone from btrfs team. Filipe, can you have a look? Feel free to reassign if somebody else is a better fit. Checking the kernel logs I agree the iomap code is complaining about failure to invalide page cache *after* a direct write. I can also see LTP is declaring the test as broken because fsync(2) after writing the file has failed with EIO: tst_test.c:1734: TINFO: LTP version: 20240524 tst_test.c:1618: TINFO: Timeout per run is 0h 30m 30s aiocp.c:211: TINFO: Maximum AIO blocks: 65536 tst_device.c:545: TINFO: Use BTRFS specific strategy aiocp.c:250: TINFO: Fill srcfile.bin with random data aiocp.c:279: TINFO: Copy srcfile.bin -> dstfile.bin aiocp.c:115: TBROK: fsync(4) failed: EIO (5) And this is weird because it shouldn't be caused by the page cache invalidation. Now I suspect this is connected with the fact that PowerPC64 where this is happening has page size of 64k and so all the failing cases are not page aligned and so individual IOs are going to use the same page and if we fall back to buffered IO for some reason, conflicts between buffered and direct IO can happen. But since this is not happening with SLE15-SP6, something must have changed.
FYI I see in dmesg for all failures: "Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!" OpenQA::run_ltp.pm: Starting AD032 OpenQA::run_ltp.pm: Starting AD033 [ 34.067042][ T378] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 34.067979][ T378] File: /var/tmp/LTP_aioMRsA7l/dstfile.bin PID: 378 Comm: kworker/0:3 OpenQA::run_ltp.pm: Starting AD034 OpenQA::run_ltp.pm: Starting AD035 [ 34.465431][ T378] Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O! [ 34.466377][ T378] File: /var/tmp/LTP_aiocPYepp/dstfile.bin PID: 378 Comm: kworker/0:3 OpenQA::run_ltp.pm: Starting AD036 Observed on 6.10.0-rc7 [1] and 6.9.9 [2] [1] https://openqa.opensuse.org/tests/4349717/file/serial0.txt [2] https://openqa.opensuse.org/tests/4341608/file/serial0.txt
Created attachment 876151 [details] dmesg from 6.10.0-rc7-3.g92abc10-default - kotd (https://openqa.opensuse.org/tests/4349717/file/serial0.txt
"Page cache invalidation failure on direct I/O. Possible data corruption due to collision with buffered I/O!" is from a92853b6746fe ("fs/direct-io.c: keep dio_warn_stale_pagecache() when CONFIG_BLOCK=n") [1] from v5.5, but that commit just adapt the check for NFS (CONFIG_BLOCK=n). It was originally added in 5a9d929d6e132 ("iomap: report collisions between directio and buffered writes to userspace") [2] in v4.16. Therefore I'm surprised it's not in 15-SP6 (checked). [1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a92853b6746fe5ffef20a7c30addf6320561e669 [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5a9d929d6e13278df62bd9e3d3ceae8c87ad1eea
Filipe, can you have a look? Feel free to reassign if somebody else is a better fit.