Bug 1217041 - BTRFS: performance is too slow when running LTP fs test
Summary: BTRFS: performance is too slow when running LTP fs test
Status: RESOLVED FIXED
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: Other Other
: P2 - High : Normal
Target Milestone: ---
Assignee: Santosh Pradhan
QA Contact:
URL: https://openqa.suse.de/tests/12789339...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-11-11 15:09 UTC by WEI GAO
Modified: 2023-12-20 00:33 UTC (History)
7 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
ext4 vs btrfs LTP log (16.46 KB, text/plain)
2023-11-11 15:09 UTC, WEI GAO
Details
strace on btrfs ltp test (57.02 KB, application/x-7z-compressed)
2023-11-11 15:09 UTC, WEI GAO
Details
strace on btrfs with time spent in system calls (327.47 KB, application/x-zip-compressed)
2023-11-12 10:48 UTC, WEI GAO
Details
strace on ext4 with time spend in system call (155.00 KB, application/x-7z-compressed)
2023-11-12 10:49 UTC, WEI GAO
Details

Note You need to log in before you can comment on or make changes to this bug.
Description WEI GAO 2023-11-11 15:09:26 UTC
Created attachment 870707 [details]
ext4 vs btrfs LTP log

## Observation

openQA test in scenario sle-15-SP6-Online-ppc64le-ltp_fs@ppc64le-virtio fails in
[fs_fill](https://openqa.suse.de/tests/12789339/modules/fs_fill/steps/7)
LTP log failed with following info:
tst_fill_fs.c:36: TINFO: Creating file mntpoint/subdir/thread2/file1 size 4807935
tst_fill_fs.c:36: TINFO: Creating file mntpoint/subdir/thread2/file2 size 90739786
tst_fill_fs.c:36: TINFO: Creating file mntpoint/subdir/thread3/file1 size 76896492
Test timeouted, sending SIGKILL!
tst_test.c:1629: TINFO: If you are running on slow machine, try exporting LTP_TIMEOUT_MUL > 1
tst_test.c:1630: TBROK: Test killed! (timeout?)

Manual check found:
Same LTP test case running on ext4 will spend 3s but btrfs need 3mins.
See attachment log and also strace log
Comment 1 WEI GAO 2023-11-11 15:09:52 UTC
Created attachment 870708 [details]
strace on btrfs ltp test
Comment 2 WEI GAO 2023-11-12 01:30:20 UTC
Further investigate 
1)check the LTP test code, the test case create multi thread to write content into filesystem and try to get error ENOSPC.

https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/fs/fs_fill/fs_fill.c

2)using strace i with -T found test case seems wait futex lock too long.

strace -t -f -o strace_btrfs_with_T_f -T ./fs_fill
awk -F'[<> ?]' '{if ($(NF-1) > 1) print}' strace_btrfs_with_T_f

27724 19:45:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27725, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
27722 19:45:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27724, si_uid=0, si_status=1, si_utime=0, si_stime=0} ---
27728 19:45:54 +++ exited with 127 +++
27727 19:45:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27728, si_uid=0, si_status=127, si_utime=0, si_stime=0} ---
27727 19:45:54 +++ exited with 127 +++
27722 19:45:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27727, si_uid=0, si_status=127, si_utime=0, si_stime=0} ---
27722 19:45:54 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27729, si_uid=0, si_status=0, si_utime=1, si_stime=0} ---
27722 19:45:54 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=27752, si_uid=0} ---
27722 19:45:54 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=27752, si_uid=0} ---
27722 19:45:55 <... wait4 resumed>0x7fffe8926ce0, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <1.622117>
27722 19:45:55 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=27752, si_uid=0} ---
27763 19:50:10 <... unlink resumed>)    = 0 <2.707933>
27752 19:51:53 <... futex resumed>)     = 0 <356.857147>   <<<<<<<<<<<
27752 19:51:55 <... futex resumed>)     = 0 <1.269408>
27722 19:51:55 <... wait4 resumed>0x7fffe8926ce0, 0, NULL) = ? ERESTARTSYS (To be restarted if SA_RESTART is set) <359.480639>  <<<<<<
27722 19:51:55 --- SIGUSR1 {si_signo=SIGUSR1, si_code=SI_USER, si_pid=27752, si_uid=0} ---
27722 19:51:55 --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=27752, si_uid=0, si_status=0, si_utime=0, si_stime=17} ---
Comment 3 WEI GAO 2023-11-12 10:48:35 UTC
Created attachment 870710 [details]
strace on btrfs with time spent in system calls
Comment 4 WEI GAO 2023-11-12 10:49:04 UTC
Created attachment 870711 [details]
strace on ext4 with time spend in system call
Comment 5 WEI GAO 2023-11-12 11:05:31 UTC
Upload strace on both btrfs and ext4 with time spent in system calls.

The issue not exist if you ONLY run one thread. I suspect with some internal lock?

The memory/io/cpu load is normall when do btrfs test.

sar -h -r -u -b -q -d  5
05:59:45        CPU     %user     %nice   %system   %iowait    %steal     %idle
05:59:50        all      0.0%      0.0%      0.1%      2.1%      0.0%     97.8%

05:59:45          tps      rtps      wtps   bread/s   bwrtn/s
05:59:50       615.60      0.20    615.40     25.60 111107.20

05:59:45    kbmemfree   kbavail kbmemused  %memused kbbuffers  kbcached  kbcommit   %commit  kbactive   kbinact   kbdirty
05:59:50        21.9M    307.0M     92.6M     18.8%    192.0k    274.4M    260.6M     10.3%    170.2M    136.6M      2.4M

05:59:45      runq-sz  plist-sz   ldavg-1   ldavg-5  ldavg-15   blocked
05:59:50            0       168      3.39      1.03      0.71         0

05:59:45          tps     rkB/s     wkB/s   areq-sz    aqu-sz     await     svctm     %util DEV
05:59:50       615.60     12.8k     54.3M     90.3k      0.11      0.17      0.42     25.8% vda
05:59:50         0.00      0.0k      0.0k      0.0k      0.00      0.00      0.00      0.0% sr0
05:59:50       465.60      0.0k     25.1M     55.1k      0.13      0.18      1.04     48.4% loop0



strace -t -f -o strace_btrfs -T ./fs_fill

In strace on btrfs with time spent in attachment, you can get time spend on write system call is large compare with the ext4.

Such as following trace(pid 1924,1925,1926,1927 are write thread in strace file), write time spend 0.049914s
1927  04:49:02 <... write resumed>)     = 4096 <0.049914>
1926  04:49:02 <... write resumed>)     = 4096 <0.049910>
1924  04:49:02 <... write resumed>)     = 4096 <0.049904>


In strace on ext4 write time spend 0.000040s
1635  04:46:11 write(3, "\245\370\336\202\212\224\215\240\2gyS\371w4i\215\333\346\374\242\312\25\320m\f\302S\210\357,\331"..., 4096) = -1 ENOSPC (No space left on device) <0.000040>
Comment 6 David Sterba 2023-11-15 16:58:26 UTC
I don't see it in test case file, what is the partition size? On PPC64 the page size is 64K and this causes trouble on filesystems of ~1G size, the threads could compete for a very small number of pages/blocks to actually write the data. There could still be a problem somewhere but the test scenario should be properly scaled.
Comment 7 WEI GAO 2023-11-20 01:15:54 UTC
(In reply to David Sterba from comment #6)
> I don't see it in test case file, what is the partition size? On PPC64 the
> page size is 64K and this causes trouble on filesystems of ~1G size, the
> threads could compete for a very small number of pages/blocks to actually
> write the data. There could still be a problem somewhere but the test
> scenario should be properly scaled.

Test env:
getconf PAGE_SIZE
65536

mount
/dev/loop0 on /tmp/LTP_fs_Fr6GS9/mntpoint type btrfs 
(rw,relatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)

df -h /dev/loop0
Filesystem      Size  Used Avail Use% Mounted on
/dev/loop0      300M   29M  214M  12% /tmp/LTP_fs_Fr6GS9/mntpoint

When Use% increase to 14%, the write speed start become very slow, after few mins later the Use% can reach 98%.

The LTP has two write patterns, one is write fix block size use writev syscall(block size is 4096*512 for each writev call), this pattern no issue happen.
Another write pattern is write with random size, this pattern trigger the above issue.
Comment 8 WEI GAO 2023-11-20 01:32:35 UTC
(In reply to WEI GAO from comment #7)
> (In reply to David Sterba from comment #6)
> > I don't see it in test case file, what is the partition size? On PPC64 the
> > page size is 64K and this causes trouble on filesystems of ~1G size, the
> > threads could compete for a very small number of pages/blocks to actually
> > write the data. There could still be a problem somewhere but the test
> > scenario should be properly scaled.
> 
> Test env:
> getconf PAGE_SIZE
> 65536
> 
> mount
> /dev/loop0 on /tmp/LTP_fs_Fr6GS9/mntpoint type btrfs 
> (rw,relatime,ssd,discard=async,space_cache=v2,subvolid=5,subvol=/)
> 
> df -h /dev/loop0
> Filesystem      Size  Used Avail Use% Mounted on
> /dev/loop0      300M   29M  214M  12% /tmp/LTP_fs_Fr6GS9/mntpoint
> 
> When Use% increase to 14%, the write speed start become very slow, after few
> mins later the Use% can reach 98%.
> 
> The LTP has two write patterns, one is write fix block size use writev
> syscall(block size is 4096*512 for each writev call), this pattern no issue
> happen.
> Another write pattern is write with random size, this pattern trigger the
> above issue.

When i increase dev size to 1G the case can PASS.
Comment 9 Santosh Pradhan 2023-11-20 10:00:01 UTC
Unlike ext4, btrfs sets the sectorsize (FS block size) to PAGE_SIZE (which is 64k in this case). To match with ext4, can we retry the test case with sectorsize 4k (support of sub-page sectorsize).

e.g. mkfs.btrfs -s 4k
Comment 10 WEI GAO 2023-11-20 12:02:57 UTC
(In reply to Santosh Pradhan from comment #9)
> Unlike ext4, btrfs sets the sectorsize (FS block size) to PAGE_SIZE (which
> is 64k in this case). To match with ext4, can we retry the test case with
> sectorsize 4k (support of sub-page sectorsize).
> 
> e.g. mkfs.btrfs -s 4k

Set LTP case with fs opts "-s 4k", the test case can pass.

static struct tst_test test = {
        .max_runtime = 60,
        .dev_min_size = 300,
        .dev_fs_opts = (const char *const[]){"-s 4k", NULL}, <<<<<<<<<
        .needs_root = 1,
        .mount_device = 1,
        .mntpoint = MNTPOINT,
        .all_filesystems = 1,
        .setup = setup,
        .cleanup = cleanup,
        .test = testrun,
                .skip_filesystems = (const char *[]) {
                "tmpfs",
                "ext2",
                "ext3",
                "ext4",
                "xfs",
                "vfat",
                "exfat",
//                "btrfs",
                NULL
        },
        .tcnt = 2
};
Comment 11 Santosh Pradhan 2023-11-20 13:36:58 UTC
Did the test case pass with modified option (i.e. -s 4k)?
Comment 12 WEI GAO 2023-11-20 23:52:43 UTC
(In reply to Santosh Pradhan from comment #11)
> Did the test case pass with modified option (i.e. -s 4k)?

Yes case is pass with modified option. See https://bugzilla.suse.com/show_bug.cgi?id=1217041#c9
Comment 13 WEI GAO 2023-11-20 23:54:53 UTC
(In reply to WEI GAO from comment #12)
> (In reply to Santosh Pradhan from comment #11)
> > Did the test case pass with modified option (i.e. -s 4k)?
> 
> Yes case is pass with modified option. See
> https://bugzilla.suse.com/show_bug.cgi?id=1217041#c9

sorry, above comments link is wrong, the feedback in comment #10
Comment 14 Santosh Pradhan 2023-11-21 05:27:25 UTC
Thanks much David and Qu.

Thanks much Wei for all your effort.

Now with 64KB PAGE_SIZE (on PPC), btrfs needs the file system size (partition size) to be bigger than 1GB (as btrfs by default sets the sectorsize to PAGE_SIZE).The test case passes with 1G or bigger file system size. 

But to compare with (ext4 or so which uses FS block size or sectorsize to 4K), btrfs sectorsize can be tuned to 4K (smaller than PAGE_SIZE i.e. sub-page size sectorsize e.g. mkfs.btrfs -s 4k). In fact his helps to succeed with test case.

As there is nothing much left to be investigated, I would suggest to close the ticket if-and-only-if David, Wei and Goldwyn agree.
Comment 15 Radoslav Tzvetkov 2023-11-29 14:04:55 UTC
Santosh Pradhan, with this said, you can close the ticket. ANd we hope the tests to be readjusted. Any of the participants could reopen it. 

Santosh, Hector do we need to document this in the release notes or docs? If so, please create a task or bug for the doc-team.
Comment 16 WEI GAO 2023-11-30 10:35:01 UTC
https://progress.opensuse.org/issues/151759 for tracking LTP case update
Comment 17 WEI GAO 2023-12-15 11:54:48 UTC
(In reply to Radoslav Tzvetkov from comment #15)
> Santosh Pradhan, with this said, you can close the ticket. ANd we hope the
> tests to be readjusted. Any of the participants could reopen it. 
> 
> Santosh, Hector do we need to document this in the release notes or docs? If
> so, please create a task or bug for the doc-team.

No idea what's happen, the result show pass for 44.1, any update from kernel side already?
https://openqa.suse.de/tests/13003538#step/fs_fill/7
Comment 18 Santosh Pradhan 2023-12-19 18:27:30 UTC
Though the decision was to modify the LTP test case, Wei Gao said he will address it. I have not done anything in that. Please check if the test case is modified in git, I will also take a look and update.
Comment 19 Santosh Pradhan 2023-12-19 18:38:23 UTC
I am also going to check if btrfs code is modified to address the issue for ppc.
Comment 20 WEI GAO 2023-12-20 00:25:55 UTC
(In reply to Santosh Pradhan from comment #18)
> Though the decision was to modify the LTP test case, Wei Gao said he will
> address it. I have not done anything in that. Please check if the test case
> is modified in git, I will also take a look and update.

My patch is still in review backlog.
https://patchwork.ozlabs.org/project/ltp/patch/20231218131142.12464-1-wegao@suse.com/
Comment 21 WEI GAO 2023-12-20 00:33:06 UTC
(In reply to Santosh Pradhan from comment #19)
> I am also going to check if btrfs code is modified to address the issue for
> ppc.

I also schedule rerun on 44.1, 11 rerun both failed, so i can not explain why 44.1 show a pass result.
https://openqa.suse.de/tests/13119838#next_previous

Also latest build 45.1 show failed again. So let's continue keep an eye on this.
https://openqa.suse.de/tests/13085149#