|
Bugzilla – Full Text Bug Listing |
| Summary: | LTP: swapping01, swap performance lower compare with 15sp5 | ||
|---|---|---|---|
| Product: | [openSUSE] PUBLIC SUSE Linux Enterprise Server 15 SP6 | Reporter: | WEI GAO <wegao> |
| Component: | Kernel | Assignee: | Kernel Bugs <kernel-bugs> |
| Status: | IN_PROGRESS --- | QA Contact: | |
| Severity: | Normal | ||
| Priority: | P3 - Medium | CC: | avinesh.kumar, mhocko, petr.vorel, rtsvetkov, swayammitra.tripathy, tiwai, vbabka |
| Version: | unspecified | ||
| Target Milestone: | --- | ||
| Hardware: | Other | ||
| OS: | Other | ||
| URL: | https://openqa.suse.de/tests/12970322/modules/swapping01/steps/7 | ||
| Whiteboard: | |||
| Found By: | openQA | Services Priority: | |
| Business Priority: | Blocker: | Yes | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
New log with more debuginfo such as ps-aux
swap trace swap trace swapping log disable_hughpage_log |
||
|
Description
WEI GAO
2023-12-06 12:02:54 UTC
OK, so I have looked at the test and it effectively does this: buf = malloc(1.3 * MemAvailable) // #define COE_SLIGHT_OVER 0.3 memset(buf) This is expected to swap out. How much exactly really depends but the test is setting up a threshold to mem_over_max = MemAvailable If VmSwap > mem_over_max then we have a failure. On an otherwise idle system one would expect that 30% of the task memory would get swapped out because MemAvailable should give you a rough estimate on how much to allocate before you are pushed out to the swap. But this might really depend on specific numbers. Do you think you could expand the test by reporting /proc/meminfo before the test starts and at the moment of the failure (before killing the task)? Also how reproducible is this? Does this happen every time even when the test is running on a freshly booted test system? (In reply to Michal Hocko from comment #1) > OK, so I have looked at the test and it effectively does this: > buf = malloc(1.3 * MemAvailable) // #define COE_SLIGHT_OVER 0.3 > memset(buf) > > This is expected to swap out. How much exactly really depends but the test > is setting up a threshold to > mem_over_max = MemAvailable > > If VmSwap > mem_over_max then we have a failure. > > On an otherwise idle system one would expect that 30% of the task memory > would get swapped out because MemAvailable should give you a rough estimate > on how much to allocate before you are pushed out to the swap. But this > might really depend on specific numbers. > > Do you think you could expand the test by reporting /proc/meminfo before the > test starts and at the moment of the failure (before killing the task)? > Sure i can expand test to reporting the meminfo before/after test, but currently i can not reproduce in my local env, i suspect the workload is different compare with osd env. Need further check. > Also how reproducible is this? Does this happen every time even when the > test is running on a freshly booted test system? Currently in openqa env(The openqa worker machine load level maybe heavy sometimes) history result (https://openqa.suse.de/tests/12970322#next_previous), you can see 4 failed after 10 build run, ~40% failure rate. Each run in openqa the test case is running on a freshly booted test system, you can check the openqa case just boot machine and start run the case. If you check 15sp5 openqa history result show 100% pass. https://openqa.suse.de/tests/12988578#next_previous (In reply to WEI GAO from comment #2) [...] > > Do you think you could expand the test by reporting /proc/meminfo before the > > test starts and at the moment of the failure (before killing the task)? > > > > Sure i can expand test to reporting the meminfo before/after test Thanks. I believe that reporting values used for the allocation scaling would be a general improvement of the test. Could you try to extend the existing LTP test with those? > , but > currently i can not reproduce in my local env, i suspect the workload is > different compare with osd env. Need further check. Could the difference be that the test doesn't fail when executed on its own but tests running prior to it could have changed the state enough to make the test fail easily? (In reply to Michal Hocko from comment #3) > (In reply to WEI GAO from comment #2) > [...] > > > Do you think you could expand the test by reporting /proc/meminfo before the > > > test starts and at the moment of the failure (before killing the task)? > > > > > > > Sure i can expand test to reporting the meminfo before/after test > > Thanks. I believe that reporting values used for the allocation scaling > would be a general improvement of the test. Could you try to extend the > existing LTP test with those? > https://patchwork.ozlabs.org/project/ltp/list/?series=385996 > > , but > > currently i can not reproduce in my local env, i suspect the workload is > > different compare with osd env. Need further check. > > Could the difference be that the test doesn't fail when executed on its own > but tests running prior to it could have changed the state enough to make > the test fail easily? I suppose you mean some other test running before can impact the system vm status? Once my patch get merged then we can get vm info before/after confirm the system status. Another interesting thing i find in this test case run a full cycle before the real test. Means(copy description on LTP commit 1b53999fddbdc935e3af03c641de94b398b14246) Child: touch unreferenced pages (via alloc&write&free 1.3*MemAvailable) [1] alloc&wirte 1.3*MemAvailable raise(SIGSTOP) ... Parent: waiting for child suspension check child's VmSwap to see if heavy-swap occurs in a process ... [1] As to perform alloc&write&free, the system pages will go through a completed life cycle from buddy-system to active-list to inactive-list then back to buddy-system, which reflect to a page status is theoretically like: "inactive,unreferenced -> active,referenced -> ... ->inactive,unreferenced" so that will helpful to produce what the kernel target commit fixed situation. Running with my private branch(with memory info print out) 19 times in openqa OSD env but no failed happen. https://openqa.suse.de/tests/13108514#next_previous But you can still see fail happen in kernel job group from time to time: https://openqa.suse.de/tests/13086075#next_previous (In reply to WEI GAO from comment #5) > Running with my private branch(with memory info print out) 19 times in > openqa OSD env but no failed happen. > https://openqa.suse.de/tests/13108514#next_previous > > But you can still see fail happen in kernel job group from time to time: > https://openqa.suse.de/tests/13086075#next_previous I am sorry but I have hard time to find the corresponding logs for those failures. Do those contain the meminfo output as well? (In reply to Michal Hocko from comment #6) > (In reply to WEI GAO from comment #5) > > Running with my private branch(with memory info print out) 19 times in > > openqa OSD env but no failed happen. > > https://openqa.suse.de/tests/13108514#next_previous > > > > But you can still see fail happen in kernel job group from time to time: > > https://openqa.suse.de/tests/13086075#next_previous > > I am sorry but I have hard time to find the corresponding logs for those > failures. Do those contain the meminfo output as well? Sorry, currently my private branch(the first link) with meminfo print out BUT can not reproduce this issue. The offical openqa(the second link) result has this issue but without the extra meminfo(without new patch). Now the patch is merged and i hope next run we can get failed result with meminfo. I will give you detail log(with explaination) once i got this log. Just FYI
In normal situation the meminfo will print out 2 times,
In failed situation the meminfo will print 3 times(the last one is meminfo before fail).
--- a/testcases/kernel/mem/swapping/swapping01.c
+++ b/testcases/kernel/mem/swapping/swapping01.c
@@ -83,7 +83,9 @@ static void test_swapping(void)
switch (pid = SAFE_FORK()) {
case 0:
+ TST_PRINT_MEMINFO(); <<<< first meminfo
do_alloc(0);
+ TST_PRINT_MEMINFO(); <<<< second meminfo
do_alloc(1);
exit(0);
default:
@@ -138,6 +140,7 @@ static void check_swapping(void)
swap_free_now = SAFE_READ_MEMINFO("SwapFree:");
sleep(1);
long diff = labs(swap_free_now - SAFE_READ_MEMINFO("SwapFree:"));
+
if (diff < 10)
break;
@@ -146,6 +149,7 @@ static void check_swapping(void)
swapped = SAFE_READ_PROC_STATUS(pid, "VmSwap:");
if (swapped > mem_over_max) {
+ TST_PRINT_MEMINFO(); <<<< third meminfo if case failed
kill(pid, SIGCONT);
tst_brk(TFAIL, "heavy swapping detected: "
"%ld MB swapped.", swapped / 1024);
The case output also need pay attention, the case will run many runs and
if you see "TPASS: no heavy swapping detected" then 1 run complete.
Take example:
https://openqa.suse.de/tests/13108523#step/swapping01/6
tst_kconfig.c:87: TINFO: Parsing kernel config '/proc/config.gz'
tst_kernel.c:88: TINFO: uname.machine=x86_64 kernel is 64bit
tst_test.c:1690: TINFO: LTP version: 20230929
tst_test.c:1576: TINFO: Timeout per run is 0h 10m 30s
tst_safe_macros.c:619: TINFO: === /proc/meminfo === <<<< first meminfo
MemTotal: 989824 kB
MemFree: 82644 kB
MemAvailable: 726208 kB
...
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 2048 kB
DirectMap4k: 128868 kB
DirectMap2M: 919552 kB
tst_safe_macros.c:619: TINFO: === /proc/meminfo === <<<< secon meminfo
MemTotal: 989824 kB
MemFree: 724952 kB
MemAvailable: 689948 kB
Buffers: 92 kB
Cached: 13604 kB
...
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
Hugetlb: 2048 kB
DirectMap4k: 128868 kB
DirectMap2M: 919552 kB
swapping01.c:111: TINFO: available physical memory: 709 MB
swapping01.c:116: TINFO: try to allocate: 921 MB
swapping01.c:147: TINFO: SwapFree difference 28016
swapping01.c:159: TPASS: no heavy swapping detected, 236 MB swapped.<<< when u see TPASS/TFAIL mean case first run Finish
tst_safe_macros.c:619: TINFO: === /proc/meminfo === <<<< test start run next run.
MemTotal: 989824 kB
MemFree: 820772 kB
MemAvailable: 783092 kB
Buffers: 0 kB
Cached: 9480 kB
SwapCached: 376 kB
Active: 11560 kB
One failed happen after another bunch of run. https://openqa.suse.de/tests/13112516#step/swapping01/6 tst_safe_macros.c:619: TINFO: === /proc/meminfo === <<<<1 MemTotal: 989824 kB MemFree: 806032 kB MemAvailable: 763508 kB Buffers: 8 kB Cached: 14280 kB SwapCached: 1948 kB Active: 5960 kB Inactive: 11896 kB Active(anon): 108 kB Inactive(anon): 3444 kB Active(file): 5852 kB Inactive(file): 8452 kB Unevictable: 80 kB Mlocked: 80 kB SwapTotal: 2098152 kB SwapFree: 2045560 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 36 kB Writeback: 0 kB AnonPages: 2456 kB Mapped: 2156 kB Shmem: 4 kB KReclaimable: 43192 kB Slab: 75996 kB SReclaimable: 43192 kB SUnreclaim: 32804 kB KernelStack: 2240 kB PageTables: 2952 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2592040 kB Committed_AS: 113660 kB VmallocTotal: 34359738367 kB VmallocUsed: 25592 kB VmallocChunk: 0 kB Percpu: 332 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB Unaccepted: 0 kB HugePages_Total: 1 HugePages_Free: 1 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 2048 kB DirectMap4k: 104292 kB DirectMap2M: 944128 kB tst_safe_macros.c:619: TINFO: === /proc/meminfo ====== <<<<2 MemTotal: 989824 kB MemFree: 818340 kB MemAvailable: 771140 kB Buffers: 0 kB Cached: 5108 kB SwapCached: 1084 kB Active: 2616 kB Inactive: 4636 kB Active(anon): 108 kB Inactive(anon): 2032 kB Active(file): 2508 kB Inactive(file): 2604 kB Unevictable: 80 kB Mlocked: 80 kB SwapTotal: 2098152 kB SwapFree: 2044888 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 36 kB Writeback: 0 kB AnonPages: 2160 kB Mapped: 2264 kB Shmem: 4 kB KReclaimable: 43032 kB Slab: 75656 kB SReclaimable: 43032 kB SUnreclaim: 32624 kB KernelStack: 2240 kB PageTables: 2952 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2592040 kB Committed_AS: 113660 kB VmallocTotal: 34359738367 kB VmallocUsed: 25592 kB VmallocChunk: 0 kB Percpu: 332 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB Unaccepted: 0 kB HugePages_Total: 1 HugePages_Free: 1 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 2048 kB DirectMap4k: 104292 kB DirectMap2M: 944128 kB swapping01.c:111: TINFO: available physical memory: 745 MB swapping01.c:116: TINFO: try to allocate: 969 MB swapping01.c:147: TINFO: SwapFree difference 41896 swapping01.c:147: TINFO: SwapFree difference 512 tst_safe_macros.c:619: TINFO: === /proc/meminfo ===<<<<3 before case fail MemTotal: 989824 kB MemFree: 65796 kB MemAvailable: 17800 kB Buffers: 0 kB Cached: 3388 kB SwapCached: 559864 kB Active: 1964 kB Inactive: 752564 kB Active(anon): 92 kB Inactive(anon): 751072 kB Active(file): 1872 kB Inactive(file): 1492 kB Unevictable: 80 kB Mlocked: 80 kB SwapTotal: 2098152 kB SwapFree: 1242740 kB Zswap: 0 kB Zswapped: 0 kB Dirty: 80 kB Writeback: 0 kB AnonPages: 191988 kB Mapped: 2272 kB Shmem: 4 kB KReclaimable: 43184 kB Slab: 75844 kB SReclaimable: 43184 kB SUnreclaim: 32660 kB KernelStack: 2240 kB PageTables: 4896 kB SecPageTables: 0 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 2592040 kB Committed_AS: 1106224 kB VmallocTotal: 34359738367 kB VmallocUsed: 25592 kB VmallocChunk: 0 kB Percpu: 332 kB HardwareCorrupted: 0 kB AnonHugePages: 2048 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB FileHugePages: 0 kB FilePmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB Unaccepted: 0 kB HugePages_Total: 1 HugePages_Free: 1 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 2048 kB DirectMap4k: 104292 kB DirectMap2M: 944128 kB swapping01.c:155: TFAIL: heavy swapping detected: 783 MB swapped tst_test.c:1622: TINFO: Killed the leftover descendant processes (In reply to WEI GAO from comment #9) > One failed happen after another bunch of run. > https://openqa.suse.de/tests/13112516#step/swapping01/6 [...] > CommitLimit: 2592040 kB > Committed_AS: 113660 kB [...] > tst_safe_macros.c:619: TINFO: === /proc/meminfo ====== <<<<2 > MemTotal: 989824 kB > MemFree: 818340 kB > MemAvailable: 771140 kB > Buffers: 0 kB > Cached: 5108 kB [...] > Active(anon): 108 kB > Inactive(anon): 2032 kB > Active(file): 2508 kB > Inactive(file): 2604 kB 753MB available for allocations but almost no page cache, with ~40MB of reclaimable slab there is not really much to reclaim here. [...] > SwapTotal: 2098152 kB > SwapFree: 2044888 kB 52MB of swapped out memory before the memory starts being faulted in. [...] > swapping01.c:111: TINFO: available physical memory: 745 MB > swapping01.c:116: TINFO: try to allocate: 969 MB 969 cannot fit in even when the current memory is reclaimed. > swapping01.c:147: TINFO: SwapFree difference 41896 > swapping01.c:147: TINFO: SwapFree difference 512 > tst_safe_macros.c:619: TINFO: === /proc/meminfo ===<<<<3 before case fail > MemTotal: 989824 kB > MemFree: 65796 kB > MemAvailable: 17800 kB > Buffers: 0 kB > Cached: 3388 kB > SwapCached: 559864 kB > Active: 1964 kB > Inactive: 752564 kB > Active(anon): 92 kB > Inactive(anon): 751072 kB So we have 730MB of anonymous memory in use > SwapTotal: 2098152 kB > SwapFree: 1242740 kB and 835MB of swapped out memory. Discounting 52MB swapped out portion that was there initially we have 783MB swapped out + 730MB resident. That doesn't add up to the allocated 969MB. Is it possible that something else is interfering during this test run? Thanks for your feedback,I will add ps -aux to check whether another process has some strange mem allocation, will give feedback once i have result. Created attachment 872039 [details]
New log with more debuginfo such as ps-aux
(In reply to WEI GAO from comment #12) > Created attachment 872039 [details] > New log with more debuginfo such as ps-aux NOTE: current reproduce rate is 1%. https://openqa.suse.de/tests/13298632#step/swapping01/6 Add extra ps -aux but i can not find any suspicious process. The RSS total size is far away from the available RAM(966M). swapping01.c:77: TINFO: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND swapping01.c:77: TINFO: root 2461 12.1 20.6 996556 204788 hvc0 T 22:08 0:00 swapping01 -i 5 swapping01.c:77: TINFO: root 2465 0.0 0.3 14564 3200 hvc0 R 22:08 0:00 ps aux --sort=-%mem swapping01.c:77: TINFO: root 1731 0.4 0.2 8672 2944 hvc0 S 22:02 0:01 pidstat -p ALL 1 swapping01.c:77: TINFO: root 1488 0.0 0.1 10220 1920 hvc0 Ss 22:02 0:00 /bin/bash swapping01.c:77: TINFO: root 2439 0.4 0.1 5056 1796 hvc0 S 22:08 0:00 swapping01 -i 5 swapping01.c:77: TINFO: root 1478 0.0 0.1 42140 1664 ? Ss 22:02 0:00 /usr/lib/systemd/systemd --user swapping01.c:77: TINFO: root 452 0.0 0.1 39588 1536 ? Ss 22:01 0:00 /usr/lib/systemd/systemd-journald swapping01.c:77: TINFO: root 2437 0.0 0.1 5056 1536 hvc0 S+ 22:08 0:00 swapping01 -i 5 swapping01.c:77: TINFO: nscd 611 0.0 0.1 715692 1288 ? Ssl 22:01 0:00 /usr/sbin/nscd swapping01.c:77: TINFO: root 494 0.0 0.1 52744 1280 ? Ss 22:01 0:00 /usr/lib/systemd/systemd-udevd swapping01.c:77: TINFO: root 1 0.5 0.1 45064 1192 ? Ss 22:01 0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize=31 swapping01.c:77: TINFO: message+ 598 0.0 0.1 18324 1024 ? Ss 22:01 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only swapping01.c:77: TINFO: root 1317 0.0 0.1 298904 1024 ? Ssl 22:02 0:00 /usr/sbin/rsyslogd -n -iNONE swapping01.c:77: TINFO: root 591 0.0 0.1 13504 1000 ? S<sl 22:01 0:00 /sbin/auditd swapping01.c:77: TINFO: root 634 0.0 0.0 37564 896 ? Ss 22:01 0:00 /usr/lib/systemd/systemd-logind swapping01.c:77: TINFO: root 1457 0.0 0.0 10016 768 ? Ss 22:02 0:00 /usr/sbin/cron -n swapping01.c:77: TINFO: root 642 0.0 0.0 18752 640 ? SLs 22:01 0:00 /usr/sbin/wickedd --systemd --foreground swapping01.c:77: TINFO: postfix 1437 0.0 0.0 49076 512 ? S 22:02 0:00 pickup -l -t fifo -u Another question from my side why we have 562M SwapCached? MemTotal: 989824 kB MemFree: 83764 kB MemAvailable: 37208 kB Buffers: 0 kB Cached: 11920 kB SwapCached: 575992 kB <<<< Active: 12980 kB Inactive: 781792 kB (In reply to WEI GAO from comment #13) [...] > swapping01.c:77: TINFO: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > swapping01.c:77: TINFO: root 2461 12.1 20.6 996556 204788 hvc0 T 22:08 0:00 swapping01 -i 5 > swapping01.c:77: TINFO: root 2439 0.4 0.1 5056 1796 hvc0 S 22:08 0:00 swapping01 -i 5 > swapping01.c:77: TINFO: root 2437 0.0 0.1 5056 1536 hvc0 S+ 22:08 0:00 swapping01 -i 5 Is it expected to have several instances of the test running? (In reply to Michal Hocko from comment #14) > (In reply to WEI GAO from comment #13) > [...] > > swapping01.c:77: TINFO: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > swapping01.c:77: TINFO: root 2461 12.1 20.6 996556 204788 hvc0 T 22:08 0:00 swapping01 -i 5 > > swapping01.c:77: TINFO: root 2439 0.4 0.1 5056 1796 hvc0 S 22:08 0:00 swapping01 -i 5 > > swapping01.c:77: TINFO: root 2437 0.0 0.1 5056 1536 hvc0 S+ 22:08 0:00 swapping01 -i 5 > > Is it expected to have several instances of the test running? That's expected result for LTP test, LTP framework(main process)will fork a process for running test case(swapping01), and if you check swapping01.c then you will find swapping01.c fork another process for malloc memory, so totally 3 process is normal. (In reply to WEI GAO from comment #15) > (In reply to Michal Hocko from comment #14) > > (In reply to WEI GAO from comment #13) > > [...] > > > swapping01.c:77: TINFO: USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND > > > swapping01.c:77: TINFO: root 2461 12.1 20.6 996556 204788 hvc0 T 22:08 0:00 swapping01 -i 5 > > > swapping01.c:77: TINFO: root 2439 0.4 0.1 5056 1796 hvc0 S 22:08 0:00 swapping01 -i 5 > > > swapping01.c:77: TINFO: root 2437 0.0 0.1 5056 1536 hvc0 S+ 22:08 0:00 swapping01 -i 5 > > > > Is it expected to have several instances of the test running? > > That's expected result for LTP test, LTP framework(main process)will fork a > process for running test case(swapping01), and if you check swapping01.c > then you will find swapping01.c fork another process for malloc memory, so > totally 3 process is normal. OK, thanks for the clarification. I had that suspicion but wanted to make sure I am not mixing anything up. I will have a closer look at specific numbers. RSS is not a great measure here because it doesn't include swapped out memory. I am not aware of a way that ps would estimate swapped out memory. I have this little scriplet that does a rough estimation. Could you plug it in and report on failure cases please?
$ cat who_swap.sh
#!/bin/bash
for dir in /proc/[0-9][0-9]*
do
pid=${dir##*/}
name=`cat $dir/comm`
awk -v name=$name -v pid=$pid '/^Swap/{s+=$2}END{if (s) printf "%dK %s:%d\n", s, name, pid}' $dir/smaps
done | sort -n -k1
Created attachment 872132 [details]
swap trace
(In reply to WEI GAO from comment #18) > Created attachment 872132 [details] > swap trace Just note: the swap size is double since your script calculate swap and swapPss both. (In reply to WEI GAO from comment #19) > (In reply to WEI GAO from comment #18) > > Created attachment 872132 [details] > > swap trace > > Just note: the swap size is double since your script calculate swap and > swapPss both. You are right. My script is really old and pre-dates SwapPss. Anyway should be trivial ti fix. Just change the script to collect Swap: or SwapPss:. (In reply to WEI GAO from comment #18) > Created attachment 872132 [details] > swap trace swapping01.c:193: TINFO: available physical memory: 728 MB swapping01.c:198: TINFO: try to allocate: 946 MB swapping01.c:233: TINFO: !!!!!!!!after mem alocate!!!!!!!!!! swapping01.c:129: TINFO: Name: swapping01 swapping01.c:129: TINFO: Umask: 0022 swapping01.c:129: TINFO: State: T (stopped) swapping01.c:129: TINFO: Tgid: 4691 swapping01.c:129: TINFO: Ngid: 0 swapping01.c:129: TINFO: Pid: 4691 [...] swapping01.c:129: TINFO: VmSwap: 775524 kB [...] swapping01.c:77: TINFO: 1550997K swapping01:4691 [...] SwapTotal: 2098152 kB SwapFree: 1278156 kB 819996 kB swapped out. there are no other significant swap consumers except for the test case. The memory pressure has evicted most of the anonymous memory, though. The test cases itself shouldn't have allocated that much so I still suspect some other source of allocation. I do not see anything outstanding in meminfo output. You can drop the swap consumption collecting (my script) but I would like to see /proc/<pid>/smaps of the process. Maybe there is something weird going on there (quite unlikely). Btw. if you let the test run for a longer time, are those failures showing any patterns? E.g. is it possible that any planned jobs could interfere? Vlastimil, do you have any idea what kind of debugging we could enable to learn more? Where exactly are these print hooks placed in the latest trace? swapping01.c:163: TINFO: !!!!!before check!!!!! swapping01.c:233: TINFO: !!!!!!!!after mem alocate!!!!!!!!!! That's not as shown in comment 8, is it? Could you repeat it with this placement? static void do_alloc(int allow_raise) { long mem_count; void *s; if (allow_raise == 1) tst_res(TINFO, "available physical memory: %ld MB", mem_available_init / 1024); mem_count = mem_available_init + mem_over; if (allow_raise == 1) tst_res(TINFO, "try to allocate: %ld MB", mem_count / 1024); <-- here s = SAFE_MALLOC(mem_count * 1024); <-- here memset(s, 1, mem_count * 1024); <-- here if ((allow_raise == 1) && (raise(SIGSTOP) == -1)) { tst_res(TINFO, "memory allocated: %ld MB", mem_count / 1024); tst_brk(TBROK | TERRNO, "kill"); } free(s); } There are two calls of do_alloc() so we should get 6 reports. Do we know if SAFE_MALLOC() and free() here result in mmap() and munmap() or glibc caches the area between the first free() and second SAFE_MALLOC()? If it does, it could explain the results perhaps. strace could tell, but also it should be more evident from the trace if you add the hooks as suggested above. Thanks! Created attachment 872211 [details]
swap trace
Apologies for losing track of this bug, I'm analyzing the last trace now. Yeah something weird is going on. After the first do_alloc() memsets the memory we have: SwapFree: 1665084 kB Inactive(anon): 697936 kB AnonPages: 546168 kB SwapCached: 160856 kB And after the second do_alloc(): SwapFree: 1309036 kB Inactive(anon): 745548 kB AnonPages: 190108 kB SwapCached: 564560 kB In between those, everything behaves as if "free(s)" freed everything properly including full munmap() (the VSZ of the process drops) and not e.g. madvise(MADV_DONTNEED). Note the test could still be more robust against unexpected glibc implementation changes by doing mmap()/munmap() instead of malloc()/free(). So reclaim does behave differently during the second do_alloc() as the test tries to check. Not sure if it's the "considers pages referenced" and where it would get the info from. Hmm, maybe it's workingset shadow nodes for anonymous memory that get unexpectedly carried over to supposedly independent reallocations, and we should be flushing those somewhere? I'll need to check. Also could we dump /proc/vmstat together with /proc/meminfo at the same places as previously? Might confirm the suspicion. Thanks! Created attachment 873411 [details]
swapping log
Thanks. It seems the workingset counters don't support my hypothesis, but I noticed some negative correlation with AnonHugePages. What happens if they are disabled? i.e. echo never > /sys/kernel/mm/transparent_hugepage/enabled Created attachment 873461 [details]
disable_hughpage_log
(In reply to Vlastimil Babka from comment #27) > Thanks. It seems the workingset counters don't support my hypothesis, but I > noticed some negative correlation with AnonHugePages. What happens if they > are disabled? i.e. echo never > /sys/kernel/mm/transparent_hugepage/enabled After disable hugepages issue not reproduce after 400 runs. Also attach a log for disable AnonHugePages. So this issue related with hugepages ? (In reply to WEI GAO from comment #31) > So this issue related with hugepages ? I think your testing has proved that. I will investigate THP swapping in SP6... After inspecting the code and counters, I think there's nothing THP specific in the handling that would trigger this, as all the THP's are being split anyway before swapping out. So I suspect it's all due to performance changes, and the whole problem is non-deterministic and related to swap out speed vs the test program's loop speed. When the test program populates/dirties the allocated memory and starts running out of free memory, reclaim will initiate swap out. Only after that is finished, can the pages be reused for newly populated pages. I suspect we might have some borderline situation where swapout will be fast enough in some cases and not in others, with will result in further swap outs to be initiated, and in the end more will be swapped out than necessary. To test this, I recommend adding some sleeps in the loop that dirties the memory. I think few ms sleep should be enough after each 1MB, or something. Or you could start with more and then gradually take it back so the test isn't unnecessarily too much. And if these sleeps remove the issue, then I'd say it should be the permanent fix, and not a change in the kernel. The regression the LTP test is checking wouldn't be helped by the sleeps, but a deterministic bad behavior. If this test is indeed affected by performance/timing, this change will make it more robust while keeping the original purpose. (In reply to Vlastimil Babka from comment #34) > After inspecting the code and counters, I think there's nothing THP specific > in the handling that would trigger this, as all the THP's are being split > anyway before swapping out. So I suspect it's all due to performance > changes, and the whole problem is non-deterministic and related to swap out > speed vs the test program's loop speed. > > When the test program populates/dirties the allocated memory and starts > running out of free memory, reclaim will initiate swap out. Only after that > is finished, can the pages be reused for newly populated pages. I suspect we > might have some borderline situation where swapout will be fast enough in > some cases and not in others, with will result in further swap outs to be > initiated, and in the end more will be swapped out than necessary. > > To test this, I recommend adding some sleeps in the loop that dirties the > memory. I think few ms sleep should be enough after each 1MB, or something. > Or you could start with more and then gradually take it back so the test > isn't unnecessarily too much. > > And if these sleeps remove the issue, then I'd say it should be the > permanent fix, and not a change in the kernel. The regression the LTP test > is checking wouldn't be helped by the sleeps, but a deterministic bad > behavior. If this test is indeed affected by performance/timing, this change > will make it more robust while keeping the original purpose. Patch for LTP: https://patchwork.ozlabs.org/project/ltp/patch/20240418071422.10221-1-wegao@suse.com/ Validation result: pass after 500 runs https://openqa.suse.de/tests/14056297#next_previous (In reply to WEI GAO from comment #35) > Patch for LTP: > https://patchwork.ozlabs.org/project/ltp/patch/20240418071422.10221-1- > wegao@suse.com/ > > Validation result: pass after 500 runs > https://openqa.suse.de/tests/14056297#next_previous Great, thanks! |