Bug 1217850 - LTP: swapping01, swap performance lower compare with 15sp5
Summary: LTP: swapping01, swap performance lower compare with 15sp5
Status: IN_PROGRESS
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: Kernel (show other bugs)
Version: unspecified
Hardware: Other Other
: P3 - Medium : Normal
Target Milestone: ---
Assignee: Kernel Bugs
QA Contact:
URL: https://openqa.suse.de/tests/12970322...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-12-06 12:02 UTC by WEI GAO
Modified: 2024-04-18 09:29 UTC (History)
7 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
New log with more debuginfo such as ps-aux (252.94 KB, text/plain)
2024-01-20 10:35 UTC, WEI GAO
Details
swap trace (38.53 KB, text/plain)
2024-01-24 08:19 UTC, WEI GAO
Details
swap trace (107.51 KB, text/plain)
2024-01-26 05:05 UTC, WEI GAO
Details
swapping log (630.49 KB, text/plain)
2024-03-12 00:19 UTC, WEI GAO
Details
disable_hughpage_log (794.35 KB, text/plain)
2024-03-13 04:09 UTC, WEI GAO
Details

Note You need to log in before you can comment on or make changes to this bug.
Description WEI GAO 2023-12-06 12:02:54 UTC
The case fail rate is ~40% but it already indicate performance is lower then 15sp5.

Failed case:
openQA test in scenario sle-15-SP6-Online-x86_64-ltp_mm_swapping@64bit fails in
[swapping01](https://openqa.suse.de/tests/12970322/modules/swapping01/steps/7)

15sp5 pass case:
https://openqa.suse.de/tests/12978940


Test case fork a child do malloc and free a block of memory and start check VmSwap of /proc/xx/status value is still big or not 
/*\
 * [Description]
 *
 * Detect heavy swapping during first time swap use.
 *
 * This case is used for testing kernel commit:
 * 50a15981a1fa ("[S390] reference bit testing for unmapped pages")
 *
 * The upstream commit fixed a issue on s390/x platform that heavy
 * swapping might occur in some condition, however since the patch
 * was quite general, this testcase will be run on all supported
 * platforms to ensure no regression been introduced.
 *
 * Details of the kernel fix:
 *
 * On x86 a page without a mapper is by definition not referenced / old.
 * The s390 architecture keeps the reference bit in the storage key and
 * the current code will check the storage key for page without a mapper.
 * This leads to an interesting effect: the first time an s390 system
 * needs to write pages to swap it only finds referenced pages. This
 * causes a lot of pages to get added and written to the swap device.
 * To avoid this behaviour change page_referenced to query the storage
 * key only if there is a mapper of the page.
 *
 * [Algorithm]
 *
 * Try to allocate memory which size is slightly larger than current
 * available memory. After allocation done, continue loop for a while
 * and calculate the used swap size. The used swap size should be small
 * enough, else it indicates that heavy swapping is occurred unexpectedly.
 */

Reproduce:
git clone 'https://github.com/linux-test-project/ltp.git'
cd ltp
make autotools
./configure
cd testcases/kernel/mem/swapping/
make
./swapping01
Comment 1 Michal Hocko 2023-12-07 09:16:56 UTC
OK, so I have looked at the test and it effectively does this:
buf = malloc(1.3 * MemAvailable) // #define COE_SLIGHT_OVER 0.3
memset(buf)

This is expected to swap out. How much exactly really depends but the test is setting up a threshold to 
mem_over_max = MemAvailable

If VmSwap > mem_over_max then we have a failure.

On an otherwise idle system one would expect that 30% of the task memory would get swapped out because MemAvailable should give you a rough estimate on how much to allocate before you are pushed out to the swap. But this might really depend on specific numbers.

Do you think you could expand the test by reporting /proc/meminfo before the test starts and at the moment of the failure (before killing the task)?

Also how reproducible is this? Does this happen every time even when the test is running on a freshly booted test system?
Comment 2 WEI GAO 2023-12-08 07:49:02 UTC
(In reply to Michal Hocko from comment #1)
> OK, so I have looked at the test and it effectively does this:
> buf = malloc(1.3 * MemAvailable) // #define COE_SLIGHT_OVER 0.3
> memset(buf)
> 
> This is expected to swap out. How much exactly really depends but the test
> is setting up a threshold to 
> mem_over_max = MemAvailable
> 
> If VmSwap > mem_over_max then we have a failure.
> 
> On an otherwise idle system one would expect that 30% of the task memory
> would get swapped out because MemAvailable should give you a rough estimate
> on how much to allocate before you are pushed out to the swap. But this
> might really depend on specific numbers.
> 
> Do you think you could expand the test by reporting /proc/meminfo before the
> test starts and at the moment of the failure (before killing the task)?
>

Sure i can expand test to reporting the meminfo before/after test, but currently i can not reproduce in my local env, i suspect the workload is different compare with osd env. Need further check.

> Also how reproducible is this? Does this happen every time even when the
> test is running on a freshly booted test system?

Currently in openqa env(The openqa worker machine load level maybe heavy sometimes) history result (https://openqa.suse.de/tests/12970322#next_previous),
you can see 4 failed after 10 build run, ~40% failure rate.
Each run in openqa the test case is running on a freshly booted test system, you can check the openqa case just boot machine and start run the case.

If you check 15sp5 openqa history result show 100% pass. https://openqa.suse.de/tests/12988578#next_previous
Comment 3 Michal Hocko 2023-12-08 08:57:37 UTC
(In reply to WEI GAO from comment #2)
[...]
> > Do you think you could expand the test by reporting /proc/meminfo before the
> > test starts and at the moment of the failure (before killing the task)?
> >
> 
> Sure i can expand test to reporting the meminfo before/after test

Thanks. I believe that reporting values used for the allocation scaling would be a general improvement of the test. Could you try to extend the existing LTP test with those?

> , but
> currently i can not reproduce in my local env, i suspect the workload is
> different compare with osd env. Need further check.

Could the difference be that the test doesn't fail when executed on its own but tests running prior to it could have changed the state enough to make the test fail easily?
Comment 4 WEI GAO 2023-12-10 09:54:15 UTC
(In reply to Michal Hocko from comment #3)
> (In reply to WEI GAO from comment #2)
> [...]
> > > Do you think you could expand the test by reporting /proc/meminfo before the
> > > test starts and at the moment of the failure (before killing the task)?
> > >
> > 
> > Sure i can expand test to reporting the meminfo before/after test
> 
> Thanks. I believe that reporting values used for the allocation scaling
> would be a general improvement of the test. Could you try to extend the
> existing LTP test with those?
> 

https://patchwork.ozlabs.org/project/ltp/list/?series=385996 

> > , but
> > currently i can not reproduce in my local env, i suspect the workload is
> > different compare with osd env. Need further check.
> 
> Could the difference be that the test doesn't fail when executed on its own
> but tests running prior to it could have changed the state enough to make
> the test fail easily?

I suppose you mean some other test running before can impact the system vm status? 
Once my patch get merged then we can get vm info before/after confirm the system status.

Another interesting thing i find in this test case run a full cycle before the real test. Means(copy description on LTP commit 1b53999fddbdc935e3af03c641de94b398b14246) 

    Child:
             touch unreferenced pages (via alloc&write&free 1.3*MemAvailable) [1]
             alloc&wirte 1.3*MemAvailable
             raise(SIGSTOP)
             ...
    Parent:
             waiting for child suspension
             check child's VmSwap to see if heavy-swap occurs in a process
             ...

    [1] As to perform alloc&write&free, the system pages will go through a
    completed life cycle from buddy-system to active-list to inactive-list
    then back to buddy-system, which reflect to a page status is theoretically like:
    "inactive,unreferenced -> active,referenced -> ... ->inactive,unreferenced"
    so that will helpful to produce what the kernel target commit fixed situation.
Comment 5 WEI GAO 2023-12-19 01:57:21 UTC
Running with my private branch(with memory info print out) 19 times in openqa OSD env but no failed happen. 
https://openqa.suse.de/tests/13108514#next_previous

But you can still see fail happen in kernel job group from time to time:
https://openqa.suse.de/tests/13086075#next_previous
Comment 6 Michal Hocko 2023-12-19 08:26:17 UTC
(In reply to WEI GAO from comment #5)
> Running with my private branch(with memory info print out) 19 times in
> openqa OSD env but no failed happen. 
> https://openqa.suse.de/tests/13108514#next_previous
> 
> But you can still see fail happen in kernel job group from time to time:
> https://openqa.suse.de/tests/13086075#next_previous

I am sorry but I have hard time to find the corresponding logs for those failures. Do those contain the meminfo output as well?
Comment 7 WEI GAO 2023-12-19 08:55:17 UTC
(In reply to Michal Hocko from comment #6)
> (In reply to WEI GAO from comment #5)
> > Running with my private branch(with memory info print out) 19 times in
> > openqa OSD env but no failed happen. 
> > https://openqa.suse.de/tests/13108514#next_previous
> > 
> > But you can still see fail happen in kernel job group from time to time:
> > https://openqa.suse.de/tests/13086075#next_previous
> 
> I am sorry but I have hard time to find the corresponding logs for those
> failures. Do those contain the meminfo output as well?

Sorry, currently my private branch(the first link) with meminfo print out BUT can not reproduce this issue.
The offical openqa(the second link) result has this issue but without the extra meminfo(without new patch). 
Now the patch is merged and i hope next run we can get failed result with meminfo. I will give you detail log(with explaination) once i got this log.
Comment 8 WEI GAO 2023-12-19 09:48:16 UTC
Just FYI
In normal situation the meminfo will print out 2 times,
In failed situation the meminfo will print 3 times(the last one is meminfo before fail).
--- a/testcases/kernel/mem/swapping/swapping01.c
+++ b/testcases/kernel/mem/swapping/swapping01.c
@@ -83,7 +83,9 @@ static void test_swapping(void)

        switch (pid = SAFE_FORK()) {
        case 0:
+               TST_PRINT_MEMINFO();  <<<< first meminfo
                do_alloc(0);
+               TST_PRINT_MEMINFO();  <<<< second meminfo
                do_alloc(1);
                exit(0);
        default:
@@ -138,6 +140,7 @@ static void check_swapping(void)
                swap_free_now = SAFE_READ_MEMINFO("SwapFree:");
                sleep(1);
                long diff = labs(swap_free_now - SAFE_READ_MEMINFO("SwapFree:"));
+
                if (diff < 10)
                        break;

@@ -146,6 +149,7 @@ static void check_swapping(void)

        swapped = SAFE_READ_PROC_STATUS(pid, "VmSwap:");
        if (swapped > mem_over_max) {
+               TST_PRINT_MEMINFO();  <<<< third meminfo if case failed
                kill(pid, SIGCONT);
                tst_brk(TFAIL, "heavy swapping detected: "
                                "%ld MB swapped.", swapped / 1024);

The case output also need pay attention, the case will run many runs and
if you see "TPASS: no heavy swapping detected" then 1 run complete.

Take example:
https://openqa.suse.de/tests/13108523#step/swapping01/6

tst_kconfig.c:87: TINFO: Parsing kernel config '/proc/config.gz'
tst_kernel.c:88: TINFO: uname.machine=x86_64 kernel is 64bit
tst_test.c:1690: TINFO: LTP version: 20230929
tst_test.c:1576: TINFO: Timeout per run is 0h 10m 30s
tst_safe_macros.c:619: TINFO: === /proc/meminfo ===  <<<< first meminfo 
MemTotal:         989824 kB
MemFree:           82644 kB
MemAvailable:     726208 kB
...
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:            2048 kB
DirectMap4k:      128868 kB
DirectMap2M:      919552 kB
tst_safe_macros.c:619: TINFO: === /proc/meminfo === <<<< secon meminfo
MemTotal:         989824 kB
MemFree:          724952 kB
MemAvailable:     689948 kB
Buffers:              92 kB
Cached:            13604 kB
...
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:            2048 kB
DirectMap4k:      128868 kB
DirectMap2M:      919552 kB
swapping01.c:111: TINFO: available physical memory: 709 MB
swapping01.c:116: TINFO: try to allocate: 921 MB
swapping01.c:147: TINFO: SwapFree difference 28016
swapping01.c:159: TPASS: no heavy swapping detected, 236 MB swapped.<<< when u see TPASS/TFAIL mean case first run Finish
tst_safe_macros.c:619: TINFO: === /proc/meminfo ===  <<<< test start run next run.
MemTotal:         989824 kB
MemFree:          820772 kB
MemAvailable:     783092 kB
Buffers:               0 kB
Cached:             9480 kB
SwapCached:          376 kB
Active:            11560 kB
Comment 9 WEI GAO 2023-12-19 10:41:03 UTC
One failed happen after another bunch of run.
https://openqa.suse.de/tests/13112516#step/swapping01/6

tst_safe_macros.c:619: TINFO: === /proc/meminfo === <<<<1
MemTotal:         989824 kB
MemFree:          806032 kB
MemAvailable:     763508 kB
Buffers:               8 kB
Cached:            14280 kB
SwapCached:         1948 kB
Active:             5960 kB
Inactive:          11896 kB
Active(anon):        108 kB
Inactive(anon):     3444 kB
Active(file):       5852 kB
Inactive(file):     8452 kB
Unevictable:          80 kB
Mlocked:              80 kB
SwapTotal:       2098152 kB
SwapFree:        2045560 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                36 kB
Writeback:             0 kB
AnonPages:          2456 kB
Mapped:             2156 kB
Shmem:                 4 kB
KReclaimable:      43192 kB
Slab:              75996 kB
SReclaimable:      43192 kB
SUnreclaim:        32804 kB
KernelStack:        2240 kB
PageTables:         2952 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2592040 kB
Committed_AS:     113660 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       25592 kB
VmallocChunk:          0 kB
Percpu:              332 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:            2048 kB
DirectMap4k:      104292 kB
DirectMap2M:      944128 kB
tst_safe_macros.c:619: TINFO: === /proc/meminfo ====== <<<<2
MemTotal:         989824 kB
MemFree:          818340 kB
MemAvailable:     771140 kB
Buffers:               0 kB
Cached:             5108 kB
SwapCached:         1084 kB
Active:             2616 kB
Inactive:           4636 kB
Active(anon):        108 kB
Inactive(anon):     2032 kB
Active(file):       2508 kB
Inactive(file):     2604 kB
Unevictable:          80 kB
Mlocked:              80 kB
SwapTotal:       2098152 kB
SwapFree:        2044888 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                36 kB
Writeback:             0 kB
AnonPages:          2160 kB
Mapped:             2264 kB
Shmem:                 4 kB
KReclaimable:      43032 kB
Slab:              75656 kB
SReclaimable:      43032 kB
SUnreclaim:        32624 kB
KernelStack:        2240 kB
PageTables:         2952 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2592040 kB
Committed_AS:     113660 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       25592 kB
VmallocChunk:          0 kB
Percpu:              332 kB
HardwareCorrupted:     0 kB
AnonHugePages:         0 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:            2048 kB
DirectMap4k:      104292 kB
DirectMap2M:      944128 kB
swapping01.c:111: TINFO: available physical memory: 745 MB
swapping01.c:116: TINFO: try to allocate: 969 MB
swapping01.c:147: TINFO: SwapFree difference 41896
swapping01.c:147: TINFO: SwapFree difference 512
tst_safe_macros.c:619: TINFO: === /proc/meminfo ===<<<<3 before case fail
MemTotal:         989824 kB
MemFree:           65796 kB
MemAvailable:      17800 kB
Buffers:               0 kB
Cached:             3388 kB
SwapCached:       559864 kB
Active:             1964 kB
Inactive:         752564 kB
Active(anon):         92 kB
Inactive(anon):   751072 kB
Active(file):       1872 kB
Inactive(file):     1492 kB
Unevictable:          80 kB
Mlocked:              80 kB
SwapTotal:       2098152 kB
SwapFree:        1242740 kB
Zswap:                 0 kB
Zswapped:              0 kB
Dirty:                80 kB
Writeback:             0 kB
AnonPages:        191988 kB
Mapped:             2272 kB
Shmem:                 4 kB
KReclaimable:      43184 kB
Slab:              75844 kB
SReclaimable:      43184 kB
SUnreclaim:        32660 kB
KernelStack:        2240 kB
PageTables:         4896 kB
SecPageTables:         0 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     2592040 kB
Committed_AS:    1106224 kB
VmallocTotal:   34359738367 kB
VmallocUsed:       25592 kB
VmallocChunk:          0 kB
Percpu:              332 kB
HardwareCorrupted:     0 kB
AnonHugePages:      2048 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
Unaccepted:            0 kB
HugePages_Total:       1
HugePages_Free:        1
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:            2048 kB
DirectMap4k:      104292 kB
DirectMap2M:      944128 kB
swapping01.c:155: TFAIL: heavy swapping detected: 783 MB swapped
tst_test.c:1622: TINFO: Killed the leftover descendant processes
Comment 10 Michal Hocko 2024-01-15 09:31:13 UTC
(In reply to WEI GAO from comment #9)
> One failed happen after another bunch of run.
> https://openqa.suse.de/tests/13112516#step/swapping01/6
[...]
> CommitLimit:     2592040 kB
> Committed_AS:     113660 kB
[...]
> tst_safe_macros.c:619: TINFO: === /proc/meminfo ====== <<<<2
> MemTotal:         989824 kB
> MemFree:          818340 kB
> MemAvailable:     771140 kB
> Buffers:               0 kB
> Cached:             5108 kB
[...]
> Active(anon):        108 kB
> Inactive(anon):     2032 kB
> Active(file):       2508 kB
> Inactive(file):     2604 kB

753MB available for allocations but almost no page cache, with ~40MB of reclaimable slab there is not really much to reclaim here.
  
[...]
> SwapTotal:       2098152 kB
> SwapFree:        2044888 kB

52MB of swapped out memory before the memory starts being faulted in.
[...]

> swapping01.c:111: TINFO: available physical memory: 745 MB
> swapping01.c:116: TINFO: try to allocate: 969 MB

969 cannot fit in even when the current memory is reclaimed.

> swapping01.c:147: TINFO: SwapFree difference 41896
> swapping01.c:147: TINFO: SwapFree difference 512


> tst_safe_macros.c:619: TINFO: === /proc/meminfo ===<<<<3 before case fail
> MemTotal:         989824 kB
> MemFree:           65796 kB
> MemAvailable:      17800 kB
> Buffers:               0 kB
> Cached:             3388 kB
> SwapCached:       559864 kB
> Active:             1964 kB
> Inactive:         752564 kB
> Active(anon):         92 kB
> Inactive(anon):   751072 kB

So we have 730MB of anonymous memory in use

> SwapTotal:       2098152 kB
> SwapFree:        1242740 kB

and 835MB of swapped out memory. Discounting 52MB swapped out portion that was there initially we have 783MB swapped out + 730MB resident. That doesn't add up to the allocated 969MB. Is it possible that something else is interfering during this test run?
Comment 11 WEI GAO 2024-01-19 02:43:10 UTC
Thanks for your feedback,I will add ps -aux to check whether another process has some strange mem allocation, will give feedback once i have result.
Comment 12 WEI GAO 2024-01-20 10:35:18 UTC
Created attachment 872039 [details]
New log with more debuginfo such as ps-aux
Comment 13 WEI GAO 2024-01-20 11:05:26 UTC
(In reply to WEI GAO from comment #12)
> Created attachment 872039 [details]
> New log with more debuginfo such as ps-aux

NOTE: current reproduce rate is 1%. https://openqa.suse.de/tests/13298632#step/swapping01/6

Add extra ps -aux but i can not find any suspicious process. The RSS total size is far away from the available RAM(966M).

swapping01.c:77: TINFO: USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND

swapping01.c:77: TINFO: root      2461 12.1 20.6 996556 204788 hvc0    T    22:08   0:00 swapping01 -i 5

swapping01.c:77: TINFO: root      2465  0.0  0.3  14564  3200 hvc0     R    22:08   0:00 ps aux --sort=-%mem

swapping01.c:77: TINFO: root      1731  0.4  0.2   8672  2944 hvc0     S    22:02   0:01 pidstat -p ALL 1

swapping01.c:77: TINFO: root      1488  0.0  0.1  10220  1920 hvc0     Ss   22:02   0:00 /bin/bash

swapping01.c:77: TINFO: root      2439  0.4  0.1   5056  1796 hvc0     S    22:08   0:00 swapping01 -i 5

swapping01.c:77: TINFO: root      1478  0.0  0.1  42140  1664 ?        Ss   22:02   0:00 /usr/lib/systemd/systemd --user

swapping01.c:77: TINFO: root       452  0.0  0.1  39588  1536 ?        Ss   22:01   0:00 /usr/lib/systemd/systemd-journald

swapping01.c:77: TINFO: root      2437  0.0  0.1   5056  1536 hvc0     S+   22:08   0:00 swapping01 -i 5

swapping01.c:77: TINFO: nscd       611  0.0  0.1 715692  1288 ?        Ssl  22:01   0:00 /usr/sbin/nscd

swapping01.c:77: TINFO: root       494  0.0  0.1  52744  1280 ?        Ss   22:01   0:00 /usr/lib/systemd/systemd-udevd

swapping01.c:77: TINFO: root         1  0.5  0.1  45064  1192 ?        Ss   22:01   0:02 /usr/lib/systemd/systemd --switched-root --system --deserialize=31

swapping01.c:77: TINFO: message+   598  0.0  0.1  18324  1024 ?        Ss   22:01   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only

swapping01.c:77: TINFO: root      1317  0.0  0.1 298904  1024 ?        Ssl  22:02   0:00 /usr/sbin/rsyslogd -n -iNONE

swapping01.c:77: TINFO: root       591  0.0  0.1  13504  1000 ?        S<sl 22:01   0:00 /sbin/auditd

swapping01.c:77: TINFO: root       634  0.0  0.0  37564   896 ?        Ss   22:01   0:00 /usr/lib/systemd/systemd-logind

swapping01.c:77: TINFO: root      1457  0.0  0.0  10016   768 ?        Ss   22:02   0:00 /usr/sbin/cron -n

swapping01.c:77: TINFO: root       642  0.0  0.0  18752   640 ?        SLs  22:01   0:00 /usr/sbin/wickedd --systemd --foreground

swapping01.c:77: TINFO: postfix   1437  0.0  0.0  49076   512 ?        S    22:02   0:00 pickup -l -t fifo -u

Another question from my side why we have 562M SwapCached?
MemTotal:         989824 kB
MemFree:           83764 kB
MemAvailable:      37208 kB
Buffers:               0 kB
Cached:            11920 kB
SwapCached:       575992 kB <<<<
Active:            12980 kB
Inactive:         781792 kB
Comment 14 Michal Hocko 2024-01-23 08:43:31 UTC
(In reply to WEI GAO from comment #13)
[...] 
> swapping01.c:77: TINFO: USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> swapping01.c:77: TINFO: root      2461 12.1 20.6 996556 204788 hvc0    T    22:08   0:00 swapping01 -i 5
> swapping01.c:77: TINFO: root      2439  0.4  0.1   5056  1796 hvc0     S    22:08   0:00 swapping01 -i 5
> swapping01.c:77: TINFO: root      2437  0.0  0.1   5056  1536 hvc0     S+   22:08   0:00 swapping01 -i 5

Is it expected to have several instances of the test running?
Comment 15 WEI GAO 2024-01-23 10:31:26 UTC
(In reply to Michal Hocko from comment #14)
> (In reply to WEI GAO from comment #13)
> [...] 
> > swapping01.c:77: TINFO: USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > swapping01.c:77: TINFO: root      2461 12.1 20.6 996556 204788 hvc0    T    22:08   0:00 swapping01 -i 5
> > swapping01.c:77: TINFO: root      2439  0.4  0.1   5056  1796 hvc0     S    22:08   0:00 swapping01 -i 5
> > swapping01.c:77: TINFO: root      2437  0.0  0.1   5056  1536 hvc0     S+   22:08   0:00 swapping01 -i 5
> 
> Is it expected to have several instances of the test running?

That's expected result for LTP test, LTP framework(main process)will fork a process for running test case(swapping01), and if you check swapping01.c then you will find swapping01.c fork another process for malloc memory, so totally 3 process is normal.
Comment 16 Michal Hocko 2024-01-23 12:05:33 UTC
(In reply to WEI GAO from comment #15)
> (In reply to Michal Hocko from comment #14)
> > (In reply to WEI GAO from comment #13)
> > [...] 
> > > swapping01.c:77: TINFO: USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> > > swapping01.c:77: TINFO: root      2461 12.1 20.6 996556 204788 hvc0    T    22:08   0:00 swapping01 -i 5
> > > swapping01.c:77: TINFO: root      2439  0.4  0.1   5056  1796 hvc0     S    22:08   0:00 swapping01 -i 5
> > > swapping01.c:77: TINFO: root      2437  0.0  0.1   5056  1536 hvc0     S+   22:08   0:00 swapping01 -i 5
> > 
> > Is it expected to have several instances of the test running?
> 
> That's expected result for LTP test, LTP framework(main process)will fork a
> process for running test case(swapping01), and if you check swapping01.c
> then you will find swapping01.c fork another process for malloc memory, so
> totally 3 process is normal.

OK, thanks for the clarification. I had that suspicion but wanted to make sure I am not mixing anything up. I will have a closer look at specific numbers. RSS is not a great measure here because it doesn't include swapped out memory.
Comment 17 Michal Hocko 2024-01-23 12:11:38 UTC
I am not aware of a way that ps would estimate swapped out memory. I have this little scriplet that does a rough estimation. Could you plug it in and report on failure cases please?

$ cat who_swap.sh 
#!/bin/bash

for dir in /proc/[0-9][0-9]*
do 
	pid=${dir##*/}
	name=`cat $dir/comm`
	awk -v name=$name -v pid=$pid '/^Swap/{s+=$2}END{if (s) printf "%dK %s:%d\n", s, name, pid}' $dir/smaps
done | sort -n -k1
Comment 18 WEI GAO 2024-01-24 08:19:34 UTC
Created attachment 872132 [details]
swap trace
Comment 19 WEI GAO 2024-01-24 08:25:56 UTC
(In reply to WEI GAO from comment #18)
> Created attachment 872132 [details]
> swap trace

Just note: the swap size is double since your script calculate swap and swapPss both.
Comment 20 Michal Hocko 2024-01-24 08:48:37 UTC
(In reply to WEI GAO from comment #19)
> (In reply to WEI GAO from comment #18)
> > Created attachment 872132 [details]
> > swap trace
> 
> Just note: the swap size is double since your script calculate swap and
> swapPss both.

You are right. My script is really old and pre-dates SwapPss. Anyway should be trivial ti fix. Just change the script to collect Swap: or SwapPss:.

(In reply to WEI GAO from comment #18)
> Created attachment 872132 [details]
> swap trace

swapping01.c:193: TINFO: available physical memory: 728 MB
swapping01.c:198: TINFO: try to allocate: 946 MB
swapping01.c:233: TINFO: !!!!!!!!after mem alocate!!!!!!!!!!
swapping01.c:129: TINFO: Name:	swapping01
swapping01.c:129: TINFO: Umask:	0022
swapping01.c:129: TINFO: State:	T (stopped)
swapping01.c:129: TINFO: Tgid:	4691
swapping01.c:129: TINFO: Ngid:	0
swapping01.c:129: TINFO: Pid:	4691
[...]
swapping01.c:129: TINFO: VmSwap:	  775524 kB

[...]
swapping01.c:77: TINFO: 1550997K swapping01:4691
[...]
SwapTotal:       2098152 kB
SwapFree:        1278156 kB

819996 kB swapped out.

there are no other significant swap consumers except for the test case. The memory pressure has evicted most of the anonymous memory, though. The test cases itself shouldn't have allocated that much so I still suspect some other source of allocation. I do not see anything outstanding in meminfo output.

You can drop the swap consumption collecting (my script) but I would like to see /proc/<pid>/smaps of the process. Maybe there is something weird going on there (quite unlikely).

Btw. if you let the test run for a longer time, are those failures showing any patterns? E.g. is it possible that any planned jobs could interfere?

Vlastimil, do you have any idea what kind of debugging we could enable to learn more?
Comment 21 Vlastimil Babka 2024-01-24 16:20:49 UTC
Where exactly are these print hooks placed in the latest trace?

swapping01.c:163: TINFO: !!!!!before check!!!!!

swapping01.c:233: TINFO: !!!!!!!!after mem alocate!!!!!!!!!!

That's not as shown in comment 8, is it?

Could you repeat it with this placement?

static void do_alloc(int allow_raise)
{
        long mem_count;
        void *s;

        if (allow_raise == 1)
                tst_res(TINFO, "available physical memory: %ld MB",
                                mem_available_init / 1024);

        mem_count = mem_available_init + mem_over;

        if (allow_raise == 1)
                tst_res(TINFO, "try to allocate: %ld MB", mem_count / 1024);
<-- here
        s = SAFE_MALLOC(mem_count * 1024);
<-- here
        memset(s, 1, mem_count * 1024);
<-- here
        if ((allow_raise == 1) && (raise(SIGSTOP) == -1)) {
                tst_res(TINFO, "memory allocated: %ld MB", mem_count / 1024);
                tst_brk(TBROK | TERRNO, "kill");
        }

        free(s);
}

There are two calls of do_alloc() so we should get 6 reports.

Do we know if SAFE_MALLOC() and free() here result in mmap() and munmap() or glibc caches the area between the first free() and second SAFE_MALLOC()? If it does, it could explain the results perhaps. strace could tell, but also it should be more evident from the trace if you add the hooks as suggested above. Thanks!
Comment 22 WEI GAO 2024-01-26 05:05:40 UTC
Created attachment 872211 [details]
swap trace
Comment 23 Vlastimil Babka 2024-03-08 14:04:15 UTC
Apologies for losing track of this bug, I'm analyzing the last trace now.
Comment 24 Vlastimil Babka 2024-03-08 16:01:52 UTC
Yeah something weird is going on. After the first do_alloc() memsets the memory we have:

SwapFree:        1665084 kB
Inactive(anon):   697936 kB
AnonPages:        546168 kB
SwapCached:       160856 kB

And after the second do_alloc():

SwapFree:        1309036 kB
Inactive(anon):   745548 kB
AnonPages:        190108 kB
SwapCached:       564560 kB

In between those, everything behaves as if "free(s)" freed everything properly including full munmap() (the VSZ of the process drops) and not e.g. madvise(MADV_DONTNEED). Note the test could still be more robust against unexpected glibc implementation changes by doing mmap()/munmap() instead of malloc()/free().

So reclaim does behave differently during the second do_alloc() as the test tries to check. Not sure if it's the "considers pages referenced" and where it would get the info from. Hmm, maybe it's workingset shadow nodes for anonymous memory that get unexpectedly carried over to supposedly independent reallocations, and we should be flushing those somewhere? I'll need to check.

Also could we dump /proc/vmstat together with /proc/meminfo at the same places as previously? Might confirm the suspicion. Thanks!
Comment 26 WEI GAO 2024-03-12 00:19:34 UTC
Created attachment 873411 [details]
swapping log
Comment 27 Vlastimil Babka 2024-03-12 09:01:54 UTC
Thanks. It seems the workingset counters don't support my hypothesis, but I noticed some negative correlation with AnonHugePages. What happens if they are disabled? i.e. echo never > /sys/kernel/mm/transparent_hugepage/enabled
Comment 28 WEI GAO 2024-03-13 04:09:10 UTC
Created attachment 873461 [details]
disable_hughpage_log
Comment 29 WEI GAO 2024-03-13 04:10:34 UTC
(In reply to Vlastimil Babka from comment #27)
> Thanks. It seems the workingset counters don't support my hypothesis, but I
> noticed some negative correlation with AnonHugePages. What happens if they
> are disabled? i.e. echo never > /sys/kernel/mm/transparent_hugepage/enabled

After disable hugepages issue not reproduce after 400 runs.
Also attach a log for disable AnonHugePages.
Comment 31 WEI GAO 2024-03-20 13:56:22 UTC
So this issue related with hugepages ?
Comment 32 Vlastimil Babka 2024-03-20 15:33:27 UTC
(In reply to WEI GAO from comment #31)
> So this issue related with hugepages ?

I think your testing has proved that. I will investigate THP swapping in SP6...
Comment 34 Vlastimil Babka 2024-04-12 08:47:23 UTC
After inspecting the code and counters, I think there's nothing THP specific in the handling that would trigger this, as all the THP's are being split anyway before swapping out. So I suspect it's all due to performance changes, and the whole problem is non-deterministic and related to swap out speed vs the test program's loop speed.

When the test program populates/dirties the allocated memory and starts running out of free memory, reclaim will initiate swap out. Only after that is finished, can the pages be reused for newly populated pages. I suspect we might have some borderline situation where swapout will be fast enough in some cases and not in others, with will result in further swap outs to be initiated, and in the end more will be swapped out than necessary.

To test this, I recommend adding some sleeps in the loop that dirties the memory. I think few ms sleep should be enough after each 1MB, or something. Or you could start with more and then gradually take it back so the test isn't unnecessarily too much.

And if these sleeps remove the issue, then I'd say it should be the permanent fix, and not a change in the kernel. The regression the LTP test is checking wouldn't be helped by the sleeps, but a deterministic bad behavior. If this test is indeed affected by performance/timing, this change will make it more robust while keeping the original purpose.
Comment 35 WEI GAO 2024-04-18 07:16:09 UTC
(In reply to Vlastimil Babka from comment #34)
> After inspecting the code and counters, I think there's nothing THP specific
> in the handling that would trigger this, as all the THP's are being split
> anyway before swapping out. So I suspect it's all due to performance
> changes, and the whole problem is non-deterministic and related to swap out
> speed vs the test program's loop speed.
> 
> When the test program populates/dirties the allocated memory and starts
> running out of free memory, reclaim will initiate swap out. Only after that
> is finished, can the pages be reused for newly populated pages. I suspect we
> might have some borderline situation where swapout will be fast enough in
> some cases and not in others, with will result in further swap outs to be
> initiated, and in the end more will be swapped out than necessary.
> 
> To test this, I recommend adding some sleeps in the loop that dirties the
> memory. I think few ms sleep should be enough after each 1MB, or something.
> Or you could start with more and then gradually take it back so the test
> isn't unnecessarily too much.
> 
> And if these sleeps remove the issue, then I'd say it should be the
> permanent fix, and not a change in the kernel. The regression the LTP test
> is checking wouldn't be helped by the sleeps, but a deterministic bad
> behavior. If this test is indeed affected by performance/timing, this change
> will make it more robust while keeping the original purpose.

Patch for LTP:
https://patchwork.ozlabs.org/project/ltp/patch/20240418071422.10221-1-wegao@suse.com/

Validation result: pass after 500 runs
https://openqa.suse.de/tests/14056297#next_previous
Comment 36 Vlastimil Babka 2024-04-18 09:29:24 UTC
(In reply to WEI GAO from comment #35)
> Patch for LTP:
> https://patchwork.ozlabs.org/project/ltp/patch/20240418071422.10221-1-
> wegao@suse.com/
> 
> Validation result: pass after 500 runs
> https://openqa.suse.de/tests/14056297#next_previous

Great, thanks!