Bug 1159882 - Excessive swapping when buffers / cache expand beyond free physical RAM
Excessive swapping when buffers / cache expand beyond free physical RAM
Status: REOPENED
: 1177541 (view as bug list)
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Kernel
Leap 15.2
x86-64 openSUSE Leap 15.2
: P3 - Medium : Major (vote)
: Leap 15.2
Assigned To: openSUSE Kernel Bugs
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-12-29 14:34 UTC by robert spitzenpfeil
Modified: 2021-06-24 22:30 UTC (History)
11 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
mkoutny: needinfo? (radelahunt)
mhocko: needinfo? (vbabka)


Attachments
vmstat traces - all swap off - all swap on - all swap off again (41.51 KB, application/x-compressed-tar)
2020-01-06 11:04 UTC, robert spitzenpfeil
Details
vmtraces running dd (10.87 KB, application/x-7z-compressed)
2020-01-08 22:58 UTC, robert spitzenpfeil
Details

Note You need to log in before you can comment on or make changes to this bug.
Description robert spitzenpfeil 2019-12-29 14:34:22 UTC
User-Agent:       Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36
Build Identifier: 

https://forums.opensuse.org/showthread.php/538586-observing-excessive-swapping-when-copying-large-files

---
Same behavior here: Started dd if=/dev/sdc of=/dev/null bs=4M with no swap being active. Everything was fine until I turned swap on. The machine would freeze immediately. Swapoff runs for minutes until finally succeeding. I think it is a kernel bug.
---

Reproducible: Always

Steps to Reproduce:
1. copy large file (e.g. 2x RAM size) with cp / use dd that triggers block device caching, eventually swap 
2. watch system suffer from high iowait on device holding root-fs & swap
Actual Results:  
Do NOT increasing cache beyond free physical RAM size by swapping

Expected Results:  
Smooth operation, no lags
Comment 1 robert spitzenpfeil 2019-12-29 14:40:56 UTC
The system I'm currently running test on:

CPU: Dual Core Intel Core2 T5200 
Kernel: 5.3.12-2-default x86_64 

Mem: 861.9/1969.0 MiB (43.8%) 
Storage: 29.82 GiB (51.0% used)

Drives:    Local Storage: total: 29.82 GiB used: 15.20 GiB (51.0%) 
           ID-1: /dev/sda vendor: SanDisk model: SDSSDRC032G size: 29.82 GiB speed: 1.5 Gb/s serial: <filter> 


Other laptops are affected as well, less so due to being a whole lot newer with more RAM and faster storage.
Comment 2 Michal Hocko 2020-01-06 09:17:19 UTC
Could you collect /proc/vmstat data while you are running the test. E.g. something like
while true
do
    TS="$(date +%s)"
    cp /proc/vmstat vmstat.$TS
    sleep 1s
done

Also is this a new problem? When have you noticed it? Was it after an update to a maintenance kernel or upgrade to a new major kernel release?
Comment 3 robert spitzenpfeil 2020-01-06 11:03:13 UTC
I first noticed it on my work laptop, when copying a VM image (about 30G) from A to B. That machine has an i7, NVME storage (samsung 960 EVO) and 16GB of RAM.

At one point it was hitting swap massively (high IO, not amount swapped out at any given time), and the GUI would freeze for seconds at a time. I was wondering why the heck it would start using swap at all.

All of this badness goes away when turning swap off completely.

I don't have anything useful to say as to when this may have started, just that I've never before experienced such pathological behaviour when just copying a file! As described in the forum posts, it just doesn't make any sense whatsoever.

I do expect a massive performance penalty when swap is used, that is not the question. The question here is, why it hits swap at all. My suspicion is that buffers grow to a point that triggers swapping, which should never happen.

For convenience I started testing on the old laptop, different hardware, slower, much less RAM, just to rule out some freak HW issue or configuration differences. Both run up to date TW.
Comment 4 robert spitzenpfeil 2020-01-06 11:04:47 UTC
Created attachment 826954 [details]
vmstat traces - all swap off - all swap on - all swap off again
Comment 5 robert spitzenpfeil 2020-01-06 11:08:33 UTC
I ran this during taking the vmstat traces:

dd if=/dev/sda1 of=/dev/null status=progress

All swap space was turned off initially, transfer speed was compatible with a SATA-I connection, about 150MB/s. Then swap was turned on and badness occurred. Transfer speed dropped to less than 50% (swap on same device, 2x IO, might be OK by the numbers), and the GUI became painlessly unresponsive.

I ran this running a plasma5 session + tmux.
Comment 6 Michal Hocko 2020-01-06 11:46:51 UTC
(In reply to robert spitzenpfeil from comment #4)
> Created attachment 826954 [details]
> vmstat traces - all swap off - all swap on - all swap off again

This is not useful much without knowing when exactly when the swap is enabled/disabled. Could you provide a single run of the effected workload without any changes in the configuration please?
Comment 7 robert spitzenpfeil 2020-01-06 12:20:27 UTC
File size is a pretty good indicator.

I will run it again.
Comment 8 robert spitzenpfeil 2020-01-08 19:06:15 UTC
Yet another process that is affected by this:

Restoring a VM (Virtualbox) from a disk image (clonezilla), Host IO cache is on.

Most of the writes end up in buffers, at one point swapping kicks in, just a few 100s of MB, nothing serious. GUI performance gets rather choppy.

Turning swap off instantly resolves it on my i7.
Comment 9 robert spitzenpfeil 2020-01-08 22:58:02 UTC
Created attachment 827191 [details]
vmtraces running dd
Comment 10 Michal Hocko 2020-01-10 08:56:05 UTC
(In reply to robert spitzenpfeil from comment #9)
> Created attachment 827191 [details]
> vmtraces running dd

During the 76s captured here we have
pgalloc_dma 4009 and pgalloc_dma32 771840 allocated. That means 3G worth of memory allocated. pgfree 564330 pages have been freed during that time period and memory reclaim has recycled pgsteal_direct 195381 (direct reclaim) and pgsteal_kswapd  294145 (kswapd) which is 1.9G.  The reclaim effectiveness (scan/steal) was 92% for both the direct reclaim and kswapd. This looks reasonable. 53824 pages have been swapped out (210M) which is not that bad. But more importantly pswpin 2148 (8M) has been swapped back in. This means there were no real refaults from the swap going on so this is not really any form of a swap storm.

In other words these numbers are not indicating any form of struggling. Either you haven't captured counters while the system has been really struggling or something else is going on. Can you try to check what are those processes stuck on?
Comment 11 robert spitzenpfeil 2020-01-10 09:41:41 UTC
Could it be answered why it is swapping at all?

I question the rationale of swapping to increase buffers. As everything works fluently with swap off, I fail to see the necessity for swapping in the first place. My opinion: "just copy the damn data" and limit buffers to actually free RAM, shrink buffers when RAM is required elsewhere, THEN swap if absolutely necessary.

I will acquire some data on my i7 machine while doing the VM restore.
Comment 12 Michal Hocko 2020-01-10 13:41:23 UTC
(In reply to robert spitzenpfeil from comment #11)
> Could it be answered why it is swapping at all?
> 
> I question the rationale of swapping to increase buffers. As everything
> works fluently with swap off, I fail to see the necessity for swapping in
> the first place.

Well, you are right and this is what the memory reclaim implements. In fact the reclaim is heavily page cache biased. There is a heuristic to detect single pagecache access patterns. The anonymous memory is reclaim usually only when the page cache gets really low.

I didn't get (and won't get to analyze further before Tuesday) check details but there are several things that might be tried in the mean time.

a) rule out memory cgroup controller - background the global reclaim tries to spread the memory pressure evenly to all cgroups. Some of them might be really low on pagecache and swapout is preferred. This could be ruled out by booting with cgroup_disable=memory kernel parameter

b) there is more going on and the the page cache is really low

c) there is a lot of dirty page cache accumulated. This is not likely much because the reclaim efficiency is high but it would be good to double check by reducing the amount of dirty data that might accumulate /proc/sys/vm/dirty_{background_}bytes to something relatively small (say 300MB for dirty_bytes and 100MB for background)

d) there is a bug in the kernel
Comment 13 robert spitzenpfeil 2020-01-28 14:46:36 UTC
This bug report may be related.

https://bugzilla.kernel.org/show_bug.cgi?id=196729
Comment 14 Michal Hocko 2020-01-28 17:39:15 UTC
Apart from things to try mentioned in comment 12 it would be interesting to see how the system behaves with 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa upstream commit reverted. Let me know if you need a help with that.
Comment 15 robert spitzenpfeil 2020-04-10 19:26:21 UTC
It seems there have been improvements!

So far I cannot reproduce on my i7 machine, which is great.

I will try on the older core2 duo. I've also asked on the forum thread for others to test it again and report back.

Maybe it's gone :-)
Comment 16 robert spitzenpfeil 2020-04-10 20:33:59 UTC
I cannot reproduce on my old laptop as well.

This is looking GOOD !
Comment 17 Vlastimil Babka 2020-04-14 12:25:30 UTC
What's the current kernel version that appears fixed?
Comment 18 robert spitzenpfeil 2020-04-14 14:42:15 UTC
I'm on TW 5.6.2-1-default and cannot reproduce on both of my laptops.

Someone in the forum has tested again and it seems to be OK now.

Recently I tested Debian 18.04 LTS and that is borked.


If it matters, I currently use:

vm.swappiness = 10
vm.vfs_cache_pressure = 50

BTW, these settings didn't do anything to the problem when it still existed.
Comment 19 Karl Mistelberger 2020-04-14 15:03:55 UTC
Linux erlangen 5.6.2-1-default #1 SMP Thu Apr 2 06:31:32 UTC 2020 (c8170d6) x86_64 x86_64 x86_64 GNU/Linux

https://forums.opensuse.org/showthread.php/538586-observing-excessive-swapping-when-copying-large-files?p=2932840#post2932840
Comment 20 Miroslav Beneš 2020-08-05 11:47:23 UTC
Robert, I guess you haven't encountered the problem again with 5.6 and later. Should we close with WORKSFORME then?
Comment 21 robert spitzenpfeil 2020-09-10 08:14:23 UTC
I haven't been affected by this in quite a while. Let's close it.
Comment 22 Miroslav Beneš 2020-09-10 09:05:15 UTC
Per comment 21.
Comment 23 Robert Delahunt 2020-09-19 11:10:05 UTC
Confirmed this bug exists in OpenSUSE LEAP 15.2.

https://forums.opensuse.org/showthread.php/544618-Hard-Disk-Activity-Memory-Hole?p=2965585#post2965585

Brand new Dell Inspiron 7591 laptop.  16GB RAM, 1TB SSD.

dd, rsync, and over operations that use lots of disk I/O result in the system dramatically digging into swap.  Even with swappiness=1, swap use (of 1GB swap) increased to 105 MB (10%).

Transcript of forum post:    ----- QUOTE BEGIN

I do not know where else to post this, so here goes. I have a clean basic XFCE installation of OpenSUSE LEAP 15.2. This behavior happens on both the Asus R541U laptop I used to have (8GB RAM, 512MB Swap) and my new Dell Inspiron 7591 (16GB RAM, 1GB swap). I would boot into OpenSUSE to do some "hard drive wrangling", i.e. making disk images of hard drives via USB adapters (dd if=/some/device | gzip -c > imagefile) or zeroizing old disks (dd if=/dev/zero of=/some/device).

As soon as I begin the dd process, my RAM and swap climb through the roof. Almost no applications are open when this occurs. For example:

1) When I was using my Asus to read the 512GB SSD via an adapter to another USB external hard drive (BACKUP) (i.e. dd if=/dev/nvme0n1 | gzip -c > /run/media/robert/BACKUP/Windows/dell7591.img.gz)

2) When I was backing up my files on the Asus (rsync -Hav /home/robert/ /run/media/robert/BACKUP/dell/robert/)

3) When I was copying the dd image to the new 1TB SSD upgrade for my dell (gunzip -c /run/media/robert/BACKUP/Windows/dell7591.img.gz | dd of=/dev/nvme0n1)

4) When I was just now zeroizing the old 512GB SSD via the same USB adapter (dd if=/dev/zero of=/dev/sdb)

5) When I was synchronizing my incremental monthly backups (both 2TB external USB drives running LUKS) (rsync -Hav --delete --progress /run/media/robert/BACKUP/ /run/media/robert/BACKUP2/ )

It always seems connected to rsync/gzip/dd, i.e. heavy use of filesystems. If I boot OpenSUSE and I am just sitting in OpenSUSE using applications, usually it does not cause me to dig into swap.

At the height of the zeroizing action, for example, swap use (16GB RAM, 1GB swap, new Dell Inspiron 7591) climbed to 108MB. It dropped to 11MB.

Given that I have 16GB of RAM, such behavior is absolutely unacceptable. All the I/O should be happening on disks. I have not been able to triangulate, using top, what process is eating RAM so much.

I am using EXT4 exclusively, no BTRFS anywhere.

I have remounted all tmpfs entries to only give them 1GB of RAM to work with, as in the past this has prevented such excessive swappiness (believe it or not; it's difficult to prove; older versions of OpenSUSE, etc).

I am willing to run experiments to see what's going on.

I noticed that there were some btrfs components of systemd that were installed. I uninstalled them, but the problem remains.

I don't understand how even running something complex as rsync + gzip + dd should need to dig into that much system RAM. I mean, I have 16GB!

Have any memory leaks been reported on OpenSUSE? ----- END QUOTE

I am very willing to provide any information to help resolve this apparent memory hole or memory leak.
Comment 24 Robert Delahunt 2020-09-19 11:14:14 UTC
I think I did this properly, but please help me because I'm new to BugZilla.  I saved this to OpenSUSE LEAP 15.2 because I experience it "in the wild" in LEAP 15.2.  Please forgive me if this is not the right way to do it.  Please contact me ASAP if you need anything: I really want to help the community.
Comment 25 Michal Hocko 2020-09-21 07:56:45 UTC
see comment 2
Comment 26 Miroslav Beneš 2020-09-21 07:57:57 UTC
I think it would have been better to open a new bug for Leap 15.2 and link it with this one, but whatever. Let's keep it here.

It is not surprising to see it strikes 15.2 too. The original bug was reported against 5.3 kernel, which is in 15.2. It got somehow fixed in 5.6 at the latest. We may try to find the fixes but they may be too intrusive to backport.

First, it would be nice to walk through the bug and provide the same info Michal and Vlastimil asked for the original report. That is, vmstat logs, swap on/swap off behaviour and such.
Comment 27 Robert Delahunt 2020-09-21 15:45:10 UTC
I had reinstalled OpenSUSE LEAP 15.2 using the DVD but with network enabled, so what I should've gotten was a fresh installation of the most current stable OpenSUSE LEAP 15.2.

VMSTAT information as requested:

http://www.puresimplicity.net/~delahunt/vmstat/swapon/

http://www.puresimplicity.net/~delahunt/vmstat/swapoff/

Basically, I had reinstalled OpenSUSE LEAP 15.2 with the LUKS-contained LVM of /dev/system/home and /dev/system/swap but I had deleted the LV of swap and ran the system without swap.  So the swapoff is a recording of vmstat while I was doing

dd if=/dev/zero of=/dev/sdb bs=4K status=progress

and the swapon directory is after I went back into the partitioner, recreated the swap LV, turned swap on, then ran the dd command above all over again.

The system immediately dug into swap to about the 40MB mark.  Running free -m second by second, I could see available RAM plummet and swap climb.

I put other assorted diagnostic information in http://www.puresimplicity.net/~delahunt/vmstat such as dmesg, cpuinfo, lsmod, rpms, etc.

I noted that the partitioner installed a package called nvme-cli-1.10-lp152.1.3.x86_64 when I clicked "accept" to add the swap LV.

I am very determined to help get this fixed, so please notify me immediately if there's anything else I can help with.
Comment 28 Robert Delahunt 2020-09-21 18:19:14 UTC
Please note that I had set vm.vfs_cache_pressure=200 on this installation, so both the vmstat results above were while vfs_cache_pressure=200.

I noticed that, while the system ran mostly fine with this set and with no swap, at boot sometimes the system seems to have some sort of deep process bogging down, as the keyboard (for instance) seems to not register every 5th key or so.  I have to very closely watch the asterisks on log-in pages or else it throws off password typing, etc.

Might be an unrelated bug, not sure.  I expect that, because this laptop is brand new, there might be some early kernel bugs or new hardware issues, and I'd absolutely love to help out in any way I can.

Even if it means running a debug kernel, if you can show me how.  I have enabled multiple ACPI and other kernel boot time parameters in the boot command line to see if maybe one of these helps us find the problem.
Comment 29 Andrei Borzenkov 2020-09-21 18:47:42 UTC
(In reply to Miroslav Beneš from comment #26)
> It is not surprising to see it strikes 15.2 too. The original bug was
> reported against 5.3 kernel, which is in 15.2. It got somehow fixed in 5.6
> at the latest. We may try to find the fixes but they may be too intrusive to
> backport.
> 

The commit 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa mentioned in comment #14 was effectively removed in commit b91ac374346ba206cfd568bb0ab830af6b205cfd which went into 5.5. I actually observed quite similar symptoms in Ubuntu 18.04 as soon as it bumped HWE kernel to 5.3 and had to install 5.5 (5.4 had the same issue).

I do not know if b91ac374346ba206cfd568bb0ab830af6b205cfd alone can be back ported but may be 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa could be reverted as this is what happened in upstream anyway.
Comment 30 Robert Delahunt 2020-09-21 18:51:30 UTC
Would you recommend the user (me) build or install one of the newer Linux kernels?  Do you have a specific version or tree that you would prefer I attempt to build or install?  Please let me know.  I may decide, some time today, to grab the latest stable Linux kernel and build it anyways, just for "fun," after my class today.
Comment 31 Robert Delahunt 2020-09-21 20:33:44 UTC
I ran the dd exercise booting into the OpenSUSE LEAP 15.2 debug kernel.  The dmesg output:

http://www.puresimplicity.net/~delahunt/vmstat/dmesg_debug.txt

Is the result of doing so and then running the dd command.  Logs filled up pretty quick, etc.  Hope this helps someone debug the problem.
Comment 32 Michal Hocko 2020-09-22 08:17:20 UTC
(In reply to Robert Delahunt from comment #30)
> Would you recommend the user (me) build or install one of the newer Linux
> kernels?  Do you have a specific version or tree that you would prefer I
> attempt to build or install?  Please let me know.  I may decide, some time
> today, to grab the latest stable Linux kernel and build it anyways, just for
> "fun," after my class today.

Running with the latest Linus' tree with the same config might tell us more. There are certainly other changes in MM that might make a difference. 2c012a4ad1a2 ("mm: vmscan: scan anonymous pages on file refaults") can definitely cause more swapping. I was not particularly happy about the patch (https://lore.kernel.org/linux-mm/20190712071359.GN29483@dhcp22.suse.cz/). Another option would be trying with that one reverted. Let me know if you need a help with that.
Comment 33 Michal Hocko 2020-09-22 09:19:13 UTC
(In reply to Robert Delahunt from comment #27)
> I had reinstalled OpenSUSE LEAP 15.2 using the DVD but with network enabled,
> so what I should've gotten was a fresh installation of the most current
> stable OpenSUSE LEAP 15.2.
> 
> VMSTAT information as requested:
> 
> http://www.puresimplicity.net/~delahunt/vmstat/swapon/
> 
> http://www.puresimplicity.net/~delahunt/vmstat/swapoff/
> 
> Basically, I had reinstalled OpenSUSE LEAP 15.2 with the LUKS-contained LVM
> of /dev/system/home and /dev/system/swap but I had deleted the LV of swap
> and ran the system without swap.  So the swapoff is a recording of vmstat
> while I was doing
> 
> dd if=/dev/zero of=/dev/sdb bs=4K status=progress
> 
> and the swapon directory is after I went back into the partitioner,
> recreated the swap LV, turned swap on, then ran the dd command above all
> over again.

I have looked at swapon data.
                First vmstat    Last vmstat[diff]
pgscan_direct   2383            0
pgscan_kswapd   14565113        424280
pgsteal_kswapd  14513308        423192

No direct reclaim, so kswapd was able to cope with the allocation pace. The overall reclaim efficiency is nice as well (99%) and the reclaim itself has freed 1.6G worth of memory

pgalloc_dma32   2334299         467232
pgalloc_normal  24091786        3564456

while 15.7G of memory was requested during that time period.

pswpin          0               0
pswpout         0               11842

No memory has been swapped in while 46M has been swapped out. This on its own doesn't sound overly excessive. I would be much more worried if pswpin was high because that would suggest that memory actively in use has been swapped out and so the owner would see larger latencies on refault.

workingset_activate     86      41
workingset_refault      10237   141
workingset_restore      0       0

These are stats for disk based page cache refaults. workingset_refault will tell us how many page cache pages have been reclaimed and then faulted back in again. 141 pages is really minuscule. workingset_activate tells us how many pages were reclaimed recently so we should consider them active. workingset_restore will tell us that the refault is happening on a previously active page. All in all not much of a refault activity. If there is enough clean page cache then we shouldn't swap at all.

There are three jumps in swapout
vmstat.1600702573:pswpout 0
vmstat.1600702574:pswpout 2448
[...]
vmstat.1600702577:pswpout 5555
vmstat.1600702578:pswpout 10605
vmstat.1600702579:pswpout 11330

The biggest one swapped out 5050 withing one second.
                 vmstat.1600702577    1600702578 [diff]
nr_active_anon          373934        -3376
nr_active_file          131158        0
nr_inactive_anon        48441         8556
nr_inactive_file        3288371       1169
nr_dirty                632308        28

There is plenty of inactive pagecache. A large part is dirty but there should be still a lot of clean page cache to reclaim. The file inactive list is clearly not low on the global level.

workingset_refault      10281          0

no refaults detected so the heuristic from 2c012a4ad1a2 shouldn't trigger.

These are all global numbers. Picture would be quite different if memory cgroups were deployed though. I have asked earlier for the behavior with cgroup_disable=memory on the kernel command line parameter. See comment 12 for more information.
Comment 34 Robert Delahunt 2020-09-22 12:10:22 UTC
Would using a tumbleweed kernel give you good information?  If so, please tell me what specific tumbleweed packages I need to install and what LEAP 15.2 packages I need to remove in order to test this theory.

Also, would you have me grab the latest stable kernel and "yes | make oldconfig" and then see how that goes?  I haven't built a kernel for OpenSUSE LEAP 15.2 before, so I might need a tutorial on how to mkinitrd.

Please notify me.
Comment 35 Michal Hocko 2020-09-22 13:04:05 UTC
(In reply to Robert Delahunt from comment #34)

Please start by cgroup_disable=memory with your current kernel first. If this works around the problem then my theory about proportional reclaim distributed memory pressure to anonymous mostly cgroups would be a good fit.

After that is confirmed then it would be great to run with the current kernel. Installing one from http://download.opensuse.org/repositories/Kernel:/stable/standard/ should do it.
Comment 36 Robert Delahunt 2020-09-22 13:34:39 UTC
http://www.puresimplicity.net/~delahunt/vmstat/cgroup_disable/

Here are some vmstats for the stock OpenSUSE LEAP 15.2 kernel.  I started the dd command then realized I don't have a swap set, so I created one and added it.  As soon as I ran swapon, it climbed to about 20MB or so.  Still a bit better than the excessive swapping, but still, with 16GB of main RAM and nothing but Chrome running, that's excessive.

Let me know if this is enough or you want me to install the latest kernel.  I'm very eager to help.
Comment 37 Michal Hocko 2020-09-22 15:57:22 UTC
(In reply to Robert Delahunt from comment #36)
> http://www.puresimplicity.net/~delahunt/vmstat/cgroup_disable/

Thanks. The data seem to be in line with what we have seen previously:
                        vmstat.1600781117   1600781118 [diff]
nr_active_anon          166943              -3026
nr_active_file          85576               3
nr_inactive_anon        33207               8180
nr_inactive_file        3540169             9504
nr_dirty                667650              -203
workingset_activate     154                 0
workingset_refault      2455                1
workingset_restore      0                   0
pswpin                  0                   0
pswpout                 2448                3118

The inactive list is really large and mostly clean so there shouldn't be any reason to swap out. I suspect the reclaim is confused for some reason. Again anonymous inactive list is low and needs rotation but I fail to see any reason why it should get reclaimed. get_scan_count should opt for page cache reclaim only.

Could you give the newer kernel a try as noted in previous comment, please?
Comment 38 Robert Delahunt 2020-09-22 16:00:25 UTC
I will install the kernel in the repository in the link.  Please reply soon with what specific packages I need to install from it, and/or any other information, as I have only ever ran a different kernel than stock twice.  "Back in my day" I would compile a static kernel for Slackware-Current.  Now, however, I will need a slight bit of coaching.  If this needs to come over direct email or text or whatever, please let me know.  I will boot back into OpenSUSE LEAP 15.2 and await your instructions while adding the repo.
Comment 39 Michal Hocko 2020-09-22 16:20:53 UTC
(In reply to Robert Delahunt from comment #38)
> I will install the kernel in the repository in the link.  Please reply soon
> with what specific packages I need to install from it, and/or any other
> information, as I have only ever ran a different kernel than stock twice. 
> "Back in my day" I would compile a static kernel for Slackware-Current. 
> Now, however, I will need a slight bit of coaching.  If this needs to come
> over direct email or text or whatever, please let me know.  I will boot back
> into OpenSUSE LEAP 15.2 and await your instructions while adding the repo.

Installing the kernel should be sufficient AFAIK.
Comment 40 Robert Delahunt 2020-09-22 16:25:40 UTC
http://www.puresimplicity.net/~delahunt/vmstat/suse_stable/

I guess I didn't need help.

So I got the new kernel installed and selected it at boot.  Ran the same dd test.  I left it running and the system never used swap.  /proc/sys/vm/swappiness still = 60.
Comment 41 Michal Hocko 2020-09-22 17:05:43 UTC
(In reply to Robert Delahunt from comment #40)
> http://www.puresimplicity.net/~delahunt/vmstat/suse_stable/
> 
> I guess I didn't need help.
> 
> So I got the new kernel installed and selected it at boot.  Ran the same dd
> test.  I left it running and the system never used swap. 
> /proc/sys/vm/swappiness still = 60.

OK, this is good to know. Newer kernels have changes which check refaults on anonymous memory as well so this has likely changed the balance. These would be out of scope for 15.2 unfortunately.

Vlastimil, I remember we have discussed this problem in upstream some time ago. You've had a patch which has disabled the heuristic (2c012a4ad1a2). Testing with that reverted would sound like a good next step.
Comment 42 Vlastimil Babka 2020-09-23 12:02:53 UTC
(In reply to Michal Hocko from comment #41)
> Vlastimil, I remember we have discussed this problem in upstream some time
> ago. You've had a patch which has disabled the heuristic (2c012a4ad1a2).
> Testing with that reverted would sound like a good next step.

I think the past discussion was about us *not* having 2c012a4ad1a2 (in an older kernel) as the problem was different - file pages thrashing while unused anonymous pages sit idly. See https://lore.kernel.org/linux-mm/b7f5e356-1f0a-98be-4a32-09a766c3949b@suse.cz/

Anyway, what is the actual observed issue here? Is it that part of the swap gets used? I think Michal's analysis in comment 33 shows the swapped out pages are not accessed (no increase in pswpin) so it shouldn't actually cause excessive IO. So is it only that the swap being used looks bad?

If there's really observed performance issue (e.g. system being sluggish) while doing the operations listed in comment 23, does disabling swap completely make any difference? If not, we might be looking at a red flag here, IMHO.
Comment 43 Michal Hocko 2020-09-23 12:55:16 UTC
(In reply to Vlastimil Babka from comment #42)
> (In reply to Michal Hocko from comment #41)
> > Vlastimil, I remember we have discussed this problem in upstream some time
> > ago. You've had a patch which has disabled the heuristic (2c012a4ad1a2).
> > Testing with that reverted would sound like a good next step.
> 
> I think the past discussion was about us *not* having 2c012a4ad1a2 (in an
> older kernel) as the problem was different - file pages thrashing while
> unused anonymous pages sit idly. See
> https://lore.kernel.org/linux-mm/b7f5e356-1f0a-98be-4a32-09a766c3949b@suse.
> cz/

Ahh, I remember now.
 
> Anyway, what is the actual observed issue here? Is it that part of the swap
> gets used? I think Michal's analysis in comment 33 shows the swapped out
> pages are not accessed (no increase in pswpin) so it shouldn't actually
> cause excessive IO. So is it only that the swap being used looks bad?

Yes this is the case here. But I am more worried this is a more general problem that might actually hit somewhere else. There shouldn't be really any real reason to swap out anything with that much of easily reclaimable page cache which doesn't refault heavily. Remember this is a simple stream writer usecase. That shouldn't really disrupt anonymous memory users.

I am quite busy now but I will try to prepare a kernel with 2c012a4ad1a2 reverted because that might be easier to adopt in 15.2 resp SLE15-SP2 kernels than the current upstream which is likely fixing the problem by applying the refault logic to the anonymous memory as well.

Thanks Vlastimil!
Comment 44 Robert Delahunt 2020-09-23 13:04:40 UTC
This is a problem because with default kernel VM settings and a swap, a 16GB system using dd/gzip/rsync is heavily impacted.  For instance, I can connect my external 1TB hard drive and (/home LUKS -> external 1TB LUKS) have 100MB or higher swap utilization.  And that's the first command being run when the system is booted.

Changing swappiness to 1 and VFS cache pressure to 200 doesn't eliminate swapping.

System bogs very drastically, even with / being housed on a brand new 1TB Kingston SSD.

I understand that maybe some of this is intrinsic to the older kernel plus the brand new hardware, but still, I've never seen previous versions of OpenSUSE dig so heavily into swap just backing up my stuff to my 1TB external, for instance.

At some points the system lags so bad that the mouse slows and the system (for all intents and purposes) behaves like it's locked up.  Getting to a virtual terminal is possible, so the system isn't locked, but it drags down all of X and XFCE with it.  (Which is noteworthy: user is not using a "larger" WM/DE like KDE/Gnome/MATE.)

So it's basically every disk I/O.  For instance, I got a new MicroSD to put college stuff on (Windows vs Linux, so that my college documents are "portable" in case of a problem or in case I need to do work at school) and even putting maybe 1GB of documents on that 64GB MicroSD caused the system to dig into swap.  So it's literally every Disk I/O.

Running the original OpenSUSE LEAP 15.2 kernel with the swappiness and cache pressure variables modified but without a swap alleviated half the issues, but it still caused (when the system reached the end of RAM and had to "move things around") the system to lag pretty bad.

These issues seem to be completely gone with the bleeding edge kernel.

Please consider this a serious issue.  Maybe on this fast a system, a user would be willing to ignore it.  But it affects OpenSUSE as a whole in that anyone who may be trying OpenSUSE but sees this behavior may just decide to burn a different distribution to DVD and install something else.  Which may affect their perception of SUSE Enterprise Linux as a result.

For me, I seriously had the thought to switch distributions.  And I've been using OpenSUSE since at least 42.3.  Of course, I didn't, but still....

I can't tell you what to do, I would just beg you to consider this a serious issue.
Comment 45 Michal Hocko 2020-09-23 13:38:37 UTC
(In reply to Andrei Borzenkov from comment #29)
> (In reply to Miroslav Beneš from comment #26)
> > It is not surprising to see it strikes 15.2 too. The original bug was
> > reported against 5.3 kernel, which is in 15.2. It got somehow fixed in 5.6
> > at the latest. We may try to find the fixes but they may be too intrusive to
> > backport.
> > 
> 
> The commit 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa mentioned in comment #14
> was effectively removed in commit b91ac374346ba206cfd568bb0ab830af6b205cfd
> which went into 5.5. I actually observed quite similar symptoms in Ubuntu
> 18.04 as soon as it bumped HWE kernel to 5.3 and had to install 5.5 (5.4 had
> the same issue).
> 
> I do not know if b91ac374346ba206cfd568bb0ab830af6b205cfd alone can be back
> ported but may be 2c012a4ad1a2cd3fb5a0f9307b9d219f84eda1fa could be reverted
> as this is what happened in upstream anyway.

I have read through this bugzilla again and noticed that I have missed this comment previously. So reverting 2c012a4ad1a2c is not really straightforward exactly because of b91ac374346 which openSUSE-15.2 kernel has as well. And looking closer it can contribute to the problem itself. Mostly because it
of 
+               /*
+                * When refaults are being observed, it means a new
+                * workingset is being established. Deactivate to get
+                * rid of any stale active pages quickly.
+                */
+               refaults = lruvec_page_state(target_lruvec,
+                                            WORKINGSET_ACTIVATE);
+               if (refaults != target_lruvec->refaults ||
+                   inactive_is_low(target_lruvec, LRU_INACTIVE_FILE))
+                       sc->may_deactivate |= DEACTIVATE_FILE;
+               else
+                       sc->may_deactivate &= ~DEACTIVATE_FILE;
[...]
+       if (file >> sc->priority && !(sc->may_deactivate & DEACTIVATE_FILE))
+               sc->cache_trim_mode = 1;
+       else
+               sc->cache_trim_mode = 0;

note how refaults != target_lruvec->refaults can easily move us to SCAN_FRACT even if there is a lot of page cache after a single activation. 2c012a4ad1a2c was less agresive in that regards because it only forced active -> inactive rebalance on an activation.

I might be misreading this, the logic is quite convoluted but it should be pretty straightforward to drop this patch and have you retest it.

Ccing Mel as well.
Comment 46 Michal Hocko 2020-09-23 14:02:22 UTC
The kernel with b91ac374346 dropped should appear in https://download.opensuse.org/repositories/home:/mhocko:/bsc1159882/standard. After it gets build etc. Please give it a try with the same setup as previously.
Comment 47 Robert Delahunt 2020-09-24 00:43:52 UTC
Ran the provided kernel.  It seemed to do well.

http://www.puresimplicity.net/~delahunt/vmstat/mhock/

Although swappiness = 1 and vfs cache pressure = 200, it didn't seem to go beyond 6 GB of RAM usage.  No swap seemed to get used.  I was creating 1GB files full of zeros and I ran my rsync command to back up my files.
Comment 48 Robert Delahunt 2020-09-24 00:52:36 UTC
(In reply to Robert Delahunt from comment #47)
> Ran the provided kernel.  It seemed to do well.
> 
> http://www.puresimplicity.net/~delahunt/vmstat/mhock/
> 
> Although swappiness = 1 and vfs cache pressure = 200, it didn't seem to go
> beyond 6 GB of RAM usage.  No swap seemed to get used.  I was creating 1GB
> files full of zeros and I ran my rsync command to back up my files.

DISREGARD, I realized I had booted into Linux to copy my Music (13GB) to my external hard drive (USB-C enclosure for my NVME 512GB SSD).  As I was doing so, I saw free RAM falling and then fired up vmstat again.  Swap usage climbed (literally only used Yast and Chrome after a reboot).  Check the second set of vmstat logs after the time delay.

Sorry about that, I spoke too soon.
Comment 49 Karl Mistelberger 2020-09-24 07:05:38 UTC
(In reply to Robert Delahunt from comment #48)
> DISREGARD, I realized I had booted into Linux to copy my Music (13GB) to my
> external hard drive (USB-C enclosure for my NVME 512GB SSD).

For reliably testing I always copied a large drive:

https://forums.opensuse.org/showthread.php/538586-observing-excessive-swapping-when-copying-large-files?p=2932840#post2932840
Comment 50 Michal Hocko 2020-09-24 13:30:15 UTC
Do you have vmstats from the swapping situation?
Comment 51 Robert Delahunt 2020-09-24 13:36:09 UTC
http://www.puresimplicity.net/~delahunt/vmstat/mhock/

Like I said, the vmstats after the time delay. They are in this directory.

Thanks for your diligence! :-)
Comment 52 Michal Hocko 2020-09-25 14:57:43 UTC
(In reply to Robert Delahunt from comment #51)
> http://www.puresimplicity.net/~delahunt/vmstat/mhock/
> 
> Like I said, the vmstats after the time delay. They are in this directory.

I've misunderstood your comment. Anyway.
The system has started swapout at 1600908544 until 1600908551 to grow to 10194 and stayed there for some time for some time to repeat a similar pattern.
                     vmstat.1600908545       vmstat.1600908546 [diff]
nr_active_anon       200295                  -1660
nr_active_file       126906                  6879
nr_inactive_anon     35519                   7892
nr_inactive_file     3524850                -5187
nr_dirty             27282                  -18238
pswpout              271                    6282
workingset_activate  146                    19
workingset_refault   146                    19

So in overall numbers a huge amount of clean page cache. There are some refaults and all of them are eve activations. But the number is still very small to the actual page cache in general.

pgscan_kswapd         59636                  48139
pgsteal_kswapd        34126                  41061
pgscan_direct         0                      0

kswapd has relcaim 41k pages but let me outline that the overall number of anonymous pages has increased in total. So it is not just the streaming IO that is going on. We know that ~15% of the reclaimed memory was anonymous (and swapped out) the rest must have been the page cache. If this was fully proportional (swappiness) then the percentage would be different. So I suspect that there is still a prevalent pagecache only reclaim happening with some occasional runs based by refault information. We also age the anonymous active list quite a lot but that shouldn't really lead to swapout on its own. It however points a finger to 2c012a4ad1a2c.

I haven't checked the full data set. It would be worth having another test with 2c012a4ad1a2c reverted before we spend more time on the data. I will upload a new kernel to the same location. Please note that the new kernel will have a different release number (bsc1159882_2).
Comment 53 Michal Hocko 2020-09-29 11:07:57 UTC
(In reply to Michal Hocko from comment #52)
> It would be worth having another test
> with 2c012a4ad1a2c reverted before we spend more time on the data. I will
> upload a new kernel to the same location. Please note that the new kernel
> will have a different release number (bsc1159882_2).

Any news?
Comment 54 Robert Delahunt 2020-09-29 18:22:30 UTC
Sorry, today is crunch time for my graduate college courses.  I should be able to get to it tomorrow, 9/30/20.  I'll do my best to get to it ASAP.  This laptop has had a RAM upgrade to 32GB, by the way, which you'll probably notice in my next vmstat post.
Comment 55 Michal Hocko 2020-09-30 06:58:37 UTC
(In reply to Robert Delahunt from comment #54)
> Sorry, today is crunch time for my graduate college courses.  I should be
> able to get to it tomorrow, 9/30/20.  I'll do my best to get to it ASAP.

No rush.
Comment 56 Robert Delahunt 2020-10-01 15:46:40 UTC
I do not see a release that is listed as _2 at the end, not from Yast Software or your direct link.  Please advise.
Comment 57 Robert Delahunt 2020-10-01 16:04:41 UTC
Nevermind, I re-checked and saw the date stamp was 25 September, so I reinstalled (what should be) the new kernel.  Here are your new vmstats:

http://www.puresimplicity.net/~delahunt/vmstat/mhocko2/

I have 32GB of RAM now but still it dug into about 20 MB of swap, even with swappiness=1.

Changing swappiness to 60 during this operation didn't seem to influence how much swap it was using, as it still hovered around 20MB or so.

It does this both with a file operation (copying large files to an external 512GB SSD in an enclosure) or zeroizing this drive when finished (dd if=/dev/zero of=/dev/sdc1 bs=4K count=1024 etc)

I noticed that running sync after the file copy operation dug into swap, i.e. after terminating the copy command, took a long while.

Please advise.
Comment 58 Michal Hocko 2020-10-02 09:08:20 UTC
I didn't get to process data yet and will unlikely to do it sooner than next week. I am quite surprised that you still see a swapout though. I assume you have double checked the correct kernel is booted, right? (sorry about the stupid question but with more kernels involved this can happen).

Have you tried to test with cgroup_disable=memory as well?
Comment 59 Robert Delahunt 2020-10-02 18:49:19 UTC
Your most recent kernel with cgroup_disable=memory shows no swap usage when copying 16GB of data between drives.  There was plenty of time to observe RAM get used up (monitoring free -m every second using a script) but it not resort to using swap.  I double-checked and swappiness is set to 60 right now, so it would have had plenty of authority to swap out.  vfs_cache_pressure=100 as well.  Default system values.

http://www.puresimplicity.net/~delahunt/vmstat/mhocko3/

There are the vmstat files for your convenience.

Please let me know what else I can do to help eradicate this bug.
Comment 60 Michal Hocko 2020-10-05 06:25:30 UTC
(In reply to Robert Delahunt from comment #59)
> Your most recent kernel with cgroup_disable=memory shows no swap usage when
> copying 16GB of data between drives.

Thanks! Do you happen to use memory cgroups controller intentionally or it is being used automagically? I suspect the later. As already mentioned earlier (comment 12) the global memory pressure is spread over all existing memory cgroups.

Anyway, your earlier tests suggested that cgroup_disable on its own didn't help and we need to have the 2 patches reverted. I will mull over some more but I unless Vlastimil or Mel oppose I will go ahead and revert both in 15sp2 and openSUSE-15.2 branches. For a better experience with cgroups enabled I would recommend using a most recent kernel (e.g. one from our stable repository).
Comment 61 Robert Delahunt 2020-10-05 17:13:10 UTC
I'm just an average Joe, I don't even know what cgroups are for.
Comment 62 Michal Hocko 2020-10-06 07:46:42 UTC
(In reply to Robert Delahunt from comment #61)
> I'm just an average Joe, I don't even know what cgroups are for.

I would suspect some service has enabled the memory controller. Or maybe systemd on your system does that but I believe that the version we have in OS15.2 doesn't do that yet. Michal Koutny would know better and maybe give you better clues how to find out.

For the general cgroups setup, please provide
mount | grep cgroup

Next steps will depend on the output.
Comment 63 Michal Koutný 2020-10-06 11:03:39 UTC
(In reply to Michal Hocko from comment #62)
> For the general cgroups setup, please provide
> mount | grep cgroup
Unless explicitly disabled (with kernel cmdline), the memory hierarchy (root only) would be always mounted. To get the information about fine-grained grouping, I suggest
> find /sys/fs/cgroup/memory -type d

Additionally, if there's a non-trivial structure, you can track the originating service by looking at Memory* directives (MemoryDenyWriteExecute= is irrelevant)
> systemctl cat "*.service" | grep -E "# /|Memory"
Comment 64 Robert Delahunt 2020-10-06 11:58:23 UTC
mount | grep cgroup :

mpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)

systemctl cat "*.service" | grep -E "# /|Memory"

# /usr/lib/systemd/system/systemd-update-utmp.service
# /usr/lib/systemd/system/kbdsettings.service
# /usr/lib/systemd/system/auditd.service
## /etc/systemd/system/auditd.service and add network-online.target
# /usr/lib/systemd/system/apparmor.service
# /usr/lib/systemd/system/getty@.service
# /usr/lib/systemd/system/getty@tty1.service.d/noclear.conf
# /usr/lib/systemd/system/systemd-fsck@.service
# /usr/lib/systemd/system/lvm2-pvscan@.service
# /usr/lib/systemd/system/user@.service
# /usr/lib/systemd/system/kmod-static-nodes.service
# /usr/lib/systemd/system/detect-part-label-duplicates.service
# /usr/lib/systemd/system/systemd-user-sessions.service
# /usr/lib/systemd/system/systemd-ask-password-plymouth.service
# /usr/lib/systemd/system/systemd-journald.service
MemoryDenyWriteExecute=yes
# /usr/lib/systemd/system/systemd-fsck@.service
# /usr/lib/systemd/system/systemd-udevd.service
MemoryDenyWriteExecute=yes
# /usr/lib/systemd/system/rtkit-daemon.service
# /usr/lib/systemd/system/accounts-daemon.service
# /usr/lib/systemd/system/upower.service
MemoryDenyWriteExecute=true
# /usr/lib/systemd/system/firewalld.service
# /usr/lib/systemd/system/smartd.service
# /usr/lib/systemd/system/systemd-backlight@.service
# /usr/lib/systemd/system/lvm2-monitor.service
# /usr/lib/systemd/system/bluetooth.service
# /usr/lib/systemd/system/sshd.service
# /usr/lib/systemd/system/dbus.service
# /usr/lib/systemd/system/systemd-remount-fs.service
# /usr/lib/systemd/system/cron.service
# /usr/lib/systemd/system/avahi-daemon.service
# /usr/lib/systemd/system/systemd-journal-flush.service
# /usr/lib/systemd/system/user@.service
# /usr/lib/systemd/system/colord.service
# /usr/lib/systemd/system/systemd-tmpfiles-setup.service
# /usr/lib/systemd/system/display-manager.service
# /usr/lib/systemd/system/cups.service
# /usr/lib/systemd/system/udisks2.service
# /usr/lib/systemd/system/ModemManager.service
# /usr/lib/systemd/system/wpa_supplicant.service
# /usr/lib/systemd/system/postfix.service
# /run/systemd/generator/systemd-cryptsetup@cr\x2dauto\x2d1.service
# /usr/lib/systemd/system/mcelog.service
# /usr/lib/systemd/system/systemd-logind.service
MemoryDenyWriteExecute=yes
# /usr/lib/systemd/system/rsyslog.service
# /usr/lib/systemd/system/NetworkManager.service
# /usr/lib/systemd/system/NetworkManager.service.d/NetworkManager-ovs.conf
# /usr/lib/systemd/system/nscd.service
# /usr/lib/systemd/system/../../dracut/modules.d/98dracut-systemd/dracut-shutdown.service
# /usr/lib/systemd/system/irqbalance.service
# /usr/lib/systemd/system/haveged.service
# /usr/lib/systemd/system/systemd-backlight@.service
# /usr/lib/systemd/system/fwupd.service
# /usr/lib/systemd/system/polkit.service
# /usr/lib/systemd/system/iscsi.service
# /usr/lib/systemd/system/systemd-sysctl.service
# /usr/lib/systemd/system/systemd-sysctl.service.d/50-kernel-uname_r.conf
# /usr/lib/systemd/system/systemd-udev-trigger.service
# /usr/lib/systemd/system/systemd-random-seed.service
# /usr/lib/systemd/system/systemd-fsck-root.service
# /usr/lib/systemd/system/klog.service
# /lib/systemd/system/klog.service
# /usr/lib/systemd/system/systemd-modules-load.service
# /usr/lib/systemd/system/systemd-tmpfiles-setup-dev.service

What are we investigating?

By the way this is the stock kernel with cgroup_disable=memory boot parameter and no swap.
Comment 65 Michal Hocko 2020-10-06 12:21:28 UTC
(In reply to Robert Delahunt from comment #64)
[...] 
> By the way this is the stock kernel with cgroup_disable=memory boot
> parameter and no swap.

Sorry, I should have been more explicit. We are interested in who is using memory cgroup controller. But the cgroup_disable kernel command line makes it disabled.

From the systemctl it seems no service is really trying to use it so I suspect it will be cgroup v1 created automatically and the hierarchy will mirror the systemd organization structure (slices, scopes etc.). Please boot again with the kernel command line parameter dropped.
Comment 66 Robert Delahunt 2020-10-06 12:34:25 UTC
With cgroup_memory=disable removed (not active) in boot parameters....
(By the way, I only started using that parameter during the process of this bug testing, so the initial comments I provided when I first began helping didn't have this...)
This is stock kernel (i.e. opensuse-update)
Linux desktop-01721d1.lan 5.3.18-lp152.44-default #1 SMP Wed Sep 30 18:51:43 UTC 2020 (914f31e) x86_64 x86_64 x86_64 GNU/Linux

mount | grep cgroup > /tmp/cgroup.txt

tmpfs on /sys/fs/cgroup type tmpfs (ro,nosuid,nodev,noexec,mode=755)
cgroup on /sys/fs/cgroup/unified type cgroup2 (rw,nosuid,nodev,noexec,relatime)
cgroup on /sys/fs/cgroup/systemd type cgroup (rw,nosuid,nodev,noexec,relatime,xattr,name=systemd)
cgroup on /sys/fs/cgroup/devices type cgroup (rw,nosuid,nodev,noexec,relatime,devices)
cgroup on /sys/fs/cgroup/rdma type cgroup (rw,nosuid,nodev,noexec,relatime,rdma)
cgroup on /sys/fs/cgroup/perf_event type cgroup (rw,nosuid,nodev,noexec,relatime,perf_event)
cgroup on /sys/fs/cgroup/net_cls,net_prio type cgroup (rw,nosuid,nodev,noexec,relatime,net_cls,net_prio)
cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
cgroup on /sys/fs/cgroup/cpuset type cgroup (rw,nosuid,nodev,noexec,relatime,cpuset)
cgroup on /sys/fs/cgroup/cpu,cpuacct type cgroup (rw,nosuid,nodev,noexec,relatime,cpu,cpuacct)
cgroup on /sys/fs/cgroup/pids type cgroup (rw,nosuid,nodev,noexec,relatime,pids)
cgroup on /sys/fs/cgroup/hugetlb type cgroup (rw,nosuid,nodev,noexec,relatime,hugetlb)
cgroup on /sys/fs/cgroup/freezer type cgroup (rw,nosuid,nodev,noexec,relatime,freezer)
cgroup on /sys/fs/cgroup/blkio type cgroup (rw,nosuid,nodev,noexec,relatime,blkio)

systemctl cat "*.service" | grep -E "# /|Memory"

# /run/systemd/generator/systemd-cryptsetup@cr\x2dauto\x2d1.service
# /usr/lib/systemd/system/kbdsettings.service
# /usr/lib/systemd/system/sshd.service
# /usr/lib/systemd/system/user@.service
# /usr/lib/systemd/system/cups.service
# /usr/lib/systemd/system/../../dracut/modules.d/98dracut-systemd/dracut-shutdown.service
# /usr/lib/systemd/system/systemd-journald.service
MemoryDenyWriteExecute=yes
# /usr/lib/systemd/system/NetworkManager.service
# /usr/lib/systemd/system/NetworkManager.service.d/NetworkManager-ovs.conf
# /usr/lib/systemd/system/auditd.service
## /etc/systemd/system/auditd.service and add network-online.target
# /usr/lib/systemd/system/smartd.service
# /usr/lib/systemd/system/lvm2-pvscan@.service
# /usr/lib/systemd/system/ModemManager.service
# /usr/lib/systemd/system/upower.service
MemoryDenyWriteExecute=true
# /usr/lib/systemd/system/systemd-random-seed.service
# /usr/lib/systemd/system/user@.service
# /usr/lib/systemd/system/postfix.service
# /usr/lib/systemd/system/mcelog.service
# /usr/lib/systemd/system/systemd-fsck@.service
# /usr/lib/systemd/system/apparmor.service
# /usr/lib/systemd/system/systemd-fsck-root.service
# /usr/lib/systemd/system/systemd-modules-load.service
# /usr/lib/systemd/system/cron.service
# /usr/lib/systemd/system/rtkit-daemon.service
# /usr/lib/systemd/system/systemd-fsck@.service
# /usr/lib/systemd/system/lvm2-monitor.service
# /usr/lib/systemd/system/irqbalance.service
# /usr/lib/systemd/system/systemd-ask-password-plymouth.service
# /usr/lib/systemd/system/systemd-tmpfiles-setup.service
# /usr/lib/systemd/system/systemd-tmpfiles-setup-dev.service
# /usr/lib/systemd/system/systemd-update-utmp.service
# /usr/lib/systemd/system/avahi-daemon.service
# /usr/lib/systemd/system/systemd-journal-flush.service
# /usr/lib/systemd/system/dbus.service
# /usr/lib/systemd/system/kmod-static-nodes.service
# /usr/lib/systemd/system/firewalld.service
# /usr/lib/systemd/system/systemd-udev-trigger.service
# /usr/lib/systemd/system/systemd-logind.service
MemoryDenyWriteExecute=yes
# /usr/lib/systemd/system/accounts-daemon.service
# /usr/lib/systemd/system/fwupd.service
# /usr/lib/systemd/system/systemd-backlight@.service
# /usr/lib/systemd/system/rsyslog.service
# /usr/lib/systemd/system/haveged.service
# /usr/lib/systemd/system/systemd-backlight@.service
# /usr/lib/systemd/system/nscd.service
# /usr/lib/systemd/system/polkit.service
# /usr/lib/systemd/system/systemd-udevd.service
MemoryDenyWriteExecute=yes
# /usr/lib/systemd/system/detect-part-label-duplicates.service
# /usr/lib/systemd/system/getty@.service
# /usr/lib/systemd/system/getty@tty1.service.d/noclear.conf
# /usr/lib/systemd/system/display-manager.service
# /usr/lib/systemd/system/udisks2.service
# /usr/lib/systemd/system/bluetooth.service
# /usr/lib/systemd/system/iscsi.service
# /usr/lib/systemd/system/systemd-user-sessions.service
# /usr/lib/systemd/system/klog.service
# /lib/systemd/system/klog.service
# /usr/lib/systemd/system/wpa_supplicant.service
# /usr/lib/systemd/system/systemd-sysctl.service
# /usr/lib/systemd/system/systemd-sysctl.service.d/50-kernel-uname_r.conf
# /usr/lib/systemd/system/colord.service
# /usr/lib/systemd/system/systemd-remount-fs.service

find /sys/fs/cgroup/memory -type d

/sys/fs/cgroup/memory
Comment 67 Michal Hocko 2020-10-06 12:40:10 UTC
(In reply to Robert Delahunt from comment #66)
[...]
> cgroup on /sys/fs/cgroup/memory type cgroup (rw,nosuid,nodev,noexec,relatime,memory)
[...]
> find /sys/fs/cgroup/memory -type d
> 
> /sys/fs/cgroup/memory

This is more than surprising. Because there are no cgroups created and as such they cannot really influence the reclaim decisions. Could you try to reproduce your original problem and see whether something/somebody creates any directories (cgroups) in that hiearchy?
Comment 68 Robert Delahunt 2020-10-06 12:53:08 UTC
I set up one window doing dd if=/dev/zero of=/dev/sdc bs=4K 
Another monitoring the output of df -h | grep cgroup every second
Another monitoring directories under /sys/fs/cgroup/memory
Another monitoring free -m

As expected, swap use spiked at roughly 200 MB (consider: I have 32 GB RAM)

At no time did cgroup get used or any directories populate under /sys/fs/cgroup/memory

Understand that I have a long history with OpenSUSE memory issues.

In previous versions, I would put entries in /etc/fstab that forcefully remounted all tmpfs mountpoints with a size=512M option to reduce their usage.

In the past, it seemed that helped my system use swap less, though I never collected any scientific data that would empirically prove it helped.  It just seemed that stock OpenSUSE (previous versions) dug into swap more readily, so forcing tmpfs entries to use less space seemed to help.

I have long been suspicious of tmpfs reclaim anyways.

But regardless, it seems nothing is using cgroups.

I have also speculated that OpenSUSE needs a laptop-specific kernel which would alter this behavior.  I mean, does anyone honestly need groups on a laptop?

Anyways, one minor note is the kernel you made, mhocko, results in me not having a sound card.

But yeah, back to the original topic, cgroups isn't using anything, but the machine dug into swap predictably, just like before.
Comment 69 Robert Delahunt 2020-10-06 12:54:11 UTC
My experience with OpenSUSE is from v 42.3 through LEAP 15.2 (present).
Comment 70 Michal Hocko 2020-10-06 13:00:09 UTC
(In reply to Robert Delahunt from comment #68)
> I set up one window doing dd if=/dev/zero of=/dev/sdc bs=4K 
> Another monitoring the output of df -h | grep cgroup every second

cgroup uses a virtual filesystem so df will not tell you much.

> Another monitoring directories under /sys/fs/cgroup/memory

I simply do not see how memcg controller enabled but not used can make any picture.

[...]
> But yeah, back to the original topic, cgroups isn't using anything, but the
> machine dug into swap predictably, just like before.

All that with the latest kernel I have provided, right? Is this really repeatable. Both with the cgroup controller disabled and enabled?
Comment 71 Robert Delahunt 2020-10-06 13:08:00 UTC
My latest test monitoring /sys/fs/cgroup/memory was with the OpenSUSE LEAP 15.2 stock (update) kernel (see uname -a in previous post).

I can run it all again with your kernel, sure.  But please reply real quick and tell me exactly what data you want, so that I can make sure I test things exactly as you want, with all the data you want.  I don't have your kernel installed (I did a test where I reinstalled LEAP 15.2 without online updates and it seemed to fix most of the other weirdness I experienced in GTK/XFCE apps).

So just please tell me everything you want to know.  I'll install your latest kernel, reboot without the cgroups command line option, and then run all the tests you need, once I get back from the gym.
Comment 72 Michal Hocko 2020-10-06 13:31:41 UTC
(In reply to Robert Delahunt from comment #71)
> My latest test monitoring /sys/fs/cgroup/memory was with the OpenSUSE LEAP
> 15.2 stock (update) kernel (see uname -a in previous post).

OK, that explains it, I guess. Please stick with the test kernel so that we can actually draw any conclusion here. So far I believe that the two identified patches have made swapping much more probable under stream IO. Upstream kernel behaves differently because of later changes. The state we have in 15.2 kernels is half baked and therefore I would rather like to restore the previous behavior.
For that I would like to see confirmed that
a) test kernel (the latest one) doesn't swap under your streaming IO load without cgroups (cgroup_disable=memory) and that this is the case consistently in several runs
b) the same tested _without_ cgroup_disable=memory parameter - aka cgroups enabled by default (check the cgroup hierarchy find /sys/fs/cgroup/memory -type d)

in both cases collect /prov/vmstat as before
Comment 73 Robert Delahunt 2020-10-06 16:17:02 UTC
uname -a
Linux desktop-01721d1.lan 5.3.18-lp152.2.gb85b477-default #1 SMP Fri Sep 25 14:55:58 UTC 2020 (b85b477) x86_64 x86_64 x86_64 GNU/Linux

dd if=/dev/zero of=/dev/sdc bs=4K status=progress

the find command found no directories in group other than memory

swap use rose to 130 MB and then fluxuated between 80 and 100 MB.

I am in Gnome Classic and literally the only things running are Dropbox and Chrome (and I'm only in one tab replying to this bug request).

Your kernel is loaded.

cgroup_disable=memory is NOT in the boot line (I removed it using advanced options prior to booting the kernel).

How else may I help?
Comment 74 Michal Hocko 2020-10-06 17:02:17 UTC
(In reply to Robert Delahunt from comment #73)
[...]
> How else may I help?

Please read comment 72 again.
Comment 75 Robert Delahunt 2020-10-06 17:16:58 UTC
In about five minutes:

This is with vmstats with your kernel with cgroups enabled:

http://www.puresimplicity.net/~delahunt/vmstat/mhocko4/

This is with vmstats with your kernel with cgroups disabled:

http://www.puresimplicity.net/~delahunt/vmstat/mhocko5/
Comment 76 Robert Delahunt 2020-10-06 17:21:57 UTC
Note that when I ran your kernel with cgroup memory disabled (mhocko5) and monitored the /sys/fs/cgroup/memory directory for more directories using find, it said /sys/fs/cgroup/memory didn't exist.

This is different from the opensuse-update kernel (stock update) which still had the directory but it wasn't being used.
Comment 77 Robert Delahunt 2020-10-06 17:23:06 UTC
Both data sets are up.
Comment 78 Abdulrhman Ied 2020-10-10 11:51:38 UTC
I probably have the same issue. 
I already opened a new bug (see bug#1177541), before I get pointed to this one.
So here is again my issue, and I hope that would be helpful. 

"My system freezes for few seconds when there is high disk usage, like copying large files, or when opening a demanding chrome web pages (due to swaping?).
It happens on Ext4, Btrfs and XFS, so file system doesn't matter.
It happens on both Gnome and Xfce, so that also doesn't matter. 
Windows 10, Fedora and Ubuntu works almost fine on the same device, it's a problem with Leap 15.2 only. So I upgraded my system from Leap 15.2 to TW, and everything works almost fine now. 
I booted my device to TW but with Leap kernel (5.3.18-lp152.44-default), and the freezes happen again. 
So it seems to me that's a kernel issue. My search lead me to multiple cases with the same issue (different distros).
See: https://askubuntu.com/questions/1212736/system-freezes-on-disk-i-o
And: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1861359
It seems like newer kernel have this issue fixed, maybe version 5.5.6 (as stated  in the link), and it seems that Ubuntu backported successfully a fix to kernel 5.4.
This issue is very annoying, I hope openSUSE can backport a fix from upstream.           

Reproducible: Always

Steps to Reproduce:
1. Start a disk demanding process, like copying large files
Actual Results:  
Freezes and a laggy mouse cursor 

Expected Results:  
Smooth system

Maybe it could be more obvious in devices with low ram, but Ubuntu 20.04 and Windows 10 work perfectly on the same device."
Comment 79 Robert Delahunt 2020-10-10 11:55:41 UTC
Can confirm and agree with Abdulrhman Ied.  Depending on the day, there would be a 3-5 second lag when my system began digging into swap.  I figured it was just the swap factor and/or my system.

However, I noticed on the MHocko kernels (see comment history) that it happened far less often.
Comment 80 Abdulrhman Ied 2020-10-10 17:51:16 UTC
(In reply to Robert Delahunt from comment #79)
> Can confirm and agree with Abdulrhman Ied.  Depending on the day, there
> would be a 3-5 second lag when my system began digging into swap.  I figured
> it was just the swap factor and/or my system.
> 
> However, I noticed on the MHocko kernels (see comment history) that it
> happened far less often.

I tried the mentioned kernel (5.3.18-lp152.2.gb85b477-preempt) without any further configuration (I just installed the kernel and reboot to it), and the system still freezes.
Comment 81 Robert Delahunt 2020-10-10 18:05:15 UTC
(In reply to Abdulrhman Ied from comment #80)
> (In reply to Robert Delahunt from comment #79)
> > Can confirm and agree with Abdulrhman Ied.  Depending on the day, there
> > would be a 3-5 second lag when my system began digging into swap.  I figured
> > it was just the swap factor and/or my system.
> > 
> > However, I noticed on the MHocko kernels (see comment history) that it
> > happened far less often.
> 
> I tried the mentioned kernel (5.3.18-lp152.2.gb85b477-preempt) without any
> further configuration (I just installed the kernel and reboot to it), and
> the system still freezes.

Please provide system specifications.
Comment 82 Abdulrhman Ied 2020-10-11 14:30:21 UTC
(In reply to Robert Delahunt from comment #81)

> Please provide system specifications.

OS: openSUSE Leap 15.2 x86_64
Host: HP 15 Notebook PC 099011000000000000
Kernel: 5.3.18-lp152.44-default
CPU: Intel Celeron N2840 (2) @ 2.582GHz
GPU: Intel Atom Processor Z36xxx/Z37xxx Se
Memory: 1426MiB / 1870MiB

I have installed the latest kernel from http://download.opensuse.org/repositories/Kernel:/stable/standard and things are much improved now.
Comment 83 Takashi Iwai 2020-10-12 16:20:43 UTC
*** Bug 1177541 has been marked as a duplicate of this bug. ***
Comment 84 Michal Hocko 2020-10-21 08:54:55 UTC
Sorry for a late reply. I was busy with other issues

(In reply to Robert Delahunt from comment #75)
> In about five minutes:
> 
> This is with vmstats with your kernel with cgroups enabled:
> 
> http://www.puresimplicity.net/~delahunt/vmstat/mhocko4/
diff between the first and last snapshot
           1602004484   1602004607[diff]
pswpin          0       0
pswpout         0       48703
pgscan_kswapd   0       7177411
pgscan_direct   0       0
pgsteal_kswapd  0       7115294
pgsteal_direct  0       0

> This is with vmstats with your kernel with cgroups disabled:
> 
> http://www.puresimplicity.net/~delahunt/vmstat/mhocko5/
           1602004764   1602004958[diff]
pswpin          0       0
pswpout         0       75781
pgscan_kswapd   0       13728122
pgscan_direct   0       0
pgsteal_kswapd  0       13664819
pgsteal_direct  0       0

Both do swap out. The later covers a longer time period - 194s vs 123s and scans twice as many pages which results in twice as many pages reclaimed and 55% more swapout.

From that we can conclude (from a high level) that the swapout reflects the overall reclaim and cgroups enabled/disabled doesn't play any major role here. Which is a good confirmation because it would be really curious to see a difference in the behavior just from having cgroups enabled without being used.

So let's focus on the cgroups enabled case for now. Let's have a look at
                  1602004840   1602004841[diff]
pswpin                 0           0
pswpout                3103        6801
pgsteal_kswapd         206006      148891
pgscan_kswapd          251736      148891
nr_active_anon         339953      -1840
nr_active_file         147150      7
nr_inactive_anon       38139       9382
nr_inactive_file       7263922     -2703
workingset_activate    170         0
workingset_refault     170         57
workingset_restore     0           0

From this we can conclude that
- some active anonymous pages have been rotated to the inactive list which grown much larger though - even when we consider the swapout. So there must be some process allocating a nontrivial amount of anonymous memory and there is more going on than just the IO test case
- There is a ton of inactive page cache to reclaim from
- refaults are quite marginal

So this is in line with previous observations. I am inclined to drop the two patches mentioned earlier (comment 60) as they are known to contribute considerably. Unless Vlastimil or Mel speak up.

At this moment I am not sure how much more time I can spend on this so I would recommend to use a more recent kernel.

Btw. considering stalls. The data I have seen so far doesn't indicate any reclaim induced source of a potential stall. There is no swap in neither no direct reclaim. So existing reclaim decisions. Maybe in your regular workload there is a considerable swapin (pswpin) going on.
Comment 87 Abdulrhman Ied 2020-10-28 19:31:04 UTC
> At this moment I am not sure how much more time I can spend on this so I
> would recommend to use a more recent kernel.

Yeah, I am already using a more recent kernel and it's much better. Thanks for all your time, I hope that this is a temporary situation, and fixes will be backported soon, so we can go back to a higher standard stability.
Comment 88 Abdulrhman Ied 2021-06-24 22:30:49 UTC
Hello,

After I had upgraded to openSUSE Leap 15.3, the issue got resolved.

lsb_release -d
Description:	openSUSE Leap 15.3

uname -r
5.3.18-59.5-default

Thank you very much for all of your efforts.