Bug 1165351 - systemd libpam-systemd immediate memory exhaustion triggers oom upon ssh
systemd libpam-systemd immediate memory exhaustion triggers oom upon ssh
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Basesystem
Current
Other Other
: P5 - None : Major (vote)
: ---
Assigned To: Thomas Blume
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2020-02-29 16:24 UTC by Luis Chamberlain
Modified: 2022-07-18 08:46 UTC (History)
5 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Image while system is idle (594.45 KB, image/png)
2020-02-29 16:43 UTC, Luis Chamberlain
Details
systemd eating memory for breakfast (612.01 KB, image/png)
2020-02-29 16:44 UTC, Luis Chamberlain
Details
nsswitch.conf (1.16 KB, text/plain)
2020-03-03 13:16 UTC, Franck Bui
Details
0001-execute-set-rlimits-before-using-libtirpc-which-uses.patch (2.47 KB, patch)
2020-04-08 15:32 UTC, Thomas Blume
Details | Diff

Note You need to log in before you can comment on or make changes to this bug.
Description Luis Chamberlain 2020-02-29 16:24:53 UTC
I have a fresh install of OpenSUSE Tumbleweed on two HP ProLiant DL380 Gen9, its at version 20200222 right now both with 30 GiB of RAM using systemd-244-2.1.x86_64. On one of them I noticed that whenever I ssh in as a regular user but not root I see the out of memory killer triggered. That is.. 30 GiB of RAM eaten up.

The system now has the debug boolean set on /etc/pam.d/common-session and sudo journalctl -f reveals:

Feb 29 16:31:18 chivo sshd[16511]: Accepted publickey for mcgrof from 10.100.208.217 port 44016 ssh2: RSA SHA256:tOilLVI1jM/Pvg5uJVQbeXGEAU0mma6Jxma9c5eVX0w
Feb 29 16:31:18 chivo sshd[16511]: pam_systemd(sshd:session): pam-systemd initializing
Feb 29 16:31:18 chivo sshd[16511]: pam_systemd(sshd:session): Asking logind to create session: uid=1000 pid=16511 service=sshd type=tty class=user desktop= seat= vtnr=0 tty= display= remote=yes remote_user= remote_host=10.100.208.217
Feb 29 16:31:18 chivo sshd[16511]: pam_systemd(sshd:session): Session limits: memory_max=n/a tasks_max=n/a cpu_weight=n/a io_weight=n/a runtime_max_sec=n/a
Feb 29 16:31:18 chivo systemd-logind[2184]: New session 62 of user mcgrof.      
Feb 29 16:31:18 chivo systemd[1]: Starting User Manager for UID 1000...         
                                                                                
-- <oom trace> ---
                                                                             
Feb 29 16:31:34 chivo kernel: oom_reaper: reaped process 16513 ((systemd)), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Feb 29 16:31:34 chivo systemd[1]: user@1000.service: Failed with result 'protocol'.
Feb 29 16:31:34 chivo systemd[1]: Failed to start User Manager for UID 1000.    
Feb 29 16:31:34 chivo systemd[1]: Started Session 62 of user mcgrof.            
Feb 29 16:31:34 chivo sshd[16511]: pam_systemd(sshd:session): Reply from logind: id=62 object_path=/org/freedesktop/login1/session/_362 runtime_path=/run/user/1000 session_fd=7 seat= vtnr=0 original_uid=1000
Feb 29 16:31:34 chivo sshd[16511]: pam_unix(sshd:session): session opened for user mcgrof(uid=1000) by (uid=0)


Here's what a vmstat log looks like every second, first with the system on idle, then I create a new ssh session, it takes a while to eat up all the memory and then the oom takes over and kills the pam systemd service, and I am able to log in. The blank links are put in place below to highlight when the ssh session is just delaying and the oom is about to hit:

mcgrof@chivo ~ $ sudo vmstat 1                                                  
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0 248704 31981364    136 160708    2  248    35   252   31   63  0  1 99  0  0
 0  0 248704 31981056    136 160856    0    0     0     0  131  101  0  0 100  0  0
 0  0 248704 31981056    136 160856    0    0     0     0   82   92  0  0 100  0  0
 0  0 248704 31980552    136 160824   52    0   384     0  108  157  0  0 100  0  0
 0  0 248704 31980472    136 161196    0    0     0   232  164  280  0  0 100  0  0
 0  0 248704 31980200    136 161344  128    0   228     0  155  240  0  0 100  0  0
 0  0 248704 31980200    136 161412    0    0     0     0  160  106  0  0 100  0  0
 0  0 248704 31979628    136 161484    0    0     0     0   95  108  0  0 100  0  0
 0  0 248704 31979628    136 161484    0    0     0     0  114  146  0  0 100  0  0
 0  0 248704 31979628    136 161484    0    0     0     0   79   84  0  0 100  0  0
 1  0 248704 31932744    136 161484    0    0    16     0  120  133  0  0 100  0  0
 0  0 248704 31929664    136 161504    0    0     0     0  116  104  0  0 100  0  0
 0  0 248704 31929640    136 161508    0    0     0     0  125  162  0  0 100  0  0
 1  0 248704 28031352    136 163092  416    0  1940     0  574  515  3  2 95  0  0
 1  0 248704 22681916    136 163092    0    0    64   340  473  354  3  2 95  0  0
 1  0 248704 17326052    136 163172    0    0     0     0  364  116  3  2 95  0  0
 1  0 248704 11968392    136 163172    0    0     0     0  364  124  3  2 95  0  0
 1  0 248704 6612528    136 163172    0    0     0     0  374  110  3  2 95  0  0
 1  0 248704 1257168    136 163172    0    0     0     0  366  114  3  2 95  0  0
 1  1 565632 215044    136  82292  192 316912  1864 316912 1678 1135  1  2 91  6  0
 2  1 954816 207848    136  76672  264 389708  1564 389708 30002 12569  0  1 89  9  0
                                                                                
                                                                                
 3  2 353004 24513544    136  78252  140 139752  8480 139752 38917 113910  0 16 83  1  0
 0  0 350444 32073372    136 122312 3040   24 42184    68 4774 14383  0  1 98  1  0
 0  0 350444 32073372    136 122344    8    0    24     0  134  159  0  0 100  0  0


Running $(sudo htop --sort-key PERCENT_MEM  -d 1) reveals the same but in a much more fun way.

FWIW it is using yp, and I so I have:

/etc/yp.conf
domain suse.de server <some-private-ip>
domain suse.de server <some-other-ip>

However leaving this blank, running $(netconfig update -f), and then ssh'ing in reveals it still triggers.

Related public issues:

https://github.com/systemd/systemd/issues/11502
https://bugzilla.redhat.com/show_bug.cgi?id=1666690
https://github.com/systemd/systemd/commit/084eeb865ca63887098e0945fb4e93c852b91b0f

Red Hat's bug reveals it was a bad backport of the fix for CVE-2018-16864 and so that itself introduced another CVE, CVE-2019-3815.

I can reproduce very thing with a simple shiny new user created as user named test. The user root does not trigger the issue. What is however odd is that this issue is not reproducible on the other twin system.
Comment 2 Anthony Iliopoulos 2020-02-29 16:33:10 UTC
see also: https://github.com/systemd/systemd/issues/9656

(this appears to be the same issue, but there's no conclusion)
Comment 3 Luis Chamberlain 2020-02-29 16:37:56 UTC
(In reply to Luis Chamberlain from comment #0)
> Related public issues:
> 
> https://github.com/systemd/systemd/issues/11502
> https://bugzilla.redhat.com/show_bug.cgi?id=1666690
> https://github.com/systemd/systemd/commit/
> 084eeb865ca63887098e0945fb4e93c852b91b0f
> 
> Red Hat's bug reveals it was a bad backport of the fix for CVE-2018-16864
> and so that itself introduced another CVE, CVE-2019-3815.

Our fix for CVE-2018-16864 was addressed via bsc#1120323 so Cc'ing Franck Bui.
Comment 4 Luis Chamberlain 2020-02-29 16:43:17 UTC
Created attachment 831637 [details]
Image while system is idle

Image while system is idle
Comment 5 Luis Chamberlain 2020-02-29 16:44:04 UTC
Created attachment 831638 [details]
systemd eating memory for breakfast

Here is an image which reveals how systemd eats up memory, 30 GiB worth, in no time.
Comment 6 Luis Chamberlain 2020-02-29 16:45:46 UTC
If one runs htop as root with:

sudo htop --sort-key PERCENT_MEM  -d 1

one can easily try to trigger strace on the (system) process which gets called which eats up memory by hitting "s". I have tried that but come up empty. Similarly running strace on systemd-journald doesn't tell at least me anything quite revealing or stupid it may be doing.
Comment 7 Franck Bui 2020-03-03 11:18:41 UTC
After looking at the affected system, it seems to be an issue related to NSS and more specifically to the use of NIS.

Here is the backtrace of the process (child of PID1) just before it gets killed by OOM killer:

> #0  0x00007f9545019217 in pthread_cond_init@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
> #1  0x00007f95439c4f9e in clnt_dg_create () from /lib64/libtirpc.so.3
> #2  0x00007f95439c527a in clnt_tli_create () from /lib64/libtirpc.so.3
> #3  0x00007f95439c5641 in getclnthandle () from /lib64/libtirpc.so.3
> #4  0x00007f95439c640a in __rpcb_findaddr_timed () from /lib64/libtirpc.so.3
> #5  0x00007f95439c7c11 in clnt_tp_create_timed () from /lib64/libtirpc.so.3
> #6  0x00007f95439c7dc9 in clnt_create_timed () from /lib64/libtirpc.so.3
> #7  0x00007f95439e5d8b in yp_bind_client_create_v3 () from /usr/lib64/libnsl.so.2
> #8  0x00007f95439e61db in __yp_bind.part.0 () from /usr/lib64/libnsl.so.2
> #9  0x00007f95439e7160 in yp_all () from /usr/lib64/libnsl.so.2
> #10 0x00007f95439f8121 in _nss_nis_initgroups_dyn () from /lib64/libnss_nis.so.2
> #11 0x00007f9543a0551b in getgrent_next_nss () from /lib64/libnss_compat.so.2
> #12 0x00007f9543a05a45 in _nss_compat_initgroups_dyn () from /lib64/libnss_compat.so.2
> #13 0x00007f9545669876 in internal_getgrouplist () from /lib64/libc.so.6
> #14 0x00007f9545669bbb in initgroups () from /lib64/libc.so.6
> #15 0x000055811e7e4097 in get_supplementary_groups (group=<optimized out>, ngids=<synthetic pointer>, supplementary_gids=<synthetic pointer>, gid=100, user=0x55811fd9c7b0 "mcgrof", c=0x55811fd34ab8)

It seems that something went wrong when the child of PID1 called initgroups(3) as the call never returned.

It can be reproduced easily by starting any services that set User= option. For example starting the following test service leads to the same memory exhaustion:

# systemctl cat test
# /etc/systemd/system/test.service
[Unit]
Description=test

[Service]
User=1000
Type=oneshot
ExecStart=/usr/bin/sleep infinity
Comment 8 Franck Bui 2020-03-03 11:20:27 UTC
Also removing the NIS entry "+:::" in /etc/group workarounds the issue.
Comment 9 Franck Bui 2020-03-03 11:26:55 UTC
I can't reproduce the issue with getent(1). Indeed doing:

 # getent initgroups mcgrof
 # getent group 

works as expected.

So it seems that the problem only happens in the context of PID1.
Comment 10 Franck Bui 2020-03-03 11:30:32 UTC
Thorsten, since you're libnss_nis maintainer, does that ring you a bell or can you think of any reason why initgroups(3) would loop forever and consume all memory when it's called in PID1 context ?
Comment 11 Thorsten Kukuk 2020-03-03 11:42:40 UTC
(In reply to Franck Bui from comment #10)
> Thorsten, since you're libnss_nis maintainer, does that ring you a bell or
> can you think of any reason why initgroups(3) would loop forever and consume
> all memory when it's called in PID1 context ?

From the backtrace it's in libtirpc, not libnss_nis.

I'm not aware of anything in PID1 context, the usual reasons for such a crash are, that NIS is needed to contact the NIS server, so typical endless recursion.
Maybe systemd as PID1 does not allow access to everything needed?

How does nsswitch.conf look like on the host? Does removing NIS from hosts, services, protocols and netgroup help?
Comment 12 Franck Bui 2020-03-03 13:14:49 UTC
(In reply to Thorsten Kukuk from comment #11)
> I'm not aware of anything in PID1 context, the usual reasons for such a
> crash are, that NIS is needed to contact the NIS server, so typical endless
> recursion.

Well the lib is not supposed to crash so there's definitively a bug here.

> Maybe systemd as PID1 does not allow access to everything needed?

I'm not aware of such restrictions.

> How does nsswitch.conf look like on the host?

I'll attach it.

> Does removing NIS from hosts, services, protocols and netgroup help?

Nope it doesn't most likely because "group" uses "compat".

BTW you can have a look at the system, see (private) comment #1.
Comment 13 Franck Bui 2020-03-03 13:16:24 UTC
Created attachment 831792 [details]
nsswitch.conf
Comment 14 Franck Bui 2020-03-03 17:42:48 UTC
After some more debugging I finally found that the bug is essentially the same as the one reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1705641

A more interesting discussion can be found here https://github.com/systemd/systemd/pull/13359 but unfortunately it led to nowhere.

Adding Thomas in Cc as he's the libtirpc maintainer.
Comment 15 Thomas Blume 2020-03-04 09:24:11 UTC
(In reply to Franck Bui from comment #14)
> After some more debugging I finally found that the bug is essentially the
> same as the one reported here:
> https://bugzilla.redhat.com/show_bug.cgi?id=1705641
> 
> A more interesting discussion can be found here
> https://github.com/systemd/systemd/pull/13359 but unfortunately it led to
> nowhere.
> 
> Adding Thomas in Cc as he's the libtirpc maintainer.

Indeed:

         return (tbsize = (int)rl.rlim_max);

seems to be a bit excessive.
But I'm not sure how else to determine a good value.
Thorsten, would be changing this to the soft limit appropriate, e.g:

         return (tbsize = (int)rl.rlim_cur);

?
Btw. the RedHat bugzilla indicates that enabling nscd would also fix the excessive memory consumption.

What do you think?
Comment 16 Franck Bui 2020-03-04 09:36:42 UTC
(In reply to Thomas Blume from comment #15)
>          return (tbsize = (int)rl.rlim_max);
> 
> seems to be a bit excessive.
> But I'm not sure how else to determine a good value.
> Thorsten, would be changing this to the soft limit appropriate, e.g:
> 
>          return (tbsize = (int)rl.rlim_cur);
> 
> ?

Indeed that was my thinking too and it's actually how glibc implements getdtablesize(3). It looks like there might be a bug here therefore I would suggest to send a patch upstream that make the code use the soft limit.

But that won't be sufficient unfortunately because PID1 initializes the rlimits of its child *after* calling initgroups(3). Therefore the soft limit will still be the one used by PID1 which is still equal to the hard limit.

I'll send a patch to upstream to attempt to fix it.
Comment 17 Thorsten Kukuk 2020-03-04 09:41:26 UTC
(In reply to Thomas Blume from comment #15)
> (In reply to Franck Bui from comment #14)
> > After some more debugging I finally found that the bug is essentially the
> > same as the one reported here:
> > https://bugzilla.redhat.com/show_bug.cgi?id=1705641
> > 
> > A more interesting discussion can be found here
> > https://github.com/systemd/systemd/pull/13359 but unfortunately it led to
> > nowhere.
> > 
> > Adding Thomas in Cc as he's the libtirpc maintainer.
> 
> Indeed:
> 
>          return (tbsize = (int)rl.rlim_max);
> 
> seems to be a bit excessive.
> But I'm not sure how else to determine a good value.
> Thorsten, would be changing this to the soft limit appropriate, e.g:
> 
>          return (tbsize = (int)rl.rlim_cur);
> 
> ?

As far as I understand the code, no, this will not work. To be fast they are doing something which is very fast, but also very memory consuming: have a table for all possible file descriptors to lookup the data instead of going through a list of existing file descriptors. If you set the limit lower than possible, you could get a out of array access.
The code was written at a time, when 1024 file handles was the maximum possible, not for that big numbers as of today.

I would discuss this on the tirpc mailing list. The whole code needs to be rewritten to be less memory consuming.

> Btw. the RedHat bugzilla indicates that enabling nscd would also fix the
> excessive memory consumption.

No, I tried that already yesterday: if the nscd cache is full (somebody did run "getent group" before, it works. If the nscd cache is empty, I saw the same crash.
Comment 18 Thorsten Kukuk 2020-03-04 09:52:23 UTC
(In reply to Franck Bui from comment #16)

> Indeed that was my thinking too and it's actually how glibc implements
> getdtablesize(3). It looks like there might be a bug here therefore I would
> suggest to send a patch upstream that make the code use the soft limit.

glibc is using sunrpc, not ti-rpc. You cannot compare that functions, as the code calling this function is different and use it different. With the problem, that connections could be ignored if the number of the fd does not fit. One common problem since the day Linux supports more than 1024 file descriptors for a process.

But you are right that TI-RPC needs to be adjusted, as written, the code calling __rpc_dtbsize() needs to be rewritten. Best to not allocate that huge array, but in every case it needs some sanity checks if the filed escriptor is inside the table or outside.
Comment 19 Luis Chamberlain 2020-03-04 13:21:25 UTC
(In reply to Thorsten Kukuk from comment #17)
> (In reply to Thomas Blume from comment #15)
> > (In reply to Franck Bui from comment #14)
> > > After some more debugging I finally found that the bug is essentially the
> > > same as the one reported here:
> > > https://bugzilla.redhat.com/show_bug.cgi?id=1705641
> > > 
> > > A more interesting discussion can be found here
> > > https://github.com/systemd/systemd/pull/13359 but unfortunately it led to
> > > nowhere.
> > > 
> > > Adding Thomas in Cc as he's the libtirpc maintainer.
> > 
> > Indeed:
> > 
> >          return (tbsize = (int)rl.rlim_max);
> > 
> > seems to be a bit excessive.
> > But I'm not sure how else to determine a good value.
> > Thorsten, would be changing this to the soft limit appropriate, e.g:
> > 
> >          return (tbsize = (int)rl.rlim_cur);
> > 
> > ?
> 
> As far as I understand the code, no, this will not work. To be fast they are
> doing something which is very fast, but also very memory consuming: have a
> table for all possible file descriptors to lookup the data instead of going
> through a list of existing file descriptors.

Let us think proactively as well, if this was done, was it inspired by another codebase? Who else might be doing similar practice?

> If you set the limit lower than
> possible, you could get a out of array access.
> The code was written at a time, when 1024 file handles was the maximum
> possible, not for that big numbers as of today.

chivo:~/:[0]# cat /proc/sys/fs/file-max
9223372036854775807

> I would discuss this on the tirpc mailing list. The whole code needs to be
> rewritten to be less memory consuming.

And probably backported... down to what release? :)

A puzzle I don't see yet resolved is that this issue triggers only on one system, and not for another twin system with the same software/hardware. Any ideas on that?

ie, I am not sure if this issue is easily reproducible on all systems.
Comment 20 Thorsten Kukuk 2020-03-04 13:32:24 UTC
(In reply to Luis Chamberlain from comment #19)
> (In reply to Thorsten Kukuk from comment #17)

> > As far as I understand the code, no, this will not work. To be fast they are
> > doing something which is very fast, but also very memory consuming: have a
> > table for all possible file descriptors to lookup the data instead of going
> > through a list of existing file descriptors.
> 
> Let us think proactively as well, if this was done, was it inspired by
> another codebase? Who else might be doing similar practice?

Of course, this was common best practice 25 years ago if you needed performance. 
 
> > If you set the limit lower than
> > possible, you could get a out of array access.
> > The code was written at a time, when 1024 file handles was the maximum
> > possible, not for that big numbers as of today.
> 
> chivo:~/:[0]# cat /proc/sys/fs/file-max
> 9223372036854775807

That's why this is no good practice today anymore and needs to be rewritten.
Comment 21 Thomas Blume 2020-03-04 13:46:21 UTC
(In reply to Luis Chamberlain from comment #19)
> A puzzle I don't see yet resolved is that this issue triggers only on one
> system, and not for another twin system with the same software/hardware. Any
> ideas on that?

The issue won't show up if nscd is used and the nscd cache is full.
Comment 22 Luis Chamberlain 2020-03-04 14:00:54 UTC
(In reply to Thomas Blume from comment #21)
> (In reply to Luis Chamberlain from comment #19)
> > A puzzle I don't see yet resolved is that this issue triggers only on one
> > system, and not for another twin system with the same software/hardware. Any
> > ideas on that?
> 
> The issue won't show up if nscd is used and the nscd cache is full.

Chivo is the name of the affected system, tuctuc is the name of the system on which this issue has not appeared yet. Both have the same release of Tumblewedc installed. nscd was disabled on both systems upon a fresh install, and ncsd remains disabled on tuctuc.

Any other theory?
Comment 23 Luis Chamberlain 2020-03-04 21:36:04 UTC
Also, if the culprit is libtirpc function __rpc_dtbsize() shouldn't we be able to proof of concept this easily? And wouldn't this issue be more visible and easier to reproduce elsewhere?

Other binaries that depend on libtirpc, but I am not sure if they use __rpc_dtbsize().

/bin/

rpcinfo


/usr/bin/

cifsdd
dbwrap_tool
guestmount
libguestfs-test-tool
mailq
net
newaliases
nmblookup
oLschema2ldif
pdbedit
profiles
regdiff
regpatch
regshell
regtree
rpcclient
rsmtp
rsmtp
samba-regedit
sharesec
smbcacls

/usr/sbin/

automount
eventlogadm
exim
libvirtd
nmbd
rpc.gssd
rpc.mountd
rpc.statd
rpc.svcgssd
sendmail
showmount
sm-notify
smbd
virtinterfaced
virtlockd
virtlogd
virtlxcd
virtnetworkd
virtnodedevd
virtnwfilterd
virtproxyd
virtqemud
virtsecretd
virtstoraged
virtxend
yp_dump_binding
ypbind
yppoll
ypset
yptest
Comment 24 Luis Chamberlain 2020-03-04 22:12:54 UTC
(In reply to Thorsten Kukuk from comment #20)
> (In reply to Luis Chamberlain from comment #19)
> > (In reply to Thorsten Kukuk from comment #17)
> > > If you set the limit lower than
> > > possible, you could get a out of array access.
> > > The code was written at a time, when 1024 file handles was the maximum
> > > possible, not for that big numbers as of today.
> > 
> > chivo:~/:[0]# cat /proc/sys/fs/file-max
> > 9223372036854775807
> 
> That's why this is no good practice today anymore and needs to be rewritten.

But the above is file-max, I think __rpc_dtbsize() gets the RLIMIT_NOFILE, which is different. FWIW:

mcgrof@chivo ~ $ sudo cat /proc/sys/fs/nr_open
1073741816
mcgrof@chivo ~ $ python -c "import resource; print(resource.getrlimit(resource.RLIMIT_NOFILE))"
(1024, 524288)
Comment 25 Franck Bui 2020-03-05 08:04:32 UTC
(In reply to Luis Chamberlain from comment #23)
> Also, if the culprit is libtirpc function __rpc_dtbsize() shouldn't we be
> able to proof of concept this easily? And wouldn't this issue be more
> visible and easier to reproduce elsewhere?

See previous comments, it's due to the fact __rpc_dtbsize() is called with rlimits inherited from PID1. The rlimits of started service are initialized *after* the call to initgroups(3).
Comment 26 Thomas Blume 2020-03-05 13:51:30 UTC
(In reply to Thorsten Kukuk from comment #17)
> have a
> table for all possible file descriptors to lookup the data instead of going
> through a list of existing file descriptors. If you set the limit lower than
> possible, you could get a out of array access.
> The code was written at a time, when 1024 file handles was the maximum
> possible, not for that big numbers as of today.
> 
> I would discuss this on the tirpc mailing list. The whole code needs to be
> rewritten to be less memory consuming.

Looking for some more background I've found this:

https://www.shrubbery.net/solaris9ab/SUNWdev/ONCDG/p27.html

-->
The __rpc_dtbsize() function calls the getrlimit() function to determine the maximum value that the system can assign to a newly created file descriptor. The result is cached for efficiency.
--<

If that's the purpose I would assume that:

return (tbsize = (int)rl.rlim_max);

is correct.
So maybe we just need to subtract something from dtbsize in case it has an insane value?
Comment 27 Luis Chamberlain 2020-03-05 15:44:08 UTC
(In reply to Franck Bui from comment #25)
> (In reply to Luis Chamberlain from comment #23)
> > Also, if the culprit is libtirpc function __rpc_dtbsize() shouldn't we be
> > able to proof of concept this easily? And wouldn't this issue be more
> > visible and easier to reproduce elsewhere?
> 
> See previous comments, it's due to the fact __rpc_dtbsize() is called with
> rlimits inherited from PID1.

OK.

> The rlimits of started service are initialized
> *after* the call to initgroups(3).

Are you saying that there is a race between the use of __rpc_dtbsize() and its initialized values?

What explains that this is not reproducible on a twin system, with the same software?
Comment 28 Thomas Blume 2020-03-09 10:37:47 UTC
(In reply to Thomas Blume from comment #26)
> Looking for some more background I've found this:
> 
> https://www.shrubbery.net/solaris9ab/SUNWdev/ONCDG/p27.html
> 
> -->
> The __rpc_dtbsize() function calls the getrlimit() function to determine the
> maximum value that the system can assign to a newly created file descriptor.
> The result is cached for efficiency.
> --<
> 
> If that's the purpose I would assume that:
> 
> return (tbsize = (int)rl.rlim_max);
> 
> is correct.
> So maybe we just need to subtract something from dtbsize in case it has an
> insane value?

There is already a patch to address an integer overflow in dtbsize, see: 
https://bugzilla.redhat.com/show_bug.cgi?id=1600284 (libtirpc git commit 56b780e61ed4bae8b728a600fc5ac8052d0d3582).


-->
+               if ( (size_t) dtbsize > SIZE_MAX/sizeof(cond_t)) {
+                       mutex_unlock(&clnt_fd_lock);
+                       thr_sigsetmask(SIG_SETMASK, &(mask), NULL);
+                       ce->cf_stat = RPC_SYSTEMERROR;
+                       ce->cf_error.re_errno = EOVERFLOW;
+                       goto err;
+               }
--<

Maybe we could add a variable for the available memory to this calculation?
I guess an out of array access becomes irrelevant when the system runs out of memory. 
Thorsten, I'm just collection ideas before discussion that upstream, would be good if you could give your opinion.
Comment 29 Thorsten Kukuk 2020-03-10 08:53:56 UTC
(In reply to Thomas Blume from comment #28)
> Thorsten, I'm just collection ideas before discussion that upstream, would
> be good if you could give your opinion.

The above are all insane hacks to workaround a bad design and creating more problems than they solve.

In my opinion, there are only two possible solutions:
1. Limit the number of possible fd's and add everywhere a check in the tirpc code, that we don't access data behind the array. Quick and easy doable, but does not solve of the insane amount of memory usage and adds additional the problem, that connections could go lost if the filedescriptor number is getting to high. So the same problem was we had with glibc. So we minimize the original problem by adding new problems.

2. I seldom agree with Poettering, but here I do: we need to fix the tirpc implementation. Don't allocate an array for all possible fd numbers, but instead use a linked list or something similar. Yes, this will be slower, but if I look at the usage, I don't think this is a real problem on todays CPUs.
Comment 30 Thomas Blume 2020-03-10 13:39:34 UTC
(In reply to Thorsten Kukuk from comment #29)
 
> 2. I seldom agree with Poettering, but here I do: we need to fix the tirpc
> implementation. Don't allocate an array for all possible fd numbers, but
> instead use a linked list or something similar. Yes, this will be slower,
> but if I look at the usage, I don't think this is a real problem on todays
> CPUs.

Thanks Thorsten, interestingly, even the original patch (from 2007!) states that the implementation is silly:

-->
+/*
+ *      This machinery implements per-fd locks for MT-safety.  It is not
+ *      sufficient to do per-CLIENT handle locks for MT-safety because a
+ *      user may create more than one CLIENT handle with the same fd behind
+ *      it.  Therfore, we allocate an array of flags (vc_fd_locks), protected
+ *      by the clnt_fd_lock mutex, and an array (vc_cv) of condition variables
+ *      similarly protected.  Vc_fd_lock[fd] == 1 => a call is active on some
+ *      CLIENT handle created for that fd.
+ *      The current implementation holds locks across the entire RPC and reply.
+ *      Yes, this is silly, and as soon as this code is proven to work, this
+ *      should be the first thing fixed.  One step at a time.
+ */
--<
Comment 31 Luis Chamberlain 2020-03-13 19:55:47 UTC
(In reply to Luis Chamberlain from comment #27)
> (In reply to Franck Bui from comment #25)
> > (In reply to Luis Chamberlain from comment #23)
> > > Also, if the culprit is libtirpc function __rpc_dtbsize() shouldn't we be
> > > able to proof of concept this easily? And wouldn't this issue be more
> > > visible and easier to reproduce elsewhere?
> > 
> > See previous comments, it's due to the fact __rpc_dtbsize() is called with
> > rlimits inherited from PID1.
> 
> OK.
> 
> > The rlimits of started service are initialized
> > *after* the call to initgroups(3).
> 
> Are you saying that there is a race between the use of __rpc_dtbsize() and
> its initialized values?
> 
> What explains that this is not reproducible on a twin system, with the same
> software?

This goes unanswered. Why is this only occurring on *some* systems, not all? Why can we not easily reproduce with other test cases with 100% certainty?

Also, is there a work around I can use in the meantime, to move on with life in the meantime with this system?
Comment 32 Thomas Blume 2020-03-17 15:48:41 UTC
(In reply to Luis Chamberlain from comment #31)
>
> > What explains that this is not reproducible on a twin system, with the same
> > software?
> 
> This goes unanswered. Why is this only occurring on *some* systems, not all?
> Why can we not easily reproduce with other test cases with 100% certainty?

I have no answer yet and like to get access to tuctuc in order to compare it to chivo.
Is that possible?

> Also, is there a work around I can use in the meantime, to move on with life
> in the meantime with this system?

I guess the best for now is to reset LimitNOFILE to the old value, e.g.:

DefaultLimitNOFILE=1024

in /etc/system/system.conf
Comment 33 Luis Chamberlain 2020-03-17 16:04:28 UTC
(In reply to Thomas Blume from comment #32)
> (In reply to Luis Chamberlain from comment #31)
> >
> > > What explains that this is not reproducible on a twin system, with the same
> > > software?
> > 
> > This goes unanswered. Why is this only occurring on *some* systems, not all?
> > Why can we not easily reproduce with other test cases with 100% certainty?
> 
> I have no answer yet and like to get access to tuctuc in order to compare it
> to chivo.
> Is that possible?

Absolutely but please note that tuctuc is actively being used by myself for development purposes right now, so I would prefer if not reboots are done. The credentials are the same, the IP address is the same as chivo but the last octet ends with 196. I've enabled the same user into the wheel group so it has sudo privileges. 

> > Also, is there a work around I can use in the meantime, to move on with life
> > in the meantime with this system?
> 
> I guess the best for now is to reset LimitNOFILE to the old value, e.g.:
> 
> DefaultLimitNOFILE=1024
> 
> in /etc/system/system.conf

I'll try this. How do I apply this without restarting the system?
Comment 34 Thomas Blume 2020-03-17 16:24:04 UTC
(In reply to Luis Chamberlain from comment #33)
> Absolutely but please note that tuctuc is actively being used by myself for
> development purposes right now, so I would prefer if not reboots are done.
> The credentials are the same, the IP address is the same as chivo but the
> last octet ends with 196. I've enabled the same user into the wheel group so
> it has sudo privileges. 

ok, thanks

> > DefaultLimitNOFILE=1024
> > 
> > in /etc/system/system.conf
> 
> I'll try this. How do I apply this without restarting the system?

Please try:

systemctl daemon-reexec

If that doesn't work, I guess a reboot is needed, since it is the system daemon (e.g. pid 1) itself.
Comment 35 Luis Chamberlain 2020-03-17 17:13:44 UTC
(In reply to Thomas Blume from comment #34)
> (In reply to Luis Chamberlain from comment #33)
> > > DefaultLimitNOFILE=1024
> > > 
> > > in /etc/system/system.conf
> > 
> > I'll try this.

This is not a workaround -- the OOM still occurs. I had restarted the system, and the OOM still triggers *every* single time I ssh in.

mcgrof@chivo ~ $ sudo grep DefaultLimitNOFILE /etc/systemd/system.conf 
#DefaultLimitNOFILE=1024:524288
DefaultLimitNOFILE=1024

I had logged in twice:

mcgrof@chivo ~ $ sudo journalctl -k | grep "Out of memory"
Mar 17 18:07:58 chivo kernel: Out of memory: Killed process 5995 ((systemd)) total-vm:54694764kB, anon-rss:31910968kB, file-rss:2172kB, shmem-rss:0kB, UID:0 pgtables:64580kB oom_score_adj:0
Mar 17 18:08:22 chivo kernel: Out of memory: Killed process 6060 ((systemd)) total-vm:54694764kB, anon-rss:31928396kB, file-rss:2144kB, shmem-rss:0kB, UID:0 pgtables:64588kB oom_score_adj:0


> How do I apply this without restarting the system?
> 
> Please try:
> 
> systemctl daemon-reexec
> 
> If that doesn't work, I guess a reboot is needed, since it is the system
> daemon (e.g. pid 1) itself.

Thanks, to be safe I just rebooted. I'll keep the above in mind for the future. Is there a way to visualize the values at run time which are in effect?
Comment 36 Luis Chamberlain 2020-03-18 03:30:49 UTC
I had to reboot the tuctuc system, now the oom is triggering every time on the system upon an ssh login. I didn't do anything other than a reboot.
Comment 37 Thomas Blume 2020-03-18 09:24:54 UTC
(In reply to Luis Chamberlain from comment #36)
> I had to reboot the tuctuc system, now the oom is triggering every time on
> the system upon an ssh login. I didn't do anything other than a reboot.

I took a look in the logs on tuctuc and all ooms I see, were killing the session of user1000.
loginctl shows 2 session of user1000 but systemd-cgls doesn't show an user@1000.service nor an init scope for this user.
On chivo it is almost the same, most ooms are bound to the user1000 session, only once there was an oom for unbound-anchor.service.
user1000 on chivo also doesn't have an user@1000.service and an init scope.

I'm not sure whether this is the result of the oom kill or a special setup for this user. Is user1000 supposed to be a normal user that goes via pam login?

However, it seems that the limit for the system daemon is fine and the problem is with the user daemon (systemd --user).
In that case the workaround given was wrong indeed.

Please revert the changes in /etc/systemd/system.conf and edit /usr/lib/systemd/system/user-.slice.d/10-defaults.conf

instead.
There please set MemoryHigh and MemoryMax limits according to the documentation in the systemd.resource-control manpage.

Please report whether this fixes the ooms.
Comment 38 Franck Bui 2020-03-18 10:00:37 UTC
(In reply to Luis Chamberlain from comment #31)
> Also, is there a work around I can use in the meantime, to move on with life
> in the meantime with this system?

Sorry for the late reply.

At least there's no simple workaround with systemd.

As written in comment #25, the RLIMIT_NOFILE is directly inherited from PID1 when a service is being created and before initgroups(3) is called. Only after the service specific rlimits are initialized.
Comment 39 Luis Chamberlain 2020-03-18 13:19:02 UTC
(In reply to Thomas Blume from comment #37)
> (In reply to Luis Chamberlain from comment #36)
> > I had to reboot the tuctuc system, now the oom is triggering every time on
> > the system upon an ssh login. I didn't do anything other than a reboot.
> 
> I took a look in the logs on tuctuc and all ooms I see, were killing the
> session of user1000.
> loginctl shows 2 session of user1000 but systemd-cgls doesn't show an
> user@1000.service nor an init scope for this user.
> On chivo it is almost the same, most ooms are bound to the user1000 session,
> only once there was an oom for unbound-anchor.service.
> user1000 on chivo also doesn't have an user@1000.service and an init scope.
> 
> I'm not sure whether this is the result of the oom kill or a special setup
> for this user.

I haven't done anything custom.

> Is user1000 supposed to be a normal user that goes via pam
> login?

Yes, no NIS stuff.

> However, it seems that the limit for the system daemon is fine and the
> problem is with the user daemon (systemd --user).
> In that case the workaround given was wrong indeed.
> 
> Please revert the changes in /etc/systemd/system.conf and edit
> /usr/lib/systemd/system/user-.slice.d/10-defaults.conf
> 
> instead.
> There please set MemoryHigh and MemoryMax limits according to the
> documentation in the systemd.resource-control manpage.
> 
> Please report whether this fixes the ooms.

Nope, the full 30 GiB *and* then swap are eaten up and then the OOM triggers. I rebooted the system after making the changes and trying.
Comment 40 Thomas Blume 2020-03-18 14:14:35 UTC
(In reply to Luis Chamberlain from comment #39)
> > However, it seems that the limit for the system daemon is fine and the
> > problem is with the user daemon (systemd --user).
> > In that case the workaround given was wrong indeed.
> > 
> > Please revert the changes in /etc/systemd/system.conf and edit
> > /usr/lib/systemd/system/user-.slice.d/10-defaults.conf
> > 
> > instead.
> > There please set MemoryHigh and MemoryMax limits according to the
> > documentation in the systemd.resource-control manpage.
> > 
> > Please report whether this fixes the ooms.
> 
> Nope, the full 30 GiB *and* then swap are eaten up and then the OOM
> triggers. I rebooted the system after making the changes and trying.

Unfortunately, it seems that the session cleanup was successfull this time.
I don't see a user@1000.slice on chivo anymore, so I cannot verify the settings. :(
Could you please retry the login on chivo?
Even better would be if you create a local (non root) user for me so that I can test myself and don't need to nag you all the time.
Comment 41 Luis Chamberlain 2020-03-18 23:38:39 UTC
(In reply to Thomas Blume from comment #40)
> (In reply to Luis Chamberlain from comment #39)
> > > However, it seems that the limit for the system daemon is fine and the
> > > problem is with the user daemon (systemd --user).
> > > In that case the workaround given was wrong indeed.
> > > 
> > > Please revert the changes in /etc/systemd/system.conf and edit
> > > /usr/lib/systemd/system/user-.slice.d/10-defaults.conf
> > > 
> > > instead.
> > > There please set MemoryHigh and MemoryMax limits according to the
> > > documentation in the systemd.resource-control manpage.
> > > 
> > > Please report whether this fixes the ooms.
> > 
> > Nope, the full 30 GiB *and* then swap are eaten up and then the OOM
> > triggers. I rebooted the system after making the changes and trying.
> 
> Unfortunately, it seems that the session cleanup was successfull this time.
> I don't see a user@1000.slice on chivo anymore, so I cannot verify the
> settings. :(
> Could you please retry the login on chivo?

Yes is still triggers.

> Even better would be if you create a local (non root) user for me so that I
> can test myself and don't need to nag you all the time.

I did that early on, on February 29 I provided credentials on private comment #1. Let me know if you have issues logging in. The login will take a while, 20-30 seconds because the system is OOM'ing while you log in. It will succeed after the OOM and will let you know after. As I noted earlier as well the best way to visualize the OOM is to use htop:

sudo htop --sort-key PERCENT_MEM  -d 1

The user test is part of the wheel group and sudo is set with:

%wheel ALL=(ALL) NOPASSWD: ALL

The credentials apply to both systems, chivo and tuctuc.
Comment 42 Thomas Blume 2020-03-19 08:34:25 UTC
(In reply to Luis Chamberlain from comment #41)
>
> I did that early on, on February 29 I provided credentials on private
> comment #1. Let me know if you have issues logging in. The login will take a
> while, 20-30 seconds because the system is OOM'ing while you log in. It will
> succeed after the OOM and will let you know after. As I noted earlier as
> well the best way to visualize the OOM is to use htop:
> 
> sudo htop --sort-key PERCENT_MEM  -d 1
> 
> The user test is part of the wheel group and sudo is set with:
> 
> %wheel ALL=(ALL) NOPASSWD: ALL
> 
> The credentials apply to both systems, chivo and tuctuc.

Oops, sorry, I've overlooked your testuser.
Now it becomes interesting, systemd-analyze dump shows:

-> Unit user@1001.service:
        Description: User Manager for UID 1001
        Instance: 1001
        Unit Load State: loaded
        Unit Active State: failed
        State Change Timestamp: Thu 2020-03-19 09:12:55 CET
        Inactive Exit Timestamp: Thu 2020-03-19 09:12:39 CET
        Active Enter Timestamp: n/a
        Active Exit Timestamp: n/a
        Inactive Enter Timestamp: Thu 2020-03-19 09:12:55 CET
        May GC: no
        Need Daemon Reload: no
        Transient: no
        Perpetual: no
        Garbage Collection Mode: inactive
        Slice: user-1001.slice
[....]
        -> ExecStart:
                Command Line: /usr/lib/systemd/systemd --user
                        PID: 10907
                        Start Timestamp: Thu 2020-03-19 09:12:39 CET
                        Exit Timestamp: Thu 2020-03-19 09:12:55 CET
                        Exit Code: killed
                        Exit Status: 9
        CPUAccounting: no
        IOAccounting: no
        BlockIOAccounting: no
        MemoryAccounting: yes
        TasksAccounting: yes
        IPAccounting: no
        CPUWeight: 18446744073709551615
        StartupCPUWeight: 18446744073709551615
        CPUShares: 18446744073709551615
        StartupCPUShares: 18446744073709551615
        CPUQuotaPerSecSec: infinity
        CPUQuotaPeriodSec: infinity
        AllowedCPUs: 
        AllowedMemoryNodes: 
        IOWeight: 18446744073709551615
        StartupIOWeight: 18446744073709551615
        BlockIOWeight: 18446744073709551615
        StartupBlockIOWeight: 18446744073709551615
        DefaultMemoryMin: 0
        DefaultMemoryLow: 0
        MemoryMin: 0
        MemoryLow: 0
        MemoryHigh: 18446744073709551615
        MemoryMax: 18446744073709551615
[...]
        TasksMax: 18446744073709551615
[...]
--<

whereas the slice shows:

-->
-> Unit user-1001.slice:
        Description: User Slice of UID 1001
[...]
        MemoryHigh: 3221225472
        MemoryMax: 4294967296
[...]
        TasksMax: 10813
[...]
--<

So, the systemd user instance started in user-1001.slice ignores the limits of the slice for memory as well as for TasksMax.
I guess this is the root cause of the issue.
Franck, since the systemd user instance is started below the user-1001.slice I would expect that it respects the memory limit of the slice.
Any idea why it doesn't?
Comment 43 Luis Chamberlain 2020-03-19 16:50:11 UTC
(In reply to Thomas Blume from comment #42)
> So, the systemd user instance started in user-1001.slice ignores the limits
> of the slice for memory as well as for TasksMax.
> I guess this is the root cause of the issue.

Is the theory here still that the real root cause is the tirpc allocations for each possible file descriptor, and also, now we are observing the user instance ignore user slice limits on memory which *should* in theory contain bogus / buggy allocations such as the tirp issue?

If so then should we really open two bugs ? One for libtirpc and one for systemd?

> Franck, since the systemd user instance is started below the user-1001.slice
> I would expect that it respects the memory limit of the slice.
> Any idea why it doesn't?

Any idea if we can work around this by forcing this to a low limit somehow?

And -- what explains why we didn't observe this all the time? I mean even though tuctuc no longer exhibits good behaviour, I'm sure any new install of Tumbleweed shouldn't be OOM'ing. So a new install should demo a case where this is not being observed. So -- *why* is this only observed on some systems but not all?
Comment 44 Thomas Blume 2020-03-19 17:08:25 UTC
(In reply to Luis Chamberlain from comment #43)
> (In reply to Thomas Blume from comment #42)
> > So, the systemd user instance started in user-1001.slice ignores the limits
> > of the slice for memory as well as for TasksMax.
> > I guess this is the root cause of the issue.
> 
> Is the theory here still that the real root cause is the tirpc allocations
> for each possible file descriptor, and also, now we are observing the user
> instance ignore user slice limits on memory which *should* in theory contain
> bogus / buggy allocations such as the tirp issue?
> 
> If so then should we really open two bugs ? One for libtirpc and one for
> systemd?

I'm not sure anymore whether there is really a bug in libtirpc.
I've found that it was implemented in 2007 that way by purpose.
So, all current SLES versions carry a libtirpc with this code.
It was never an issue with the correct limits in place.
So, strictly speaking the libtirpc code is not really good, but not a bug either.
Improving this code would rather qualify for a feature request than for a bug.

On the other hand systemd not respecting the limits of the user slice is clearly a bug. That must not happen.

> > Franck, since the systemd user instance is started below the user-1001.slice
> > I would expect that it respects the memory limit of the slice.
> > Any idea why it doesn't?
> 
> Any idea if we can work around this by forcing this to a low limit somehow?
> 
> And -- what explains why we didn't observe this all the time? I mean even
> though tuctuc no longer exhibits good behaviour, I'm sure any new install of
> Tumbleweed shouldn't be OOM'ing. So a new install should demo a case where
> this is not being observed. So -- *why* is this only observed on some
> systems but not all?

Yes and even more strange is that it doesn't happen for the root  user (user0).
Since it also lives in the user slice, I would expect the same issue.
I can see that even the systemd instance started for user0 doesn't respect the slice limits:

-->
-> Unit user@0.service:
        Description: User Manager for UID 0
        Instance: 0
        Unit Load State: loaded
[...]
                Command Line: /usr/lib/systemd/systemd --user
                        PID: 14739
                        Start Timestamp: Thu 2020-03-19 18:04:05 CET
        Status Text: Startup finished in 107ms.
        CPUAccounting: no
        IOAccounting: no
        BlockIOAccounting: no
        MemoryAccounting: yes
        TasksAccounting: yes
        IPAccounting: no
        CPUWeight: 18446744073709551615
        StartupCPUWeight: 18446744073709551615
        CPUShares: 18446744073709551615
        StartupCPUShares: 18446744073709551615
        CPUQuotaPerSecSec: infinity
        CPUQuotaPeriodSec: infinity
        AllowedCPUs: 
        AllowedMemoryNodes: 
        IOWeight: 18446744073709551615
        StartupIOWeight: 18446744073709551615
        BlockIOWeight: 18446744073709551615
        StartupBlockIOWeight: 18446744073709551615
        DefaultMemoryMin: 0
        DefaultMemoryLow: 0
        MemoryMin: 0
        MemoryLow: 0
        MemoryHigh: 18446744073709551615
        MemoryMax: 18446744073709551615
        MemorySwapMax: 18446744073709551615
        MemoryLimit: 18446744073709551615
        TasksMax: 18446744073709551615
--<

Still, the login with this user doesn't create an oom.
I have currently no explanation why.
Comment 45 Luis Chamberlain 2020-03-20 04:51:40 UTC
FWIW -- I upgraded to Tumbleweed 20200314 on the tuctuc system and the issue is still present. I had to upgrade to some other development work on the system.
Comment 46 Thomas Blume 2020-03-23 10:39:05 UTC
(In reply to Luis Chamberlain from comment #45)
> FWIW -- I upgraded to Tumbleweed 20200314 on the tuctuc system and the issue
> is still present. I had to upgrade to some other development work on the
> system.

I've tried on a qemu setup:

-->
g18:~ # cat /etc/os-release 
NAME="openSUSE Tumbleweed"
# VERSION="20200314"
ID="opensuse-tumbleweed"
ID_LIKE="opensuse suse"
VERSION_ID="20200314"
--<

with a way lower memory:

-->
g18:~ # free -m
              total        used        free      shared  buff/cache   available
Mem:           1977         162        1475           0         339        1659
Swap:          1281           0        1281
--<

and couldn't reproduce the issue:

-->
Control group /:
-.slice
├─1021 bpfilter_umh
├─user.slice 
│ ├─user-0.slice 
│ │ ├─session-1.scope 
│ │ │ ├─3023 sshd: root@pts/0
│ │ │ ├─3048 -bash
│ │ │ ├─4854 systemd-cgls
│ │ │ └─4855 less
│ │ └─user@0.service …
│ │   └─init.scope 
│ │     ├─3029 /usr/lib/systemd/systemd --user
│ │     └─3030 (sd-pam)
│ └─user-1000.slice 
│   ├─user@1000.service …
│   │ └─init.scope 
│   │   ├─4610 /usr/lib/systemd/systemd --user
│   │   └─4611 (sd-pam)
│   └─session-2.scope 
│     ├─4592 sshd: tslokal [priv]
│     ├─4617 sshd: tslokal@pts/1
│     └─4618 -bash
[...]
# ps -o pid,user,rss,vsz 4610
  PID USER       RSS    VSZ
 4610 tslokal  10400  16052
--<

You can see ps 4610 is consuming about 16MB, which is just fine for 2G of RAM.

Maybe it is bound to big memory machines?
Could you give it a try rebooting one of your machines with reducing the physical memory size using the mem kernel parameter?
Comment 47 Luis Chamberlain 2020-03-23 16:23:34 UTC
(In reply to Thomas Blume from comment #46)
> You can see ps 4610 is consuming about 16MB, which is just fine for 2G of
> RAM.
> 
> Maybe it is bound to big memory machines?
> Could you give it a try rebooting one of your machines with reducing the
> physical memory size using the mem kernel parameter?


I enforced mem=2G and the oom still triggers. I left the system booted with 2G , you can feel free to edit as you see fit. You do this by editing /etc/default/grub and then modifying the line that has the existing mem=2G line, which is GRUB_CMDLINE_LINUX_DEFAULT. To make it take effect on next boot, run:

sudo update-bootloader


mcgrof@chivo ~ $ free -h
              total        used        free      shared  buff/cache   available
Mem:          1.6Gi       638Mi       915Mi       0.0Ki        89Mi       885Mi
Swap:         1.0Gi        35Mi       988Mi
mcgrof@chivo ~ $ sudo dmesg| grep oom
[ 2465.363645] (systemd) invoked oom-killer: gfp_mask=0x100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[ 2465.705200]  oom_kill_process.cold+0xb/0x10
[ 2468.573932] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[ 2471.181285] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/init.scope,task=(systemd),pid=2376,uid=0
[ 2471.299473] Out of memory: Killed process 2376 ((systemd)) total-vm:54694760kB, anon-rss:952556kB, file-rss:2200kB, shmem-rss:0kB, UID:0 pgtables:3904kB oom_score_adj:0
[ 2471.483078] oom_reaper: reaped process 2376 ((systemd)), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
mcgrof@chivo ~ $ cat /proc/cmdline 
BOOT_IMAGE=/boot/vmlinuz-5.5.4-1-default root=UUID=22cf9d5d-e128-41a2-8311-4f735a1d1352 console=tty0 console=ttyS0,57600 sysrq_always_enabled panic=100 ignore_loglevel unknown_nmi_panic resume=/dev/disk/by-id/scsi-3300143804052b801-part2 mem=2G mitigations=auto crashkernel=512M-2G:128M,2G-64G:256M,64G-:512M
Comment 48 Luis Chamberlain 2020-03-25 12:59:07 UTC
Note: I have a work around for now thanks to Anthony Iliopoulos, his suggestion was to disable systemd from pam:

--- /etc/pam.d/common-session-pc.old	2020-03-25 13:42:38.404526416 +0100
+++ /etc/pam.d/common-session-pc	2020-03-25 13:42:47.308353223 +0100
@@ -28,7 +28,7 @@
 # at the start and end of sessions of *any* kind (both interactive and
 # non-interactive
 #
-session	optional	pam_systemd.so  debug
+#session	optional	pam_systemd.so  debug
 session	required	pam_limits.so	
 session	required	pam_unix.so	try_first_pass 
 session	optional	pam_umask.so

I'll also note that this issue is *still* present if you don't do the above, but just disable NIS. I disabled NIS by commenting out /etc/yp.conf and stopping ypserv.service and then disabling it. The only effective way to work around this is disabling pam_Systemd.so from pam from commom-session-pc as noted above.

But without a system to reproduce, we have no way of debugging it further. 

Thomas, I'll leave chivo with the pam_systemd.so enabled, but I'll disable it on tuctuc. I'll keep using tuctuc for development, feel free to use chivo for any further testing / debugging.
Comment 49 Luis Chamberlain 2020-03-25 13:11:17 UTC
(In reply to Luis Chamberlain from comment #48)
> Note: I have a work around for now thanks to Anthony Iliopoulos, his
> suggestion was to disable systemd from pam:
> 
> --- /etc/pam.d/common-session-pc.old	2020-03-25 13:42:38.404526416 +0100
> +++ /etc/pam.d/common-session-pc	2020-03-25 13:42:47.308353223 +0100
> @@ -28,7 +28,7 @@
>  # at the start and end of sessions of *any* kind (both interactive and
>  # non-interactive
>  #
> -session	optional	pam_systemd.so  debug
> +#session	optional	pam_systemd.so  debug
>  session	required	pam_limits.so	
>  session	required	pam_unix.so	try_first_pass 
>  session	optional	pam_umask.so
> 
> I'll also note that this issue is *still* present if you don't do the above,
> but just disable NIS. I disabled NIS by commenting out /etc/yp.conf and
> stopping ypserv.service and then disabling it. The only effective way to
> work around this is disabling pam_Systemd.so from pam from commom-session-pc
> as noted above.
> 
> But without a system to reproduce, we have no way of debugging it further. 
> 
> Thomas, I'll leave chivo with the pam_systemd.so enabled, but I'll disable
> it on tuctuc. I'll keep using tuctuc for development, feel free to use chivo
> for any further testing / debugging.

Odd, so no, I'm now able to reproduce with pam_systemd.so disabled but ypbind.service enabled. The only way to really work around this is to disable pam_systemd.so from /etc/pam.d/common-session-pc as noted earlier but also running:

systemctl stop ypbind.service
systemctl disable ypbind.service
Comment 50 Luis Chamberlain 2020-03-25 13:14:52 UTC
(In reply to Luis Chamberlain from comment #49)
> (In reply to Luis Chamberlain from comment #48)
> > Note: I have a work around for now thanks to Anthony Iliopoulos, his
> > suggestion was to disable systemd from pam:
> > 
> > --- /etc/pam.d/common-session-pc.old	2020-03-25 13:42:38.404526416 +0100
> > +++ /etc/pam.d/common-session-pc	2020-03-25 13:42:47.308353223 +0100
> > @@ -28,7 +28,7 @@
> >  # at the start and end of sessions of *any* kind (both interactive and
> >  # non-interactive
> >  #
> > -session	optional	pam_systemd.so  debug
> > +#session	optional	pam_systemd.so  debug
> >  session	required	pam_limits.so	
> >  session	required	pam_unix.so	try_first_pass 
> >  session	optional	pam_umask.so
> > 
> > I'll also note that this issue is *still* present if you don't do the above,
> > but just disable NIS. I disabled NIS by commenting out /etc/yp.conf and
> > stopping ypserv.service and then disabling it. The only effective way to
> > work around this is disabling pam_Systemd.so from pam from commom-session-pc
> > as noted above.
> > 
> > But without a system to reproduce, we have no way of debugging it further. 
> > 
> > Thomas, I'll leave chivo with the pam_systemd.so enabled, but I'll disable
> > it on tuctuc. I'll keep using tuctuc for development, feel free to use chivo
> > for any further testing / debugging.
> 
> Odd, so no, I'm now able to reproduce with pam_systemd.so disabled but
> ypbind.service enabled. The only way to really work around this is to
> disable pam_systemd.so from /etc/pam.d/common-session-pc as noted earlier
> but also running:
> 
> systemctl stop ypbind.service
> systemctl disable ypbind.service

I confirm again: disabling ypbind.service but leaving pam_systemd.so enabled still triggers this on chivo. I am leaving the system there with 2G of ram set upon boot.
Comment 51 Franck Bui 2020-03-25 14:01:57 UTC
(In reply to Luis Chamberlain from comment #48)
> Thomas, I'll leave chivo with the pam_systemd.so enabled,

Can I install a test package on chivo and reboot it ?
Comment 52 Thomas Blume 2020-03-25 14:02:37 UTC
In reply to Luis Chamberlain from comment #48)
> Note: I have a work around for now thanks to Anthony Iliopoulos, his
> suggestion was to disable systemd from pam:
> 
> --- /etc/pam.d/common-session-pc.old	2020-03-25 13:42:38.404526416 +0100
> +++ /etc/pam.d/common-session-pc	2020-03-25 13:42:47.308353223 +0100
> @@ -28,7 +28,7 @@
>  # at the start and end of sessions of *any* kind (both interactive and
>  # non-interactive
>  #
> -session	optional	pam_systemd.so  debug
> +#session	optional	pam_systemd.so  debug
>  session	required	pam_limits.so	
>  session	required	pam_unix.so	try_first_pass 
>  session	optional	pam_umask.so
> 
> I'll also note that this issue is *still* present if you don't do the above,
> but just disable NIS. I disabled NIS by commenting out /etc/yp.conf and
> stopping ypserv.service and then disabling it. The only effective way to
> work around this is disabling pam_Systemd.so from pam from commom-session-pc
> as noted above.
> 
> But without a system to reproduce, we have no way of debugging it further. 
> 
> Thomas, I'll leave chivo with the pam_systemd.so enabled, but I'll disable
> it on tuctuc. I'll keep using tuctuc for development, feel free to use chivo
> for any further testing / debugging.

Thanks Luis, actually Franck found a workaround and I've just verified it on chivo.
Starting nscd suppresses the oom at test user login:

-->
chivo:~/:[0]# systemctl start nscd
chivo:~/:[0]# systemctl status nscd
● nscd.service - Name Service Cache Daemon
     Loaded: loaded (/usr/lib/systemd/system/nscd.service; disabled; vendor preset: enabled)
     Active: active (running) since Wed 2020-03-25 14:56:52 CET; 3min 38s ago
[...]
chivo:~/:[130]# systemd-cgls 
Control group /:
-.slice
├─user.slice 
│ ├─user-1001.slice 
│ │ ├─session-18.scope 
│ │ │ ├─24546 sshd: test [priv]
│ │ │ ├─24567 sshd: test@pts/1
│ │ │ └─24570 -bash
│ │ └─user@1001.service …
│ │   └─init.scope 
│ │     ├─24557 /usr/lib/systemd/systemd --user
│ │     └─24560 (sd-pam)
[...]
chivo:~/:[0]# free -m
              total        used        free      shared  buff/cache   available
Mem:           1643         739         715           0         188         739
Swap:          1023          29         994
--<
Comment 53 Luis Chamberlain 2020-03-25 21:54:43 UTC
(In reply to Franck Bui from comment #51)
> (In reply to Luis Chamberlain from comment #48)
> > Thomas, I'll leave chivo with the pam_systemd.so enabled,
> 
> Can I install a test package on chivo and reboot it ?

Yes you can hammer on that system all you like until the issue can be root-caused. Do as you think is needed :)
Comment 54 Franck Bui 2020-03-26 08:28:16 UTC
(In reply to Luis Chamberlain from comment #53)
> Yes you can hammer on that system all you like until the issue can be
> root-caused. Do as you think is needed :)

Are you kidding ?

This issue has been understood since comment #14...
Comment 55 Luis Chamberlain 2020-03-26 15:55:24 UTC
(In reply to Franck Bui from comment #54)
> (In reply to Luis Chamberlain from comment #53)
> > Yes you can hammer on that system all you like until the issue can be
> > root-caused. Do as you think is needed :)
> 
> Are you kidding ?
> 
> This issue has been understood since comment #14...

On comment #44 Thomas mentioned he was not sure if libtirpc was the real issue here. So I'll defer to the systemd folks to decide. I'm just happy to make the system available to further diagnose.
Comment 56 Thomas Blume 2020-04-06 13:17:25 UTC
(In reply to Luis Chamberlain from comment #55)
> (In reply to Franck Bui from comment #54)
> > (In reply to Luis Chamberlain from comment #53)
> > > Yes you can hammer on that system all you like until the issue can be
> > > root-caused. Do as you think is needed :)
> > 
> > Are you kidding ?
> > 
> > This issue has been understood since comment #14...
> 
> On comment #44 Thomas mentioned he was not sure if libtirpc was the real
> issue here. So I'll defer to the systemd folks to decide. I'm just happy to
> make the system available to further diagnose.

I've created a debug version of libtirpc3 and did some testing.
After some discussion with Franck, it turns out that, unlike any other daemon on the system, a systemd user daemon has neither user nor system slice limits.
Instead it inherits the limits of the parent systemd daemon (e.g. pid1) which has, for file_open the limits of /proc/sys/fs/nr_open, e.g. no limits.

I don't want to discuss here whether a systemd user daemon should respect slice limits, but I see that all other daemons do (at least the limits of system.slice).
Therefore, I think there is no danger in limiting libtirpc to  LimitNOFILE of the system slice.
If that wouldn't be enough for libtirpc3, all daemons that use it and have the limits of system slice would have a problem.
Franck, do you see any problem for the systemd user daemon with this approach?
Comment 57 Thomas Blume 2020-04-08 15:32:31 UTC
Created attachment 835243 [details]
0001-execute-set-rlimits-before-using-libtirpc-which-uses.patch

The attached patch fixed it on chivo.

Luis, I have built a testpackage with the latest systemd and this patch at:

https://download.opensuse.org/repositories/home:/tsaupe:/branches:/Base:/System:/systemd-bsc1165351/openSUSE_Factory/

It would be good if you could test this package too (probably on your other machine or on a new installation).

Franck, if you have no objections, I would send this patch upstream.
Comment 58 Luis Chamberlain 2020-04-08 18:20:29 UTC
(In reply to Thomas Blume from comment #57)
> Created attachment 835243 [details]
> 0001-execute-set-rlimits-before-using-libtirpc-which-uses.patch
> 
> The attached patch fixed it on chivo.
> 
> Luis, I have built a testpackage with the latest systemd and this patch at:
> 
> https://download.opensuse.org/repositories/home:/tsaupe:/branches:/Base:/
> System:/systemd-bsc1165351/openSUSE_Factory/
> 
> It would be good if you could test this package too (probably on your other
> machine or on a new installation).

I have tested this on the other failing system and confirm that this fixes the issue, thanks!
Comment 59 Franck Bui 2020-04-09 09:21:29 UTC
(In reply to Thomas Blume from comment #57)
> Franck, if you have no objections, I would send this patch upstream.

Sure, go ahead.
Comment 60 Luis Chamberlain 2020-04-19 21:13:30 UTC
(In reply to Franck Bui from comment #59)
> (In reply to Thomas Blume from comment #57)
> > Franck, if you have no objections, I would send this patch upstream.
> 
> Sure, go ahead.

So I still am a bit perplexed about one thing. Why is this issue only being observed now? I checked the code lineage at least in systemd, and this seems to have existed for a long time now.
Comment 61 Thomas Blume 2020-04-20 08:58:31 UTC
(In reply to Luis Chamberlain from comment #60)
> (In reply to Franck Bui from comment #59)
> > (In reply to Thomas Blume from comment #57)
> > > Franck, if you have no objections, I would send this patch upstream.
> > 
> > Sure, go ahead.
> 
> So I still am a bit perplexed about one thing. Why is this issue only being
> observed now? I checked the code lineage at least in systemd, and this seems
> to have existed for a long time now.

IMHO, using the rlimits for the systemd user daemon in libtirpc has never worked, but that was hidden because the NOFILES for pid1 were also lower.
The issue became visible after increasing the NOFILES for pid1.
When having the rlimits for the systemd user dameon applied at he time of the libtirpc call, it starts working. This is what my patch does.
However, it seems that upstream still doubts that this is a systemd issue, see:

https://github.com/systemd/systemd/pull/15423

There is no question that the code in libtirpc also needs to be improved, but the trigger of this bug, at least in a standard tumbleweed installation, is that libirpc is using an rlimit that isn't applied yet.
Comment 62 Jaime Caamaño Ruiz 2020-05-18 13:45:50 UTC
I took this off Thomas' hands.

I have not seen it specifically mentioned here, apologies if it was already known, but this seem to be ultimately caused by the bump of nr_open introduced in systemd 240 [1]. systemd inherits its nofile limit (hard & soft) from nr_open after this bump. Before this change, we would be running a nr_open of 1024*1024=1048576 but from 240 it is bumped to 1073741816. A possible workaround is to use the build option provided by systemd Dbump-proc-sys-fs-nr-open, to avoid this bump and remain on the previous value, until we have a permanent fix in libtirpc. Frank, what do you think of this?

In the meantime, steved has applied a patch to use the soft limit instead of the hard limit, but this wont fix the issue for systemd PID1. I will let him know.

[1] https://github.com/systemd/systemd/blob/1742aae2aa8cd33897250d6fcfbe10928e43eb2f/NEWS#L96
Comment 63 Franck Bui 2020-05-19 06:47:47 UTC
(In reply to Jaime Caamaño Ruiz from comment #62)
> In the meantime, steved has applied a patch to use the soft limit instead of
> the hard limit, but this wont fix the issue for systemd PID1. I will let him
> know.

Which issue for systemd are you talking about ?

If the patch you mentioned is the one suggested since nearly the beginning of this report (comment #16) then this patch alone should be enough to fix this bug since AFAIK only libtirpc is affected when the nofile hard limit is bumped.
Comment 64 Jaime Caamaño Ruiz 2020-05-19 09:58:46 UTC
(In reply to Franck Bui from comment #63)
> Which issue for systemd are you talking about ?
> 
> If the patch you mentioned is the one suggested since nearly the beginning
> of this report (comment #16) then this patch alone should be enough to fix
> this bug since AFAIK only libtirpc is affected when the nofile hard limit is
> bumped.

Yes, that is the patch. But note that the soft limit has also been bumped and also from comment #16:

> But that won't be sufficient unfortunately because PID1 initializes the rlimits > of its child *after* calling initgroups(3). Therefore the soft limit will still > be the one used by PID1 which is still equal to the hard limit.

I understand that PID1 calls initgroups(3) with it's own soft limit, which has increased from 1048576 to 1073741816 since version 240.
Comment 65 Franck Bui 2020-05-25 08:16:07 UTC
(In reply to Jaime Caamaño Ruiz from comment #64)
> I understand that PID1 calls initgroups(3) with it's own soft limit, which
> has increased from 1048576 to 1073741816 since version 240.

Ah right I already forgot...

Thomas, so the only fix (or rather workaround) from libtirpc we can expect is the one that uses the soft limit ?

Can you please provide the link where this was being discussed ?
Comment 66 Franck Bui 2020-05-25 08:22:26 UTC
(In reply to Jaime Caamaño Ruiz from comment #62)
> possible workaround is to use the build option provided by systemd
> Dbump-proc-sys-fs-nr-open, to avoid this bump and remain on the previous
> value, until we have a permanent fix in libtirpc. Frank, what do you think
> of this?

Yes that seems the only workaround we have currently to support systems running NIS (I'm wondering how much NIS is used these days though) since systemd upstream is not willing to move the call to initgroup because libtirpc has to be fixed anyway...
Comment 67 Thomas Blume 2020-05-25 08:26:00 UTC
(In reply to Franck Bui from comment #65)
> (In reply to Jaime Caamaño Ruiz from comment #64)
> > I understand that PID1 calls initgroups(3) with it's own soft limit, which
> > has increased from 1048576 to 1073741816 since version 240.
> 
> Ah right I already forgot...
> 
> Thomas, so the only fix (or rather workaround) from libtirpc we can expect
> is the one that uses the soft limit ?
> 
> Can you please provide the link where this was being discussed ?

Hi Franck, the discussion thread was here:

https://sourceforge.net/p/libtirpc/mailman/message/36989375/


Jaime, are you ok to reassign the bug to you?
Comment 68 Franck Bui 2020-05-25 08:44:26 UTC
(In reply to Thomas Blume from comment #67)
> 
> https://sourceforge.net/p/libtirpc/mailman/message/36989375/
> 

"At this point I would rather get NIS up and running now, as well 
as some other people seeing this problem. Patch coming."

So the guy is not aware that this workaround won't work with systems running systemd (or any application that would raise its soft limit)...

Anyway I'll disable the bump of /proc/sys/fs/nr_open as a temporary workaround although I fear it's going to stay forever. I already tested it one or two months ago and it did the trick.

But someone should really try to push for a real fix in libtirpc, so please don't close this bug until the real work is done.
Comment 69 Jaime Caamaño Ruiz 2020-05-29 11:18:43 UTC
I submitted a patch with the linked list implementation [1] but I don't expect it to be merged any time soon in the best of cases. Franck should we close this bug on your call?

BTW, do we have clear reproduction steps so I can try the patch out?


[1] https://sourceforge.net/p/libtirpc/mailman/message/37024059/
Comment 70 Jaime Caamaño Ruiz 2020-05-29 11:20:23 UTC
(In reply to Thomas Blume from comment #67)
 
> Jaime, are you ok to reassign the bug to you?

I'm ok with that although maybe it is Franck who is going to really solve this.
Comment 71 Franck Bui 2020-05-29 12:09:05 UTC
(In reply to Jaime Caamaño Ruiz from comment #69)
> I submitted a patch with the linked list implementation [1] but I don't
> expect it to be merged any time soon in the best of cases.

Great !

> Franck should we close this bug on your call?

Nope, as I wrote the fix belong to libtirpc, so until this project is fixed this bug should stay opened.

I simply submitted a workaround for convenience.
Comment 72 Luis Chamberlain 2020-06-01 13:28:35 UTC
Is there a test case which can be written so that we can easily determine if a release has a proper fix merged? Does sytemd have a set of tests run prior to release?
Comment 73 Luis Chamberlain 2020-06-09 20:07:42 UTC
Do we have a fix merged on Tumbleweed yet?
Comment 74 Luis Chamberlain 2020-06-10 18:38:54 UTC
FWIW I'm on 5.6.14-1-default now and libtirpc3-1.2.6-1.14.x86_64 and I don't see a memory hog issue, however with NIS enabled and pam_systemd.so still enabled on /etc/pam.d/common-session-pc I see *huge* delays on login, even if its not a NIS login, *and* even on sudo su -
Comment 75 Thomas Blume 2020-06-15 12:12:28 UTC
(In reply to Luis Chamberlain from comment #73)
> Do we have a fix merged on Tumbleweed yet?

It was accepted to staging 4 days ago:

https://build.opensuse.org/request/show/809898

Might take a few more days until it gets released.
Comment 76 Luis Chamberlain 2021-01-18 14:38:33 UTC
(In reply to Franck Bui from comment #71)
> (In reply to Jaime Caamaño Ruiz from comment #69)
> > I submitted a patch with the linked list implementation [1] but I don't
> > expect it to be merged any time soon in the best of cases.
> 
> Great !
> 
> > Franck should we close this bug on your call?
> 
> Nope, as I wrote the fix belong to libtirpc, so until this project is fixed
> this bug should stay opened.
> 
> I simply submitted a workaround for convenience.

Any luck with this being fixed properly upstream?
Comment 77 Thomas Blume 2021-01-18 16:42:23 UTC
(In reply to Luis Chamberlain from comment #76)
> (In reply to Franck Bui from comment #71)
> > (In reply to Jaime Caamaño Ruiz from comment #69)
> > > I submitted a patch with the linked list implementation [1] but I don't
> > > expect it to be merged any time soon in the best of cases.
> > 
> > Great !
> > 
> > > Franck should we close this bug on your call?
> > 
> > Nope, as I wrote the fix belong to libtirpc, so until this project is fixed
> > this bug should stay opened.
> > 
> > I simply submitted a workaround for convenience.
> 
> Any luck with this being fixed properly upstream?

Yeah, that has been fixed in the libtirpc code.
Jaime, can you take over?
Comment 78 Jaime Caamaño Ruiz 2021-01-19 13:22:58 UTC
Yes, this should be fixed in libtirpc 1.3.1 already included in tumbleweed.
Comment 79 Thomas Blume 2022-07-18 08:46:23 UTC
already fixed, closing