Bugzilla – Bug 1215636
virtlxcd dying constantly killing all containers with it
Last modified: 2023-10-03 16:13:50 UTC
Hello, since couple of updates I've been having issue with lxd containers in libvirt/virtmanager. At random times, previously once per day now even more often, they're being killed for no reason. No significant errors in logs whatsoever. While investigating i came across virtlxcd and noticed it has 2 hour timeout, and every time the service dies it takes all containers with it. I don't know why it has timeout, but I think it's intentional what I don't think is intentional is killing all containers so please advise
(In reply to Michał Szczepaniak from comment #0) > Hello, since couple of updates I've been having issue with lxd containers in > libvirt/virtmanager. At random times, previously once per day now even more > often, they're being killed for no reason. No significant errors in logs > whatsoever. While investigating i came across virtlxcd and noticed it has 2 > hour timeout, and every time the service dies it takes all containers with > it. I don't know why it has timeout, but I think it's intentional what I > don't think is intentional is killing all containers so please advise The default timeout is 2 minutes, not 2 hours. Regardless, virtlxcd should not terminate when it is managing active "VMs". The timeout should be inhibited in that case. I've stared at the inhibition code for quite some time and it looks correct. I'll need to reproduce the issue myself and poke around with gdb. In the meantime, you can override the timeout with 'systemctl edit --full virtlxcd.service`, and remove the '--timeout 120' from VIRTLXCD_ARGS. This will prevent the daemon from terminating. Can you check if your containers are fine when virtlxcd is started without the timeout option?
Ah 2 minutes yeah I've noticed that even if it was 2h it not always kill the containers. I will try without the timeout sure. Stil no change and containers die like once per day sometimes couple times per day
Another information (because there's never enough information) I have in cmdline splash=silent quiet elevator=noop cgroup_enable=memory systemd.unified_cgroup_hierarchy=0 isolcpus=1,2,3,4,5,7,8,9,10,11 mitigations=auto I'm including this because previously I had systemd.unified_cgroup_hierarchy=1 and recently i had to switch it off and enable the memory cgroup or the containers wouldn't start so maybe there are other cgroups i need to enable?
(In reply to Michał Szczepaniak from comment #3) > cgroup_enable=memory > systemd.unified_cgroup_hierarchy=0 isolcpus=1,2,3,4,5,7,8,9,10,11 I don't have any of these in the kernel command line of my Tumbleweed host. > I'm including this because previously I had > systemd.unified_cgroup_hierarchy=1 and recently i had to switch it off and > enable the memory cgroup or the containers wouldn't start so maybe there are > other cgroups i need to enable? Interesting. Can you start your containers if you remove the above kernel options, but enable DefaultMemoryAccounting as described in the following bug comment? https://bugzilla.suse.com/show_bug.cgi?id=1214845#c7 BTW, I reproduced virtlxcd terminating after 2 minutes even with a running container. I'll need to investigate further. However, in my case the container did continue to run, although it's quite simple <domain type='lxc'> <name>vm1</name> <memory>500000</memory> <os> <type>exe</type> <init>/bin/sh</init> </os> <vcpu>1</vcpu> <clock offset='utc'/> <on_poweroff>destroy</on_poweroff> <on_reboot>restart</on_reboot> <on_crash>destroy</on_crash> <devices> <emulator>/usr/libexec/libvirt_lxc</emulator> <interface type='network'> <source network='default'/> </interface> <console type='pty' /> </devices> </domain>
even with virtlxcd terminating every 2 minutes the containers keep being alive but only die like once per day also if i run systemctl restart virtlxcd it kills all containers immediately. Don't know if it should but just reporting I will try the cmdline thing
Created attachment 869801 [details] log from when containers died In case this helps here's log from journalctl i caught right when the containers died
Another interesting thing is that they seem to die when i connected to the libvirit via virtmanager i'm connecting from different host via ssh. But also of course it doesn't happen every time
Have you tried removing the cgroup_enable, systemd.unified_cgroup_hierarchy, and isolcpus kernel parameters, and overriding DefaultMemoryAccounting=no as I suggested #4? Let me know if you have any questions about that. (In reply to Michał Szczepaniak from comment #6) > Created attachment 869801 [details] > log from when containers died From this log it appears virtlxcd has crashed. Do you see any coredumps via 'coredumpctl list virtlxcd'? If so, provide the crashing stack trace with 'coredumpctl info virtlxcd'. (In reply to Michał Szczepaniak from comment #5) > also if i run systemctl restart virtlxcd it kills all containers > immediately. Don't know if it should but just reporting Hmm, I don't see this behavior. My test containers continue running fine across virtlxcd restarts. I'll leave the containers running over the weekend and see if they mysteriously disappear as you've seen.
I will be trying it today, sorry i couldn't try it ealier because i broke my backups and had to resend everything which is like 3 days process
so far with the DefaultMemoryAccounting=yes it hasn't crashed, nor through night, nor when i'm connecting nor when i'm restarting so it's very promising
Yeah i think it's solved, thanks for help! anything i should do with DefaultMemoryAccounting?
(In reply to Michał Szczepaniak from comment #11) > Yeah i think it's solved, thanks for help! anything i should do with > DefaultMemoryAccounting? I'm not sure what you mean? The libvirt lxc driver expects the memory controller to be available under /sys/fs/cgroup/machine.slice/, which requires overriding DefaultMemoryAccounting=no in /usr/lib/systemd/system.conf.d/__20-defaults-SUSE.conf. See https://bugzilla.suse.com/show_bug.cgi?id=1214845#c7 on how to do that. I'm going to close this bug for now with status 'resolved -> worksforme'. We didn't really fix anything, only adjusted configuration. Thanks for reporting the issue and the timely responses!
I was more talking about not modifying files in /usr and more permanent config location :D
(In reply to Michał Szczepaniak from comment #13) > I was more talking about not modifying files in /usr and more permanent > config location :D Described here https://bugzilla.suse.com/show_bug.cgi?id=1214845#c7 :-)
Oki thanks a ton for help! Tho I might be back with another issue :P