Bug 1215636 - virtlxcd dying constantly killing all containers with it
Summary: virtlxcd dying constantly killing all containers with it
Status: RESOLVED WORKSFORME
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Virtualization:Tools (show other bugs)
Version: Current
Hardware: x86-64 openSUSE Tumbleweed
: P5 - None : Major (vote)
Target Milestone: ---
Assignee: James Fehlig
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2023-09-23 20:05 UTC by Michał Szczepaniak
Modified: 2023-10-03 16:13 UTC (History)
1 user (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
log from when containers died (6.21 KB, text/x-log)
2023-09-27 23:38 UTC, Michał Szczepaniak
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Michał Szczepaniak 2023-09-23 20:05:34 UTC
Hello, since couple of updates I've been having issue with lxd containers in libvirt/virtmanager. At random times, previously once per day now even more often, they're being killed for no reason. No significant errors in logs whatsoever. While investigating i came across virtlxcd and noticed it has 2 hour timeout, and every time the service dies it takes all containers with it. I don't know why it has timeout, but I think it's intentional what I don't think is intentional is killing all containers so please advise
Comment 1 James Fehlig 2023-09-27 20:35:02 UTC
(In reply to Michał Szczepaniak from comment #0)
> Hello, since couple of updates I've been having issue with lxd containers in
> libvirt/virtmanager. At random times, previously once per day now even more
> often, they're being killed for no reason. No significant errors in logs
> whatsoever. While investigating i came across virtlxcd and noticed it has 2
> hour timeout, and every time the service dies it takes all containers with
> it. I don't know why it has timeout, but I think it's intentional what I
> don't think is intentional is killing all containers so please advise

The default timeout is 2 minutes, not 2 hours. Regardless, virtlxcd should not terminate when it is managing active "VMs". The timeout should be inhibited in that case. I've stared at the inhibition code for quite some time and it looks correct. I'll need to reproduce the issue myself and poke around with gdb.

In the meantime, you can override the timeout with 'systemctl edit --full virtlxcd.service`, and remove the '--timeout 120' from VIRTLXCD_ARGS. This will prevent the daemon from terminating. Can you check if your containers are fine when virtlxcd is started without the timeout option?
Comment 2 Michał Szczepaniak 2023-09-27 20:50:40 UTC
Ah 2 minutes yeah I've noticed that even if it was 2h it not always kill the containers. I will try without the timeout sure.

Stil no change and containers die like once per day sometimes couple times per day
Comment 3 Michał Szczepaniak 2023-09-27 20:55:03 UTC
Another information (because there's never enough information) I have in cmdline 

splash=silent quiet elevator=noop cgroup_enable=memory systemd.unified_cgroup_hierarchy=0 isolcpus=1,2,3,4,5,7,8,9,10,11 mitigations=auto

I'm including this because previously I had systemd.unified_cgroup_hierarchy=1 and recently i had to switch it off and enable the memory cgroup or the containers wouldn't start so maybe there are other cgroups i need to enable?
Comment 4 James Fehlig 2023-09-27 21:26:59 UTC
(In reply to Michał Szczepaniak from comment #3)
> cgroup_enable=memory
> systemd.unified_cgroup_hierarchy=0 isolcpus=1,2,3,4,5,7,8,9,10,11

I don't have any of these in the kernel command line of my Tumbleweed host.

> I'm including this because previously I had
> systemd.unified_cgroup_hierarchy=1 and recently i had to switch it off and
> enable the memory cgroup or the containers wouldn't start so maybe there are
> other cgroups i need to enable?

Interesting. Can you start your containers if you remove the above kernel options, but enable DefaultMemoryAccounting as described in the following bug comment?

https://bugzilla.suse.com/show_bug.cgi?id=1214845#c7

BTW, I reproduced virtlxcd terminating after 2 minutes even with a running container. I'll need to investigate further. However, in my case the container did continue to run, although it's quite simple

<domain type='lxc'>
  <name>vm1</name>
  <memory>500000</memory>
  <os>
    <type>exe</type>
    <init>/bin/sh</init>
  </os>
  <vcpu>1</vcpu>
  <clock offset='utc'/>
  <on_poweroff>destroy</on_poweroff>
  <on_reboot>restart</on_reboot>
  <on_crash>destroy</on_crash>
  <devices>
  <emulator>/usr/libexec/libvirt_lxc</emulator>
    <interface type='network'>
      <source network='default'/>
    </interface>
    <console type='pty' />
  </devices>
</domain>
Comment 5 Michał Szczepaniak 2023-09-27 21:43:31 UTC
even with virtlxcd terminating every 2 minutes the containers keep being alive but only die like once per day

also if i run systemctl restart virtlxcd it kills all containers immediately. Don't know if it should but just reporting

I will try the cmdline thing
Comment 6 Michał Szczepaniak 2023-09-27 23:38:03 UTC
Created attachment 869801 [details]
log from when containers died

In case this helps here's log from journalctl i caught right when the containers died
Comment 7 Michał Szczepaniak 2023-09-28 15:00:14 UTC
Another interesting thing is that they seem to die when i connected to the libvirit via virtmanager

i'm connecting from different host via ssh. But also of course it doesn't happen every time
Comment 8 James Fehlig 2023-09-29 21:33:13 UTC
Have you tried removing the cgroup_enable, systemd.unified_cgroup_hierarchy, and isolcpus kernel parameters, and overriding DefaultMemoryAccounting=no as I suggested #4? Let me know if you have any questions about that.

(In reply to Michał Szczepaniak from comment #6)
> Created attachment 869801 [details]
> log from when containers died

From this log it appears virtlxcd has crashed. Do you see any coredumps via 'coredumpctl list virtlxcd'? If so, provide the crashing stack trace with 'coredumpctl info virtlxcd'.

(In reply to Michał Szczepaniak from comment #5)
> also if i run systemctl restart virtlxcd it kills all containers
> immediately. Don't know if it should but just reporting

Hmm, I don't see this behavior. My test containers continue running fine across virtlxcd restarts. I'll leave the containers running over the weekend and see if they mysteriously disappear as you've seen.
Comment 9 Michał Szczepaniak 2023-09-29 21:36:28 UTC
I will be trying it today, sorry i couldn't try it ealier because i broke my backups and had to resend everything which is like 3 days process
Comment 10 Michał Szczepaniak 2023-09-30 10:51:20 UTC
so far with the DefaultMemoryAccounting=yes it hasn't crashed, nor through night, nor when i'm connecting nor when i'm restarting so it's very promising
Comment 11 Michał Szczepaniak 2023-10-02 08:11:22 UTC
Yeah i think it's solved, thanks for help! anything i should do with DefaultMemoryAccounting?
Comment 12 James Fehlig 2023-10-03 16:10:08 UTC
(In reply to Michał Szczepaniak from comment #11)
> Yeah i think it's solved, thanks for help! anything i should do with
> DefaultMemoryAccounting?

I'm not sure what you mean? The libvirt lxc driver expects the memory controller to be available under /sys/fs/cgroup/machine.slice/, which requires overriding DefaultMemoryAccounting=no in /usr/lib/systemd/system.conf.d/__20-defaults-SUSE.conf. See https://bugzilla.suse.com/show_bug.cgi?id=1214845#c7 on how to do that.

I'm going to close this bug for now with status 'resolved -> worksforme'. We didn't really fix anything, only adjusted configuration. Thanks for reporting the issue and the timely responses!
Comment 13 Michał Szczepaniak 2023-10-03 16:11:10 UTC
I was more talking about not modifying files in /usr and more permanent config location :D
Comment 14 James Fehlig 2023-10-03 16:13:02 UTC
(In reply to Michał Szczepaniak from comment #13)
> I was more talking about not modifying files in /usr and more permanent
> config location :D

Described here https://bugzilla.suse.com/show_bug.cgi?id=1214845#c7 :-)
Comment 15 Michał Szczepaniak 2023-10-03 16:13:50 UTC
Oki thanks a ton for help!

Tho I might be back with another issue :P