Bugzilla – Bug 1218540
kubeadm random panic
Last modified: 2024-02-22 07:44:29 UTC
Created attachment 871659 [details] Output of kubeadm init --v=5 --skip-phases=addon/kube-proxy kubeadm panics 50% of the time I try to use it. # kubeadm version kubeadm version: &version.Info{Major:"1", Minor:"28", GitVersion:"v1.28.4", GitCommit:"bae2c62678db2b5053817bc97181fcc2e8388103", GitTreeState:"clean", BuildDate:"2023-11-24T00:00:00Z", GoVersion:"go1.21.4", Compiler:"gc", Platform:"linux/amd64"} # head -2 /etc/os-release NAME="openSUSE MicroOS" # VERSION="20231228" Attached stdout/stderr.
Created attachment 871770 [details] output excerpt of "kubectl get events -n kube-system"
Comment on attachment 871770 [details] output excerpt of "kubectl get events -n kube-system" I tested the "kubeadm init ..." command on multiple fresh installations of openSUSE TW, and saw the same flaky behavior (mentioned in previous comment): ``` kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key I0104 11:25:53.099027 1187 kubeletfinalize.go:134] [kubelet-finalize] Restarting the kubelet to enable client certificate rotation Post "https://192.168.178.74:6443/api/v1/namespaces/kube-system/services?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers) unable to create a new DNS service ``` This flaky error is due to "kube-apiserver" pods being down (unhealthy - crashing/restarting) at the time of the above post request. (attached relevant event logs - https://bugzilla.suse.com/attachment.cgi?id=871770) The apiserver pod logs the following error, tracked in the upstream project at https://github.com/kubernetes/kubernetes/issues/76146: ``` E0110 09:04:04.102627 1 controller.go:97] Error removing old endpoints from kubernetes service: no API server IP addresses were listed in storage, refusing to erase all endpoints for the kubernetes Service ``` There isn't a definitive fix available in the tracking issue discussion.There're a few workarounds suggestions, I'm testing them, but so far none have worked in my TW installation. I will provide further updates to the ticket once there's a working solution. I'm also looking into other tickets relevant to "kubeadm init" runs — boo#1218695, boo#1218687 and boo#1218694.
Quick update - Flipping "SystemdCgroup" to true in the default containerd config (/etc/containerd/config.toml), fix the apiserver crashes. ``` [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options] SystemdCgroup = true ``` Will send a SR to containerd package to patch this ^
Ping?
I discussed with the containerd package maintainers the possibility of adding this change as a patch in the Factory containerd (but that was not the ideal approach.) We are now waiting for the upstream containerd project to implement the change with this PR: https://github.com/containerd/containerd/pull/9350
Thanks for the update!