Bug 1218540 - kubeadm random panic
Summary: kubeadm random panic
Status: NEW
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Containers (show other bugs)
Version: Current
Hardware: Other Other
: P5 - None : Normal (vote)
Target Milestone: ---
Assignee: Containers Team
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-01-04 10:31 UTC by Ricardo Branco
Modified: 2024-02-22 07:44 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
Output of kubeadm init --v=5 --skip-phases=addon/kube-proxy (16.96 KB, text/plain)
2024-01-04 10:31 UTC, Ricardo Branco
Details
output excerpt of "kubectl get events -n kube-system" (1.96 KB, text/plain)
2024-01-11 06:27 UTC, Priyanka Saggu
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ricardo Branco 2024-01-04 10:31:10 UTC
Created attachment 871659 [details]
Output of kubeadm init --v=5 --skip-phases=addon/kube-proxy

kubeadm panics 50% of the time I try to use it.

# kubeadm version
kubeadm version: &version.Info{Major:"1", Minor:"28", GitVersion:"v1.28.4", GitCommit:"bae2c62678db2b5053817bc97181fcc2e8388103", GitTreeState:"clean", BuildDate:"2023-11-24T00:00:00Z", GoVersion:"go1.21.4", Compiler:"gc", Platform:"linux/amd64"}

# head -2 /etc/os-release 
NAME="openSUSE MicroOS"
# VERSION="20231228"

Attached stdout/stderr.
Comment 1 Priyanka Saggu 2024-01-11 06:27:17 UTC
Created attachment 871770 [details]
output excerpt of "kubectl get events -n kube-system"
Comment 2 Priyanka Saggu 2024-01-11 06:29:38 UTC
Comment on attachment 871770 [details]
output excerpt of "kubectl get events -n kube-system"

I tested the "kubeadm init ..." command on multiple fresh installations of openSUSE TW, and saw the same flaky behavior (mentioned in previous comment):

```
kubelet-finalize] Updating "/etc/kubernetes/kubelet.conf" to point to a rotatable kubelet client certificate and key
I0104 11:25:53.099027    1187 kubeletfinalize.go:134] [kubelet-finalize] Restarting the kubelet to enable client certificate rotation
Post "https://192.168.178.74:6443/api/v1/namespaces/kube-system/services?timeout=10s": net/http: request canceled (Client.Timeout exceeded while awaiting headers)
unable to create a new DNS service
```

This flaky error is due to "kube-apiserver" pods being down (unhealthy - crashing/restarting) at the time of the above post request.
(attached relevant event logs - https://bugzilla.suse.com/attachment.cgi?id=871770)

The apiserver pod logs the following error, tracked in the upstream project at https://github.com/kubernetes/kubernetes/issues/76146:

```
E0110 09:04:04.102627       1 controller.go:97] Error removing old endpoints from kubernetes service: no API server IP addresses were listed in storage, refusing to erase all endpoints for the kubernetes Service
```

There isn't a definitive fix available in the tracking issue discussion.There're a few workarounds suggestions, I'm testing them, but so far none have worked in my TW installation. I will provide further updates to the ticket once there's a working solution.

I'm also looking into other tickets relevant to "kubeadm init" runs — boo#1218695, boo#1218687 and boo#1218694.
Comment 3 Priyanka Saggu 2024-01-11 10:51:27 UTC
Quick update - 

Flipping "SystemdCgroup" to true in the default containerd config (/etc/containerd/config.toml), fix the apiserver crashes.


```
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
   SystemdCgroup = true
```

Will send a SR to containerd package to patch this ^
Comment 4 Felix Niederwanger 2024-02-21 07:37:14 UTC
Ping?
Comment 5 Priyanka Saggu 2024-02-21 10:21:07 UTC
I discussed with the containerd package maintainers the possibility of adding this change as a patch in the Factory containerd (but that was not the ideal approach.)

We are now waiting for the upstream containerd project to implement the change with this PR: https://github.com/containerd/containerd/pull/9350
Comment 6 Felix Niederwanger 2024-02-22 07:44:29 UTC
Thanks for the update!