|
Bugzilla – Full Text Bug Listing |
| Summary: | MicroOS upgrade failed and screwed up nodes by deleting the root filesystem completely | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE Tumbleweed | Reporter: | Jörn Reder <jreder> |
| Component: | MicroOS | Assignee: | Ignaz Forster <iforster> |
| Status: | RESOLVED FIXED | QA Contact: | E-mail List <qa-bugs> |
| Severity: | Critical | ||
| Priority: | P5 - None | CC: | iforster |
| Version: | Current | ||
| Target Milestone: | --- | ||
| Hardware: | x86-64 | ||
| OS: | openSUSE Tumbleweed | ||
| Whiteboard: | |||
| Found By: | --- | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
|
Description
Jörn Reder
2023-11-22 19:47:56 UTC
I was able to reproduce this if a `transactional-update cleanup` is called while /.snapshots is not mounted. The cleanup algorithm will then think that the corresponding /etc overlays can be deleted. On next boot the overlay is not found any more, and thus the mount during the initrd will fail. (The snapshots themselves should still be there, it's "just" the overlay which is lost.) The immediate action will be to abort if there's not /.snapshots mount. This does not explain however why /.snapshots wasn't mounted in the first place. I guess there's no log any more which could explain that? Sorry for the delay (was a few days off) and thanks for looking into this and it's good news that you could reproduce the problem when .snapshots mount is missing. I can provide logs from the health-checker service, but as far as I understood it does not give us a reason for the missing mountpoint. Hopefully you can read more out of it: Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Starting MicroOS Health Checker... Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Clearing GRUB flag Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1199]: grub2-editenv: error: cannot open `/boot/grub2/grubenv': Read-only file system. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Starting health check Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1207]: <10>Nov 19 03:01:11 root: ERROR: "/usr/libexec/health-checker/btrfs-subvolumes-mounted.sh check" failed Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1212]: active Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1198]: Health check failed! Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt root[1233]: Machine didn't come up correctly, do a rollback Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1233]: <9>Nov 19 03:01:11 root: Machine didn't come up correctly, do a rollback Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1234]: mount: /.snapshots: mount point not mounted or bad option. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1234]: dmesg(1) may have more information after failed mount system call. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1235]: ERROR: Could not set default subvolume: Inappropriate ioctl for device Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt root[1236]: ERROR: btrfs set-default 449 failed! Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1236]: <10>Nov 19 03:01:11 root: ERROR: btrfs set-default 449 failed! Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt health-checker[1237]: /usr/sbin/health-checker: line 91: telem_send_payload: command not found Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: health-checker.service: Main process exited, code=exited, status=1/FAILURE Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: health-checker.service: Failed with result 'exit-code'. Nov 19 03:01:11 k8stest1-wk-fsn1-cpx31-xgt systemd[1]: Failed to start MicroOS Health Checker. BTW: we could rescue most of the machines where this happened by going back to a working snapshot. I wrote a short report of that in the discussion forum of the kube-hetzner project: https://github.com/kube-hetzner/terraform-hcloud-kube-hetzner/discussions/1096#discussioncomment-7652896 Since then all nodes run fine again and performed one transactional-update + reboot without problems. Nevertheless, it would be good if health-checker and transactional-update would stop if a .snapshots mountpoint is missing, as this can cause more damage than it helps. This is an autogenerated message for OBS integration: This bug (1217416) was mentioned in https://build.opensuse.org/request/show/1154848 Factory / transactional-update This problem has been fixed in transactional-update version 4.6.0 already. |