Bugzilla – Bug 1220541
kexec does a full reboot with kernel 6.7.6-1.1
Last modified: 2024-04-15 22:15:41 UTC
Tested on identical systems with kexec-tools 2.0.27-3.2 (on both) and kernel 6.7.5-1.1 (on working system) and 6.7.6-1.1 (on faulty system). Faulty system (fully updated as of Feb 28 2024) is on Tumbleweed release 20240226. Working system is on release 20240222. Issue reproduced on working system by doing zypper ref and zypper in kernel-default to upgrade just the kernel to the latest version. Issue happens after the next cold boot and persists.
I couldn't find anything obvious between those versions through a quick glance. Is the kdump setup successful on 6.7.6?
@Takashi Yes, the kdump setup/service is fine. No failed units or priority 3 journal errors with either kernels. It's just that kexec does a full reboot instead of well, kexec'ing. No errors when doing kexec -l or -e, restarting kexec-load.service, or running systemctl kexec. Everything works normally but kexec does full firmware reboot instead of kexec. My firmware is really slow and it has been a frustrating last 2 days troubleshooting this. I even migrated both my machines from TW to Slowroll thinking to get away from this kernel but it followed me there too :( I've rolled back both my machines to their last working snapshot running kernel 6.7.4 (the default after migrating to Slowroll) and did a bunch of tests on my secondary machine. Tried kernel-longterm and removed kernel-default and it seemed to work fine for some time but after a dup it too stopped working. Then tried installing vanilla/stable kernels and kdump/kexec-tools from factory. That also did not fix the problem. Now I've locked the kernel packages from upgrading while rolling back to the last known good snapshot running kernel 6.7.4, did dup and everything is working as expected. Not sure where to go from here or how long I can keep the kernel packages locked without some other problem. Some zypper info that might be useful for troubleshooting: pavin@suse-pc:~> zypper lr -dP # | Alias | Name | Enabled | GPG Check | Refresh | Priority | Type | URI | Service ---+-------------------+---------+---------+-----------+---------+----------+--------+-------------------------------------------------------------------------------+-------- 8 | packman | packman | Yes | (r ) Yes | Yes | 90 | rpm-md | https://ftp.gwdg.de/pub/linux/misc/packman/suse/openSUSE_Slowroll/Essentials/ | 6 | base-update | base--> | Yes | (r ) Yes | Yes | 95 | rpm-md | https://cdn.opensuse.org/update/slowroll/repo/oss/ | 1 | base-debug | base--> | No | ---- | ---- | 99 | N/A | https://cdn.opensuse.org/debug/slowroll/repo/oss/ | 2 | base-non-oss | base--> | Yes | (r ) Yes | Yes | 99 | rpm-md | https://cdn.opensuse.org/slowroll/repo/non-oss/ | 3 | base-openh264 | base--> | Yes | (r ) Yes | Yes | 99 | rpm-md | https://codecs.opensuse.org/openh264/openSUSE_Tumbleweed/ | 4 | base-oss | base--> | Yes | (r ) Yes | Yes | 99 | rpm-md | https://cdn.opensuse.org/slowroll/repo/oss/ | 5 | base-source | base--> | No | ---- | ---- | 99 | N/A | https://cdn.opensuse.org/slowroll/repo/src-oss/ | 7 | google-chrome | googl-> | Yes | (r ) Yes | No | 99 | rpm-md | https://dl.google.com/linux/chrome/rpm/stable/x86_64 | 9 | shiftkey-packages | GitHu-> | Yes | (r ) Yes | No | 99 | rpm-md | https://rpm.packages.shiftkey.dev/rpm/ | 10 | vscode | Visua-> | Yes | (r ) Yes | No | 99 | rpm-md | https://packages.microsoft.com/yumrepos/vscode | pavin@suse-pc:~> pavin@suse-pc:~> zypper ll # | Name | Type | Repository | Comment --+---------+---------+------------+-------- 1 | kernel* | package | (any) | pavin@suse-pc:~> pavin@suse-pc:~> sudo zypper dup --dry-run Please enter the PIN: Please touch the device. Loading repository data... Reading installed packages... Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command. Computing distribution upgrade... The following 97 items are locked and will not be changed by any action: Available: kernel-debug kernel-debug-debuginfo kernel-debug-debugsource kernel-debug-devel kernel-debug-devel-debuginfo kernel-debug-vdso kernel-debug-vdso-debuginfo kernel-default-base kernel-default-base-rebuild kernel-default-debuginfo kernel-default-debugsource kernel-default-devel-debuginfo kernel-default-vdso kernel-default-vdso-debuginfo kernel-devel-longterm kernel-docs kernel-docs-html kernel-firmware kernel-firmware-nvidia-gsp-G06 kernel-firmware-nvidia-gspx-G06 kernel-install-tools kernel-kvmsmall kernel-kvmsmall-debuginfo kernel-kvmsmall-debugsource kernel-kvmsmall-devel kernel-kvmsmall-devel-debuginfo kernel-kvmsmall-vdso kernel-kvmsmall-vdso-debuginfo kernel-longterm kernel-longterm-debuginfo kernel-longterm-debugsource kernel-longterm-devel kernel-longterm-devel-debuginfo kernel-longterm-vdso kernel-longterm-vdso-debuginfo kernel-obs-build kernel-obs-build-debugsource kernel-obs-qa kernel-pae kernel-pae-debuginfo kernel-pae-debugsource kernel-pae-devel kernel-pae-vdso kernel-pae-vdso-debuginfo kernelshark kernelshark-devel kernel-source kernel-source-longterm kernel-source-vanilla kernel-syms kernel-syms-longterm kernel-vanilla kernel-vanilla-debuginfo kernel-vanilla-debugsource kernel-vanilla-devel kernel-vanilla-devel-debuginfo kernel-vanilla-vdso kernel-vanilla-vdso-debuginfo Installed: kernel-default-6.6.11-1.1 kernel-default-6.7.4-1.1 kernel-default-devel kernel-devel kernel-firmware-all kernel-firmware-amdgpu kernel-firmware-ath10k kernel-firmware-ath11k kernel-firmware-ath12k kernel-firmware-atheros kernel-firmware-bluetooth kernel-firmware-bnx2 kernel-firmware-brcm kernel-firmware-chelsio kernel-firmware-dpaa2 kernel-firmware-i915 kernel-firmware-intel kernel-firmware-iwlwifi kernel-firmware-liquidio kernel-firmware-marvell kernel-firmware-media kernel-firmware-mediatek kernel-firmware-mellanox kernel-firmware-mwifiex kernel-firmware-network kernel-firmware-nfp kernel-firmware-nvidia kernel-firmware-platform kernel-firmware-prestera kernel-firmware-qcom kernel-firmware-qlogic kernel-firmware-radeon kernel-firmware-realtek kernel-firmware-serial kernel-firmware-sound kernel-firmware-ti kernel-firmware-ueagle kernel-firmware-usb-network kernel-macros The following 57 packages are going to be upgraded: alsa-utils apache-commons-logging argyllcms autofs code crash fwupd fwupd-bash-completion gdm gdmflexiserver gdm-schema grub2 grub2-i386-pc grub2-snapper-plugin grub2-systemd-sleep-plugin grub2-x86_64-efi java-11-openjdk java-11-openjdk-headless libaa1 libblkid1 libdecor libdecor-0-0 libfdisk1 libfwupd2 libgdm1 libmount1 libsmartcols1 libsystemd0 libudev1 libutempter0 libuuid1 libvidstab1_1 libwebrtc-audio-processing-1-3 libzck1 libzvbi0 MozillaThunderbird patterns-server-kvm_server patterns-server-kvm_tools shared-mime-info systemd systemd-container systemd-coredump systemd-doc typelib-1_0-Fwupd-2_0 typelib-1_0-Gdm-1_0 ucode-amd udev util-linux util-linux-systemd vorbis-tools vpnc wget xorg-x11-server xorg-x11-server-extra xorg-x11-server-Xvfb xwayland zip The following 2 patterns are going to be upgraded: kvm_server kvm_tools The following 11 packages are going to be downgraded: libqt5-qtwebengine libvapoursynth-65 libvapoursynth-script0 libXvnc1 python3-vapoursynth tigervnc transmission-common transmission-gtk virtiofsd xorg-x11-Xvnc xorg-x11-Xvnc-module 57 packages to upgrade, 11 to downgrade. Overall download size: 332.6 MiB. Already cached: 0 B. After the operation, additional 980.0 KiB will be used. Continue? [y/n/v/...? shows all options] (y): n
I did more testing today. Kexec reboot working normally on kernels: 6.7.4 6.4.0 ALP kernel (https://download.opensuse.org/repositories/Kernel:/ALP-current/standard/x86_64/kernel-default-6.4.0-118.1.g09c1189.x86_64.rpm) Kexec reboot does firmware reboot on kernels: 6.7.6 6.6.18 longterm kernel (https://download.opensuse.org/update/slowroll/repo/oss/x86_64/kernel-longterm-6.6.18-1.1.x86_64.rpm) Let me know if there's anything else I can provide for troubleshooting.
OK, thanks. Since this is a regression in the upstream kernel, at best you can report it to the upstream. https://docs.kernel.org/admin-guide/reporting-regressions.html https://docs.kernel.org/process/handling-regressions.html Care to report your problem? They'll likely ask you to perform git bisection. A hint for building your test kernel quickly is found at https://docs.kernel.org/admin-guide/quickly-build-trimmed-linux.html
@Takashi Thanks for the references. I'm quite out of my depth here with building kernels and reporting bugs straight to kernel.org. Guess I better get learning 🤓
Updates: I built the kernels 6.7.5 and 6.7.6 from source. Issue reproduced with 6.7.6.
Submitted bug report to upstream: https://lore.kernel.org/regressions/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/
(In reply to Pavin Joseph from comment #7) > Submitted bug report to upstream: > https://lore.kernel.org/regressions/3a1b9909-45ac-4f97-ad68- > d16ef1ce99db@pavinjoseph.com/ Thanks! In the post above, you showed: Git bisect logs: git bisect start # status: waiting for both good and bad commits # bad: [b631f5b445dc3379f67ff63a2e4c58f22d4975dc] Linux 6.7.6 git bisect bad b631f5b445dc3379f67ff63a2e4c58f22d4975dc # status: waiting for good commit(s), bad commit known # good: [004dcea13dc10acaf1486d9939be4c793834c13c] Linux 6.7.5 git bisect good 004dcea13dc10acaf1486d9939be4c793834c13c ... and now git bisect should point to the commit in the middle to be tested. Did you perform testing further?
Hi there, Did the full bisection and found the culprit. Didn't quite understand the whole procedure until reading this [0] guide. Issue reproduced on mainline and current stable 6.7.7. Submitted response to upstream detailing all this. Hope it's fixed soon, let me know if there's anything I can do to improve testing for kexec bugs using OpenQA or OBS? This bug found its way into kernel-longterm as well and as a feature I use almost every day (my personal machine's firmware is quite slow) it's quite concerning no one caught this in testing. Bisection logs: git bisect start # status: waiting for both good and bad commits # good: [004dcea13dc10acaf1486d9939be4c793834c13c] Linux 6.7.5 git bisect good 004dcea13dc10acaf1486d9939be4c793834c13c # status: waiting for bad commit, 1 good commit known # bad: [b631f5b445dc3379f67ff63a2e4c58f22d4975dc] Linux 6.7.6 git bisect bad b631f5b445dc3379f67ff63a2e4c58f22d4975dc # good: [00c48bfbd6b29b8ebf64edd059dbf9e95cedd5b1] misc: fastrpc: Mark all sessions as invalid in cb_remove git bisect good 00c48bfbd6b29b8ebf64edd059dbf9e95cedd5b1 # bad: [6e85c91e7d63e46de1b4a0cb90212356da8a41cb] io_uring/net: fix multishot accept overflow handling git bisect bad 6e85c91e7d63e46de1b4a0cb90212356da8a41cb # good: [fe32ecf2e66f069230628e8917d26911c5fb2482] eventfs: Restructure eventfs_inode structure to be more condensed git bisect good fe32ecf2e66f069230628e8917d26911c5fb2482 # good: [f385565bd76b581a83b62a5b6f88ea6f149f8b83] ring-buffer: Clean ring_buffer_poll_wait() error return git bisect good f385565bd76b581a83b62a5b6f88ea6f149f8b83 # good: [992c8a5f10f81af32c3272c200fc003fb7450401] powerpc/64: Set task pt_regs->link to the LR value on scv entry git bisect good 992c8a5f10f81af32c3272c200fc003fb7450401 # good: [d79adbe1cd67bc76608e036ee2f98b71c083d9ce] x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6 git bisect good d79adbe1cd67bc76608e036ee2f98b71c083d9ce # good: [fa2b524a73545d25ae15e3d2930b9bfa83b40827] KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu git bisect good fa2b524a73545d25ae15e3d2930b9bfa83b40827 # bad: [7143c5f4cf2073193eb27c9cdb84fd4655d1802d] x86/mm/ident_map: Use gbpages only where full GB page should be mapped. git bisect bad 7143c5f4cf2073193eb27c9cdb84fd4655d1802d # good: [6d10c8c5abd1437dcbc209e307d930da60b86e91] KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl git bisect good 6d10c8c5abd1437dcbc209e307d930da60b86e91 # first bad commit: [7143c5f4cf2073193eb27c9cdb84fd4655d1802d] x86/mm/ident_map: Use gbpages only where full GB page should be mapped. Culprit: 7143c5f4cf2073193eb27c9cdb84fd4655d1802d is the first bad commit commit 7143c5f4cf2073193eb27c9cdb84fd4655d1802d Author: Steve Wahl <steve.wahl@hpe.com> Date: Fri Jan 26 10:48:41 2024 -0600 x86/mm/ident_map: Use gbpages only where full GB page should be mapped. commit d794734c9bbfe22f86686dc2909c25f5ffe1a572 upstream. When ident_pud_init() uses only gbpages to create identity maps, large ranges of addresses not actually requested can be included in the resulting table; a 4K request will map a full GB. On UV systems, this ends up including regions that will cause hardware to halt the system if accessed (these are marked "reserved" by BIOS). Even processor speculation into these regions is enough to trigger the system halt. Only use gbpages when map creation requests include the full GB page of space. Fall back to using smaller 2M pages when only portions of a GB page are included in the request. No attempt is made to coalesce mapping requests. If a request requires a map entry at the 2M (pmd) level, subsequent mapping requests within the same 1G region will also be at the pmd level, even if adjacent or overlapping such requests could have been combined to map a full gbpage. Existing usage starts with larger regions and then adds smaller regions, so this should not have any great consequence. [ dhansen: fix up comment formatting, simplifty changelog ] Signed-off-by: Steve Wahl <steve.wahl@hpe.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> arch/x86/mm/ident_map.c | 23 ++++++++++++++++++----- 1 file changed, 18 insertions(+), 5 deletions(-) [0]: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%20regression%20%e2%80%94%20The%20Linux%20Kernel%20documentation.html
Great, could you follow up your reported mail thread for this info, so that the proper upstream devs get involved for fixing the issue?
@Takashi Sure 👍 Please improve testing to catch kexec bugs like this in the future, building several different kernels and enduring the rather slow boot process on a low-end laptop about 20 times over the last few days is not an experience I want to repeat 🥺
Just wondering, has this been brought up upstream?
(In reply to Michal Hocko from comment #13) > Just wondering, has this been brought up upstream? Yes, see comment 7.
As I understand the thread, you have not tried the latest mainline. So is this reproducible with 6.8-rc*? You either try: https://download.opensuse.org/repositories/Kernel:/vanilla/standard/ or you build from the clone as in: git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git ~/linux/ cd ~/linux/ <build the kernel> ... (no stable tree adding here)
(In reply to Jiri Slaby from comment #15) > As I understand the thread, you have not tried the latest mainline. So is > this reproducible with 6.8-rc*? I have reproduced the issue on mainline, current stable (6.7.7), and a full git bisection was done between the last known good version 6.7.5 and the first known bad version 6.7.6. Reverting culprit commit on mainline fixed the issue. https://lore.kernel.org/regressions/fe72c912-f1a0-4a53-88ab-b85e8c3f7bd9@pavinjoseph.com/T/#m85c3dc66389b405ed8e789d8153e172644c57f23
(In reply to Pavin Joseph from comment #16) > (In reply to Jiri Slaby from comment #15) > > As I understand the thread, you have not tried the latest mainline. So is > > this reproducible with 6.8-rc*? > > I have reproduced the issue on mainline, current stable (6.7.7), and a full > git bisection was done between the last known good version 6.7.5 and the > first known bad version 6.7.6. > > Reverting culprit commit on mainline fixed the issue. Note 6.7 is *not* mainline. That's why I asked for testing 6.8-rc*.
(In reply to Jiri Slaby from comment #17) > Note 6.7 is *not* mainline. That's why I asked for testing 6.8-rc*. Jiri, yes, I understand 😉. I tested with 6.8-rc (the one Linus maintains) and the issue could be reproduced in it. I followed the updated docs [0]. Its steps go through mainline (6.8-rc*), stable (6.7.7), and only then does it begin the bisection. The final step for validation is to revert the identified culprit commit on mainline. [0]: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%20regression%20%e2%80%94%20The%20Linux%20Kernel%20documentation.html
Kexec has been fixed in kernel 6.8.5 and LTS kernel 6.6.26. Thank you for everyone's help 😄