Bug 1220541 - kexec does a full reboot with kernel 6.7.6-1.1
Summary: kexec does a full reboot with kernel 6.7.6-1.1
Status: RESOLVED FIXED
Alias: None
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Current
Hardware: x86-64 openSUSE Tumbleweed
: P2 - High : Major (vote)
Target Milestone: ---
Assignee: openSUSE Kernel Bugs
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 1220382
  Show dependency treegraph
 
Reported: 2024-02-28 10:18 UTC by Pavin Joseph
Modified: 2024-04-15 22:15 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---
jslaby: needinfo? (me)


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Pavin Joseph 2024-02-28 10:18:12 UTC
Tested on identical systems with kexec-tools 2.0.27-3.2 (on both) and kernel 6.7.5-1.1 (on working system) and 6.7.6-1.1 (on faulty system).

Faulty system (fully updated as of Feb 28 2024) is on Tumbleweed release 20240226.

Working system is on release 20240222.

Issue reproduced on working system by doing zypper ref and zypper in kernel-default to upgrade just the kernel to the latest version.
Issue happens after the next cold boot and persists.
Comment 1 Takashi Iwai 2024-02-29 14:17:43 UTC
I couldn't find anything obvious between those versions through a quick glance.
Is the kdump setup successful on 6.7.6?
Comment 2 Pavin Joseph 2024-02-29 17:08:10 UTC
@Takashi Yes, the kdump setup/service is fine. No failed units or priority 3 journal errors with either kernels.

It's just that kexec does a full reboot instead of well, kexec'ing. No errors when doing kexec -l or -e, restarting kexec-load.service, or running systemctl kexec. Everything works normally but kexec does full firmware reboot instead of kexec. My firmware is really slow and it has been a frustrating last 2 days troubleshooting this.

I even migrated both my machines from TW to Slowroll thinking to get away from this kernel but it followed me there too :(

I've rolled back both my machines to their last working snapshot running kernel 6.7.4 (the default after migrating to Slowroll) and did a bunch of tests on my secondary machine. Tried kernel-longterm and removed kernel-default and it seemed to work fine for some time but after a dup it too stopped working. Then tried installing vanilla/stable kernels and kdump/kexec-tools from factory. That also did not fix the problem.

Now I've locked the kernel packages from upgrading while rolling back to the last known good snapshot running kernel 6.7.4, did dup and everything is working as expected.

Not sure where to go from here or how long I can keep the kernel packages locked without some other problem.

Some zypper info that might be useful for troubleshooting:

pavin@suse-pc:~> zypper lr -dP
#  | Alias             | Name    | Enabled | GPG Check | Refresh | Priority | Type   | URI                                                                           | Service
---+-------------------+---------+---------+-----------+---------+----------+--------+-------------------------------------------------------------------------------+--------
 8 | packman           | packman | Yes     | (r ) Yes  | Yes     |   90     | rpm-md | https://ftp.gwdg.de/pub/linux/misc/packman/suse/openSUSE_Slowroll/Essentials/ | 
 6 | base-update       | base--> | Yes     | (r ) Yes  | Yes     |   95     | rpm-md | https://cdn.opensuse.org/update/slowroll/repo/oss/                            | 
 1 | base-debug        | base--> | No      | ----      | ----    |   99     | N/A    | https://cdn.opensuse.org/debug/slowroll/repo/oss/                             | 
 2 | base-non-oss      | base--> | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | https://cdn.opensuse.org/slowroll/repo/non-oss/                               | 
 3 | base-openh264     | base--> | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | https://codecs.opensuse.org/openh264/openSUSE_Tumbleweed/                     | 
 4 | base-oss          | base--> | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | https://cdn.opensuse.org/slowroll/repo/oss/                                   | 
 5 | base-source       | base--> | No      | ----      | ----    |   99     | N/A    | https://cdn.opensuse.org/slowroll/repo/src-oss/                               | 
 7 | google-chrome     | googl-> | Yes     | (r ) Yes  | No      |   99     | rpm-md | https://dl.google.com/linux/chrome/rpm/stable/x86_64                          | 
 9 | shiftkey-packages | GitHu-> | Yes     | (r ) Yes  | No      |   99     | rpm-md | https://rpm.packages.shiftkey.dev/rpm/                                        | 
10 | vscode            | Visua-> | Yes     | (r ) Yes  | No      |   99     | rpm-md | https://packages.microsoft.com/yumrepos/vscode                                | 
pavin@suse-pc:~> 
pavin@suse-pc:~> zypper ll

# | Name    | Type    | Repository | Comment
--+---------+---------+------------+--------
1 | kernel* | package | (any)      | 

pavin@suse-pc:~> 
pavin@suse-pc:~> sudo zypper dup --dry-run 
Please enter the PIN: 
Please touch the device.
Loading repository data...
Reading installed packages...
Warning: You are about to do a distribution upgrade with all enabled repositories. Make sure these repositories are compatible before you continue. See 'man zypper' for more information about this command.
Computing distribution upgrade...

The following 97 items are locked and will not be changed by any action:
 Available:
  kernel-debug kernel-debug-debuginfo kernel-debug-debugsource kernel-debug-devel kernel-debug-devel-debuginfo kernel-debug-vdso kernel-debug-vdso-debuginfo
  kernel-default-base kernel-default-base-rebuild kernel-default-debuginfo kernel-default-debugsource kernel-default-devel-debuginfo kernel-default-vdso
  kernel-default-vdso-debuginfo kernel-devel-longterm kernel-docs kernel-docs-html kernel-firmware kernel-firmware-nvidia-gsp-G06 kernel-firmware-nvidia-gspx-G06
  kernel-install-tools kernel-kvmsmall kernel-kvmsmall-debuginfo kernel-kvmsmall-debugsource kernel-kvmsmall-devel kernel-kvmsmall-devel-debuginfo kernel-kvmsmall-vdso
  kernel-kvmsmall-vdso-debuginfo kernel-longterm kernel-longterm-debuginfo kernel-longterm-debugsource kernel-longterm-devel kernel-longterm-devel-debuginfo
  kernel-longterm-vdso kernel-longterm-vdso-debuginfo kernel-obs-build kernel-obs-build-debugsource kernel-obs-qa kernel-pae kernel-pae-debuginfo kernel-pae-debugsource
  kernel-pae-devel kernel-pae-vdso kernel-pae-vdso-debuginfo kernelshark kernelshark-devel kernel-source kernel-source-longterm kernel-source-vanilla kernel-syms
  kernel-syms-longterm kernel-vanilla kernel-vanilla-debuginfo kernel-vanilla-debugsource kernel-vanilla-devel kernel-vanilla-devel-debuginfo kernel-vanilla-vdso
  kernel-vanilla-vdso-debuginfo
 Installed:
  kernel-default-6.6.11-1.1 kernel-default-6.7.4-1.1 kernel-default-devel kernel-devel kernel-firmware-all kernel-firmware-amdgpu kernel-firmware-ath10k
  kernel-firmware-ath11k kernel-firmware-ath12k kernel-firmware-atheros kernel-firmware-bluetooth kernel-firmware-bnx2 kernel-firmware-brcm kernel-firmware-chelsio
  kernel-firmware-dpaa2 kernel-firmware-i915 kernel-firmware-intel kernel-firmware-iwlwifi kernel-firmware-liquidio kernel-firmware-marvell kernel-firmware-media
  kernel-firmware-mediatek kernel-firmware-mellanox kernel-firmware-mwifiex kernel-firmware-network kernel-firmware-nfp kernel-firmware-nvidia kernel-firmware-platform
  kernel-firmware-prestera kernel-firmware-qcom kernel-firmware-qlogic kernel-firmware-radeon kernel-firmware-realtek kernel-firmware-serial kernel-firmware-sound
  kernel-firmware-ti kernel-firmware-ueagle kernel-firmware-usb-network kernel-macros

The following 57 packages are going to be upgraded:
  alsa-utils apache-commons-logging argyllcms autofs code crash fwupd fwupd-bash-completion gdm gdmflexiserver gdm-schema grub2 grub2-i386-pc grub2-snapper-plugin
  grub2-systemd-sleep-plugin grub2-x86_64-efi java-11-openjdk java-11-openjdk-headless libaa1 libblkid1 libdecor libdecor-0-0 libfdisk1 libfwupd2 libgdm1 libmount1
  libsmartcols1 libsystemd0 libudev1 libutempter0 libuuid1 libvidstab1_1 libwebrtc-audio-processing-1-3 libzck1 libzvbi0 MozillaThunderbird patterns-server-kvm_server
  patterns-server-kvm_tools shared-mime-info systemd systemd-container systemd-coredump systemd-doc typelib-1_0-Fwupd-2_0 typelib-1_0-Gdm-1_0 ucode-amd udev util-linux
  util-linux-systemd vorbis-tools vpnc wget xorg-x11-server xorg-x11-server-extra xorg-x11-server-Xvfb xwayland zip

The following 2 patterns are going to be upgraded:
  kvm_server kvm_tools

The following 11 packages are going to be downgraded:
  libqt5-qtwebengine libvapoursynth-65 libvapoursynth-script0 libXvnc1 python3-vapoursynth tigervnc transmission-common transmission-gtk virtiofsd xorg-x11-Xvnc
  xorg-x11-Xvnc-module

57 packages to upgrade, 11 to downgrade.
Overall download size: 332.6 MiB. Already cached: 0 B. After the operation, additional 980.0 KiB will be used.
Continue? [y/n/v/...? shows all options] (y): n
Comment 3 Pavin Joseph 2024-03-01 07:08:36 UTC
I did more testing today.

Kexec reboot working normally on kernels:
6.7.4
6.4.0 ALP kernel (https://download.opensuse.org/repositories/Kernel:/ALP-current/standard/x86_64/kernel-default-6.4.0-118.1.g09c1189.x86_64.rpm)

Kexec reboot does firmware reboot on kernels:
6.7.6
6.6.18 longterm kernel (https://download.opensuse.org/update/slowroll/repo/oss/x86_64/kernel-longterm-6.6.18-1.1.x86_64.rpm)

Let me know if there's anything else I can provide for troubleshooting.
Comment 4 Takashi Iwai 2024-03-01 07:46:57 UTC
OK, thanks.

Since this is a regression in the upstream kernel, at best you can report it to the upstream.
  https://docs.kernel.org/admin-guide/reporting-regressions.html
  https://docs.kernel.org/process/handling-regressions.html
Care to report your problem?

They'll likely ask you to perform git bisection.  A hint for building your test kernel quickly is found at
  https://docs.kernel.org/admin-guide/quickly-build-trimmed-linux.html
Comment 5 Pavin Joseph 2024-03-01 08:45:58 UTC
@Takashi Thanks for the references.
I'm quite out of my depth here with building kernels and reporting bugs straight to kernel.org.

Guess I better get learning 🤓
Comment 6 Pavin Joseph 2024-03-01 13:43:41 UTC
Updates:
I built the kernels 6.7.5 and 6.7.6 from source.
Issue reproduced with 6.7.6.
Comment 7 Pavin Joseph 2024-03-01 14:23:41 UTC
Submitted bug report to upstream:
https://lore.kernel.org/regressions/3a1b9909-45ac-4f97-ad68-d16ef1ce99db@pavinjoseph.com/
Comment 8 Takashi Iwai 2024-03-01 14:43:36 UTC
(In reply to Pavin Joseph from comment #7)
> Submitted bug report to upstream:
> https://lore.kernel.org/regressions/3a1b9909-45ac-4f97-ad68-
> d16ef1ce99db@pavinjoseph.com/

Thanks!

In the post above, you showed:
Git bisect logs:
git bisect start
# status: waiting for both good and bad commits
# bad: [b631f5b445dc3379f67ff63a2e4c58f22d4975dc] Linux 6.7.6
git bisect bad b631f5b445dc3379f67ff63a2e4c58f22d4975dc
# status: waiting for good commit(s), bad commit known
# good: [004dcea13dc10acaf1486d9939be4c793834c13c] Linux 6.7.5
git bisect good 004dcea13dc10acaf1486d9939be4c793834c13c

... and now git bisect should point to the commit in the middle to be tested.
Did you perform testing further?
Comment 9 Pavin Joseph 2024-03-02 08:32:36 UTC
Hi there,

Did the full bisection and found the culprit. Didn't quite understand the whole procedure until reading this [0] guide.

Issue reproduced on mainline and current stable 6.7.7.

Submitted response to upstream detailing all this.
Hope it's fixed soon, let me know if there's anything I can do to improve testing for kexec bugs using OpenQA or OBS? This bug found its way into kernel-longterm as well and as a feature I use almost every day (my personal machine's firmware is quite slow) it's quite concerning no one caught this in testing.

Bisection logs:
git bisect start
# status: waiting for both good and bad commits
# good: [004dcea13dc10acaf1486d9939be4c793834c13c] Linux 6.7.5
git bisect good 004dcea13dc10acaf1486d9939be4c793834c13c
# status: waiting for bad commit, 1 good commit known
# bad: [b631f5b445dc3379f67ff63a2e4c58f22d4975dc] Linux 6.7.6
git bisect bad b631f5b445dc3379f67ff63a2e4c58f22d4975dc
# good: [00c48bfbd6b29b8ebf64edd059dbf9e95cedd5b1] misc: fastrpc: Mark all sessions as invalid in cb_remove
git bisect good 00c48bfbd6b29b8ebf64edd059dbf9e95cedd5b1
# bad: [6e85c91e7d63e46de1b4a0cb90212356da8a41cb] io_uring/net: fix multishot accept overflow handling
git bisect bad 6e85c91e7d63e46de1b4a0cb90212356da8a41cb
# good: [fe32ecf2e66f069230628e8917d26911c5fb2482] eventfs: Restructure eventfs_inode structure to be more condensed
git bisect good fe32ecf2e66f069230628e8917d26911c5fb2482
# good: [f385565bd76b581a83b62a5b6f88ea6f149f8b83] ring-buffer: Clean ring_buffer_poll_wait() error return
git bisect good f385565bd76b581a83b62a5b6f88ea6f149f8b83
# good: [992c8a5f10f81af32c3272c200fc003fb7450401] powerpc/64: Set task pt_regs->link to the LR value on scv entry
git bisect good 992c8a5f10f81af32c3272c200fc003fb7450401
# good: [d79adbe1cd67bc76608e036ee2f98b71c083d9ce] x86/Kconfig: Transmeta Crusoe is CPU family 5, not 6
git bisect good d79adbe1cd67bc76608e036ee2f98b71c083d9ce
# good: [fa2b524a73545d25ae15e3d2930b9bfa83b40827] KVM: x86: make KVM_REQ_NMI request iff NMI pending for vcpu
git bisect good fa2b524a73545d25ae15e3d2930b9bfa83b40827
# bad: [7143c5f4cf2073193eb27c9cdb84fd4655d1802d] x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
git bisect bad 7143c5f4cf2073193eb27c9cdb84fd4655d1802d
# good: [6d10c8c5abd1437dcbc209e307d930da60b86e91] KVM: x86/pmu: Fix type length error when reading pmu->fixed_ctr_ctrl
git bisect good 6d10c8c5abd1437dcbc209e307d930da60b86e91
# first bad commit: [7143c5f4cf2073193eb27c9cdb84fd4655d1802d] x86/mm/ident_map: Use gbpages only where full GB page should be mapped.


Culprit:
7143c5f4cf2073193eb27c9cdb84fd4655d1802d is the first bad commit
commit 7143c5f4cf2073193eb27c9cdb84fd4655d1802d
Author: Steve Wahl <steve.wahl@hpe.com>
Date:   Fri Jan 26 10:48:41 2024 -0600

    x86/mm/ident_map: Use gbpages only where full GB page should be mapped.
    
    commit d794734c9bbfe22f86686dc2909c25f5ffe1a572 upstream.
    
    When ident_pud_init() uses only gbpages to create identity maps, large
    ranges of addresses not actually requested can be included in the
    resulting table; a 4K request will map a full GB.  On UV systems, this
    ends up including regions that will cause hardware to halt the system
    if accessed (these are marked "reserved" by BIOS).  Even processor
    speculation into these regions is enough to trigger the system halt.
    
    Only use gbpages when map creation requests include the full GB page
    of space.  Fall back to using smaller 2M pages when only portions of a
    GB page are included in the request.
    
    No attempt is made to coalesce mapping requests. If a request requires
    a map entry at the 2M (pmd) level, subsequent mapping requests within
    the same 1G region will also be at the pmd level, even if adjacent or
    overlapping such requests could have been combined to map a full
    gbpage.  Existing usage starts with larger regions and then adds
    smaller regions, so this should not have any great consequence.
    
    [ dhansen: fix up comment formatting, simplifty changelog ]
    
    Signed-off-by: Steve Wahl <steve.wahl@hpe.com>
    Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/all/20240126164841.170866-1-steve.wahl%40hpe.com
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 arch/x86/mm/ident_map.c | 23 ++++++++++++++++++-----
 1 file changed, 18 insertions(+), 5 deletions(-)

[0]: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%20regression%20%e2%80%94%20The%20Linux%20Kernel%20documentation.html
Comment 10 Takashi Iwai 2024-03-02 08:39:18 UTC
Great, could you follow up your reported mail thread for this info, so that the proper upstream devs get involved for fixing the issue?
Comment 11 Pavin Joseph 2024-03-02 08:45:44 UTC
@Takashi Sure 👍

Please improve testing to catch kexec bugs like this in the future, building several different kernels and enduring the rather slow boot process on a low-end laptop about 20 times over the last few days is not an experience I want to repeat 🥺
Comment 13 Michal Hocko 2024-03-04 17:08:08 UTC
Just wondering, has this been brought up upstream?
Comment 14 Takashi Iwai 2024-03-04 17:17:42 UTC
(In reply to Michal Hocko from comment #13)
> Just wondering, has this been brought up upstream?

Yes, see comment 7.
Comment 15 Jiri Slaby 2024-03-05 08:01:38 UTC
As I understand the thread, you have not tried the latest mainline. So is this reproducible with 6.8-rc*?

You either try:
https://download.opensuse.org/repositories/Kernel:/vanilla/standard/

or you build from the clone as in:
git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
~/linux/
cd ~/linux/
<build the kernel>
...
(no stable tree adding here)
Comment 16 Pavin Joseph 2024-03-05 08:26:38 UTC
(In reply to Jiri Slaby from comment #15)
> As I understand the thread, you have not tried the latest mainline. So is
> this reproducible with 6.8-rc*?

I have reproduced the issue on mainline, current stable (6.7.7), and a full git bisection was done between the last known good version 6.7.5 and the first known bad version 6.7.6.

Reverting culprit commit on mainline fixed the issue.

https://lore.kernel.org/regressions/fe72c912-f1a0-4a53-88ab-b85e8c3f7bd9@pavinjoseph.com/T/#m85c3dc66389b405ed8e789d8153e172644c57f23
Comment 17 Jiri Slaby 2024-03-05 11:05:04 UTC
(In reply to Pavin Joseph from comment #16)
> (In reply to Jiri Slaby from comment #15)
> > As I understand the thread, you have not tried the latest mainline. So is
> > this reproducible with 6.8-rc*?
> 
> I have reproduced the issue on mainline, current stable (6.7.7), and a full
> git bisection was done between the last known good version 6.7.5 and the
> first known bad version 6.7.6.
> 
> Reverting culprit commit on mainline fixed the issue.

Note 6.7 is *not* mainline. That's why I asked for testing 6.8-rc*.
Comment 18 Pavin Joseph 2024-03-05 14:02:55 UTC
(In reply to Jiri Slaby from comment #17)
> Note 6.7 is *not* mainline. That's why I asked for testing 6.8-rc*.

Jiri, yes, I understand 😉. I tested with 6.8-rc (the one Linus maintains) and the issue could be reproduced in it.

I followed the updated docs [0].
Its steps go through mainline (6.8-rc*), stable (6.7.7), and only then does it begin the bisection. The final step for validation is to revert the identified culprit commit on mainline.

[0]: https://www.leemhuis.info/files/misc/How%20to%20bisect%20a%20Linux%20kernel%20regression%20%e2%80%94%20The%20Linux%20Kernel%20documentation.html
Comment 19 Pavin Joseph 2024-04-15 22:15:41 UTC
Kexec has been fixed in kernel 6.8.5 and LTS kernel 6.6.26.
Thank you for everyone's help 😄