Bug 1226320 - [Build 93.5] Continuous migration from sles15sp5 to sles15sp6 failed to boot with "grub_verify_string" not found
Summary: [Build 93.5] Continuous migration from sles15sp5 to sles15sp6 failed to boot ...
Status: NEW
Alias: None
Product: PUBLIC SUSE Linux Enterprise Server 15 SP6
Classification: openSUSE
Component: Bootloader (show other bugs)
Version: unspecified
Hardware: Other Other
: P5 - None : Normal
Target Milestone: ---
Assignee: Bootloader Maintainers
QA Contact:
URL: https://openqa.suse.de/tests/14612205...
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2024-06-14 09:01 UTC by Huajian Luo
Modified: 2024-07-03 05:23 UTC (History)
3 users (show)

See Also:
Found By: openQA
Services Priority:
Business Priority:
Blocker: Yes
Marketing QA Status: ---
IT Deployment: ---


Attachments
grub screenshot (15.03 KB, image/png)
2024-06-14 09:06 UTC, Huajian Luo
Details
y2log (8.35 MB, application/x-bzip)
2024-06-14 09:55 UTC, Huajian Luo
Details
yast2_migration-tree-disk (1.66 KB, text/plain)
2024-06-18 05:06 UTC, Huajian Luo
Details
yast2_migration-grub_installdevice (45 bytes, text/plain)
2024-06-18 05:07 UTC, Huajian Luo
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Huajian Luo 2024-06-14 09:01:38 UTC
## Error message:

error: ../../grub-core/kern/dl.c:380: symbol 'grub_verfy_string' not found


## test steps:
This is a continuous migration test from sles15sp3->sle15sp5->sles15sp6
1) first we fresh install sles15sp3 https://openqa.suse.de/tests/14607835
2) migrated to sles15sp5 https://openqa.suse.de/tests/14608719
3) migrated to sles15sp6 the migration finished but failed to boot up
   and show the upper error message.
4) we also hit the same failure from sles15sp4->sles15sp5->sles15sp6
   https://openqa.suse.de/tests/14616792

I've tried to get a y2log but failed. 


## Observation

openQA test in scenario sle-15-SP6-Migration-from-SLE15-SPx-x86_64-migr_sles_continuous_15sp3_15sp5_ph1@64bit fails in
[yast2_migration](https://openqa.suse.de/tests/14612205/modules/yast2_migration/steps/19)

## Test suite description
Online continuous migration from SLE 15 SP3 to current SLE 15 version. This second step starts on the intermediate SLE 15 SP5 version.


## Reproducible

Fails since (at least) Build [93.1](https://openqa.suse.de/tests/14597014)


## Expected result

Last good: [92.1](https://openqa.suse.de/tests/14386740) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.suse.de/tests/latest?arch=x86_64&distri=sle&flavor=Migration-from-SLE15-SPx&machine=64bit&test=migr_sles_continuous_15sp3_15sp5_ph1&version=15-SP6)
Comment 1 Huajian Luo 2024-06-14 09:06:39 UTC
Created attachment 875485 [details]
grub screenshot
Comment 2 Huajian Luo 2024-06-14 09:15:30 UTC
We are now trying to save y2log but still failed and we can try to provide more log files with a manual test later.
Comment 3 Huajian Luo 2024-06-14 09:55:23 UTC
Created attachment 875487 [details]
y2log
Comment 4 Michael Chang 2024-06-17 04:40:33 UTC
Hi Huajian,

This looks like a setup issue in the legacy pc-bios where the boot device is different from the one where grub is actually installed. The reason this has happened recently is because the patch [1] was reverted from SP5 to SP6, so we are no longer trying to cover up such issues in a version update.

Would you please provide the full y2log for analysis?

[1] https://build.suse.de/projects/SUSE:SLE-15-SP5:Update/packages/grub2/files/0046-squash-verifiers-Move-verifiers-API-to-kernel-image.patch?expand=1

Thanks.
Comment 5 Michael Chang 2024-06-17 04:58:55 UTC
(In reply to Michael Chang from comment #4)

[snip]

> The reason this has
> happened recently is because the patch [1] was reverted from SP5 to SP6, so
> we are no longer trying to cover up such issues in a version update.

This cannot be used to explain the regression between build 92.1 and 93.1, as the patch has been removed at the very beginning of SLE-15-SP6 development. Still the observation is valid, the ABI mismatch is mostly setup issue, without the patch now trying to cover it up. Could it be that SUT undergo some modification in the test image itself ?

Thanks.
Comment 6 Huajian Luo 2024-06-17 07:38:02 UTC
https://bugzilla.suse.com/attachment.cgi?id=875487 is a fully y2log I've uploaded.
and now we can't connect to openqa.suse.de today here in beijing. 

so if it's a setup issue, but it can boot up after migration from sles15sp3 to sles15sp5, if you are now not cover up this error, so how can we workaroud it before/after migration to make it boot up for this scenario.

Thank you very much for the help.
Comment 7 Michael Chang 2024-06-17 08:52:02 UTC
(In reply to Huajian Luo from comment #6)
> https://bugzilla.suse.com/attachment.cgi?id=875487 is a fully y2log I've
> uploaded.
> and now we can't connect to openqa.suse.de today here in beijing. 
> 
> so if it's a setup issue, but it can boot up after migration from sles15sp3
> to sles15sp5, if you are now not cover up this error, so how can we
> workaroud it before/after migration to make it boot up for this scenario.

From SLE-15-SP5 to SLE-15-SP6,  grub version is bumped from 2.06 to 2.12.  We cannot really cover up all ABI changes given there could have more than this "grub_verify_string" in a version update.

I looked a bit into the log, and indeed grub installed to partition, so that chance is there mbr is still occupied by old grub before the migration. 

# <<<<<<<<<<<<<<<<
# target = i386-pc
# + /usr/sbin/grub2-install --target=i386-pc --force --skip-fs-probe /dev/vda2
# Installing for i386-pc platform.
# Installation finished. No error reported.

Is it possible to use disk's mbr as grub install device and retest ? This can eliminate potential inconsistency of boot and setup device?

Thanks.
Comment 8 Michael Chang 2024-06-17 08:55:55 UTC
(In reply to Michael Chang from comment #7)
> (In reply to Huajian Luo from comment #6)

[snip]

Btw, there's also this error in update-bootloader/pbl after running grub2-install.

# /dev/disk/by-path/pci-0000:00:09.0: not a block device

Hi Steffen,

Do you have any good idea resulting the message above? 
Thanks.
Comment 9 Huajian Luo 2024-06-17 09:22:48 UTC
Actually it failed after migration with `update-bootloader`, so as a end user how can I apply some workaround to fix it in the openqa test?

> # update-bootloader: 2024-06-14 05:45:20 <3> update-bootloader-0160 run_command.338: '/usr/lib/bootloader
> /grub2/install' failed with exit code 1, output:
> # <<<<<<<<<<<<<<<<
> # target = i386-pc
> # + /usr/sbin/grub2-install --target=i386-pc --force --skip-fs-probe /dev/vda2
> # Installing for i386-pc platform.
> # Installation finished. No error reported.
> # /dev/disk/by-path/pci-0000:00:09.0: not a block device
> # >>>>>>>>>>>>>>>>

Thanks!
Comment 10 Michael Chang 2024-06-17 09:34:00 UTC
You may boot into rescue shell, mount the root partition, chroot, and re-install grub to mbr there.

Assuming /dev/vda2 is the root partition, the command sequence would be like:

> mount /dev/vda2 /mnt
> mount --bind /proc /mnt/proc
> mount --bind /sys /mnt/sys
> mount --bind /dev /mnt/dev
> chroot /mnt
> mount -a
> /usr/sbin/grub2-install /dev/vda

Thanks.
Comment 11 Michael Chang 2024-06-17 09:51:35 UTC
Yes, the error could be related. Could you please attach the file /etc/default/grub_installdevice?

I suspect that there are two device entries in /etc/default/grub_installdevice. One is /dev/vda2, which is a partition boot record, and the other is /dev/disk/by-path/pci-0000:00:09.0, which is an mbr device. For some reason, the latter is missing, so as a consequence it is skipped and the disk's mbr is not updated, leading to this abi mismatch error.

For eg. pseudo output:

> cat /etc/default/grub_installdevice:
> /dev/vda2
> /dev/disk/by-path/pci-0000:00:09.0

In addition, please also provide output of:

> tree /dev/disk

So we could have a better overview on the namings of persistent devices on the system.

Thanks,
Comment 12 Huajian Luo 2024-06-18 05:06:20 UTC
I've tried to get these 2 output.
> cat /etc/default/grub_installdevice
> 
> /dev/vda2
> /dev/disk/by-path/pci-0000:00:09.0
Comment 13 Huajian Luo 2024-06-18 05:06:59 UTC
Created attachment 875536 [details]
yast2_migration-tree-disk
Comment 14 Huajian Luo 2024-06-18 05:07:45 UTC
Created attachment 875537 [details]
yast2_migration-grub_installdevice
Comment 15 Huajian Luo 2024-06-18 05:08:55 UTC
I've uploaded the yast2_migration-tree-disk and yast2_migration-grub_installdevice for more information. please take a look again, Thanks.
Comment 16 Michael Chang 2024-06-18 06:53:28 UTC
(In reply to Huajian Luo from comment #15)
> I've uploaded the yast2_migration-tree-disk and
> yast2_migration-grub_installdevice for more information. please take a look
> again, Thanks.

It seems the migration has changed the disk's pci device number from pci-0000:00:09.0 to pci-0000:00:08.0. To confirm that could you please check the same tree output in SP4 or SP5 (before the migration) so we can be sure it is a regression in SP6 ?
Thanks.
Comment 17 Michael Chang 2024-06-18 07:19:16 UTC
In fact the pci device number is assigned by virtualization guest's config, looked like this:

> <devices>
>   <disk type='file' device='disk'>
>    <address type='pci' domain='0x0000' bus='0x09' slot='0x00' function='0x0'/>

Probably first thing to check is the pci device number in virtualization config has not changed over the entire test session, or we have to identify since when the by-path/ symlink started to get mismatch info.
Comment 18 Huajian Luo 2024-06-20 09:43:38 UTC
This is the `tree /dev/disk before migrationt to sles15sp6

```
/dev/disk
├── by-id
│   ├── scsi-0QEMU_QEMU_CD-ROM_cd0 -> ../../sr0
│   ├── virtio-hd0 -> ../../vda
│   ├── virtio-hd0-part1 -> ../../vda1
│   ├── virtio-hd0-part2 -> ../../vda2
│   ├── virtio-hd0-part3 -> ../../vda3
│   └── virtio-hd0-part4 -> ../../vda4
├── by-label
│   └── SLE-15-SP6-Online-x86_6493.5.001 -> ../../sr0
├── by-partuuid
│   ├── 63d306cd-116a-403a-9393-3bc95f8d4583 -> ../../vda1
│   ├── 6edff295-a7de-44e5-9b03-c3ab299882ac -> ../../vda2
│   ├── 80f58702-7aab-45e9-9bfb-65f7a2bb5ea1 -> ../../vda3
│   └── b579fea2-a650-4bef-a73b-2aca2dacd193 -> ../../vda4
├── by-path
│   ├── pci-0000:00:07.0-scsi-0:0:0:0 -> ../../sr0
│   ├── pci-0000:00:08.0 -> ../../vda
│   ├── pci-0000:00:08.0-part1 -> ../../vda1
│   ├── pci-0000:00:08.0-part2 -> ../../vda2
│   ├── pci-0000:00:08.0-part3 -> ../../vda3
│   ├── pci-0000:00:08.0-part4 -> ../../vda4
│   ├── virtio-pci-0000:00:08.0 -> ../../vda
│   ├── virtio-pci-0000:00:08.0-part1 -> ../../vda1
│   ├── virtio-pci-0000:00:08.0-part2 -> ../../vda2
│   ├── virtio-pci-0000:00:08.0-part3 -> ../../vda3
│   └── virtio-pci-0000:00:08.0-part4 -> ../../vda4
└── by-uuid
    ├── 2024-06-13-18-26-24-34 -> ../../sr0
    ├── 53d13e14-06f1-414d-8c7c-905b9ab08783 -> ../../vda2
    ├── 6e7c1f4c-6363-4294-a7fd-a7c9b1cc0d36 -> ../../vda4
    └── 8892f6d7-60f6-4910-a066-c5ec679b93a6 -> ../../vda3

5 directories, 26 files
```
Comment 19 Michael Chang 2024-06-21 03:42:13 UTC
(In reply to Huajian Luo from comment #18)
> This is the `tree /dev/disk before migrationt to sles15sp6

Is it 15sp5 or 15sp4? The disk's by-path name matches that of 15sp6, so it's still unclear where the by-path name in /etc/default/grub_installdevice originates from.

Did you use a pre-built image for testing? If so, I wonder why the by-uuid name isn't being used, as the by-path name might differ between the build host and the deployed target.
Thanks.
Comment 20 Steffen Winterfeldt 2024-06-21 15:14:48 UTC
The yast log in comment 3 clearly shows vda linked to 0000:00:09.0 in the
original setup. So the grub config is correct.

What I think happened is a openqa vm config change. Note that even
unrelated changes like adding a network interface or in fact any
other device attached to the pci bus may change the enumeration.

I'd check with lspci, for example.
Comment 21 Huajian Luo 2024-07-01 09:09:24 UTC
this is the lspci before and after migration:
-----before--migration------------------------------------------------
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01)
00:06.0 Communication controller: Red Hat, Inc. Virtio console
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
--------------------------------------------------------------------
-----After--migration-----------------------------------------------
00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
00:03.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family) High Definition Audio Controller (rev 01)
00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
00:05.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01)
00:06.0 Communication controller: Red Hat, Inc. Virtio console
00:07.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
-----------------------------------------------------------------------
Comment 22 Michael Chang 2024-07-03 05:23:54 UTC
(In reply to Huajian Luo from comment #21)
> this is the lspci before and after migration:
> -----before--migration------------------------------------------------
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
> 00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
> 00:03.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family)
> High Definition Audio Controller (rev 01)
> 00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
> 00:05.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01)
> 00:06.0 Communication controller: Red Hat, Inc. Virtio console
> 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
> 00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
> --------------------------------------------------------------------
> -----After--migration-----------------------------------------------
> 00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
> 00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
> 00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
> 00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
> 00:02.0 VGA compatible controller: Device 1234:1111 (rev 02)
> 00:03.0 Audio device: Intel Corporation 82801FB/FBM/FR/FW/FRW (ICH6 Family)
> High Definition Audio Controller (rev 01)
> 00:04.0 Ethernet controller: Red Hat, Inc. Virtio network device
> 00:05.0 USB controller: Red Hat, Inc. QEMU XHCI Host Controller (rev 01)
> 00:06.0 Communication controller: Red Hat, Inc. Virtio console
> 00:07.0 SCSI storage controller: Red Hat, Inc. Virtio SCSI
> 00:08.0 SCSI storage controller: Red Hat, Inc. Virtio block device
> -----------------------------------------------------------------------

The lspci output did not list the virtio RNG device observed in the y2log attached in comment #3. With the virtio RNG device attached, the address of the virtio block device was shifted from 00:08.0 to 00:09.0, among others. [1]

To get better understanding, you may check the device type through it's [vid:pid]  [2].

Therefore I agree with Steffen in comment #20. The vm config change during the migration test trips over the device naming based on the PCI path initially used for installation. Could you please check if it can be avoided ?

Thanks.

[1]
> 2024-06-13T18:29:15.128686-04:00 susetest kernel: [    0.694608] pci 0000:00:05.0: [1af4:1005] type 00 class 0x00ff00
> 2024-06-13T18:29:15.128687-04:00 susetest kernel: [    0.696507] pci 0000:00:05.0: reg 0x10: [io  0xc120-0xc13f]
> 2024-06-13T18:29:15.128688-04:00 susetest kernel: [    0.700225] pci 0000:00:05.0: reg 0x20: [mem 0xfe004000-0xfe007fff 64bit pref]
> 2024-06-13T18:29:15.128688-04:00 susetest kernel: [    0.703795] pci 0000:00:06.0: [1b36:000d] type 00 class 0x0c0330
> 2024-06-13T18:29:15.128688-04:00 susetest kernel: [    0.705031] pci 0000:00:06.0: reg 0x10: [mem 0xfebd4000-0xfebd7fff 64bit]
> 2024-06-13T18:29:15.128689-04:00 susetest kernel: [    0.708299] pci 0000:00:07.0: [1af4:1003] type 00 class 0x078000
> 2024-06-13T18:29:15.128689-04:00 susetest kernel: [    0.710120] pci 0000:00:07.0: reg 0x10: [io  0xc080-0xc0bf]
> 2024-06-13T18:29:15.128690-04:00 susetest kernel: [    0.711778] pci 0000:00:07.0: reg 0x14: [mem 0xfebda000-0xfebdafff]
> 2024-06-13T18:29:15.128690-04:00 susetest kernel: [    0.715041] pci 0000:00:07.0: reg 0x20: [mem 0xfe008000-0xfe00bfff 64bit pref]
> 2024-06-13T18:29:15.128691-04:00 susetest kernel: [    0.717717] pci 0000:00:08.0: [1af4:1004] type 00 class 0x010000
> 2024-06-13T18:29:15.128691-04:00 susetest kernel: [    0.720778] pci 0000:00:08.0: reg 0x10: [io  0xc0c0-0xc0ff]
> 2024-06-13T18:29:15.128691-04:00 susetest kernel: [    0.724258] pci 0000:00:08.0: reg 0x14: [mem 0xfebdb000-0xfebdbfff]
> 2024-06-13T18:29:15.128692-04:00 susetest kernel: [    0.727779] pci 0000:00:08.0: reg 0x20: [mem 0xfe00c000-0xfe00ffff 64bit pref]
> 2024-06-13T18:29:15.128692-04:00 susetest kernel: [    0.730824] pci 0000:00:09.0: [1af4:1001] type 00 class 0x010000
> 2024-06-13T18:29:15.128693-04:00 susetest kernel: [    0.732729] pci 0000:00:09.0: reg 0x10: [io  0xc000-0xc07f]
> 2024-06-13T18:29:15.128694-04:00 susetest kernel: [    0.735203] pci 0000:00:09.0: reg 0x14: [mem 0xfebdc000-0xfebdcfff]
> 2024-06-13T18:29:15.128694-04:00 susetest kernel: [    0.738210] pci 0000:00:09.0: reg 0x20: [mem 0xfe010000-0xfe013fff 64bit pref]

[2]
https://www.qemu.org/docs/master/specs/pci-ids.html