Bug 1159236 - Virtual machines which change disk type become unbootable due to resume partition path changing.
Virtual machines which change disk type become unbootable due to resume parti...
Status: RESOLVED FEATURE
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Bootloader
Leap 15.1
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: Jiri Srain
Jiri Srain
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-12-15 01:35 UTC by William Brown
Modified: 2020-01-30 11:33 UTC (History)
0 users

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description William Brown 2019-12-15 01:35:03 UTC
When changing a virtual machine's disk driver for the root volume, IE virtio to scsi, this changes the path of the disks. 

However, in opensuse leap 15.1 this path is relied on for the hibernate partition such as:

menuentry 'openSUSE Leap 15.1'  --class opensuse --class gnu-linux --class gnu --class os $menuentry_id_option 'gnulinux-simple-661b1791-6e12-4c88-8a31-32322c30d045' {
	load_video
	set gfxpayload=keep
	insmod gzio
	insmod part_gpt
	insmod btrfs
	set root='hd0,gpt2'
	if [ x$feature_platform_search_hint = xy ]; then
	  search --no-floppy --fs-uuid --set=root --hint='hd0,gpt2'  661b1791-6e12-4c88-8a31-32322c30d045
	else
	  search --no-floppy --fs-uuid --set=root 661b1791-6e12-4c88-8a31-32322c30d045
	fi
	echo	'Loading Linux 4.12.14-lp151.28.36-default ...'
	linux	/boot/vmlinuz-4.12.14-lp151.28.36-default root=UUID=661b1791-6e12-4c88-8a31-32322c30d045  ${extra_cmdline} console=ttyS0,115200 resume=/dev/disk/by-path/pci-0000:00:07.0-part3 splash=silent quiet showopts
	echo	'Loading initial ramdisk ...'
	initrd	/boot/initrd-4.12.14-lp151.28.36-default
}

Note the following line:

	linux	/boot/vmlinuz-4.12.14-lp151.28.36-default root=UUID=661b1791-6e12-4c88-8a31-32322c30d045  ${extra_cmdline} console=ttyS0,115200 resume=/dev/disk/by-path/pci-0000:00:07.0-part3 splash=silent quiet showopts

When changing the virtio disk path or type, this path no longer exists, and has moved

ls -al /dev/disk/by-id/
...
pci-0000:00:0c.0-part3 -> ../../vda3

As a result, this renders the virtual machine unable to boot unless manual intervention is performed.

I would request that there are a number of outcomes to resolve this.

* Virtual machines should not be considered targets for hibernation by default (hibernation makes sense on a laptop or desktop not a vm)
* That the "resume" path is changed to a stable identifier such as a UUID
* Disabling of "resume" be documented here https://en.opensuse.org/SDB:Suspend_to_disk to allow existing vm's to have hibernation "unconfigured".
* grub2-mkconfig is fixed to detect the current active swap (it still lists the previous path).
Comment 1 Jiri Srain 2019-12-18 08:43:36 UTC
Thanks for bringing suggestions to resolve the issue. Let me comment on them:

1. Sounds like a reasonable approach, I also cannot see a use case for hibernating a virtual machine. And, if I for any reason want to: This is something that the virtualization environment could do for me, that makes more sense

2. You have this control - during installation or at any later time. The resume parameter comes from /etc/default/grub. Actually, I'm a bit surprised that only swap (for resume line) is affected, and not e.g. most of fstab

3. Not sure what exactly you mean - you can disable resume via changing the GRUB configuration (/etc/default/grub, grub2-mkconfig), or you can put noresume boot option to the cmdline to overcome a failing system

4. Does not help - it does not detect anything, only puts to the generated menu what is read from /etc/default/grub


In general: This is a problem which may happen easily also for physical machine - because of changing hardware, and I fear that there is not a general solution (even UUID will not help if you have to replace failed drive).
Comment 2 William Brown 2019-12-19 05:34:07 UTC
(In reply to Jiri Srain from comment #1)
> Thanks for bringing suggestions to resolve the issue. Let me comment on them:
> 
> 1. Sounds like a reasonable approach, I also cannot see a use case for
> hibernating a virtual machine. And, if I for any reason want to: This is
> something that the virtualization environment could do for me, that makes
> more sense

Cool, will you use this bz or another to follow up on the progress to this? 

> 
> 2. You have this control - during installation or at any later time. The
> resume parameter comes from /etc/default/grub. Actually, I'm a bit surprised
> that only swap (for resume line) is affected, and not e.g. most of fstab

The system installer (yast or other) wrote the path-id instead of a UUID here. I'm saying the installer should write this as the swap device UUID. 

> 
> 3. Not sure what exactly you mean - you can disable resume via changing the
> GRUB configuration (/etc/default/grub, grub2-mkconfig), or you can put
> noresume boot option to the cmdline to overcome a failing system

I couldn't find the noresume option in googling, and to recover I had to delete the resume= line. I also couldn't find that it was /etc/default/grub for a while, and when I did, because it's a transactional server (or kubic could be affected too) the process involved a "transactional-update shell" then the changes needed.

So what I'm saying is that the process to unconfigure this should be documented, rather than just tribal knowledge :) 

> 
> 4. Does not help - it does not detect anything, only puts to the generated
> menu what is read from /etc/default/grub

Okay, see point 2 then :) 

> 
> 
> In general: This is a problem which may happen easily also for physical
> machine - because of changing hardware, and I fear that there is not a
> general solution (even UUID will not help if you have to replace failed
> drive).

Perhaps this means there should be a "noresume" option in the advanced menu of grub then?
Comment 3 Jiri Srain 2019-12-19 07:23:42 UTC
(In reply to William Brown from comment #2)
> (In reply to Jiri Srain from comment #1)
> > Thanks for bringing suggestions to resolve the issue. Let me comment on them:
> > 
> > 1. Sounds like a reasonable approach, I also cannot see a use case for
> > hibernating a virtual machine. And, if I for any reason want to: This is
> > something that the virtualization environment could do for me, that makes
> > more sense
> 
> Cool, will you use this bz or another to follow up on the progress to this? 

Since I have another ticket on my radar about architectures, let's keep this one open for now; once I have a clear idea about the expected behavior, I will probably create a Jira ticket.

> > 2. You have this control - during installation or at any later time. The
> > resume parameter comes from /etc/default/grub. Actually, I'm a bit surprised
> > that only swap (for resume line) is affected, and not e.g. most of fstab
> 
> The system installer (yast or other) wrote the path-id instead of a UUID
> here. I'm saying the installer should write this as the swap device UUID. 

I would need to see the installation logs. Installer proposes a naming scheme, but you can always pick a different one.

> > 3. Not sure what exactly you mean - you can disable resume via changing the
> > GRUB configuration (/etc/default/grub, grub2-mkconfig), or you can put
> > noresume boot option to the cmdline to overcome a failing system
> 
> I couldn't find the noresume option in googling, and to recover I had to
> delete the resume= line. I also couldn't find that it was /etc/default/grub
> for a while, and when I did, because it's a transactional server (or kubic
> could be affected too) the process involved a "transactional-update shell"
> then the changes needed.
> 
> So what I'm saying is that the process to unconfigure this should be
> documented, rather than just tribal knowledge :) 

This is what I meant; you can remove it in /etc/default/grub and re-generate the menu.

Understood that transactional server does not make it any easier (like any change in the configuration).

> > 4. Does not help - it does not detect anything, only puts to the generated
> > menu what is read from /etc/default/grub
> 
> Okay, see point 2 then :) 
> 
> > 
> > 
> > In general: This is a problem which may happen easily also for physical
> > machine - because of changing hardware, and I fear that there is not a
> > general solution (even UUID will not help if you have to replace failed
> > drive).
> 
> Perhaps this means there should be a "noresume" option in the advanced menu
> of grub then?

Well, we used to have a "FailSafe" option in the boot menu, with parameters mostly tweaking kernel settings (like ACPI) and which - over time - was reported to make things worse than default settings - that's why we dropped it.

Recently, PM brought an idea to disable swap completely in the (no more existing) FailSafe section. I guess that we are afead of a discussion what it should consist of - and noresume is clearly one of relevant ideas.

BTW: Does it help for you to manually append 'noresume' option on the kernel command-line?
Comment 4 Jiri Srain 2019-12-19 09:30:43 UTC
See also http://bugzilla.suse.com/show_bug.cgi?id=1159294
Comment 5 William Brown 2019-12-19 23:33:40 UTC
(In reply to Jiri Srain from comment #3)
> (In reply to William Brown from comment #2)
> > (In reply to Jiri Srain from comment #1)
> > > Thanks for bringing suggestions to resolve the issue. Let me comment on them:
> > > 
> > > 1. Sounds like a reasonable approach, I also cannot see a use case for
> > > hibernating a virtual machine. And, if I for any reason want to: This is
> > > something that the virtualization environment could do for me, that makes
> > > more sense
> > 
> > Cool, will you use this bz or another to follow up on the progress to this? 
> 
> Since I have another ticket on my radar about architectures, let's keep this
> one open for now; once I have a clear idea about the expected behavior, I
> will probably create a Jira ticket.

Great, thanks! 

> 
> > > 2. You have this control - during installation or at any later time. The
> > > resume parameter comes from /etc/default/grub. Actually, I'm a bit surprised
> > > that only swap (for resume line) is affected, and not e.g. most of fstab
> > 
> > The system installer (yast or other) wrote the path-id instead of a UUID
> > here. I'm saying the installer should write this as the swap device UUID. 
> 
> I would need to see the installation logs. Installer proposes a naming
> scheme, but you can always pick a different one.

I used the installers default scheme which appears to be UUID ... which is even more concerning that the installer is inserting path based id's instead of UUID in the default install case given it causes this. So I think that's a yast bug in that ... 

> 
> > > 3. Not sure what exactly you mean - you can disable resume via changing the
> > > GRUB configuration (/etc/default/grub, grub2-mkconfig), or you can put
> > > noresume boot option to the cmdline to overcome a failing system
> > 
> > I couldn't find the noresume option in googling, and to recover I had to
> > delete the resume= line. I also couldn't find that it was /etc/default/grub
> > for a while, and when I did, because it's a transactional server (or kubic
> > could be affected too) the process involved a "transactional-update shell"
> > then the changes needed.
> > 
> > So what I'm saying is that the process to unconfigure this should be
> > documented, rather than just tribal knowledge :) 
> 
> This is what I meant; you can remove it in /etc/default/grub and re-generate
> the menu.
> 
> Understood that transactional server does not make it any easier (like any
> change in the configuration).

Okay, I'll just add it to the wiki myself. :) 

> 
> > > 4. Does not help - it does not detect anything, only puts to the generated
> > > menu what is read from /etc/default/grub
> > 
> > Okay, see point 2 then :) 
> > 
> > > 
> > > 
> > > In general: This is a problem which may happen easily also for physical
> > > machine - because of changing hardware, and I fear that there is not a
> > > general solution (even UUID will not help if you have to replace failed
> > > drive).
> > 
> > Perhaps this means there should be a "noresume" option in the advanced menu
> > of grub then?
> 
> Well, we used to have a "FailSafe" option in the boot menu, with parameters
> mostly tweaking kernel settings (like ACPI) and which - over time - was
> reported to make things worse than default settings - that's why we dropped
> it.
> 
> Recently, PM brought an idea to disable swap completely in the (no more
> existing) FailSafe section. I guess that we are afead of a discussion what
> it should consist of - and noresume is clearly one of relevant ideas.
> 
> BTW: Does it help for you to manually append 'noresume' option on the kernel
> command-line?

I haven't tried, I don't have access to the affected machine for a few weeks now due to travel. But I can easily find out later.
Comment 6 William Brown 2019-12-19 23:46:00 UTC
https://bugzilla.suse.com/show_bug.cgi?id=1159595  for the installer incorrectly using paths not uuid.
Comment 7 Jiri Srain 2020-01-30 11:33:25 UTC
Thanks for the other bug you created.

For changing the default proposal based on virtualization, Lukas created a Jira epic:

https://jira.suse.com/browse/PM-1544

based on per-architecture decision.

Let's continue the discuon there, I'm closing this one as a FEATURE request.