Bug 1137064

Summary: Upgrade from 15.0 to 15.1 breaks grub2 on system with boot RAID
Product: [openSUSE] openSUSE Distribution Reporter: Peter Loibl <loiblp>
Component: BootloaderAssignee: Jiri Srain <jsrain>
Status: NEW --- QA Contact: Jiri Srain <jsrain>
Severity: Major    
Priority: P5 - None CC: jreidinger, loiblp, mchang
Version: Leap 15.1Flags: jsrain: needinfo? (loiblp)
Target Milestone: ---   
Hardware: x86   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: Archive containing the broken files from Leap 15.1, the corrected ones and the originals from Leap 15.0

Description Peter Loibl 2019-06-02 15:10:00 UTC
Created attachment 806601 [details]
Archive containing the broken files from Leap 15.1, the corrected ones and the originals from Leap 15.0

Using a server with RAID 0 boot device and RAID 6 data device.

Upgrade from Leap 15.0 to Leap 15.1 caused the server to stop after the initial reboot at the step where the initrd should be loaded. grub2 does not respond to any actions, i.e. it is dead. It will wait there for hours (I tried).

After several re-tries of re-installing grub2 or the kernel, the boot problem was found the be located in /boot/grub2/grub.cfg (pls. refer to grub.cfg.bad at 15.1/boot/grub in the attached archive). The "linux" statement contained way too many "resume" entries. Removing these duplicates reduced the file size from 169k to 9k and the server could be booted.

The original error, however, seems to be located at the variable GRUB_CMDLINE_LINUX_DEFAULT from /etc/defaults/grub (pls. refer to grub.bad in /etc/default in the attached archive). 

The attached archive contains also the corresponding files from Leap 15.0 (taken from the backup one day before).

1. Why is the variable GRUB_CMDLINE_LINUX_DEFAULT garbaged with too many entries with the same content during update?
2. Why is the variable GRUB_CMDLINE_LINUX_DEFAULT used without proper cross-check for errors and attached many times over in /boot/grub2/grub.cfg?
3. Why does grub2 just stop without giving an error message and without timeout? Do we have a buffer overflow here?
Comment 1 Jiri Srain 2019-06-03 07:18:07 UTC
Well, it clearly is a bug that /etc/default/grub contains that many copies of the same append. Could you, please, attach the logs from upgrade (assuming that you upgraded from 15.0 to 15.1 with YaST, meaning booting the installation media)?

https://en.opensuse.org/openSUSE:Report_a_YaST_bug

The log should show how/when the command-line became broken.


About GRUB2 behavior if the cmdline is that long: I added Michael to CC, maybe he has can comment on the limits. However, it is not the root cause of the problem.
Comment 2 Peter Loibl 2019-06-03 18:21:33 UTC
The logfile is too large to upload. Pls. find the file at https://my.hidrive.com/lnk/2LhJvNbR
Comment 3 Josef Reidinger 2019-06-04 11:59:37 UTC
Hi Peter, let me explain how it works newly in 15.1. During upgrade we read original configuration, allow user to change it ( due to new cpu mitigation option ) and then write it. Issue is that it is written after upgrade. And we do want to lose any modifications done by rpm post install scripts, so for safety we simply append kernel command line ( other options are just rewritten ). And if you see, in 15.0 you already have quite big append line ( do not see from logs why ) and it is duplicated, which result in this issue. We definitively need to find way how to reasonable merge duplications. Like keep only the last occurence of parameter?
Comment 4 Peter Loibl 2019-06-06 16:27:35 UTC
Hi Josef, thanks for clarifying. I just want to stress out three points:

1. I never edited /etc/defaults/grub, so there must have been some uderlying bug in the past ... but this is not the issue here

2. I am having some 25 years development experience in an area where software bugs are not really acceptable. So, my quality expectations might be on the upper end. But your bug caused me to search for several hours for a solution ... Appending parameters to an existing configuration without checking what is already there never has been a good implementation decision (what about conflicting paramter?)

3. Why does grub just die and is not eving giving a timeout? Sounds pretty much like a good buffer overflow! Anyone looking for a nice scenario for a not so easy to find code injection attack vector? (Exactly that is the reason why I set the severity to major)
Comment 5 Josef Reidinger 2019-06-07 06:52:17 UTC
(In reply to Peter Loibl from comment #4)
> Hi Josef, thanks for clarifying. I just want to stress out three points:
> 
> 1. I never edited /etc/defaults/grub, so there must have been some uderlying
> bug in the past ... but this is not the issue here
> 
> 2. I am having some 25 years development experience in an area where
> software bugs are not really acceptable. So, my quality expectations might
> be on the upper end. But your bug caused me to search for several hours for
> a solution ... Appending parameters to an existing configuration without
> checking what is already there never has been a good implementation decision
> (what about conflicting paramter?)

kernel does not have conflicting params beside "noresume" which is handled now. Others should be simple overwritten, like if there is "quite verbose quite" then the last one wins. But I agree we need to address this issue and solution should be to remove duplicite params. ( so if there is e.g. "quite verbose quite" reduce it to "verbose quite" where we keep always just last param )

> 
> 3. Why does grub just die and is not eving giving a timeout? Sounds pretty
> much like a good buffer overflow! Anyone looking for a nice scenario for a
> not so easy to find code injection attack vector? (Exactly that is the
> reason why I set the severity to major)

Question for michal. Can you comment grub behavior here? Or maybe it does not die and simply kernel die when it gets those params?
Comment 6 Michael Chang 2019-06-20 07:12:28 UTC
(In reply to Josef Reidinger from comment #5)
> (In reply to Peter Loibl from comment #4)
> > Hi Josef, thanks for clarifying. I just want to stress out three points:
> > 
> > 1. I never edited /etc/defaults/grub, so there must have been some uderlying
> > bug in the past ... but this is not the issue here
> > 
> > 2. I am having some 25 years development experience in an area where
> > software bugs are not really acceptable. So, my quality expectations might
> > be on the upper end. But your bug caused me to search for several hours for
> > a solution ... Appending parameters to an existing configuration without
> > checking what is already there never has been a good implementation decision
> > (what about conflicting paramter?)
> 
> kernel does not have conflicting params beside "noresume" which is handled
> now. Others should be simple overwritten, like if there is "quite verbose
> quite" then the last one wins. But I agree we need to address this issue and
> solution should be to remove duplicite params. ( so if there is e.g. "quite
> verbose quite" reduce it to "verbose quite" where we keep always just last
> param )
> 
> > 
> > 3. Why does grub just die and is not eving giving a timeout? Sounds pretty
> > much like a good buffer overflow! Anyone looking for a nice scenario for a
> > not so easy to find code injection attack vector? (Exactly that is the
> > reason why I set the severity to major)
> 
> Question for michal. Can you comment grub behavior here? Or maybe it does
> not die and simply kernel die when it gets those params?

As far as I can see, grub would just discard the command line parameter which exceeds maximum command line size of the loaded kernel as a result of bounds check. There's not likely a overflow here.

It is not clear to me at which point grub died. Is it at very beginning or in the stage of loading kernel and initrd ? Was there any message left on screen when it died ?