Bugzilla – Bug 104065
Grub Error 18 on vmware gsxserver
Last modified: 2007-08-11 15:38:00 UTC
- Grub Error 18 when installing Beta1 onto a standard VMWARE gsxserver image. System boots and loads Grub and halts before loading any kernel with Error 18. No grub shell is available. System hangs. After CD1 is installed and the system reboots I get a Grub Error 18 on vmware gsxserver, installing on a 20 Gbyte virtual harddisc using the SCSI Buslogic driver. Initrd is created fine (verified by doing a rescue chroot and recreating initrd and installing grub again). Error 18 happens when the bootloader is beyond 1024 cylinders. Never seen it happening before on Gsxserver. I assume it is grub related. the default Partitioning chosen: /dev/sda1 swap 1 GB /dev/sda2 (18.9 GB) for / ################ To verify if the above error is genuine and related to the 1024 sector limit I created the following partitions on my second trial /dev/sda1 /boot 512M /dev/sda2 swap 10248M /dev/sda3 / the rest This time the install is fine and I assume that the 1024 limit is actually the reason for the error. Since I have installed many many other linuxes before (including my own distro Yoper. which I have developed under vmware) under vm and I have never seen an error 18 ....... seen others, but not that one. Hope this helps.
The generated /etc/grub.conf for the failing case would be nice, as well as the dump of the failing MBR. I suspect it's related to #103031; investigating...
No info within a week -- cannot be that critical. Meanwhile I'm pretty sure this is related. I _guess_ (since no info from reporter) that a generic MBR was installed and the virtual disk is erroneously partitioned and/or the virtual BIOS is flaky.
Created attachment 46477 [details] bootsector
Created attachment 46478 [details] grub config
Good things take a while ;) ... sorry for the delay. create the dump with dd if=/dev/sda of=firstsect.img bs=512 count=1 /etc/grub.conf reads: setup --stage2=/boot/grub/stage2 (hd0,1) (hd0,1) quit
OK. I may not have access to the (virtual) machine, but I'm nevertheless very confident that the virtual SCSI BIOS doesn't support LBA reads. No, this is not related to #103031, this is "only" a dysfunctional BIOS paired with a bad partition layout. Without LBA, the BIOS can only access the first 8GB of the disk (using XCHS). If anything that's required for booting (stage2, kernel, initrd files) is beyond that, the bootstrap fails. This even exceeds your (correct) assumption about the 1024 cyl limit. I suggest you use a /boot partition at the beginning of the disk.
I might be wrong but: Your answer makes logically NO sense. I have been running VMWARE for many years. also in many ESX/GSX environments in large corporates with very large discs. I can assure you that I have never ever received this error, even if I did not set up /boot on the first partition. In case of SUSE and other distro's this is not done anyhow by default and as such you would expect this to be tested again and again by many people. Logically your assertion does not make sense and I cannot accept this. You will see that this bug will come up again and again with VMWARE users that use the default partitioning and a larger disc file. If you don't fix it now, then you have to fix it at some point. With SLES this was NOT a problem.
Just double checked with NLD and it does give a related issue. Grub fails to install, when install manually it also give Error 18. But when I use lilo all is fine and regardless of where /boot is the system runs fine. I used the rescue disc to create a lilo.conf on the Error 18 failing system. Would that not just prove it is a grub issue and not a VMWARE bios issue?
Not quite. Lilo has a different philosophy, and gathers _no_ information at boot time, but relies on the settings determined when the lilo binary was run. Grub works with the BIOS to find a way to get the kernel into memory; if this cooperation is bad, grub will fail. On the matter of default partitioning I'm also amongst the ones saying that it's, well, sub-optimal, to put it mildly. I don't have vmware here (don't like it & don't want it), but if there's a test I can think of I'll ask you to run that. In the meantime, can you attach the lilo.conf you succeeded with?
# Modified by YaST2. Last modification on Thu Aug 18 17:06:35 2005 timeout = 80 prompt default = Linux Here the lilo.conf. boot = /dev/sda image = /boot/vmlinuz label = Linux initrd = /boot/initrd optional root = /dev/sda2 vga = 0x332 append = "selinux=0 splash=silent resume=/dev/sda1 elevator=as showopts"
BTW .... Why should lilo work and grub not? Seems like grub is buggy or is missing something that lilo has? Of course lilo.conf does not contain "Here the lilo.conf. timeout = 80 prompt default = Linux boot = /dev/sda image = /boot/vmlinuz label = Linux initrd = /boot/initrd optional root = /dev/sda2 vga = 0x332 append = "selinux=0 splash=silent resume=/dev/sda1 elevator=as showopts"
Re-read the first paragraph of comment #9.
Ok of course ..... It seems odd to me however that a modern bios would have such an issue. For example on the novell internal linux-desktop mailing list someone just asked about the same error today on an IBM T41p installing a new kernel he got that error. It seems to me that this "philosophy" seems to even reject new machines and new bioses. If that is the case I would consider this "philosphy" buggy and someone (you?) write a patch to change its behaviour. I find it very disturbing that the default bootloader we are using has such a flaky behaviour when default partitioning is selected (/bot not as first). I think that either way this causes trouble in normal userland and a lot of users are just thrown into deep end and give up on Linux, just because of the "philosophy" of the "GRand" Unified Bootloader. IMHO this is buggy and not a feature.
"modern bios" is an oxymoron. Additionally, you'd be surprised how buggy a BIOS can be. Which SLES are you talking about in comment #7 ? NLD is based on SLES9. Which bug was filed for the T41p?
Can you do me a big favor and tell me what /sys/firmware/edd/int13_dev80/sectors contains for the failing disk?
nld:~ # cat /sys/firmware/edd/int13_dev80/sectors 0x4ffd625
Why I've get this bug back? I'm not familiar with grub nor vmware.
*** Bug 131587 has been marked as a duplicate of this bug. ***
*** Bug 130000 has been marked as a duplicate of this bug. ***
This issue was recently brought to VMware's attention: Can someone provide the following info: Version and build number of VMware GSX Server where problem discovered? Host operationg system type and version? For the SUSE LINUX 10 install, are you using virutal SCSI or virtual IDE driver? Thx!
3.1.0-9089 Host: SLES9-SP3 Virtual SCSI
Setting up a duplicate environment to reproduce the problem. I'll report any findings on this bug.
*** Bug 138425 has been marked as a duplicate of this bug. ***
*** Bug 159426 has been marked as a duplicate of this bug. ***
No reaction since more than 2 months, therefore closing as CANTFIX (aka WONTFIX). If you can provide the needed information, feel free to reopen the bug.