Bug 392198

Summary: 11.0 beta3 freeze 75% of the time at start of boot
Product: [openSUSE] openSUSE 11.0 Reporter: Lee Matheson <lee_matheson>
Component: KernelAssignee: Greg Kroah-Hartman <gregkh>
Status: RESOLVED INVALID QA Contact: Jiri Srain <jsrain>
Severity: Blocker    
Priority: P5 - None CC: kdmoeller, lee_matheson, nikok79
Version: RC 1Flags: coolo: SHIP_STOPPER-
Target Milestone: ---   
Hardware: i686   
OS: openSUSE 11.0   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---

Description Lee Matheson 2008-05-19 17:28:16 UTC
It appears either the 3733012 bug is back, or one very similar is randomly present, on 11.0 Beta3.  

Original report:
https://bugzilla.novell.com/show_bug.cgi?id=373012

Since installing 10.3 beta3 (on same hardware as above bug report) about 75% of the re-boots fail with a freeze (similar to what I saw on the alpha releases), and only 1 out of 4 re-boots (or power on) succeed on the Beta3.  This is on the same hardware as the 11.0 alpha problem.

The symptoms are 11.0 beta3 freezes immediately after:
Probing EDD (edd=off to disable).

I tried setting edd=off as a menu.lst boot parameter and it made no difference.  The PC still freezes approximately upon boot 3 out of 4 times. (Note randomness, .... some times it will function right away, other times it takes 7 or 8 tries to get a boot).

I'll open a new bug report on this, and reference this older report (in case they are related).

Prior to this boot failure, (and independent as near as I can determine), Grub also misfunctions (independent of this) about 75% to 80% of the time (but does not correlate) where grub graphic menu fails, and PC only boots to a text grub menu, with grub error 7 (with subsequent message "graphics initialization failed").  I don't know if that is related to above boot failure, but it does not seem to correspond to subsequent boot failure (other than it fails about the same amount of time).  After error 7, PC will still go to text boot menu in Grub.  [I can raise a separate bug report on Grub if deemed useful].

The 25% or so the PC correctly boots linux (and X), on some times the grub graphic interface works, but most the time it does not.
Comment 1 Jiri Srain 2008-05-20 08:00:08 UTC
When exactly does the boot freeze? Does it load bootloader? Could you tell what's the last that boot process reports (you can add splash=verbose to kernel command line to disable graphic splash screen)
Comment 2 Lee Matheson 2008-05-20 20:28:37 UTC
Per my original post, it freezes IMMEDIATELY after: 
"[Linux-initrd @ 0x37b59000,0x496337 bytes"]
"Probing EDD (edd=off to disable)..."

When it freezes the entire screen goes black.  This "Probing EDD (edd=off to disable)" scrolls by so fast, I was only able to capture it by taking a video of the boot, and then playing it back.

When it freezes, there is no grub splash.  Just a black screen.

Is there a boot code I can apply (to have things stored in a log file)?  

Or after a failed boot, would it help if I then do a hardware reset, and re-boot to a live CD/DVD (such as knoppix) and then go into the hard drive and copy some openSUSE log file? (if so, which file?).
Comment 3 Lee Matheson 2008-05-22 09:33:31 UTC
Some more information on this bug ..

I am using 11.0 beta3 with kernel (from "uname -a"):
 Linux linux 2.6.25.3-2-pae #1 SMP 2008-05-10 07:46:36 +0200 i686 athlon i386 GNU/Linux

The selection of pae was done automatically by openSUSE installer, which was surprising to me, as this old athlon-1100 PC has only 1 GBbye of RAM.

I tried taking a look at the log files, when a boot failed, by booting to openSUSE-10.3 (which successfully dual boots 100% of the time on this PC), and then mounted the 11.0 beta3 / partition, and checked /var/log on the 11.0 beta3 partition.  Both "boot.msg" and "messages" had no entries for the failed boot, ... as their last entries were the previous shutdown.  I checked the time stamp of all log messages in /var/log and there were in fact no log messages with a time stamp that corresponded to the failed boot.

I then tried rebooting back to 11.0 beta3, and using the boot code "acpi=off".  This made no difference, and the boot still failed.

After repeated attempts at a normal boot (without the "acpi=off" boot code), the boot finally did succeed.  

Again, openSUSE-10.3 reliably boots 100% of the time.  ... 

I'll try installing an updated 11.0 beta3 kernel (using the "rpm -ihv" command, and hand edit the /boot/grub/menu.lst file to provide a kernel boot selection), in order to see if an updated kernel makes any difference.  
Comment 4 Lee Matheson 2008-05-22 11:00:39 UTC
Ok, I went here: http://download.opensuse.org/distribution/SL-OSS-factory/inst-source/suse/i586/
and downloaded kernel-default-2.6.25.4-2.i586.rpm and also downloaded kernel-pae-2.6.25.4-2.i586.rpm

I installed both rpms with "rpm -ivh" such that both kernels were available, in addition to my original kernel.  I checked /boot/grub/menu.lst to confirm it allowed me to boot to each of the available kernels.

Some comments, my first attempt to boot to my baseline kernel-pae-2.6.25.3-2 took 14 efforts, before the kernel would boot. On 13-attempts it froze as described in a previous post, ... ie the problem is "highly" repeatable.  Once it succeeded in booting (on the 14th attempt), I installed the two kernels.  

I then rebooted to the "kernel-default-2.6.25.4-2" kernel.  It froze just like the "kernel-pae-2.6.25.3-2".  I then tried a reboot of the "kernel-default-2.6.25.4-2" with "failsafe" settings.  It also froze just like the "kernel-pae-2.6.25.3-2 kernel".   I then tried a reboot to the kernel-pae-2.6.25.4.2 kernel. It booted successfully on this 1st attempt, but then failed a reboot test on the 2nd thru to 5th attempts, and then booted successfully on the 6th attempt.

ie. the 2.6.25.4-2 kernel has similar buggy behaviour to the 2.6.25.3-2 kernel.

I do not know what I can do next.  openSUSE-10.3 boots with no problem on this PC. 

I suppose could go into the PC BIOS and try to change settings cuh as Internal Cache (currently writeback), or System BIOS cacheable (currently Enabled), or C000-32k-shadow (currently cached) or APIC Function (currently enabled), etc .... but I would like some feedback and suggestions to help me focus on this.  And again, openSUSE-10.3 has no problems with these BIOS settings (neither did 9.3, 10.0, 10.1, nor 10.2).

This is time consuming and if this is a waste of effort, it would be useful if I could be advised of that.

Thankyou.
Comment 5 Greg Kroah-Hartman 2008-05-22 16:56:20 UTC
Should be fixed in next release.
Comment 6 Lee Matheson 2008-05-23 18:41:53 UTC
Thank-you, I am glad to read there is believed and planned to be a resolution.

I also tried the kernel-pae-2.6.25.4-8.1.i586.rpm from here:
http://download.opensuse.org/repositories/Kernel:/HEAD/openSUSE_Factory/
and I obtained the same anomalous behaviour, ... ie 4 failed boots (with an identical freeze early in boot process) followed by one successful boot on the 5th attempt.

I'm looking forward to the updated kernel with the fix.
Comment 7 Niko Koliqi 2008-05-24 17:03:37 UTC
neither kernel-default-2.6.25.4-9.1 worked for me (Vaio vgn-fs740), hope next time will be better.
can smb tell me where to find a kernel with sources; i'm on 2.6.25.3-2-default now and i need the kernel sources but don't know where to find it. Thanks 
Comment 8 Lee Matheson 2008-05-27 21:52:37 UTC
I tried the kernel-pae-2.6.25.4-11, and also kernel-default-2.6.25.4-11 from here:  http://download.opensuse.org/distribution/SL-OSS-factory/inst-source/suse/i586/
They both exhibited the same anomalous behaviour as above, booting successfully only once out of every approximately 5 attempts.  They froze during boot at the exact same place every time.

I also tried the kernel-pae-2.6.25.4.11 (with acpi=off) and tried the kernel-default-2.6.25.4.11 (with safe settings) and the kernel-vanilla-2.6.25.4-11, and in all cases the same anomalous behaviour was observed, with a freeze in the exact same place.

I'm looking forward to RC1 later this week, which I hope has a kernel build that contains the fix.
Comment 9 Lars Marowsky-Bree 2008-05-27 22:12:20 UTC
Can you please try the kernel from ftp://ftp.suse.com/pub/projects/kotd/HEAD/ instead?
Comment 10 Lee Matheson 2008-05-28 18:42:47 UTC
Ok, after reading Comment#9 (above) I went to this URL:

ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/i386/  and downloaded and installed with "rpm -ivh --oldpackage <kernel-as-below.rpm>:
kernel-default-2.6.25.4-HEAD_20080526132305.i586.rpm     27-May-2008 08:39  21.7M 
kernel-pae-2.6.25.4-HEAD_20080526132305.i586.rpm         26-May-2008 16:15  21.8M
kernel-vanilla-2.6.25.4-HEAD_20080526132305.i586.rpm     26-May-2008 16:16  21.7M 
.... installing one at a time, and then checking menu.lst to ensure it was updated to allow a boot.

I obtained the same freeze symptoms as before, where I tried booting:
1st: 1 x kernel-default-2.6.25.4-HEAD_20080526132305.i586.rpm
2nd: 1 x kernel-default-2.6.25.4-HEAD_20080526132305.i586.rpm [fail safe settings]
3rd: 1 x kernel-pae-2.6.25.4-HEAD_20080526132305.i586.rpm
4th: 1 x kernel-vanilla-2.6.25.4-HEAD_20080526132305.i586.rpm
6th-9th:  4 x kernel-default-2.6.25.4-HEAD_20080526132305.i586.rpm  [ie 4 more attempts of this build]
10th-11th: 2 x kernel-pae-2.6.25.4-HEAD_20080526132305.i586.rpm [ie 2 more attempts of this build]
12th:  1 x kernel-pae-2.6.25.4-HEAD_20080526132305.i586.rpm ... and on this 12th attempt PC did not freeze in the same repeatable place, but booted successfully (finally).  But thats rather unsatisfactory, having to boot 12 times for 1 success.

This suggests to me the "HEAD_20080526132305.i586" version of the kernel has the same problem.  

I hope I grabbed the correct kernel to test from ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/i386/.  There was a lot of kernels on that site.
Comment 11 Lee Matheson 2008-05-29 19:52:57 UTC
I noted there was the newer -2.6.25.4-HEAD_20080528142504 build of kernel-default, kernel-pae, and kernel-vanilla.  So I installed those with "rpm -ivh --oldpackage <kernel-as-below>
kernel-default-2.6.25.4-HEAD_20080528142504.i586.rpm  28-May-2008 18:10  21.7M
kernel-pae-2.6.25.4-HEAD_20080528142504.i586.rpm    28-May-2008 17:26  21.8M 
kernel-vanilla-2.6.25.4-HEAD_20080528142504.i586.rpm  28-May-2008 18:11  21.7M
...
all of those kernels froze 16 out of 18 attempts (total) during boot at exactly the same place as all the previous attempts in this thread.  Specifically attempted was:
1st: kernel-default - froze
2nd: kernel-failsafe - booted
3rd-4th:  kernel-failsafe - froze both occasions
5th:  kernel-vanilla - froze
6th:  kernel-pae - froze
7th-15th: kernel-default - froze all 9 attempts
16th-17th:  kernel-pae - froze both attempts
18th:  kernal-pae - booted.
Stopped the test as behaviour is the same as previous anomalous behaviour.

This was the last attempt I will make on 11.0 beta3.  I am downloadin now via bittorent 11.0 RC1, and I will install that and give that a try.  I hope it will work, albeit I confess my confidence is low on this being fixed (as previous reported).
Comment 12 Lee Matheson 2008-05-30 16:19:44 UTC
I just finished install 11.0 RC1. 

The anomalous behaviour is still present.  And it made the installation unpleasant.  During the installation, after the software install, and as part of the 1st boot before the "configuration" step (where x is configured ... etc ..) the PC fooze immediately upon reboot, with the IDENTICAL freeze characteristics to the above.  

I had to hit the hardware reset button 6 times, before I obtained a successful reboot.  Normally, I would NEVER do that, but my experience with this BUG to date suggested to me that if I kept hitting reset the PC may eventually not freeze upon boot, but may finish the boot, and eventually it did.

I have selected to "reopen" this Bug, and place it as a blocker, and marked it as being found in RC1.  IMHO this BUG would have blocked most users.   

Note, 10.3 does not have this problem booting from this PC (neither did 9.x, 10.0, 10.1, nor 10.2)

Is there anything else I can do to help?  

There is a lot of information about my PC, when I first raised this bug as part of the alpha releases. ....
https://bugzilla.novell.com/show_bug.cgi?id=373012
Comment 13 Greg Kroah-Hartman 2008-05-30 16:31:41 UTC
This is very odd.

Have you run memtest to verify that there are no problems with your hardware?

How about trying to boot with "nohz=off"?
Comment 14 Lee Matheson 2008-05-30 20:00:31 UTC
I tried booting 11.0 RC1 with "nohz=off".  Same freeze.

I'm currently run a hardware memory test.  Thus far no errors after 32 minutes.  

Earlier this PM, after reading comment#13, I opened the case, checked all connectors, cards, ... etc ... everything is properly seated.  There are no visible problems.  

As a test I rebooted to openSUSE-10.3 from the grub text menu.  It booted successfully 6 out of 6 tries (100% of the time).  openSUSE-11.0 alpha/beta/RC1, on the other hand, freezes 4 out of every 5 boot tries.  Sometimes worse.

Possibly ( ? ) related, is a problem with Grub, in that most of the time (90% approx) it boots with 
(1) a text display indicating an error, then 
http://picpaste.com/1grub-initial-view.jpg
http://picpaste.com/pics/1grub-initial-view.1212176826.jpg

(2) an error message "graphics initialization failed", followed by 
http://picpaste.com/2grub-graphic-initialization-failed.jpg
http://picpaste.com/pics/2grub-graphic-initialization-failed.1212176919.jpg

(3) the typical grub text menu.  
Could this be a symptom of a graphic card failure, that is in turn affecting 11.0, but does not affect 10.3 ?
http://picpaste.com/3grub-text-boot-menu.jpg
http://picpaste.com/pics/3grub-text-boot-menu.1212176968.jpg

The other 10% grub boots with a proper graphic menu selection.

As opposed to vga=normal or vga=031a, is it possible trying different VGA codes in the grub menu would make a difference to both the grub boot, and the kernel freeze/boot?
Comment 15 Stephan Kulow 2008-06-01 21:23:29 UTC
*** Bug 396200 has been marked as a duplicate of this bug. ***
Comment 16 Lee Matheson 2008-06-02 06:05:47 UTC
I do not believe it is clear that https://bugzilla.novell.com/show_bug.cgi?id=396200 is a duplicate bug.  

While both PCs (in Bug 396200 and 992198) have nvidia cards, I note Bug 396200 was able to boot with the vanilla kernel.  That has never been the case with bug 392198, as I have tried a number of different vanilla kernels, all of which exhibited the same identical failure on 11.0 beta3.  I have not tried a vanilla kernel with RC1, but given my (lack of) success with the previous ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/i386/  kernels, I have no reason to believe the RC1 vanilla kernel will make any difference.

If a URL/directory is pointed to me where I can find a vanilla kernel for 11.0 RC1 , I am willing to try that (or any other video).

Also, if there are any other tests/configurations that are recommended for me to try, please advise, and I will attempt them.  Thankyou.
Comment 17 Greg Kroah-Hartman 2008-06-10 17:33:55 UTC
you can install the vanilla kernel with:
  zypper install kernel-vanilla

Try running that kernel and letting us know what happens.
Comment 18 Lee Matheson 2008-06-12 20:49:38 UTC
In installed the kernel-vanilla, and it also still freezes.  However I did not install it via "zypper install kernel-vanilla".

Instead, between Comment #16 and now (prior to my reading Comment#17) I had updated 11.0 RC1 with a zypper refresh
zypper dist-upgrade
That installed kernel-pae-2.6.25.4-10
... 
I obtained a freeze 4 out of 5 times with kernel-pae-2.6.25.4-10.
I also installed kernel-default-2.6.25.4-10 with "rpm -ivh <package> and I still obtained a freeze 4 out of 5 times.

I then noticed Comment#17, and installed kernel-vanilla-2.6.25.4-10 also with rpm -ivh <package>.  It also has the same freeze behavior (although after a few on boot freezes, I gave up trying to boot with that vanilla kernel).

As an aside (likely unrelated), after the "zypper dist-upgrade", Grub (grub-0.97-126) now starts with a graphical boot menu every time (as opposed before when I would obtain a "graphic initialization error" and grub would go to the grub text menu 9 out of 10 times before).  However the grub menu freezes (ie keyboard freezes) 50% of the time upon a mouse press. .... I am also using a KVM switch ... I may take that (and connect direct to keyboard/video/mouse) to see if that makes any difference to the Grub boot. 

After selection by grub (successfully 50% of the attempts), I can still boot successfully 100% of the time to openSUSE-10.3.  So that still makes me think the problem is 11.0 related. 
Comment 19 Lee Matheson 2008-06-13 20:31:25 UTC
The sentence in post#18: "However the grub menu freezes(ie keyboard freezes) 50% of the time upon a mouse press."

should read:

"However the grub menu freezes (ie keyboard freezes) 50% of the time upon a key press."
Comment 20 Lee Matheson 2008-06-15 09:38:44 UTC
I removed the KVM switch, and connected Keyboard, Monitor, and Mouse direct to the PC.  I then rebooted dozens of times, trying to boot the kernel-pae-2.6.25.4-10 kernel.  

The behaviour was identical (with freeze in exact same place with same characteristics) - ie 50% grub will freeze and when grub does not freeze, 80% of the time the kernel boot will freeze.  That suggests to me that the KVM switch is not a factor in this problem.

One note, ... I do not have a PS2 mouse, but rather I have been using a Logitech USB mouse through the KVM (and for the direct without KVM tests).  I used a USB-to-PS2 adapter to connect the mouse to the KVM.  For the direct "Keyboard, Monitor, and Mouse" to PC tests, I tried testing with the USB mouse connected direct to a USB port, and also tested with the USB mouse connected direct to a PS2 mouse (via the USB-to-PS2 adapter).  It made no difference.  The behaviour was identical (with freeze in exact same place with same characteristics) in all cases.

When the 11.0 grub menu lets me (and does not freeze), openSUSE-10.3 still boots every time, with no problem. This is in contrast to 11.0 which, when the grub menu lets me (and does not freeze) the 11.0 kernel will freeze 80% of the time..  I may replace the 11.0 installed grub menu, with a 10.3 installed grub menu, to see if I can take "grub" out of the picture.

Other than that, I am pretty much out of ideas for testing.
Comment 21 Lee Matheson 2008-06-17 05:14:00 UTC
I've been wondering if this bug report could be related to a hard drive problem (and hence not an openSUSE-11.0 fault), with a problem with the hard drive's MBR? I plan to test that possibility, by either this weekend or next, where I plan to change out the hard drive on this PC with a spare drive I have, and install 11.0 on the spare drive.  Then compare the boot behaviour. [... or does someone know an easier way to check this ... ]?
Comment 22 Greg Kroah-Hartman 2008-06-18 20:40:05 UTC
As this is a problem of locking up _before_ the kernel runs, I really don't expect it to be a kernel issue itself, but rather a hardware issue.

We are going to need some kind of kernel log message before I can have any chance of fixing it in the kernel.
Comment 23 Lee Matheson 2008-06-19 20:01:42 UTC
As the initiator, I changed this bug report to "INVALID". And my apologies for this bug report.  This is a PC hardware problem.

My suspicions in Comment #21 proved correct, when I changed out the 80GByte hard drive for a 40GByte hard drive, and installed openSUSE-11.0 GM on the 40 GByte drive, the problem disappeared.  I performed 10 reboots, with no problem in each of the 10 rebotos. Grub came up and performed correctly, and the kernel booted properly.

My view now is the MBR in the previous 80GByte hard drive had an "intermittent" fault that occurred 90% of the time.  My guess is the reason openSUSE-10.3 booted ok, is because its code was located in a "healthier" location of the hard drive.  Perhaps the only puzzle is why the hardware health checks of the hard drive did not pick up anything.  

Thank you to the openSUSE developers, for implementing a faster installation and also retaining the faster reboots on openSUSE-11.0, else this test would have been unbearable.

If nothing else, I have learned now how to charcterize a hard drive with a failing MBR.

But I think I wasted far too much of everyone's time, and my sincere apologies for this.

This BUG report is withdrawn by initiator (or called invalid).