Bug 114804

Summary: HS40 resets on initial welcome and/or startup screen
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Murlin Wenzel <mwenzel>
Component: InstallationAssignee: Steffen Winterfeldt <snwint>
Status: RESOLVED FIXED QA Contact: Klaus Kämpf <kkaempf>
Severity: Major    
Priority: P5 - None CC: kstansel, patrick.donckers, Richard.Beal, trenn
Version: Beta 4   
Target Milestone: ---   
Hardware: i686   
OS: All   
Whiteboard:
Found By: Third Party Developer/Partner Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: test iso
test iso 2
test iso 3
test iso 4
test iso 5

Description Murlin Wenzel 2005-09-01 17:39:26 UTC
When booting an HS40 blade, the system will reset as soon as the welcome splash
screen starts appearing, or it will go until the initial boot selection
screen(not sure what your term for this screen is) and then reset.  After the
reset, the bios error logs are reporting IERR cpu errors and disabling cpus.
Comment 1 Lukas Ocilka 2005-09-02 06:48:15 UTC
Might be problem of linuxrc, or kernel...
Comment 2 Steffen Winterfeldt 2005-09-02 09:20:06 UTC
You're talking about the boot loader? 
 
Maybe related to bug 81046. 
 
I'd say that is a harware/bios problem. What are ierr errors and what do they 
indicate? 
Comment 3 Steffen Winterfeldt 2005-09-21 09:31:30 UTC
*** Bug 118024 has been marked as a duplicate of this bug. ***
Comment 4 Murlin Wenzel 2005-09-21 16:12:33 UTC
This should be the open bug.  My bugzilla account died a horrible death.
Comment 5 Steffen Winterfeldt 2005-09-22 08:46:08 UTC
Well, then see comment 2. 
Comment 6 Murlin Wenzel 2005-09-22 14:22:10 UTC
This does indeed sound very similar to #81046, but with entirely different
hardware.  Is there a way to do the initial boot in non-gui mode, or would I
need a custom cd?  I'd like to try and narrow this down.
Comment 7 Steffen Winterfeldt 2005-09-22 14:50:11 UTC
Hold down SHIFT when isolinux starts. But then the problem will go away. 
 
What I supspect _could_ cause this is a nonblocked interrupt (like NMI) 
because the grahics code runs without a properly setup IDT for some time. 
 
Hence my question what these IOERRs are. 
Comment 8 Murlin Wenzel 2005-09-26 20:22:47 UTC
I did verify that I can hold down SHIFT and get the install started. Of course,
everything crashes as expected on a reboot.  Has the startup code changed that
much since SLES9?  That works fine on this same hardware.

I did find out that the IERRs are either cpu faults(not likely) or PCI device
timeouts/errors (most likely)  Keep in mind that these blades are completely USB
based (floppy, cd-rom, keyboard, mouse).  Until an os stack is loaded and
running, the system is running legacy emulation which generally requires SMI
handlers or INT handlers.
Comment 9 Steffen Winterfeldt 2005-09-27 09:00:31 UTC
Yes, the code has changed a lot. 
Comment 10 Steffen Winterfeldt 2005-09-27 14:57:07 UTC
What cpu is running there? Could you attach /proc/cpuinfo?  
Comment 11 Steffen Winterfeldt 2005-09-28 10:27:59 UTC
Created attachment 51038 [details]
test iso

Please run the attached test iso on that machine. It (hopefully) will print a
colored
stripe on the screen when it hangs.

What color is it?
Comment 12 Murlin Wenzel 2005-09-30 00:01:49 UTC
Sorry I couldn't get to this sooner.  After a couple of tries, here is what I got.

1.  Welcome screen appeared with all languages... Second screen appeared(should
be  install menu) with just the SuSe logo upper right corner.  System did hard
reset with IERR.

2.  Welcome screen started to appear(background and 2 languages). System did
hard reset with IERR.

I never did see any stripe.  I can tell you the system is running 4 2.8GHZ Xeon
MP processors.  If you still need the cpuinfo, I'll have to install something
else to get the files.
Comment 13 Steffen Winterfeldt 2005-09-30 09:14:10 UTC
That would indicate that there are no unexpected interrupts or faults 
occurring. 
 
Is there any way to get the instruction pointer where the IERR happens? 
Comment 14 Steffen Winterfeldt 2005-09-30 09:21:38 UTC
Created attachment 51228 [details]
test iso 2

In any case, here's another try: This ISO will show a debug window in the upper
left corner.

You can single step by pressing Enter or Space
(or about any other key except Esc (which would end single step mode)).

What's the number after 'ip' when it crashes?
Comment 15 Murlin Wenzel 2005-10-03 17:45:34 UTC
I managed to get at least one crash with the single step code.

ip 47d:  4e.7

The second time I hit a reset, the number right after ip changed (couldn't get
it fast enough) but the 2nd number was still 4e.7.  I hope this means something
to you.

Could we get any more useful info if I installed the box in text mode(shift key)
and somehow setup the bootloader to run in text mode?

I'll try the single step again and see what I can get.
Comment 16 Steffen Winterfeldt 2005-10-04 09:01:10 UTC
Created attachment 51371 [details]
test iso 3

Please try this one. Any better or does it at least crash at a different place?
Comment 17 Murlin Wenzel 2005-10-04 17:13:00 UTC
I was holding down the spacebar to single step. This one appeared to get
further.  The latest IP when it reset 1bd5: 4e.7.  It appeared to be starting to
build the menu on the second screen.

This has got to be some type of ugly smi/protected mode timing issue.  I just
rebooted several times and hit <ESC> to get out of single step mode.  2 times
the startup succeeded and I was able to interact with the boot menu/options.  Of
course I couldn't really do anything else.  I was getting excited until I
rebooted 2 more times and the system just reset both times.
Comment 18 Steffen Winterfeldt 2005-10-05 12:20:30 UTC
Created attachment 51483 [details]
test iso 4

The welcome texts were displayed in random order. I replaced the PRNG with a
dummy in this iso. Does it still crash at different places?
Comment 19 Steffen Winterfeldt 2005-10-05 12:34:38 UTC
After reading some processor docs I no longer think an SMI could cause 
this. I'm trapping all interrupts now and you would see the mentioned 
colored stripe whenever an unexpected int or trap happens. 
 
There is a tiny window of a few instructions where an NMI could cause 
a cpu reset. I'm going to address this but it needs some work. 
 
My favorite theory at the moment is that some BIOS functions destroy 
registers they should not. I started to save regs around some of them 
in test iso 3 (and hoped it would help a bit more). 
Comment 20 Steffen Winterfeldt 2005-10-10 09:41:07 UTC
Murlin, in view of bug 81046 comment 20, are we talking about an ATI 
graphics card here? 
Comment 21 Murlin Wenzel 2005-10-10 14:37:49 UTC
My BIOS setup reports this as an ATI embedded RADEON 7000.  I haven't seen an
IBM server in ages that didn't have some type of embedded adapter.

I know I can use the 'shift' key during intial boot to start the install in text
mode, is there something similar or a way to configure grub to boot in text
mode?    I could at least get the latest code installed that way and look for
other problems.
Comment 22 Steffen Winterfeldt 2005-10-10 14:59:26 UTC
Sure, just remove the gfxmenu line in /boot/grub/memu.lst. 
 
According to the mentioned report it should be avoidable by doing 
dword aligned accesses. I'll see what I can do. 
Comment 23 Murlin Wenzel 2005-10-10 20:44:11 UTC
Using some of the various tricks you showed me, I was able to get the initial
install to complete.  I had to edit menu.lst before the reboot or I just kept
hitting the same gui welcome screen restart.  After I got the bootloader working
I ended up with the same problem described in bugzilla #114800.  I guess it
shouldn't really have surprised me that the same problem showed up on this
hardware since it has the same CSB6 IDE chipset.
Comment 24 Steffen Winterfeldt 2005-10-11 14:45:20 UTC
Created attachment 53644 [details]
test iso 5

Please test this iso. It uses only dword aligned video memory reads. Hopefully
this helps.
Comment 25 Murlin Wenzel 2005-10-11 15:19:12 UTC
IT HELPS.  So far 10 reboots, 10 successful boots to menu screen.  Now I just
have to get past bugzilla #114800
Comment 26 Steffen Winterfeldt 2005-10-11 15:24:22 UTC
Great! And it even speeds things up considerably. 
Comment 27 Murlin Wenzel 2005-10-11 15:32:37 UTC
How hard would it be to come up with a SL10 cd1 with these updates?
Comment 28 Steffen Winterfeldt 2005-10-11 15:48:15 UTC
10.0 should be no problem (comment 24 is 10.1). I'll put something together 
tomorrow. 
Comment 29 Steffen Winterfeldt 2005-10-12 13:25:01 UTC
Here is an updated 10.0 boot & rescue ISO: 
 
ftp://ftp.suse.com/pub/people/snwint/10.0/SUSE-10.0-boot.iso 
Comment 30 Steffen Winterfeldt 2005-10-12 13:29:05 UTC
*** Bug 116934 has been marked as a duplicate of this bug. ***
Comment 31 Steffen Winterfeldt 2005-11-15 14:12:39 UTC
read more about this in bug 129724
Comment 32 Steffen Winterfeldt 2005-11-15 14:13:57 UTC
*** Bug 133280 has been marked as a duplicate of this bug. ***
Comment 33 Steffen Winterfeldt 2006-03-15 11:07:28 UTC
*** Bug 158061 has been marked as a duplicate of this bug. ***