Bug 115891

Summary: Won't boot after first reboot on initial install - blank screen - kernel panic
Product: [openSUSE] SUSE LINUX 10.0 Reporter: craig gardner <cgardner>
Component: KernelAssignee: Andreas Kleen <ak>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: acpi
Version: Beta 4   
Target Milestone: ---   
Hardware: x86-64   
OS: All   
Whiteboard:
Found By: Development Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: screen log

Description craig gardner 2005-09-08 15:16:24 UTC
Beta 4 installs from DVD just fine, but after packages are installed and the
first reboot, then selecting "Liunx" from grub, the system just hangs, showing a
blank screen.  

This is HP DL385, dual core 1.8 GHz Opteron with two processors.  SCSI disks
using hardware raid.

Grub comes up fine, and shows the three options: Linux, Linux (Failsafe), and
Floppy.  (There's even the nice background showing the Provo campus.)

Selecting "Linux" results in the blank screen, no keyboard interrupt, and the
hung system.  Since I have no keyboard control at this point, I can't even try
to switch to any other console to see kernel/console messages.

Selecting "Linux (Failsafe)" seems to work, though.  So perhaps this is an ACPI
problem.
Comment 1 Hubert Mantel 2005-09-08 15:18:57 UTC
Only difference is: During installation, an UP kernel is used but for the
installed system a SMP kernel is trying to be booted.
Comment 2 craig gardner 2005-09-08 16:15:12 UTC
*** Bug 115885 has been marked as a duplicate of this bug. ***
Comment 3 craig gardner 2005-09-08 16:27:11 UTC
I've looked at several other similar bug reports, and have tried a variety of 
things to work past this problem.   
 
For example, I first tried setting the following as grub boot options: 
 
early_printk=vga vga=1 
 
But that doesn't work, because it hangs before anything can get logged, or at 
least before I can switch to the log console. 
 
So I tried these grub boot options: 
 
pci=noacpi noapic acpi_irq_balance 
 
These also didn't change anything. 
 
So I looked at the options that are set by "Linux (failsafe)", and wanted to 
find the combination of options that would make it work.  I guessed that it 
has something to do with powermanagement, so I tried: 
 
apm=off acpi=off 
 
That worked!  But I wanted to get the smallest number of option to make it 
easier to debug.  So I tried: 
 
apm=off 
 
That didn't work.  The server still hanged.  So I tried: 
 
acpi=off 
 
And that worked!  The acpi switch being off is the lone switch that makes the 
difference. 
 
I'm attaching /var/log/messages and dmesg. 
Comment 6 Thomas Renninger 2005-09-08 17:03:41 UTC
Ohoh, maybe this is one of the apci modules now loaded through initrd.
I expect the processor module...

I didn't manage to write the code to disable them via boot param.
Increase severity, whether there is still time for a fast hack (just declaring
three global varialbes for __setup and go out of init function of
fan/thermal/processor modules ...).

Could you please try:
boot into working system.
delete thermal,fan,processor modules from:
INITRD_MODULES="sata_promise sata_via via82cxxx processor thermal fan reiserfs ..."
in /etc/sysconfig/kernel
then invoke mkinitrd and try whether you can boot.

early_printk=vga vga=1
-> you should see something, delete other vga=XXX and splash= options in
/boot/grub/menu.lst and you should see something.
Comment 7 craig gardner 2005-09-08 17:57:06 UTC
Removed thermal, fan and processor from INITRD_MODULES.  Then ran mkinitrd. 
 
Rebooted. 
 
No improvement.  Still doesn't work. 
Comment 8 craig gardner 2005-09-08 18:16:51 UTC
I got the early_printk to work, thanks to your help, by removing vga=XXX and 
splash= from menu.lst.  Here's the abreviated output: 
 
ACPI: Looking for DSDT in initrd... not found! 
 not found! 
Using local APIC timer interrupts. 
Detected 12.528 MHz APIC timer. 
cpu_up: attempt to bring up CPU 1 failed 
Unable to handle kernel paging request at 0000006f812c5160 RIP: 
<ffffffff8016f4cb>{free_block+123} 
PGD 0 
Oops: 0000 [1] SMP 
 
All the register and callback data goes here.... I can include it if you 
want/need. 
 
<0> Kernel panic - not syncing: Attempted to kill init! 
 
 
Comment 9 Andreas Kleen 2005-09-09 07:16:07 UTC
Please include the full log.

That's easiest if you use earlyprintk=serial,ttyS0,baud and a null modem
cable to another machine. 
Comment 10 craig gardner 2005-09-09 14:16:14 UTC
I was hoping you wouldn't ask me to do that.  ;-) 
 
But now that I've found a null modem cable, I've got the output: 
 
time.c: Using 3.579545 MHz PM timer. 
time.c: Detected 1804.115 MHz processor. 
Console: colour VGA+ 80x25 
Dentry cache hash table entries: 262144 (order: 9, 2097152 bytes) 
Inode-cache hash table entries: 131072 (order: 8, 1048576 bytes) 
Memory: 1019548k/1048544k available (2418k kernel code, 0k reserved, 932k 
data, 212k init) 
Calibrating delay using timer specific routine.. 3615.70 BogoMIPS 
(lpj=7231418) 
Security Framework v1.0.0 initialized 
SELinux:  Disabled at boot. 
Mount-cache hash table entries: 256 
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) 
CPU: L2 Cache: 1024K (64 bytes/line) 
CPU 0(2) -> Node 0 -> Core 0 
mtrr: v2.0 (20020519) 
checking if image is initramfs... it is 
ACPI: Looking for DSDT in initrd... not found! 
 not found! 
Using local APIC timer interrupts. 
Detected 12.528 MHz APIC timer. 
cpu_up: attempt to bring up CPU 1 failed 
Unable to handle kernel paging request at 000000ef81cb1fb0 RIP: 
<ffffffff8016f4cb>{free_block+123} 
PGD 0 
Oops: 0000 [1] SMP 
CPU 0 
Modules linked in: 
Pid: 1, comm: swapper Not tainted 2.6.13-3-smp 
RIP: 0010:[<ffffffff8016f4cb>] <ffffffff8016f4cb>{free_block+123} 
RSP: 0000:ffff81003ffb1e58  EFLAGS: 00010012 
RAX: 0000001e002fca9e RBX: ffff810002532100 RCX: 000f0017e54f0000 
RDX: 0000000000000001 RSI: 0000000000000010 RDI: f000ff54f0000073 
RBP: 0000000000000010 R08: ffff81003ffb0000 R09: 0000000000000001 
R10: 0000000000019a28 R11: 0000000000000000 R12: ffff810002532508 
R13: 0000000000000000 R14: 0000000000000001 R15: ffff810002532528 
FS:  0000000000000000(0000) GS:ffffffff80508800(0000) knlGS:0000000000000000 
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b 
CR2: 000000ef81cb1fb0 CR3: 0000000000101000 CR4: 00000000000006e0 
Process swapper (pid: 1, threadinfo ffff81003ffb0000, task ffff81003ffaf500) 
Stack: ffffffff7fffffff 0000000000000000 ffff810002532100 ffff810002532568 
       0000000000000001 0000000000000000 ffffffff804cb460 ffffffff80170167 
       0000000100000001 ffffffff803d3620 
Call Trace:<ffffffff80170167>{cpuup_callback+455} 
<ffffffff8014a56f>{notifier_call_chain+31} 
       <ffffffff80156a01>{cpu_up+225} <ffffffff8010c15c>{init+268} 
       <ffffffff8010f92e>{child_rip+8} <ffffffff8010c050>{init+0} 
       <ffffffff8010f926>{child_rip+0} 
 
Code: 48 8b 14 c5 c0 ca 4c 80 48 8d 04 cd 00 00 00 00 48 c1 e1 06 
RIP <ffffffff8016f4cb>{free_block+123} RSP <ffff81003ffb1e58> 
CR2: 000000ef81cb1fb0 
 <0>Kernel panic - not syncing: Attempted to kill init! 
 
Comment 11 craig gardner 2005-09-09 14:17:42 UTC
Change status back to ASSIGNED. 
Comment 12 Andreas Kleen 2005-09-09 14:20:16 UTC
Can you give me the full log starting with the beginning of the boot?
Comment 13 craig gardner 2005-09-09 14:39:09 UTC
Sorry.  Adding screen log.  Attached. 
Comment 14 craig gardner 2005-09-09 14:40:52 UTC
Created attachment 49414 [details]
screen log
Comment 15 Andreas Kleen 2005-09-12 13:05:11 UTC
Your BIOS is somewhat broken and generates an invalid SRAT table. The fallback
error code had a quirk that lead to the crash later. 

It should work if you boot with numa=noacpi 

Probably didn't make RC2, but you can get a kotd later with that 
change and then drop the command line option again

- patches.arch/srat-fallback: (#115891) Backport some x86-64 srat
fallback fixes