Bug 130693

Summary: Wrong use of resume parameter on raid devices in initrd prevents booting
Product: [openSUSE] SUSE LINUX 10.0 Reporter: Rob Lucke <Rob.Lucke>
Component: KernelAssignee: Hannes Reinecke <hare>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Critical    
Priority: P5 - None CC: behlert, hare, krzysiek-novell, nfbrown, trenn
Version: FinalKeywords: Install
Target Milestone: ---   
Hardware: HP   
OS: SuSE Linux 10.0   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: output of acpidmp for HP x4100
contents of /proc/cpuinfo for HP xw4100
System configuration information about xw8000 system
Kernel patch to start md arrays read-only
Diff to mkinitrd to work with preceding kernel patch

Description Rob Lucke 2005-10-26 03:34:28 UTC
I upgraded three of my machines to SUSE 10.0 -- the laptop installation went just fine.  Both SMP systems failed.  In both cases, the first CD-ROM loaded and the installation failed on the first reboot.  Booting in "Safe Mode" was required to continue the installation and was required *after* the installation to get the systems to boot live, under their own power (so to speak).

One system is a dual-processor 500 MHz Pentium III (HP X500 workstation) and the other is a single processor (hyperthreaded) 3.00 GHz Pentium IV (HP xw4100 workstation).  Both systems behaved exactly the same, resisting any boot attempts except in safe mode. I am attaching the output from acpidmp for the xw4100 system along with the contents of /proc/cpuinfo.  The X500 has 1GB of RAM and is using the SMP kernel, while the xw4100 has 4GB and is using the BIGSMP kernel (both chosen by the installation process).

The output from both system's boot process stops shortly after installation of the "processor" module.  In checking report #129954, the symptoms are very similar.  I get messages from the thermal and fan modules stating "no such device" and then a kernel panic (no output to console) and complete hang.

To fix this situation, I modified the initrd file's init script to comment out the loading of the processor, thermal, and fan modules.  The systems both boot with the normal GRUB stanza, as long as I substitute the modified initrd. 

I don't have enough information to tell if the installation of the processor, thermal, and fan modules is correct for these hardware configurations.  If the installation from the initrd is correct, then there is some fault between the modules and the hardware -- I don't know enough about the new ACPI stuff at this point to know what to suggest (other than don't install the modules from inside the initrd).  I mentioned my laptop above, a Sony VAIO -- the installation of the modules works just fine for it.
Comment 1 Rob Lucke 2005-10-26 03:35:51 UTC
Created attachment 55486 [details]
output of acpidmp for HP x4100
Comment 2 Rob Lucke 2005-10-26 03:36:32 UTC
Created attachment 55487 [details]
contents of /proc/cpuinfo for HP xw4100
Comment 3 Klaus Kämpf 2005-10-26 07:29:47 UTC
Kernel ?!
Comment 4 Rob Lucke 2005-10-26 09:14:37 UTC
This may be a kernel issue, because of the processor/thermal/fan module issues.  However, whatever builds the initrd during the installation process might need to "think" a little more about automatically loading the acpi-related modules from the initrd.

NOTE: I just installed another dual-CPU HP xw8000 system and got exactly the same issue.  Kernel panic and hang after loading the processor/thermal/fan modules.

At least the behavior is consistent with HP workstations, both real old (Pentium III) X500 and brand new (Pentium IV) xw4100, xw8000.
Comment 5 Rob Lucke 2005-10-26 09:18:25 UTC
One last comment before I call it a night.  These systems worked just fine with SUSE 9.3 Professional.  I did have to add "acpi=off" to the boot options under 9.3 -- that seems to have no effect on this issue, it was the first thing I tried.
Comment 6 Thomas Renninger 2005-10-26 09:30:39 UTC
With acpi=off the processor and other ACPI modules are definitely not loaded.
But we may see here two separate bugs...

Can you boot with pci=noacpi and 1 (just add a one) as boot params?
This should at least work on the P4?

If this works on the P4, disable the powersaved service (chkconfig powersaved off) and reboot normally, this is a duplicate then I am currently working on.

For the PIII it might be something else.
Comment 7 Rob Lucke 2005-10-26 16:43:45 UTC
I tried booting with "pci=noacp 1" as F2 options to the "standard" boot stanza.  This fails with the same error I am seeing everywhere else.  Here is a transcript of the last lines of the output:

[...]
Loading processor
ACPI: CPU0 (power states: C1[C1])
ACPI: Processor [CPU0] supports 8 throttling states
ACPI: CPU1 (power states: C1[C1])
ACPI: Processor [CPU1] supports 8 throttling states
Loading thermal
Loading fan
Loading raid1
md: raid1 personality registered as nr3
Waiting for device /dev/md2 to appear: ok
no record for 'md2' in database
Attempting manual resume

Kernel panic - not syncing - I/O error reading memory image

[power button time - RwL]

As you can see, the processor/thermal/fan modules appear to be loaded all the time.  The system never gets to the point that it does the pivot to the new root disk.  Another note on these systems:  every one of them has a RAID 1 set as the system disk.  I have and xw8000 (native SCSI), an xw4100 (native SCSI), and the X500 (native SATA) -- each has two system disks and is set up in RAID 1.

I have tried every possible combination of the parameters specified in the "Safe Mode" stanza in conjunction with the "SUSE Linux 10" normal boot.  Those being "acpi=off apm=off selinux=0 nosmp noapic maxcpus=0 edd=off vga=normal ide=nodma noresume 3".  Single instances or combinations don't work.  The only way to boot is 1) Safe Mode, or 2) initrd modified to not load processor/thermal/fan.

In addition to the above options, I have also (based on information in other bug reports) tried "acpi=ht" (for the xw4100), "numa=noacpi", "acpi=oldboot", and various combinations of these and the Safe Mode options.  Still no dice.  The rescue system also boots and mounts the disks just fine, that shows that the installation process is doing the right thing on the disks and with RAID.
Comment 8 Rob Lucke 2005-10-26 17:14:31 UTC
oops, I meant "pci=noacpi" in the last message
Comment 9 Rob Lucke 2005-10-26 19:53:37 UTC
Okay, a recap:

HP X500 dual Pentium III             ---- Boots "safe" or with modified initrd
HP xw4100 3.0 GHz Pentium IV HT      ---- Boots "safe" or with modified initrd

I just installed my dual 3.06 GHz Pentium IV HT with SUSE 10.  It won't boot except in safe mode.  Period.  It gets the same message as described previously.  Modified the initrd, but that doesn't work at all.  Tried "pci=noacpi 1" and that doesn't work.  The BIOS in this machine is completely up to date.  So ...

HP xw8000 dual 3.06 GHz Pentim IV HT ---- Boots "safe" only

All of these systems have SMP in common.  Two of them have hyperthreading.  The dual non-HT system boots with mods, as does the single CPU HT machine.  The dual CPU HT machine won't boot except "safe".  They also have RAID 1 system disks (partitioned) in common.  The RAID configuration is identical: separate /boot, swap, and / partitions.

This is beginning to look like and SMP kernel issue, not necessarily an ACPI issue.  The other possibility is some SMP issue with RAID 1 system disks.  They all boot "safe", which includes "nosmp maxcpus=0".

Comments, next steps?  I will make the xw8000 the test machine, because the others are my file server and business server, respectively.  The xw8000 can be the test bed.
Comment 10 Rob Lucke 2005-10-26 21:53:00 UTC
Created attachment 55644 [details]
System configuration information about xw8000 system

Detailed configuration information, including output from acpidmp, lshal, and /proc files.
Comment 11 Rob Lucke 2005-10-26 22:15:45 UTC
The powersaved is off on all of these systems.
Comment 12 Rob Lucke 2005-10-26 23:30:20 UTC
Okay, I found it.  It is not ACPI at all, but an artifact of the resume feature.

THEORY:
If the resume device is specified (/dev/md1), then the initrd's init script executes discover_resume_device.  There appears to be a timing issue, maybe with RAID, udev, and the script -- the removal of the processor/fan/thermal module loads sped things up enough on two of the systems that the device was not found within the timeout, allowing the boot to proceed.  If the device "shows up" in time, there is something else happening that interferes with the boot.  

I have not dug up the details on this, but I am "instrumenting" the init script.  By removing the "resume=<device>" option and replacing it with "noresume", I can boot the problem systems.  This is also something that the "safe mode" boot stanza does.
Comment 13 Rob Lucke 2005-10-27 02:39:28 UTC
By instrumenting the initrd's init script, I think I found the cause of the issue.  First, the installation process seems to always place "resume=<swapdev>" into the kernel options, unless the menu entry is "safe mode".

The presence of the "resume=/dev/md1" kernel parameter activates the portions of the init script that try to detect the resume device.  We need to make a distinction between the parameter string that represents the device "/dev/md1", the actual device file name (which is present in /dev) /dev/md1, and the active device itself as it is instantiated by the udev file system and represented by the /sys/block/md1 file and associated information.

If you look for the udev_wait_for_device function in the initrd's init script, you will see that the script is testing for "/dev/$root".  The udev_discover_resume function is passing the "/dev/md1" string with the path stripped: "md1" to udev_wait_for_device.  This test will always succeed, as long as there is a device file in /dev.  I added the ls command to the initrd and verified that there is a /dev/md1 present by default.  Should this be looking at the udev device information instead?

Note that the presence of the /dev/md1 device file (under /dev) does not mean the device is active as far as udev or the kernel is concerned.  udev_discover_resume, however, trundles off as if the actual /dev/md1 device was active and readable.  When it calls "udevinfo -q path -n $resume", the error message "no record for 'md1' in database" is accurate:  there is no udev device under /sys/block named md1 (or md anything).  This is because the raidautorun call is not made until *after* the call to udev_discover_resume in the init script.  Indeed, examining the contents of /proc/mdstat at this point shows that the raid1 module is present, but there are no active devices.  The udevfs will not have created any devices at this point.

The call to udevinfo "path=$(/sbin/udevinfo -q path -n $resume)" fails (generating the error message), so the backup processing (correctly) sets the device major (9) and minor (1) number variables from the /dev/md1 device file.  These values get echoed into /sys/power/resume as "9:1" -- this appears to be where things go completely, er, south.  It may well be that removing the processor module (it was the only one that ever loaded successfully) somehow affects the behavior when the resume device info is echoed into /sys/power/resume -- it will tack a kernel guy to answer why that might be.

I am going to be incommunicado tomorrow morning (10/27) PST, due to a clustering seminar I am giving.  I think this should be enough information to generate proposed fixes, and I will try to move the raidautorun ahead of the resume stuff if I get time.

There are two questions that beg asking: 1) does it make sense to resume from a RAID device, and 2) does it make sense for the install procedure to always place the resume option into the boot parameters.  If the answer to 2) is "yes", then 1) needs to work properly.  I think this is easily fixed, but Lucke's first law states "Ignorance simplifies any problem."

Have a good morning.
Comment 14 Thomas Renninger 2005-10-27 07:56:08 UTC
First: Thank you for the very detailed description! This could have taken weeks to realise it is the resume parameter and not the processor module if we cannot access the system. I still wonder why/how the processor module should influence this.

Whatabout SLES9, we do not load the processor module in initrd there, but resume from a raid device would still fail?

CC'ing Pavel this might also be interessting for you?
Comment 15 Rob Lucke 2005-10-27 08:32:14 UTC
This kind of troubleshooting is the heart of what I do for a living.  I helped HP build a 1980-CPU, 13 TFLOP, Itanium-2 Linux cluster with pre-production hardware and pre-release IA-64 Linux.  It is a wonder I am still sane and willing to dive into stuff like this.  8^)

I've also done some pretty extensive re-engineering of the startup process in SLES9, and I don't recall ever running into resume anywhere.  That kernel was 2.6.1, so maybe it was added later. The script was called "linuxrc" in the initrd there, and as I recall, it only did a few module loads and the pivot_root call.  It was not nearly as complex as the new "init" script for SUSE Linux 10.

I don't know why the processor module would affect the behavior if it is not installed.  It certainly didn't on one of the three systems.  I have not had any reason to dig into the ACPI stuff yet, so I am not that familiar with the architecture and its behavior.

Maybe there is some connection with the processor module and the /sys/power/resume device and the crash -- it still could be a timing issue.  After all, should echoing MAJOR:MINOR for a non-active device into /sys/power/resume cause a kernel panic with no oops message?  8^)  Some questions have no answer.

Have a good afternoon.
Comment 16 Thomas Renninger 2005-10-27 08:58:43 UTC
The processor module should have nothing to do with /sys/power/* initialisation and also nothing to do with raid initialisation it should be totally unrelated to this stuff, only possibility I see is a timing issue. 
What is rather scary, is that this means raid systems on SLES9 run by luck or better: resume params are set incorrectly but it just doesn't matter (if I interpret your assumptions right).
Comment 17 Stefan Behlert 2005-10-27 09:10:27 UTC
Kay, can you look at comment 13 please? I'm lost with that udev :)
Comment 18 Hannes Reinecke 2005-10-27 09:19:50 UTC
Resume on md does _not_ currently work with SL10.0. Working on it.
Comment 19 Hannes Reinecke 2005-10-27 09:22:56 UTC
Just remove 'resume=xxx' from the boot commandline and the system should boot.
Comment 20 Hannes Reinecke 2005-10-27 14:18:47 UTC
Ad comment #15:
> After all, should echoing MAJOR:MINOR for a non-active device into
> /sys/power/resume cause a kernel panic with no oops message?
Well, this echo triggers the _kernel_ to search for a resume device. Of course all hell might break loose if the resume device is in principle there but not properly configured (as is the case with md devices).

Still wondering why it's my fault, though ... Sounds more like the fault of either md or acpi module.
Comment 21 Rob Lucke 2005-10-27 21:27:39 UTC
I think two things need to happen on this situation:

1) the initrd script needs to be fixed to properly look at the right thing -- if the device is not there, then it is not there and the code should not echo the wrong thing to the right place.  If the script was checking for a live device instead of the /dev entry, then resume wouldn't work, but the kernel wouldn't lock up either.

2) The default of adding the "resume=" option to the kernel boot parameters needs to be re-examined in the light of RAID devices.  If the resume is not supported in the general case (RAID, raw, ...) then the installation process needs to be updated and the restriction (resume is not supported from RAID) needs to be documented somewhere.  This will require installation intelligence to determine the type of device specified for swap (or allowing a special designation for the resume device).

These two things are the expedient thing to do, given that rolling the kernel to catch this type of error (echoing the wrong thing to /sys/power/resume) is not going to happen in a very timely fashion (an assumption on my part).  If I were the kernel guys, I would say "Just don't echo the wrong thing into the device --  your bad, why us?".

If I get time, I will test moving the raidautorun call to before the udev_discover_resume routine.  This may or may not start a modification cascade of some type, but it *could* actually be an easy fix ...  At a minimum, I would be looking at fixing the udev_wait_for_device function to properly return a "not found" when the device isn't really active -- it should not be looking exclusively at /dev file entries.  This would eliminate the hang, not require modification of the kernel or install process, and would result in happier customers.  The down side would be that resume to/from RAID would not work, silently.
Comment 22 Rob Lucke 2005-10-27 22:19:51 UTC
Okay, moved the raidautorun call.  The script is now (mostly) original, but the section changes is:

echo "Loading raid1"
modprobe raid1 $params

echo "raidautorun ..."
raidautorun
echo "done..."

Note that the raid1 module is loaded and the raid devices are started.  At this point, the resume device (I ASSuME) will not need file system modules (jbd, ext3, etc) for access to the resume device.  Before the root device and others are accessed, the jbd and ext3 modules are loaded in my copy of init.  This boots just fine, even with the resume= option in place.

I won't generate the patch, because I hacked on the init script and don't want to screw things up for you.  Is there any other information that you need?
Comment 23 Thomas Renninger 2005-10-28 14:47:31 UTC
Ohh this is mine again... 
I will add an "ignore the resume parameter in raid case" hook into mkinitrd, then.

Hare: Shall we also not support resume from raid partition for SLES10 then, or do you have an idea how to properly solve this?
Comment 24 Stefan Behlert 2005-10-28 15:54:29 UTC
Resume on RAIDs is a 'need-to-have'. Whom do we need for that? Is it mostly a udev-problem or swsusp-problem?
Comment 25 Thomas Renninger 2005-10-29 14:10:01 UTC
I expect this is an (mk)initrd problem. If Rob's investigations are right, the raid is setup after the initrd tries to echo the resume partion into /sys/power/resume. The kernel seeks for the resume flag in the swap partition, that has not setup correctly, yet.
A quick look in mkinitrd seems to confirm that this is true:

(line 2311):
# wait for the resume device
        |udev_discover_resume
(line 2317):
    # Load fs modules _after_ resume
...
somewhat later comes the raid stuff.
Still wondering why the processor module has anything to do with that... Even loading it could take a while, it is done long before.

Whatabout the lvm/evms stuff? If a swap is placed there, resume is also not possible if I interpret the code right?
Hannes?
Comment 26 Pavel Machek 2005-10-30 11:25:40 UTC
#24: why do we need to resume from RAIDs? It is not exactly easy to do... I'd prefer not to do that.

I do not think processor module has anything to do with this one, except changing timing slightly.
Comment 27 Pavel Machek 2005-10-30 11:43:26 UTC
*** Bug 121829 has been marked as a duplicate of this bug. ***
Comment 28 Neil Brown 2005-10-31 01:07:31 UTC
We want to resume from raid for the same reason we want raid for anything else -
reliability in the face of devices failure.

I'm not at all comfortable about simply moving the 'raidautorun' call to before
the resume is attempted just yet. 
The problem is that when you start an md/raid array currently, it will write out
new superblocks and could even start a sync.  However you really don't want to
write ANYTHING to ANY device before resuming as that changes state.

I have had in mind for some time that it would be nice to be able to start
md arrays in a 'readonly' mode that didn't write anything anywhere.  Now I have
a clear case where there is a problem, so I will find a fix. Probably it will
be some module parameter accessed at /sys/modules/md-mod/parameters/start-ro

If set to 1, all devices are started read-only, but flip to rw on the first
write request... Something like that.

For now, I recommend avoiding the attempt to resume from an md array.
Comment 29 Neil Brown 2005-10-31 04:23:17 UTC
Created attachment 55990 [details]
Kernel patch to start md arrays read-only

This patch (which I'll send upstream shortly and see about checking in to the suse kernel) allow arrays
to be started read-only.  With this in place, I believe it is safe to resume from an md/raid array.
Comment 30 Neil Brown 2005-10-31 04:27:35 UTC
Created attachment 55991 [details]
Diff to mkinitrd to work with preceding kernel patch

This is a patch against mkinitrd (as in 10.0) which
should make it create initrd images what work better
if swap (and resume) are on md/raid.

If the module parameter implemented in the previous patch is present, then we enable start_ro and start md arrays before attempting resume.  If it isn't present, we refuse to try to resume from an md array.

Note that I think I found a typo in mkinitrd.
It sets resume_mode to 'no' and later tests if it is 'off'.  I think these should both be the same, and the patch makes them both 'off'.
Comment 31 Pavel Machek 2005-11-01 22:32:30 UTC
Well, enabling write access on first write is nice hack... but it would be cleaner to just start arrays in read-only mode, and only enable writes on explicit request.

initrd can do that just after attempting resume. Maybe that makes patch #55990 unneccessary?

Next, if we really want to support this, it should wait for 10.1. It needs testing, and it would be nice if it was in mainline before we start using it. 
Comment 32 Thomas Renninger 2005-11-02 08:45:31 UTC
So we need:
  a) ignore resume parameter for raid swaps on 10.0 (and possibly a decission of a 
     project manager whether this is worth a YOU update)
  b) a proper solution as suggested in #31 or #30 for 10.1/stable
right?
Comment 33 Forgotten User ZhJd0F0L3x 2005-11-02 11:08:22 UTC
(In reply to comment #32)
> So we need:
>   a) ignore resume parameter for raid swaps on 10.0 (and possibly a decission
> of a 
>      project manager whether this is worth a YOU update)

i think a SDB article is the correct "fix".

>   b) a proper solution as suggested in #31 or #30 for 10.1/stable
> right?

yes, this is 10.1 material.


regarding comment #26: while resume from raid is not exactly high priority for me, resume from cryptoswap should be and i assume raid is just a special case of dm, isn't it?
Comment 34 Neil Brown 2005-11-02 11:22:24 UTC
(In reply to comment #33)
>
> 
> regarding comment #26: while resume from raid is not exactly high priority for
> me, resume from cryptoswap should be and i assume raid is just a special case
> of dm, isn't it?
> 

No, raid isn't just a special case of dm.
'dm' and 'md' are two completely independant modules in Linux.
dm supports various mappings of some set of devices to another, including
the equivalent for 'raid0', and (I think) raid1, and multipath and
lvm.

md support raid0, linear, raid1, raid5, raid6, and multipath (though it 
does that last very poorly and I doubt anyone uses it).

'raid' in the current context is md/raid, not dm.

The problem that this bugzilla is about with md would equally exist
with dm if someone placed swap (and hence suspend/resume) on a dm 
device.
The solution would be conceptually similar (start dm devices in read-only mode before attempting resume) but very different in the details.  I don't know those details.

Comment 35 Pavel Machek 2005-11-02 11:30:09 UTC
#33: there was "cryptoloop howto" from ast, circulating around lkml (IIRC). I may be able to find it. But this is probably better discussion for research maillist than for bugzilla...
Comment 36 Hannes Reinecke 2005-12-19 09:30:49 UTC
Hmm.

Really hmm.

Neil, with your patch we might be trying to call 'raidautorun' on devices which don't even exist yet, as they are still in the process of being discovered / activated.

For MD to work properly we would need a way to activate from hotplug events.

Better to discuss this offline.
Comment 37 Hannes Reinecke 2006-01-16 14:38:12 UTC
Fixed up mkinitrd for 10.1 to properly enable raid devices.