|
Bugzilla – Full Text Bug Listing |
| Summary: | boot stops on mounting a software RAID - Invalid root filesystem | ||
|---|---|---|---|
| Product: | [openSUSE] openSUSE 11.1 | Reporter: | Michael McCarthy <sysop> |
| Component: | Installation | Assignee: | Michal Marek <mmarek> |
| Status: | RESOLVED FIXED | QA Contact: | Jiri Srain <jsrain> |
| Severity: | Critical | ||
| Priority: | P2 - High | CC: | andreas.osterburg, archie.cobbs, aschnell, cus, daved, dhaselwood, esacchi, fhassel, forgotten_bKE5XLoalW, forgotten_Drfk9mafMw, forgotten_E_KYbzzvNl, hare, hpj, joe_morris, lmuelle, mamdoh, martin_schaub, mmarek, mvancura, nfbrown, peter.schlaf, petr.m, richlv, sfurdal, sperkins, whiplash |
| Version: | Final | ||
| Target Milestone: | --- | ||
| Hardware: | i386 | ||
| OS: | Other | ||
| Whiteboard: | |||
| Found By: | Beta-Customer | Services Priority: | |
| Business Priority: | Blocker: | --- | |
| Marketing QA Status: | --- | IT Deployment: | --- |
| Attachments: |
save_y2log
This is the boot.msg file GBs boot.msg the fix as a patch md-restart-uevent Script that waits until MD-array is started (for UDEV) |
||
|
Description
Michael McCarthy
2008-11-16 13:27:26 UTC
Created attachment 252501 [details]
save_y2log
This is the save_y2log
Created attachment 252502 [details]
This is the boot.msg file
NOTE: The sytem was stopped at the /bin/sh prompt when I inserted a USB flash drive. The follwing lines in the boot.msg file indicate the time when the system was stopped and the usb stick inserted prior to resuming with <CTRL-D>.
<6>usb 1-1: new high speed USB device using ehci_hcd and address 2
<6>usb 1-1: configuration #1 chosen from 1 choice
<6>usb 1-1: New USB device found, idVendor=0930, idProduct=6534
<6>usb 1-1: New USB device strings: Mfr=1, Product=2, SerialNumber=3
<6>usb 1-1: Product: USB Flash Memory
<6>usb 1-1: Manufacturer: M-Sys
<6>usb 1-1: SerialNumber: 087043505130A995
The installation was completed, but any time the system is rebooted, the stop occurs. These are the last couple lines from the boot screen: PM: Starting manual resume from disk Waiting for device /dev/md0 to appear: ok invalid root filesystem - exiting to /bin/sh $ After <CTRL_D> the boot resumes and the following line appears $ exit mounting root on /dev/md0 and the boot continues. It appears that the check on the root filesystem is incorrect or is done at the wrong time. Possibly it is checking the wrong device. Looks like a kernel problem then. adjusting summary: every boot stops I think this might be the same as bug 433980 and bug 435778. I'm guessing some udev raciness but I'm no expert there. Sounds like an initramfs problem. Initramfs can not just wait for /dev/md* devices to appear. Because of the weird lifetime rules of md, and it's legacy device-creation interface, it is always there, and does not show up when it available. I guess, it needs some special handling, to check when the device becomes "online". I can not tell for sure, I do not have any raid rootfs setup. Neil, can you explain what initramfs should do here? I don't know about DM-RAID. According to bug 435778, this has been fixed. It needs a 'udevadm settle' before asking udev what the filesystem type is. Hannes added 'wait_for_events' in mkinird-boot.sh for mdadm, just after assembling the root device. Since bug 435778 is only internally accessible, could you spell out here what the resolution is? At the present, I have to hit Ctrl-D to boot every time, and if this has been resolved (but no updates have been available, I would gladly tweak whatever is needed til there is an update. This bit me hard on updating from 10.3 to 11.1. What file needs 'udevadm settle' added to it? I tried adding 'wait_for_events' at the bottom of /lib/mkinitrd/scripts/boot-md.sh, but it didn't fix it here. Perhaps it is a different file. I would appreciate knowing what to do to fix mine, since this is resolved. Thanks. That is the correct file and (close enough to) the correct fix. We put "wait_for_events" just before the final "fi" of that file. One you do that, you need to recreate the initrd. Simply running mkinitrd as root should do this. Then try to reboot. Sorry to be a pain in the neck, but adding wait_for_events to /lib/mkinitrd/scripts/boot-md.sh before the last fi and running mkinitrd still did not fix it. You also mention in Comment #9 that there is a 'udevadm settle' also added. I cannot see anywhere in that file that I can figure out where it should go. Perhaps in another file? I just checked boot-udev.sh, and it looks to me like wait_for_events is a variable meaning udevadm settle with a timeout set to 30 in setup-udev.sh. Is it possible the timeout is too quick for my system? I will test and let you know. Well, changing the timeout to 60 didn't change anything. The message "invalid root filesystem -- exiting to /bin/sh" is found in line 88 of boot-mount.sh where it looks like it is in the filesystem checking area. I suppose it cannot do a fsck, even though it looks like it has assembled the raid1, and not found the resume, before that message. If the wait_for_events is the fix, it isn't working here. Just to triple check, this is my last few lines from boot.md.sh: if [ "$md_dev" ] ; then /sbin/mdadm $mdconf --auto=md $md_dev || /sbin/mdadm -Ac partitions $mdarg --auto=md $md_dev fi wait_for_events fi (beware of word wrapping, that is only 5 lines). Thanks for any ideas. Should I reopen this bug? I've added you to cc for bug 435778 so you should be able to read it. Comment #7 might be an interesting starting point. If you see IF_FS_TYPE=whatever then it seems likely that it is the same race. If you do not, the problem must be elsewhere. Thanks for that. This is what I found. udevadm info -q path -n /dev/md0 gave me the correct /devices/virtual/block/md0 udevadm info -q env -p /devices/virtual/block/md0 gave me 'no record for '/devices/virtual/block/md0/ in the database' (note: you had a typo) after some testing, I tried udevadm info -q all -n /dev/md0 That gave me some output, but ID_FS_TYPE was missing. I entered echo change > /sys/block/mdo/uevent and repeated udevadm info -q all -n /dev/md0 This time I did have ID_FS_TYPE as well as others. Not quite sure what this reveals yet, but it gives me more things to check. Either hitting Ctrl-D or entering exit continues the boot, but boot.md shows as failed, as well as it shows failed (but apparently succeeds) when it later tried to mount my raid 1 home partition md1. So I wil look again at the mkinitrd scripts. Thanks for your help so far. OK, still haven't found the problem. I just found mdadm-3.0-12.1.x86_64.rpm in CTiel's home 11.1 hotfix directory. It improved some things, boot.md no longer shows as failed, but I still get the invalid root filesystem error and it stops booting. I tried adding 'echo change > /sys/block/$md_dev/uevent', just before wait_for_events, and saw that the variable worked, but got an error that the file was not valid or something like that, basically showing me that md0 was not there yet. It was before the message "waiting for device /dev/md0 to appear : ok", so I am assuming it has to be a bug in a different script. 10.3 worked great. I did test again, and 'udevadm info -q env -n /dev/md0' after failure to boot still does not contain ID_FS_TYPE, but after echo change > /sys/block/md0/uevent, it does. Well, after looking around, and figuring out there was a boot folder that had the order of the scripts executing, I decided (since it failed right after resume device not found, ignoring) it had to be something in boot-mount.sh. I compared it with the one from 10.3 and too much has changed for me to understand what I could change. So I added the echo line (and then subsequent test still failed but I did has the ID_FS_TYPE this time), then added a sleep line after it to give it a bit more time. Now it will boot, but this is at best a kludge or desperate work around. To see what I changed:
# And now for the real thing
if ! discover_root ; then
echo "not found -- exiting to /bin/sh"
cd /
PATH=$PATH PS1='$ ' /bin/sh -i
fi
echo change > /sys/block/md0/uevent
sleep 1
sysdev=$(/sbin/udevadm info -q path -n $rootdev)
# Fallback if rootdev is not controlled by udev
if [ $? -ne 0 ] && [ -b "$rootdev" ] ; then
devn=$(devnumber $rootdev)
maj=$(devmajor $devn)
min=$(devminor $devn)
if [ -e /sys/dev/block/$maj:$min ] ; then
sysdev=$(cd -P /sys/dev/block/$maj:$min ; echo $PWD)
fi
It surely isn't a fix, but at least now it will boot without me needing to hit Ctrl-D. Hope this helps to track down a real fix.
This problem looks similar. https://bugzilla.novell.com/show_bug.cgi?id=460917 Until the real developers are able to track this down, I thought of a slightly less cludgy fix. I took out what I added to boot-mount.sh, and added these 2 lines in the opposite order just before the wait_for_events in boot-md.sh. This is also working, and since it uses the md variable, it should work a bit better than the hard coded md0, which meant I needed to put this fix in boot-md.sh and not mount. Here is what it looks like now at the end. (watch the wrapping) if [ "$md_dev" ] ; then /sbin/mdadm $mdconf --auto=md $md_dev || /sbin/mdadm -Ac partitions $mdarg --auto=md $md_dev fi sleep 1 echo change > /sys/block/md$md_minor/uevent wait_for_events fi Hope this will help you Petr until a real fix comes along. This is a better workaround than earlier, and it allows mine to boot right up. I tried undoing the workaround and applied the patch from https://bugzilla.novell.com/show_bug.cgi?id=460917 but it did not help me at all. My problem is definitely that ID_FS_TYPE was missing. I just checked, after booting with the workaround, and on the booted system I get jmorris:/ # udevadm info -q env -n /dev/md0 MD_LEVEL=raid1 MD_DEVICES=2 MD_METADATA=0.90 MD_UUID=ffb096e5:5d1a78ab:71771454:6b84c526 ID_FS_USAGE=filesystem ID_FS_TYPE=ext3 ID_FS_VERSION=1.0 ID_FS_UUID=a2d3f2bf-eaec-45a0-b843-55b15f037d83 ID_FS_UUID_ENC=a2d3f2bf-eaec-45a0-b843-55b15f037d83 ID_FS_LABEL=root ID_FS_LABEL_ENC=root ID_FS_LABEL_SAFE=root jmorris:/ # udevadm info -q env -n /dev/md1 MD_LEVEL=raid1 MD_DEVICES=2 MD_METADATA=0.90 MD_UUID=50317547:316ba81e:fc3a8342:011169ec MD_DEVNAME=1 So even after booting, the echo command would seem to be needed to give md1 the FS info. jmorris:/ # echo change > /sys/block/md1/uevent jmorris:/ # udevadm info -q env -n /dev/md1 MD_LEVEL=raid1 MD_DEVICES=2 MD_METADATA=0.90 MD_UUID=50317547:316ba81e:fc3a8342:011169ec ID_FS_USAGE=filesystem ID_FS_TYPE=ext3 ID_FS_VERSION=1.0 ID_FS_UUID=44d4e1ac-8ce7-49ec-b64f-2c084a817515 ID_FS_UUID_ENC=44d4e1ac-8ce7-49ec-b64f-2c084a817515 ID_FS_LABEL=home ID_FS_LABEL_ENC=home ID_FS_LABEL_SAFE=home FSTAB_NAME=/dev/md1 FSTAB_DIR=/home FSTAB_TYPE=ext3 FSTAB_OPTS=acl,user_xattr FSTAB_FREQ=1 FSTAB_PASSNO=2 Interestingly enough, KDE popped up a dialog box after entering the echo command. It sure seems to be a difficult bug to locate. It looks as if either md doesn't generate change events for when it's ready or that the udev rules need to be modified to update ID_FS_XXX for md. Kay? Neil? Per comment 21 NEEDINFO. FWIW, modification according to https://bugzilla.novell.com/show_bug.cgi?id=445490#19 fixed it on 3 different systems for me. Joe, thanks a lot. BTW, I would raise the severity to critical or even blocker, since it definitely inhibits common install scenarios for a lot of folks.. I have had the same result. https://bugzilla.novell.com/show_bug.cgi?id=445490#19 fixed the problem. *** Bug 460917 has been marked as a duplicate of this bug. *** We see this in comment#3: Waiting for device /dev/md0 to appear: ok As said in comment#7, we can not just wait for the /dev/md* device-node to appear, it is always there. We need to loop until the device is usable. The lifetime of md device is "broken" from udev's view, they need static device nodes because of their legacy behavior. They do not fit into today's kernel/udev device model, and need to be special-cased in initramfs. Michal, that needs changes in mkinitrd-boot.sh in mdadm package. You're listed as maintainer of that package. Can you help here? I'm not really familiar with md, so you're the better person here ... :) If we would boot by-uuid, instead of the questionable md kernel device name use, it would just work, right? I do not believe so. For example, this is from my running system:
joe@jmorris:~> ls -l /dev/disk/by-uuid/
total 0
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 801c2029-6789-4117-a461-da745a19f062 -> ../../sda1
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 a2d3f2bf-eaec-45a0-b843-55b15f037d83 -> ../../md0
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 a97bf970-827c-458d-8d59-764fc0381da8 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 e3e86c6b-4445-4611-90a8-fb7b2df6e5af -> ../../sda7
Even though my home is on md1, as you can see above it is still absent in by-uuid. I believe md0 is only there because of the added
sleep 1
echo change > /sys/block/md$md_minor/uevent
that I added to boot-mount.sh See Comment #19.
But, looking around at /dev/disk/by-id, I see:
joe@jmorris:~> ls -l /dev/disk/by-id/
total 0
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part5 -> ../../sdb5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-HDS722516VLAT20_VNR40AC4CMNT6S-part6 -> ../../sdb6
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY -> ../../sda
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part5 -> ../../sda5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part6 -> ../../sda6
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 ata-ST3200822A_3LJ07EHY-part7 -> ../../sda7
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 edd-int13_dev80 -> ../../sda
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part5 -> ../../sda5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part6 -> ../../sda6
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev80-part7 -> ../../sda7
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 edd-int13_dev81 -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part5 -> ../../sdb5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 edd-int13_dev81-part6 -> ../../sdb6
lrwxrwxrwx 1 root root 9 2009-01-09 05:21 md-uuid-50317547:316ba81e:fc3a8342:011169ec -> ../../md1
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 md-uuid-ffb096e5:5d1a78ab:71771454:6b84c526 -> ../../md0
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S -> ../../sdb
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part2 -> ../../sdb2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part5 -> ../../sdb5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_HDS722516VLAT20_VNR40AC4CMNT6S-part6 -> ../../sdb6
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY -> ../../sda
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part1 -> ../../sda1
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part2 -> ../../sda2
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part5 -> ../../sda5
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part6 -> ../../sda6
lrwxrwxrwx 1 root root 10 2009-01-09 13:20 scsi-SATA_ST3200822A_3LJ07EHY-part7 -> ../../sda7
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 usb-IC_USB_Storage-CFC_20020509145305401-0:0 -> ../../sdc
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 usb-IC_USB_Storage-MMC_20020509145305401-0:2 -> ../../sde
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 usb-IC_USB_Storage-MSC_20020509145305401-0:3 -> ../../sdf
lrwxrwxrwx 1 root root 9 2009-01-09 13:20 usb-IC_USB_Storage-SMC_20020509145305401-0:1 -> ../../sdd
Which at least recognizes md1 as well as md0, though I am not sure how that would change the fact that the file system info is still not output, which is what is causing this part to fail, i.e.
jmorris:/home/joe # udevadm info -q env -n /dev/md0
MD_LEVEL=raid1
MD_DEVICES=2
MD_METADATA=0.90
MD_UUID=ffb096e5:5d1a78ab:71771454:6b84c526
ID_FS_USAGE=filesystem
ID_FS_TYPE=ext3
ID_FS_VERSION=1.0
ID_FS_UUID=a2d3f2bf-eaec-45a0-b843-55b15f037d83
ID_FS_UUID_ENC=a2d3f2bf-eaec-45a0-b843-55b15f037d83
ID_FS_LABEL=root
ID_FS_LABEL_ENC=root
ID_FS_LABEL_SAFE=root
jmorris:/home/joe # udevadm info -q env -n /dev/md1
MD_LEVEL=raid1
MD_DEVICES=2
MD_METADATA=0.90
MD_UUID=50317547:316ba81e:fc3a8342:011169ec
MD_DEVNAME=1
Without the ID_FS_TYPE info the fsck cannot work and stops booting.
I may need to eat my words. I checked earlier our 10.3 server with udevinfo (10.3 doesn't have udevadm) and I found out only md0 on that box had ID_FS_XXXX info (it has 4 raid 1. I am waiting for this bug to be fixed before I upgrade it to 11.1). It also only had the link to md0 in by-uuid, but by-id had them all. I would be willing to test to see if booting by-id would work. What all would need changed? Menu.lst? Fstab? rebuild initrd, anything else? Just tried Yast Partitioner, by-id is greyed out, so it is not an option. UUID is an option, but I cannot imagine trying to remember its name for any commands. /dev/disk/by-uuid/a2d3f2bf-eaec-45a0-b843-55b15f037d83. From the above, though, it looks like it uses ID_FS_UUID to get it, not MD_UUID. Without the echo change > /sys/block/md0/uevent though there is no ID_FS_UUID present. There is MD_UUID. BTW, I haven't been able to reproduce even bug #460917 yet, so no progress so far. Any help from the udeve experts would be appreciated :). Michal, I bet that installing 10.2 involving a md and than upgrading to 11.1 will suffice. Since I can easily reproduce this problem (my raid was originally built back around 9.2 with upgrades to 9.3, 10.1, 10.2, 10.3, and now 11.1 if something needs changed) I would be happy to try booting by-uuid as mentioned in Comment #28 if I had some info as to what needs changed, and if it does not depend on the ID_FS_UUID info, which I know already does not work. got bitten by this one recently.. comment19 seems to help. Created attachment 267459 [details]
GBs boot.msg
The same here with my DELL Dimension 8200. The relevant lines in my boot.msg read as follows: Trying manual resume from /dev/md0 resume device /dev/md0 not found (ignoring) Trying manual resume from /dev/md0 resume device /dev/md0 not found (ignoring) Waiting for device /dev/md1 to appear: ok invalid root filesystem -- exiting to /bin/sh $ Here I have to hit Ctrl+D (or type "exit" followed by <ENTER>) to continue booting. Any progress with this bug? This bug is holding me back from upgrading our office server, which has 4 md raid 1 partitions, and even though the work around is working OK, I know it is a cludge and not a fix. Just asking. If there is anything I could do to help... Does anyone know if a newly created md raid 1 partition with 11.1 will boot right up with out the work around? Just grabbing at straws. Joe, it escapes me, why this glitch holds you back from doing anything with 11.1? Well yes, it's nagging to boot into the rescue system, mount the devices by hand, edit the offender, call mkinitrd, and reboot.. Even a newly created RAID 1 will fail without the fix. But being as alarmed as you are, how about doing the upgrade, wait for the "will reboot in 10s" message, cancel it, apply attached patch, call mkinitrd, and all should be well. OTOH, you will only circumvent this issue, even if a fix is committed and rolled out, if you manually add the correct update repo during the upgrade/install.. Thus the former approach is still more appealing in my book. Created attachment 268090 [details]
the fix as a patch
Just switch to a text console, change to install root path during install (inquire with df -h), and apply with patch -p0 < mkinitrd-boot-md-fix.diff
It's a lot more difficult to describe the procedure, then doing it all the way..
Kay, I'm probably not the right person for this bug :(. Could you have a look? See comment #7, it's nothing udev could fix. Md needs special workarounds. It needs to be checked, that the kernel md code sends the "change" event at the proper time, when device is ready to read from userspace, if we are not sure that is correct already. Also, the initramfs needs special handling of md devices, loop until the device is readable to investigate it, it can not just wait for the device to appear. And this looks like the ideal bug to get Milan started ... So... I'm seeing this very issue on a brand-new-clean install of 11.1 (so re: #32 - an upgrade has nothing to do with it.) I have two SATA disks, each of which has three partitions. All partitions are RAID autodetect, configured as RAID 1 - the first set for /boot, the second for swap, and the third for the root partition. I did all this at install time, and hit this bug when the system did its first reboot to complete the installation. So I believe that the requirement for reproducing this is 1) having your root partition being handled by MD raid, and possibly 2) having it not be /dev/md0. (In reply to comment #29) > Without the ID_FS_TYPE info the fsck cannot work and stops booting. I'll second that; the init dies in /boot/83-mount.sh, lines 79-90: if [ -z "$rootfstype" -a -x /sbin/udevadm -a -n "$sysdev" ]; then eval $(/sbin/udevadm info -q env -p $sysdev | sed -n '/ID_FS_TYPE/p') rootfstype=$ID_FS_TYPE [ -n "$rootfstype" ] && [ "$rootfstype" = "unknown" ] && $rootfstype= ID_FS_TYPE= fi # check filesystem if possible if [ -z "$rootfstype" ]; then echo "invalid root filesystem -- exiting to /bin/sh" cd / PATH=$PATH PS1='$ ' /bin/sh -i The patch in #39 worked perfectly, though. Thanks for the info it does happen on a new install and not just an upgrade. As far as which dev / is on, it failed on mine with / being on md0, so that isn't it. I suspect the only relevant fact is that the root partition is on an md raid device, and it does not by default contain the ID_FS_TYPE info, which causes the fsck to fail. I have been using my (IMHO) rather cludgy fix for over 2 months with no apparent ill effects, and it does allow it to work effectively normal. Thanks for your input and info Gordon. New SLES/SLED 11 : the same problem I have a fix for this problem in a different direction. Instead of fixing it in udev, I changed the /lib/mkinitrd/setup/91-mount.sh file by saving the variable rootfstype at the end. I am new to opensuse, therefore I don’t have a deep understanding of the system. But I think this line should be there, even when this bug is solved by udev. The reason is that in my opinion it makes no sense to save the name of the fsck command in the configuration file and try to get the file system type at boot time. openSUSE Factory: The boot resume without this error. (In reply to comment #19) This fix worked nicely for many weeks now, but with some of the latest patches I did install some days ago the fix was cancelled. Booting stops again with the error "invalid root filesystem". When I looked into my boot-mount.sh today, the first two of the three lines added where gone. GONE: sleep 1 GONE: echo change > /sys/block/md$md_minor/uevent STILL THERE: wait_for_events After adding again the two missing lines and calling mkinitrd (from a root console) the problem was gone. I can confirm the patches undid the fix on the machine involved in my report #24. But in the meantime I have installed 11.1 on two other machines with md root filesystems where this problem did not arise. This is correlated with the problem machine having a 64bit AMD Athon processor, and the two problem-free machines having Intel Core2 Quad processors. *** Bug 433980 has been marked as a duplicate of this bug. *** I got bitten too while upgrading to 11.1. Patch from #24 fixed it. Thank you, Joe, for the patch from Comment #19. I manage it for *SUSE*10 in another bug (502714 - can't make it public, sorry for that). For *SUSE*11 this is a part of mdadm package so I'm reassigning this bug to mdadm maintainer. Again, why can't md send out 'change' events once it detects the device is useable? That would be a kernel patch, but would save us quite a lot of hassle from userspace. drivers/md/md.c:md_new_event() seems like an ideal place for this ... Actually we have this: kobject_uevent(&mddev->gendisk->dev.kobj, KOBJ_CHANGE); at the end of drivers/md/md.c:do_md_run(), so we should be getting CHANGE events for this case. Created attachment 296894 [details]
md-restart-uevent
Send 'change' uevent when array is restarted.
Maybe the above helps. Can you test with it and have a 'udevadm monitor --env' running; this will show us all events which are being generated. After upgrading to the latest kernel on openSUSE 11.1 the same problem
occured.
The problem behind is very simple and the solution, too.
The main problem is, that all udev rules are fired when a new MD
array is assembled, but this does not necessarily has to be started in these moments.
The critical section is within the udev-rules-file "64-md-raid.rules" (/lib/udev/rules.d/)
==> HERE IS THE PROBLEMATIC RULE:
IMPORT{program}="vol_id --export $tempnode"
vol_id will be called on an array that is assembled, but not started,
so the determination of the filesystem type fails (env-Var ID_FS_TYPE)
(and then the remaining parts in iniramfs fail)
It can be seen when you call "mdadm --detail $tempnode" and pipe it to /dev/console within these rules (separate script needed)
It says "status: clean, Not Started"
The solution is simple: Wait until the array is started.
My quick solution is the following:
A new udev program called "md-stat.sh" which waits, until the array is ready.
Run it before vol_id will be called (file is attached).
So add the following rule before the vol_id rule:
IMPORT{program}="md-stat.sh $tempnode"
(the dirtiest hack is to wait 3 seconds [IMPORT{program}="/bin/sleep 3"])
4 steps are needed to fix:
1. Put md-stat.sh to /lib/udev
2. Modify /lib/udev/rules.d/64-md-raid.rules
3. run "mkinitrd"
4. reboot ;-)
Created attachment 297396 [details]
Script that waits until MD-array is started (for UDEV)
Had the same problem ("invalid root filesystem") during boot time with opensuse 11.1 and an AMD Athlon 2100+. My Intel core i7 920 and Pentium III computers have no such a problem. However, I installed your patch on my AMD system and it worked! No hang at boottime! No [Ctrl + D] anymore! Thank You very much!!
I built 11.1 test kernels with the patch from comment #57. Those of you who can reproduce the bug, please try them _without_ the mkinitrd / udev workarounds. The kernels are here: http://labs.suse.cz/mmarek/bnc445490/ Hi Michal, Could you please do the same with the patch from comment #18 of bug #509495. I fairly sure that will fix the problem. Thanks. I don't think that bug is world-readable, so I'll paste it in below for others to see. From: NeilBrown <neilb@suse.de> References: bnc#509495 Subject: Update size of md device as soon as it is successfully assemble. It is import that we get the size of an md device 'right' before generating the KOBJ_CHANGE message. If we don't, then if a program runs as a result of that message and open the md device before the mdadm which created the device closes it, the new program will see a size of zero which will be confusing. So call revalidate_disk, which checks and updates the size. Signed-off-by: NeilBrown <neilb@suse.de> --- drivers/md/md.c | 1 + 1 file changed, 1 insertion(+) --- linux-2.6.27-SLE11_BRANCH.orig/drivers/md/md.c +++ linux-2.6.27-SLE11_BRANCH/drivers/md/md.c @@ -3809,6 +3809,7 @@ static int do_md_run(mddev_t * mddev) md_wakeup_thread(mddev->thread); md_wakeup_thread(mddev->sync_thread); /* possibly kick off a reshape */ + revalidate_disk(mddev->gendisk); mddev->changed = 1; md_new_event(mddev); sysfs_notify(&mddev->kobj, NULL, "array_state"); OK, I finally built packages with Neil's patch: http://labs.suse.cz/mmarek/bnc445490/neil/ The top of rpm changelog is * Tue Jul 07 2009 mmarek@suse.cz - patches.fixes/md-update-size: Update size of md device as soon as it is successfully assemble. (bnc#509495). Please install these packages and revert all the previous workarounds, hopefully this is the real fix. Just to let you know. I undid the work around from Comment #19, installed the kernel packages for my machine from Comment #66, and it rebooted with no problems. Looks like this finally fixes this bug. Thanks!!! Will this be rolled into a kernel update via the update repo? (In reply to comment #67) > Looks like this finally fixes this bug. Thanks!!! Will this be > rolled into a kernel update via the update repo? Thanks for testing. Yes, I have check the fix in so it should be in any future update. I don't know when that is scheduled to be. This bug appears to still be present. I did a NET install 10/27/2009 and upon boot it failed exactly as described above. At first I thought the install was bad, but with a little luck I stumbled onto getting to continue booting. Following the suggestion in post #16 fixed the problem. Kernel: 2.6.27.2-9-pae /md0 - Suse 10.3 /md1 - swap /md2 - Suse 11.1 Maybe 11.1 on the 3rd RAID1 partition is the cause(?) Were you installing 11.1? If so, you will need to update to the latest kernel. Right, the fix went into the 2.6.27.29-0.1 update kernel, watch for the following in the rpm changelog: * Thu Jul 09 2009 nfbrown@suse.de - patches.fixes/md-update-size: Update size of md device as soon as it is successfully assemble. (bnc#509495). Yes, I was installing 11.1 (and from the network). When I rebooted I encountered the problem. I assumed that with the network install the kernel would be recent. I ran update and it went from 2.6.27.7-9 to 2.6.27.37-0.1 and it now boots OK. So, it looks like the bug will be with us until the kernel in the initial install is more recent than 2.6.27.29-0.1. *** Bug 490008 has been marked as a duplicate of this bug. *** *** Bug 484897 has been marked as a duplicate of this bug. *** Guys, I am new to SUSE Linux, Don't wanna seem like a bonehead here but I have been fighting this issue for days now. I have a brand new all intel server with raid 10, 4 2tb hard drives and it runs great until I install updates and then reboot, now it exits to $ prompt and I'm lost what to do next to fix it, I see the solution above but am not sure how to implement it. By the way the os is SUSE 11 sp1 Do you mean "openSUSE-11.1" or "SLES11 SP1" or "SLED11 SP1" ?? "SUSE 11 sp1" isn't a valid name. What kernel are you running? Type uname -a at the "$" prompt and report the output. SLES-11 sp1 says Linux (none) 2.6.27.54-0.2-default #1 SMP 2010-10-19 18:40:07 +0200 x86_64 x86_64 x86_x64 GNU/Linux |