Bug 665720 - services randomly fail to start when parallel boot is enabled
Summary: services randomly fail to start when parallel boot is enabled
Status: RESOLVED WONTFIX
: 680297 (view as bug list)
Alias: None
Product: openSUSE 11.4
Classification: openSUSE
Component: Kernel (show other bugs)
Version: Factory
Hardware: x86 Other
: P3 - Medium : Major with 7 votes (vote)
Target Milestone: ---
Assignee: E-mail List
QA Contact: E-mail List
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-01-20 02:40 UTC by Felix Miata
Modified: 2012-08-02 16:00 UTC (History)
10 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
/var/log/boot.msg from hung init (66.68 KB, text/plain)
2011-03-17 03:45 UTC, Felix Miata
Details
boot.msg from stalled boot before disabling dbus service (193 bytes, text/plain)
2011-03-29 12:57 UTC, Felix Miata
Details
boot.msg from stalled boot before disabling dbus service (65.29 KB, text/plain)
2011-03-29 15:55 UTC, Felix Miata
Details
boot.omsg created by (hung) last boot described in comment 32 (65.97 KB, text/plain)
2011-04-04 12:09 UTC, Felix Miata
Details
output from hwinfo on host big31 (540.66 KB, text/plain)
2011-04-04 12:12 UTC, Felix Miata
Details
output from chkconfig --list on host big31 (4.88 KB, text/plain)
2011-04-04 12:14 UTC, Felix Miata
Details
shutdown tail of boot.omsg (5.34 KB, text/plain)
2011-04-04 18:14 UTC, Felix Miata
Details
services list on 11.4 on big31's non-RAID HD (4.31 KB, text/plain)
2011-04-18 22:30 UTC, Felix Miata
Details
loaded modules list on 11.4 on big31's non-RAID HD (5.87 KB, text/plain)
2011-04-18 22:30 UTC, Felix Miata
Details
services list on 11.4 on big31 on RAID1 (5.02 KB, text/plain)
2011-04-18 22:31 UTC, Felix Miata
Details
loaded modules list on 11.4 on big31 on RAID1 (6.14 KB, text/plain)
2011-04-18 22:31 UTC, Felix Miata
Details
boot.omsg from boot subsequent to last halted boot (56.97 KB, text/plain)
2011-04-18 22:31 UTC, Felix Miata
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Felix Miata 2011-01-20 02:40:33 UTC
http://lists.opensuse.org/opensuse-factory/2011-01/msg00022.html is where I first posted this to mailing list two weeks ago, but that thread has generated no responses from anyone but me. At least two systems exhibiting this had fresh installs considerably before this started, updated via zypper dup numerous times either from a milestone iso or from online Factory repos. That which I know for sure is being ignored as of kernel-2.6.37-18-desktop:

     KBD_DELAY="250"
     KBD_RATE="20"
     KBD_NUMLOCK="yes"

Delay doesn't seem too far off, but rate is considerably higher than 20, and NUM, which is on on POST via BIOS, stays off in all vts after the kernel has turned it off.
Comment 3 Felix Miata 2011-01-23 03:16:42 UTC
This problem evaporates if I run kernel-vanilla-2.6.37-18.1
Comment 4 Felix Miata 2011-01-23 04:48:17 UTC
I noticed also with desktop kernel that rpcbind does not start during init either, but I'm not sure whether rpcbind fails every boot or just sometimes, and I suppose that's likely to be a different bug.
Comment 5 Felix Miata 2011-01-27 19:37:04 UTC
dup after M6 availability announcement made this go away on first host tested (gx150).
Comment 6 Felix Miata 2011-01-29 07:24:21 UTC
After M6+ dup on host kt880 this problem remains using 2.6.37-rc7-desktop, but is gone using 2.6.37-20-vanilla.
Comment 7 Jeff Mahoney 2011-02-07 18:45:11 UTC
I'm not able to reproduce this with 2.6.37-20-desktop.

What is your $KBD_TTY set to?
Comment 8 Felix Miata 2011-02-08 00:57:16 UTC
Note: all comments below results from booting runlevel 3 and staying there.

AMD/rv200 host m7ncd: 'echo $KBD_TTY' from root login on tty2 produces blank line in response. /etc/sysconfig/keyboard contains 'KBD_TTY="tty1 tty2 tty3 tty4 tty5 tty6"'. Last 4 boots, all after fresh dup, this host didn't exhibit this problem. All with desktop kernel.

Intel/mga host gx150 same as above for 2 boots between last dup on 27 Jan fresh dup after comment 7, but first boot after the dup, /etc/sysconfig/keyboard was ignored. Next boot to vanilla 38rc it was obeyed. Next boot to 37-20-desktop it was again obeyed, but next again ignored, and next again obeyed, before shutdown to....

AMD/mga host kt880: same as m7ncd with 37-rc7-desktop on only boot before fresh (locked kernel) dup. First boot after dup keyboard was obeyed, next boot ignored. Next I in'd 37-20-desktop and booted again, and keyboard was ignored. Next boot to 37-20-desktop.

Intel/rv380 host big31: Some differences from m7ncd. Using -default here instead of -desktop kernel. /etc/sysconfig/keyboard has 'KBD_TTY=""' (as in original RPM version of /etc/sysconfig/keyboard {as keyboa01rd} timestamped Jan 30 00:58), while echo $KBD_TTY returns a blank line. Keyboard file was obeyed with initial 2.6.37-20-desktop BOTD, and first boot after fresh dup. This system is not rebooting via the reboot command, returning 'pidofproc: pidofproc: cannot stat /sbin/splash: No such file or directory.' I did 'touch /sbin/splash' without making it executable, and that prevents reboots without producing a cannot stat message; chmod 755 on the file didn't help. I have to use the reset button every time. All boots this system today have obeyed /etc/sysconfig/keyboard.

BTW, I usually avoid tty1, logging on tty2 and/or tty3 and/or others and exhausting them before loggin in on tty1.
BTW2, whenever keyboard is ignored, bug 400552 is avoided.
Comment 9 Jeff Mahoney 2011-02-10 00:12:30 UTC
echo $KBD_TTY should always return a blank line. /etc/sysconfig/keyboard isn't sourced into the user's shell environment - it's only sourced into /etc/init.d/kbd.

KBD_TTY directly affects which ttys have kbdrate applied to them, so if it's empty, kbdrate isn't running on any of them. I'm not sure what you mean when describing the file on big31.

When you run kbdrate -r 20 -d 250 on a system that it isn't working on, does it fix it? If not, can you provide the output of strace <kbdrate command>?

Likewise with setleds +num.
Comment 10 Felix Miata 2011-02-15 05:39:23 UTC
RC1 seems to have made this problem disappear on other hosts. Big31 is down for several more days waiting on a part that may have been responsible for a probable digression in comment 8. IIRC, big31 simply wouldn't exhibit this problem.
Comment 11 Felix Miata 2011-02-20 04:24:17 UTC
(In reply to comment #9)
> When you run kbdrate -r 20 -d 250 on a system that it isn't working on, does it
> fix it? If not, can you provide the output of strace <kbdrate command>?
 
> Likewise with setleds +num.

I finally remembered to try this on a M6 system (kt880) before duping to RC1. kbdrate & setleds from root prompt do what they're supposed to do. I did double check, and the problem is only exhibited on alternate boots, except if because on current boot it was ignored I've run kbdrate & setleds manually, in which case on subsequent boot it was ignored again, breaking the alternate boot pattern.

All the above only applied if restart was via command 'init 6' (>6 successive boots). Then I switched to restart via command 'reboot' for the next several boots, in which keyboard file was always ignored, and after switching back to 'init 6' for three more boots; then since (>8 successive boots), the alternate boot pattern has been back.

In all of this testing specifically for this bug I've only been booting into runlevel 3, via 3 on Grub cmdline, in part because of common loss of sync between NUM state and NUM led when switching back & forth between X & consoles.

Then on an init 6 restart I removed the 3 from Grub's cmdline. KDM came up, I switched to tty2, found NUM lit with NUM state on and kbd repeat at 20, logged in root, ran an rpm query, logged out, went back to KDM, and selected restart. On restart I again removed 3 from Grub cmdline before proceeding, but this time I was left on tty1, with NUM led off to see "...runlevel 5 has been reached", and no access to X's tty7. I logged in root on tty2, then 'init 3; init 5' to reach KDM, switched back to tty2 to find NUM state off but NUM led on, and kbd rate maximized. I exited login, switched to KDM, selected reboot, removed 3 from Grub cmdline again, and found myself left on tty1 in runlevel 5 same as last restart. I repeated this 5/KDM restart process once more with same result.

Tired of this, I changed repos to RC1 iso only, zypper ref; zypper al kernel-desktop; zypper dup, and repeated booting via command 'reboot'. /etc/sysconfig/keyboard was obeyed on 3-4 successive runlevel 3 boots, then not on the next. So, I 'zypper rl kernel-desktop', switched repo from RC1 iso to factory's OSS & non-OSS, 'zypper ref', 'zypper in kernel-desktop', and rebooted via command 'reboot'. That resulted in no boot, just GRUB. 

I fixed Grub using Knoppix CD, then booted RC1, and rebooted 4 times using 'init 6'. /etc/sysconfig/keyboard's NUM, delay and rate settings were disobeyed every time. Then I restarted using 'reboot', and they were obeyed. I did it again, and they were not, and again twice more, both not. Then I tried 'init 6' again, and they were obeyed, and again thrice, but next two not after turning NUM on manually each time, and next not after turn NUM on manually, then turning it back off before init 6. Next I left NUM off, but after init 6 again disobeyed 7 cycles before being obeyed again.

Next disobeyed runlevel 3 boot, after login I did 'init 1', typed root password, 'init 3', and enjoyed normal runlevel 3 function. Next I booted via 'reboot' directly to runlevel 1, typed root password, 'init 3', and happiness again, at least 7 straight cycles.

From where I sit, this is seriously wierd, and very annoying, since network does not come up automatically on every normal (3) boot when keyboard settings are ignored.
Comment 12 Felix Miata 2011-02-21 00:11:06 UTC
I tried on another K7 host (kt400) already on RC1 with 2.6.37-22-desktop. 6 straight reboots via 'reboot', and always keyboard file obeyed. On reboot 7, it wasn't.
Comment 13 Felix Miata 2011-02-26 03:59:25 UTC
On P4 host gx260 @RC1 booted to dup to RC2 after first locking installed kernel, initial boot and reboot via 'reboot' after dup completed, keyboard was obeyed. Then I unlocked installed kernel and ran dup again to get RC2 kernel and kmp. First boot on new kernel (via 'reboot') keyboard was not obeyed, but on next is was.
Comment 14 Felix Miata 2011-03-17 03:45:43 UTC
Created attachment 419846 [details]
/var/log/boot.msg from hung init

On host big31 I installed kernel-default-2.6.37.4-1 to see if it would help this, but apparently not through the first couple of boots. After one that seemed OK, attempting to mount noauto nfs mounts caused the session to hang, and subsequent attempt to reboot to hang as well. Later I had one that never did complete after announcing switch to runlevel 3, and this attachment is from that one

I've now changed RUN_PARALLEL in /etc/sysconfig/boot to no, which so far seems to be a workaround.
Comment 15 Felix Miata 2011-03-17 05:41:15 UTC
*** Bug 680297 has been marked as a duplicate of this bug. ***
Comment 16 Lars Müller 2011-03-17 08:48:50 UTC
What makes you think this is a dup as you wrote in comment 15?

BTW 8 weeks ago nobody talked about RUN_PARALLEL of /etc/sysconfig/boot.  A comment like #2 in bug#680297 sounds rude.  It's ok to close a bug as DUP but there is no need to pee at me. :/
Comment 17 Felix Miata 2011-03-17 13:00:18 UTC
(In reply to comment #16)
> What makes you think this is a dup as you wrote in comment 15?

You mean besides the fact that sometimes services start and sometimes not, or that the services that start or not aren't always the same, or that sometimes init never finishes?

> BTW 8 weeks ago nobody talked about RUN_PARALLEL of /etc/sysconfig/boot.

Which is an unfortunate result of the lack of attention to this bug or the Factory mailing list thread(s?) I started back then.

>  A comment like #2 in bug#680297 sounds rude.

That comment was an expression of frustration that so little attention was given to a bug that might have been fixed or a workaround provided in relnotes before 11.4 went GA. No such intent was meant, and I'm sorry you feel that way.
Comment 18 Stephan Barth 2011-03-28 09:52:56 UTC
I also ran into this bug. No progress yet?
Comment 19 Lars Müller 2011-03-28 13:07:12 UTC
@Felix: Please add pointers to the archive of the threads at http://lists.opensuse.org/ you talk about in comment #17  The input from them might help too.

For the one system I've seen this happening with I'm sure it uses a non default set of packages.  Might some magic package is missing?
Comment 20 Mark Goldstein 2011-03-28 13:27:26 UTC
There were a number of people reporting about similar problems on openSUSE list.
In my case, as I wrote in https://bugzilla.novell.com/show_bug.cgi?id=680297,
the boot process just locked up.
Right after installation it looked OK. It is old desktop I'm using to test some stuff. I installed more or less default system with KDE4 and GNOME.
I know I can only work with proprietary NVidia drivers (no one of open source drivers could ever work with my screen correctly). In this case I tried the default first, saw that it can't detect anything and also the screen flickers as other people reported. So I installed nvidia driver (legacy, since this machine has an old video card) and started configuring the X using run-level 3. I also new from my experience with 11.3 on the same machine, that I have to add nopat option to the kernel. Everything looked OK again, but the next day when I turned the machine on the problem started. Sometimes the boot process worked OK, but in most cases it either locked up, or ended up without networking and X.
So at least in my case it looked rather like some kind of racing between the different components. For that reason I tried disabling RUN_PARALLEL and it helped.
Comment 21 Felix Miata 2011-03-28 14:30:25 UTC
http://lists.opensuse.org/opensuse-testing/2011-01/msg00001.html seems to be my first reported encounter, though not recognized as involved with parallel boot at the time.

http://lists.opensuse.org/opensuse-factory/2011-01/msg00022.html also wasn't recognized as a parallel boot issue at the time, but IIRC it's the one that prompted me to file this bug.
Comment 22 Jeff Mahoney 2011-03-28 14:35:18 UTC
Parallel boot isn't a kernel issue. Re-assigning to sysvinit maintainer.
Comment 23 Dr. Werner Fink 2011-03-28 14:50:18 UTC
Guess: duplicate of bug #642289
before complaining, please test out
https://build.opensuse.org/package/show?package=sysvinit&project=home%3AWernerFink%3Abranches%3AopenSUSE%3A11.4%3AUpdate%3ATest

*** This bug has been marked as a duplicate of bug 642289 ***
Comment 24 Felix Miata 2011-03-28 14:52:51 UTC
(In reply to comment #22)
> Parallel boot isn't a kernel issue.

Per comment 1 vanilla kernel is alternate solution, so why couldn't it be a kernel issue?
Comment 25 Felix Miata 2011-03-29 12:57:19 UTC
Created attachment 421896 [details]
boot.msg from stalled boot before disabling dbus service

https://bugzilla.novell.com/show_bug.cgi?id=642289#add_comment
(In reply to comment #58)
> (In reply to comment #57)

> My guess is more dbus and ConsoleKit.  For a try please disable dbus
> for next boot
 
>         insserv -fr dbus

"Next" boot so as to attempt to do that completely hung at Starting D-Bus daemon. Any chance this is a kernel issue related to (enabled) HyperThreading CPU rather than a single core or multicore CPU? Most mounts md RAID?

> ... as this is not a solution we have to investigate this precisely.
 
> Beside this with a working blogd you may have a look into /var/log/boot.msg
> you can compare the time stamps in the <notice> entries for the
> dbus service.

Attached
Comment 26 Felix Miata 2011-03-29 13:12:06 UTC
After booting to 11.2 and chrooting to 11.4 to insserv -fr dbus, next 11.4 boot stalled just after starting runlevel 3 anyway:
INIT: Entering runlevel 3
Boot logging started...
Master Resource Control: previous runlevel: N, switching to runlevel: 3
Master Resource Control: Running /etc/init.d/before.local
Comment 27 Dr. Werner Fink 2011-03-29 15:39:20 UTC
(In reply to comment #25)

Felix: this attachment is not a boot.msg (IMHO):

cndrvcups-common-1.90-1.i386
cndrvcups-ufr2-uk-1.90-1.i386
cups-1.4.6-6.1.i586
cups-client-1.4.6-6.1.i586
cups-libs-1.4.6-6.1.i586
python-cups-1.9.52-4.1.i586
python-cupshelpers-1.2.5-4.1.i586

.. that's all and looks like a rpm list.
Comment 28 Felix Miata 2011-03-29 15:55:23 UTC
Created attachment 421946 [details]
boot.msg from stalled boot before disabling dbus service

Oops. LMB to select boot.msg from file list probably moved mouse pointer and I didn't notice.
Comment 29 Dr. Werner Fink 2011-03-29 17:16:47 UTC
AFAICS the blogd works flawless, that is no doubled log lines anymore.
IMHO this is a problem with dbus or maybe with haldaemon.

On the other hand the followinfg serices are started in parallel

  acpid
  cpufreq
  dbus
  earlysyslog
  fbset
  1ibm-prtm
  irqbindall
  microcode.ctl
  random
  rtcheck
  set_kthread_prio
  slert

one of them does fool the system (IMHO).
Comment 30 Dr. Werner Fink 2011-04-04 07:37:24 UTC
>> You have to specify a comment when changing the status of a bug from RESOLVED 
>> to REOPENED.
Comment 31 Dr. Werner Fink 2011-04-04 07:40:30 UTC
Felix? Now what is going on? As blogd works now I'd like to ask
what has caused the remaining problem on our system.
Comment 32 Felix Miata 2011-04-04 09:41:39 UTC
(In reply to comment #31)
> Felix? Now what is going on? As blogd works now I'd like to ask
> what has caused the remaining problem on our system.

host big31 http://www.smolts.org/client/show_all/pub_f2e7a2ea-9a3d-4f4e-9ee5-4a2252755bac currently on kernel 2.6.37.4-1-default as indicated in comment 28 attachment, and unlike my other test systems runs on md RAID1, simply stops part way throught init on a random basis, that is, several or more boots proceed normally, but eventually boot simply won't finish. Since comment 31 I booted it 6 straight times successfully with parallel boot disabled, then saw in the boot.msg between EOF and tail of comment 28 attachment 14 lines, 5 of which were blank, all about acpid and cpufreq. Then I reenabled parallel boot, and booted it 4 straight times normally & successfully. #5 was very slow after:
Master Resource Control: previous runlevel: N, switching to runlevel: ... 3
Master Resource Control: Running /etc/init.d/before.local ... done
acpid: starting up with proc fs
acpid: 2 rules loaded
acpid: waiting for events: event logging is off
Starting acpid ... done
Starting D-Bus daemon ... done
Loading CPUFreq modules - hardware support not available ... skipped
Checking/updating CPU microcode ... done
Starting syslog services ... done 

That's where it sat 10+ minutes after trying to boot. Then it sat several more minutes, as CAD failed to actually cause a reboot after "INIT: Switching to runlevel: 6; INIT: Sending processes the TERM signal; INIT: Sending processes the KILL signal". I hit the reset button, then chose 2.6.37-20-vanilla, which booted normally. I then replaced 2.6.37-20-vanilla with 2.6.37.6-4-vanilla and booted normally 4 straight times, which seemed to me to indicate this was indeed an openSUSE kernel bug, until boot #5 hung even earlier than with desktop kernel, at "Starting D-Bus daemon ... done", and CAD failed to reboot again. I did notice that shutdown/reboot from boot #4 had a noticeable pause.

11.2 on same system always boots completely.

# zypper wp blogd
...
No providers of 'blogd' found.
#

??? Where does blogd come from???
Comment 33 Dr. Werner Fink 2011-04-04 11:27:37 UTC
You may try as root

 type -p blogd

which should result in /sbin/blogd and with

 rpm -qf /sbin/blogd

you should see at least sysvinit-tools.

You may also debug your system by disabling script by script to
identify the cause of this random hang (compare with comment #29)
Beside this you may use /var/log/boot.omsg after a reboot to
see the former boot/shutdown messages and /var/log/boot.msg to
investigate the current boot messages.

Clearly you should use the blogd from the attachments of the
bug #642289
Comment 34 Dr. Werner Fink 2011-04-04 11:38:46 UTC
Beside this you may also provide some more informations
about the hardware, setup, and configurtion of the
affected system.
Comment 35 Felix Miata 2011-04-04 12:09:45 UTC
Created attachment 422960 [details]
boot.omsg created by (hung) last boot described in comment 32

45536 byte /sbin/blogd from bug 642289 was used here rather than original of 34780 bytes, as well as parallel boot enbled.
Comment 36 Felix Miata 2011-04-04 12:12:21 UTC
Created attachment 422961 [details]
output from hwinfo on host big31
Comment 37 Felix Miata 2011-04-04 12:14:15 UTC
Created attachment 422963 [details]
output from chkconfig --list on host big31
Comment 38 Felix Miata 2011-04-04 12:28:19 UTC
(In reply to comment #33)
> You may try as root

>  type -p blogd

OK
 
> which should result in /sbin/blogd and with

>  rpm -qf /sbin/blogd
 
> you should see at least sysvinit-tools.

OK
 
> You may also debug your system by disabling script by script to
> identify the cause of this random hang (compare with comment #29)

Can you look at comment 37 attachment and suggest which to try or which order to try?

> Beside this you may use /var/log/boot.omsg after a reboot to
> see the former boot/shutdown messages and /var/log/boot.msg to
> investigate the current boot messages.

Comment 35 attachment provided this, which seems to add little to comment 32.

> Clearly you should use the blogd from the attachments of the
> bug #642289

Are you sure? I am already using that blogd now, but zypper lu shows sysvinit & sysvinit-tools installed are 2.88-37.40.1 while 2.88-37.43.1 are available? Should I install those plus available aaa_base, aaa_base-extras and other available updates before proceeding further?

If my comment 25 question was answered, I'm missing it.
Comment 39 Dr. Werner Fink 2011-04-04 12:56:18 UTC
I'm absolute sure that the blogd should be those from bug #642289
otherwise you will be fooled by a crahsing blogd, compare with
http://bugzilla.novell.com/show_bug.cgi?id=642289#c60

As first few steps I would disable fbset and the disable all
modules for KMS/DRM/GPU support, next irqbindall, set_kthread_prio,
...
Comment 40 Felix Miata 2011-04-04 18:14:52 UTC
Created attachment 423036 [details]
shutdown tail of boot.omsg

I'm not fully up to speed on disabling things not listed in chkconfig output.

I took CPUFREQ=off off cmdline and executed 'insserv -fr fbset' and got 4 good boots with 4th a delayed shutdown, 5th delayed by fsck on 2 ext2 partitions not cleanly umounted, and recovering journal on 4 md devices, but otherwise normal, with 12 unhalted boots before executing 'insserv -d fbset' & booting again, 8 straight times complete before halting starting D-Bus on #9. Next I went back to 'insserv -fr fbset' with CPUFREQ=off restored to cmdline. 8 good boots, then on #9 a long pause at acpid and than halt at starting syslog services. Maybe while I try nomodeset on cmdline you can answer about DRM/GPU control, irqbindall, set_kthread_prio.
Comment 41 Felix Miata 2011-04-04 18:22:35 UTC
Hung at Starting acpid on very first nomodeset boot.
Comment 42 Felix Miata 2011-04-04 21:48:24 UTC
2.6.38.2-3-default hangs starting either acpid or D-Bus about as often as booting successfully to completion.
Comment 43 Dr. Werner Fink 2011-04-06 08:15:22 UTC
Hmmm ... the mainboard from BIOSTAR could have a problem
in Advanced Configuration and Power Interface (ACPI) or
the Desktop Management Interface (DMI) ...

Questiion: Are there any updates for the BIOS of the mainboard
available?
Comment 44 Felix Miata 2011-04-06 08:40:55 UTC
(In reply to comment #43)
> Hmmm ... the mainboard from BIOSTAR could have a problem
> in Advanced Configuration and Power Interface (ACPI) or
> the Desktop Management Interface (DMI) ...

> Questiion: Are there any updates for the BIOS of the mainboard
> available?

No BIOS update is available that addresses anything resembling issues described in this bug. http://www.biostar.com.tw/app/en/mb/bios.php?S_ID=355

11.0 and 11.2 always boot right up, and IIRC, so does Knoppix from CD and DVD v 5.3.1, 6.0 & 6.2.

I've not yet seen explained how to disable DRM/GPU control, irqbindall, set_kthread_prio. I only know how to disable services that obviously exist in chkconfig --list output, plus nomodeset on cmdline. If there's some doc that explains, please name it.

I'm also waiting for an answer to whether there's any point to trying disabling HyperThreading in the BIOS.

Note too all my other systems for which blogd solved the parallel boot issues are both slower and older than big31, not running RAID, and not HyperThreading.
Comment 45 Dr. Werner Fink 2011-04-06 10:00:11 UTC
(In reply to comment #44)
> No BIOS update is available that addresses anything resembling issues
> described in this bug. http://www.biostar.com.tw/app/en/mb/bios.php?S_ID=355

The hwinfo you've attached shows:

  DMI: BIOSTAR Group G31-M7 TE/G31-M7 TE, BIOS 080014  05/19/2008

and maybe there have been some fixes in the meanwhile not mentioned.

> 11.0 and 11.2 always boot right up, and IIRC, so does Knoppix from CD and
> DVD v 5.3.1, 6.0 & 6.2.

The problem happens by using the kernel from 11.4 not from 11.0, nor 11.2,
nor from Knoppix. This could be a bug of the kernel as well as a bug
of the mainboard or a bug handling ACPI events of this mainboard or
all of them.

> I've not yet seen explained how to disable DRM/GPU control, irqbindall,
> set_kthread_prio. I only know how to disable services that obviously exist in
> chkconfig --list output, plus nomodeset on cmdline. If there's some doc that
> explains, please name it.

For drm use 

   lsmod | grep drm

and blacklist the module which uses drm_kms_helper in
/etc/modprobe.d/99-local.conf, run mkinird afterwards to make this also
available in the initrd.  Also disable KMS by setting NO_KMS_IN_INITRD
to "yes" in /etc/sysconfig/kernel

> I'm also waiting for an answer to whether there's any point to trying
> disabling HyperThreading in the BIOS.

This is openSUSE not SLES

> Note too all my other systems for which blogd solved the parallel boot issues
> are both slower and older than big31, not running RAID, and not
> HyperThreading.

/usr/src/linux/Documentation/kernel-parameters.txt parameter acpi= ...

  [...]

        acpi=           [HW,ACPI,X86]
                        Advanced Configuration and Power Interface
                        Format: { force | off | strict | noirq | rsdt }
                        force -- enable ACPI if default was off
                        off -- disable ACPI if default was on
                        noirq -- do not use ACPI for IRQ routing
                        strict -- Be less tolerant of platforms that are not
                                strictly ACPI specification compliant.
                        rsdt -- prefer RSDT over (default) XSDT
                        copy_dsdt -- copy DSDT to memory

                        See also Documentation/power/pm.txt, pci=noacpi
  [...]

Beside this HyperThreading seem to be disabled due hwinfo from
attachment #422961 [details]

  [...]
  CPU0: Hyper-Threading is disabled
  [...]
Comment 46 Dr. Werner Fink 2011-04-07 12:39:13 UTC
Felix? do you have any news?  That is do you have done some tests with
the acpi= kernel parameter?
Comment 47 Felix Miata 2011-04-07 13:16:45 UTC
(In reply to comment #46)
> Felix? do you have any news?  That is do you have done some tests with
> the acpi= kernel parameter?

1-Over half of my machines share physical space, that is, most I use to test with have to be physically moved from storage to a single spot where connection to power, keyboard, trackball & display is possible. Big31 lost its right to that space due to the time lag between the comment 40, 41, 42 group and comment 43 so that I could follow up in timely fashion on bugs not exhibited on big31.

2-Before big31 can get its space back I have to finish with the machine now in shared space, and, must find out how to answer all your questions. I still don't know about how to disable every one of the items listed in comment 29, some of which I asked about again in later comment(s), such as irqbindall & set_kthread_prio.

3-I have my doubts about a BIOS update released more than 4 years after manufacture fixing anything to do with ACPI, particularly since older kernels don't have apparent ACPI trouble.

So, no, not yet.
Comment 48 Dr. Werner Fink 2011-04-12 09:44:31 UTC
Even if triggered by RUN_PARALLEL="yes" this is IMHO a kernel
related problem.  I've also reports from users by PM that there
is a network problem with a breakdown of the network throughput
of the kernel used with 11.4.
Comment 49 Ivan Ganev 2011-04-14 06:43:27 UTC
I have the same problem, after adding some services to runlevel 5 it began NOT to execute (if parallel booting is enabled). The services I have in runlevel 5 are:
acpid
alsasound
apache2
auditd
avahi-daemon
bluez-coldplug
cpufreq
cron
cups
dbus
earlysyslogd
earlyxdm
fbset
kbd
mysql
network
network-remotefs
nfs-server
nscd
postfix
random
rpcbind
smartd
smb
splash
splash_early
sshd
stoppreload
syslog
xdm

Some of these are added to other runlevels as well, but that's by default I guess. After turning parallel booting off my system began to execute runlevel 5 again and now it works well.
Comment 50 Dr. Werner Fink 2011-04-14 07:59:57 UTC
(In reply to comment #49)

Which services do you have added to runlevel 5?

Do you have some special setings like an elevator= entry in the kernels
command line? -> /usr/src/linux/Documentation/kernel-parameters.txt

Please try to add nopreload to the kernels command line and retry
parallel boot.

Beside this please consider to attach also your hwinfo of your system.
Comment 51 Felix Miata 2011-04-17 12:45:55 UTC
After recent zypper dup, the blogd fix referred to in comment 39 was replaced with an April 6 blogd 38904 bytes that brought back the boot hangs, even with the Tumbleweed 2.6.38.2-19-default kernel.

I flashed the BIOS. Date is now 09/22/2009. That resulted in a lot of BIOS options that weren't there previously. Among them were ACPI choices. I selected v3.0. Now I have a network connection problem to solve. ifconfig -a in 11.0 & 11.4 report the PCI e100 configured as expected as eth0, but it acts as though the cable is disconnected, which is exactly what Knoppix 6.4/2.6.37 reports. Pinging my router results in "network unreachable". Reflashing and selecting ACPI v1.0 hasn't helped. I'll have to come back to this when I have more time.
Comment 52 Felix Miata 2011-04-18 22:30:44 UTC
Created attachment 425510 [details]
services list on 11.4 on big31's non-RAID HD
Comment 53 Felix Miata 2011-04-18 22:30:58 UTC
Created attachment 425511 [details]
loaded modules list on 11.4 on big31's non-RAID HD

Intel/rv380 host big31:
Removing the e100 NIC and using the onboard r8101e fixed networking. The e100 AFAIK had been working fine the day before my previous comment.

While big31 was down for HD replacement in February, I disconnected the other RAID HD, installed a spare HD, and put whatever milestone or RC was available at the time on it. I made a few tests with it, then put it aside to work on other hosts.

After comment 51 I started with some experiments with using the temporary HD on eSATA as sdc. During that process I brought its 11.4 current, and installed the then-latest desktop kernel from Tumbleweed. I left parallel boot turned on, and the rpm'd blogd dated April 6th in place. In all I booted probably at least 10 times without encountering this problem before updating to a newer Tumbleweed default kernel. So to recap, on it I used 2.6.37.2-3-desktop, 2.6.38.2-19-desktop & finally 2.6.38.3-20-default probably at _least_ a baker's dozen times in all without any apparent boot delays or halts.
Comment 54 Felix Miata 2011-04-18 22:31:09 UTC
Created attachment 425512 [details]
services list on 11.4 on big31 on RAID1
Comment 55 Felix Miata 2011-04-18 22:31:22 UTC
Created attachment 425513 [details]
loaded modules list on 11.4 on big31 on RAID1

First boot back to 11.4 on RAID after disconnecting sdc was normal. I then restored the April 6 blogd file, restored BOOT_PARALLEL="yes", and installed the latest Tumbleweed 2.6.38.3-20-desktop kernel. I booted 2.6.38.3-20 7 times successfully before subsequent boot completely stopped at "blogd: system console stolen at line 266!; blogd: Can not read from fd 0: Input/output error; Starting D-Bus daemon" for at least 10 or 15 minutes before I hit CAD to get it moving again, but had to hit the reset button to get past "INIT: Sending processes the KILL signal".
Comment 56 Felix Miata 2011-04-18 22:31:37 UTC
Created attachment 425514 [details]
boot.omsg from boot subsequent to last halted boot

I now find the likelihood that this is related to ACPI rather slim, unless it has to do with services enabled or kernel modules required running RAID that aren't required or used not running RAID. I'm not sure whether all I've provided completely answers the need for info, but am checking it off for now pending comments from the info requester about my comments since his request.
Comment 57 Felix Miata 2011-04-19 06:11:12 UTC
I created another partition on big31's sdc and installed Factory/2.6.38-4-default via HTTP since making comment 56. I specified one of the md partitions to mount on boot, but otherwise only mount automatically partitions on sdc. Lsmod shows 50 lines of modules loaded, only one instead of the several on md2's 11.4, grepping for "raid". I've booted it at least 14 times without any apparent slowdowns, pauses or halting.
Comment 58 Dieter Jurzitza 2011-04-19 18:08:34 UTC
Hi folks,
just for cuirosity: this sounds somewhat like another incarnation of #335267 - the latter one having a beard so long that Methusalem would get really jealous about :-(.

The fact that deactivating RUN_PARALLEL points into this direction (for me, always for me ...), too. Only my two cents here - and no one really provided a solution for #335267 though I see it since 10.2 and have seen it with 11.4, too. (and I am not alone)

Maybe RUN_PARALLEL has conceptual deficiencies that haven't been considered so far - but sorry, I can report, can't support with real technical help but testing.

Take care



Dieter Jurzitza
Comment 59 Dr. Werner Fink 2011-04-20 07:55:24 UTC
(In reply to comment #58)

Parallel boot of really independent services should work or we
see a) a missed dependency, or b) a broken calculation of the
dependencies (or a forgotten re-calculation), or c) a problem
due to the CPU and/or I/O load.  Beside this parallel boot will
be the future with systemd.

Point a) should be fixed in the boot/rc scripts.
Point b) seems not to be the problem as the algorithm used by
insserv and startpar is well known and well understood.
Point c) ... hmmm ...
Comment 60 Mario Guzman 2011-05-04 22:48:11 UTC
Just wanted to add that I too have this bug, installed the latest yast updates as of today since 11.4 was unusable for me (no network, required NOMODESET, slow, etc.). BTW, if it helps I tested all pre-GA releases and RC1/RC2, all worked fine until GA. GA broke 11.4 for me.
Comment 61 Jeff Mahoney 2012-08-02 15:59:58 UTC
With the coming release of openSUSE 12.2, openSUSE kernel developers are focusing their efforts there. Reports against openSUSE 11.4 and prior will not get the attention needed to resolve them before openSUSE 12.2 is release and openSUSE 11.4 becomes unmaintained.

Please re-test with openSUSE 12.1 or openSUSE RC2+ and re-open with an updated Product if you still encounter your issue.

We apologize for this issue not getting the attention it deserves but we are focusing our resources in the area where they will have the most impact for our users.  We're working hard to make openSUSE 12.2 the best openSUSE release yet!
Comment 62 Jeff Mahoney 2012-08-02 16:00:48 UTC
With the coming release of openSUSE 12.2, openSUSE kernel developers are focusing their efforts there. Reports against openSUSE 11.4 and prior will not get the attention needed to resolve them before openSUSE 12.2 is release and openSUSE 11.4 becomes unmaintained.

Please re-test with openSUSE 12.1 or openSUSE RC2+ and re-open with an updated Product if you still encounter your issue.

We apologize for this issue not getting the attention it deserves but we are focusing our resources in the area where they will have the most impact for our users.  We're working hard to make openSUSE 12.2 the best openSUSE release yet!