Bug 217563

Summary: HAL doesn't always start properly
Product: [openSUSE] openSUSE 10.2 Reporter: Magnus Boman <mboman>
Component: BasesystemAssignee: Danny Al-Gaaf <dalgaaf>
Status: VERIFIED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: andreas.hanke, fred.blaise, markus.kriewald, nix, wittemar
Version: Beta 1 plus   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: Other Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: boot.msg
messages
bootchart graph of a boot process where hald disappeared
bootchart graph of a boot process where hald survived
hal_output.txt

Description Magnus Boman 2006-11-02 21:35:58 UTC
Sometimes when booting up the machine, I've noticed that NetworkManager can't find any network cards. A restart normally solves this. I found out that HAL/DBUS doesn't always start properly.
I'm attaching boot.msg and message incase that helps.

mblxws01:/home/mboman/tmp # ps auxw|grep -i hal
101       2983  0.0  0.0   2028   880 ?        S    15:37   0:00 hald-addon-keyboard: listening on /dev/input/event1
root      4038  0.0  0.0   2860   752 pts/0    R+   15:41   0:00 grep -i hal

mblxws01:/home/mboman/tmp # ps auxw|grep -i dbus
100       2505  0.0  0.0   3552  1008 ?        Ss   15:37   0:00 /usr/bin/dbus-daemon --system
mboman    3721  0.0  0.0   3772   836 ?        Ss   15:40   0:00 /usr/bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session
root      4064  0.0  0.0   2856   744 pts/0    R+   15:42   0:00 grep -i dbus
Comment 1 Magnus Boman 2006-11-02 21:36:18 UTC
Created attachment 103606 [details]
boot.msg
Comment 2 Magnus Boman 2006-11-02 21:36:57 UTC
Created attachment 103607 [details]
messages
Comment 4 Danny Al-Gaaf 2006-11-04 14:25:51 UTC
please change in /etc/init.d/haldaemon this line:

   HALDAEMON_PARA="--daemon=yes --retain-privileges";
   
to:

   HALDAEMON_PARA="--daemon=yes --retain-privileges --verbose=yes --use-syslog";

and attach the part of /var/log/messages since boot if this happen again.

Comment 7 Timo Hoenig 2006-11-05 18:38:53 UTC
I ran into this as well, seems to be a race as it can not be reproduced reliably.  However, D-Bus was always working fine, just HAL did not run.

Adjusting summary.

-> Beta1 Plus
Comment 8 Timo Hoenig 2006-11-05 18:46:16 UTC
Just for the log:

The change as proposed by Danny (comment #4) makes it impossible to reproduce the problem at any time for me (HAL always gets started properly).
Comment 9 Andreas Hanke 2006-11-06 03:57:57 UTC
(In reply to comment #8)
> The change as proposed by Danny (comment #4) makes it impossible to reproduce
> the problem at any time for me (HAL always gets started properly).

This sounds very familiar, it's the same in bug 218184: hald doesn't start properly, but as soon as the debug parameters are added, it does.

Adding myself to CC (for a reason, please don't remove me again, thanks).
Comment 10 Timo Hoenig 2006-11-06 08:07:57 UTC
*** Bug 218184 has been marked as a duplicate of this bug. ***
Comment 11 Peter Nixon 2006-11-08 16:49:59 UTC
I have been seeing this bug also for approximately the last month. I run the latest Factory updated on a daily basis with smart. I see the problem on about 20% of boots, however it is MUCH more likely to happen if I have just done a "smart upgrade"
Comment 12 Andreas Hanke 2006-11-10 19:22:40 UTC
Created attachment 104740 [details]
bootchart graph of a boot process where hald disappeared
Comment 13 Andreas Hanke 2006-11-10 19:25:00 UTC
Created attachment 104741 [details]
bootchart graph of a boot process where hald survived
Comment 14 Timo Hoenig 2006-11-11 18:35:13 UTC
Andreas, thanks a lot for the graphs -- that's a great idea to narrow down the cause of this bug.

Did anyone run into this with Beta2?  So far, I did not run into this issue on my systems running Beta2.
Comment 15 Peter Nixon 2006-11-11 19:13:50 UTC
I am still seeing this problem with latest Factory (Is it in sync with Beta2)?

# date
Sat Nov 11 21:05:42 EET 2006
# smart update;smart upgrade -y
Loading cache...
Updating cache...                                ################################################################### [100%]

Fetching information for 'SUSE Factory'...
-> ftp://mirrors.kernel.org/opensuse/distribution/SL-OSS-factory/inst-source/media.1/media
media                                            ################################################################### [ 100%]

Updating cache...                                ################################################################### [100%]

Channels have no new packages.
Saving cache...

Loading cache...
Updating cache...                                ################################################################### [100%]

Computing transaction...
No interesting upgrades available.

Comment 16 Timo Hoenig 2006-11-11 19:20:03 UTC
Peter, can you please try whether HAL survives if you delay the start?  You can test that by replacing

 startproc -p $HALDAEMON_PID $HALDAEMON_BIN $HALDAEMON_PARA

with

 sleep 5 && tartproc -p $HALDAEMON_PID $HALDAEMON_BIN $HALDAEMON_PARA

in '/etc/init.d/haldaemon'.

Thanks!
Comment 17 Timo Hoenig 2006-11-11 19:21:30 UTC
(In reply to comment #16)

>  sleep 5 && tartproc -p $HALDAEMON_PID $HALDAEMON_BIN $HALDAEMON_PARA

Of course, this should read

 sleep 5 && startproc -p $HALDAEMON_PID $HALDAEMON_BIN $HALDAEMON_PARA
Comment 18 Andreas Hanke 2006-11-11 20:30:32 UTC
Knowing that the desired way to debug hald is --daemon=yes --verbose=yes --use-syslog, I have ignored this because it makes the problem irreproducible. Instead I have changed the startproc invocation to be as follows:



HALDAEMON_PARA="--daemon=no"
startproc -l /tmp/hal_output.txt -p $HALDAEMON_PID $HALDAEMON_BIN $HALDAEMON_PARA



You can find my /tmp/hal_output.txt attached. Maybe it's at least a bit useful.
Comment 19 Andreas Hanke 2006-11-11 20:31:00 UTC
Created attachment 104803 [details]
hal_output.txt
Comment 20 Andreas Hanke 2006-11-11 20:40:09 UTC
** ERROR **: file blockdev.c: line 835 (hotplug_event_begin_add_blockdev): assertion failed: (d_it != NULL)
aborting...
Comment 21 Peter Nixon 2006-11-11 21:02:53 UTC
I also see this error in /var/log/messages when it doesnt work.

I have made the change requested in Comment #16
As the problem is difficult to reproduce reliably, I can't tell if it made any difference. I will report it it reoccurs..
(Note. The problem most reliably occurs on the first and second reboot after a "smart upgrade".. Maybe something starts up a bit slower the first few times after it has been upgraded???)
Comment 22 Peter Nixon 2006-11-11 21:07:29 UTC
Just a quick note: In my opinion the Severity of this bug should be upgraded. It causes me major annoyance, but would send a non-expert linux user running for another platform if it affects them...

As it is I can't figure out a reliable way to stop it happening or to reproduce it... When it happens, its possible to reboot 3 or 4 times without fixing it at which point I usually revert to a manual "ifconfig up" on an ethernet cable. One or 2 reboots later it usually fixes itself and I return to using wifi as normal...
Comment 23 Andreas Hanke 2006-11-11 21:15:16 UTC
Forget about smart, it has absolutely and definitely nothing to do with this and just causes confusion here.

Only the engineers should touch the "Severity" field. Be patient, I'm very confident that this report will be handled properly nevertheless.

I think now it's time to wait and see whether the information about the failed assertion in file blockdev.c: line 835 goes into the right direction.
Comment 24 Danny Al-Gaaf 2006-11-11 21:28:03 UTC
hm ... the g_assert() call is IMO really strange and the only case in the complete code where the complete daemon die because a device could not be found.

And somehow the code look not really 'secure/save', because the code only try to get the parent device from the gdl and not from tdl. Could be a littlebit racy. 

I take a  look at this.
Comment 25 Timo Hoenig 2006-11-11 21:34:43 UTC
Danny, we should really make HAL to issue such warnings using syslog.  It would have spared us a lot of time.
Comment 26 Marcel Witte 2006-11-11 21:40:28 UTC
I reported the bug 218184 (Comment #10)

Here it seems that hald isn't crashing anymore since I upgraded to Beta2 with smart...
Comment 27 Danny Al-Gaaf 2006-11-11 22:02:05 UTC
Could you check if this already happen with the package from 
http://beta.suse.com/private/dkukawka/hal/testpackages/hal-0.5.8_git20061106-6/ ?
Comment 30 Marcel Witte 2006-11-11 22:22:20 UTC
Danny, do you mean me? I've installed hal-0.5.8_git20061106-4.x86_64 and it's working now.
Comment 31 Andreas Hanke 2006-11-11 22:31:30 UTC
Marcel, you wrote in comment 26 that it worked for you even with stock Beta2 packages, but not for me. So your information from comment 30 doesn't really apply, sorry.

I'm testing hal-0.5.8_git20061106-5.i586.rpm right now on the very same machine where stock Beta2 had the problem. So far it looks good, but I have rebooted only 5 times and would like to test it more.
Comment 32 Marcel Witte 2006-11-11 23:10:49 UTC
sry for my english... I justed wanted to know if I should try hal-0.5.8_git20061106-5.x86_64.rpm, even if it is working since update to beta2 with hal-0.5.8_git20061106-4.x86_64...
I hate english ;-)
Comment 33 Danny Al-Gaaf 2006-11-11 23:22:49 UTC
mboman also could no longer reproduce the bug. If the bug occours anymore, open the bug. I submitted a new package to STABLE.
Comment 34 Andreas Hanke 2006-11-11 23:37:37 UTC
I have tested the test package hal-0.5.8_git20061106-5.i586.rpm by rebooting the system 30 times after installing it. There was not a single failure.

For verification, I downgraded hal to the stock Beta2 package and then it failed again on the first attempt already.

So assuming that the new hal submission has the patch from the test package in it, the bug is fixed.
Comment 35 Peter Nixon 2006-11-12 00:37:24 UTC
I have also upgraded to your package and after 10 reboots have as yet been unable to reproduce a failure.. Looks good..
Comment 36 Markus Kriewald 2006-11-16 21:07:06 UTC
*** Bug 220912 has been marked as a duplicate of this bug. ***