Bug 390676

Summary: NIS/autofs not starting properly
Product: [openSUSE] openSUSE 11.0 Reporter: Karl Eichwalder <ke>
Component: NetworkAssignee: Marius Tomaschewski <mt>
Status: RESOLVED FIXED QA Contact: Jiri Srain <jsrain>
Severity: Normal    
Priority: P5 - None CC: coolo, kukuk, mt, radmanic, varkoly
Version: Beta 3   
Target Milestone: ---   
Hardware: Other   
OS: Other   
Whiteboard:
Found By: --- Services Priority:
Business Priority: Blocker: ---
Marketing QA Status: --- IT Deployment: ---
Attachments: proposed patch for sysconfig

Description Karl Eichwalder 2008-05-15 10:06:42 UTC
After the update to 11.0 beta3, my home directory does not exist (NIS/autofs does not work properly).  More info later.
Comment 1 Andreas Jaeger 2008-05-15 11:52:34 UTC
Waiting for more info ;)
Comment 2 Karl Eichwalder 2008-05-15 12:17:06 UTC
:)

Booting again, everything got initialized properly.

It is probably not a new bug introduced with beta3.  Maybe, it isn't even related to the update procedure.  It also happened in the past and also to Tanja during 11.0 alpha/beta test installations.

It is rather annoying, though ;)
Comment 3 Thorsten Kukuk 2008-05-15 19:28:44 UTC
Then please provide the informations.

But I guess this is because of the massive changes of the installation workflow => something for the responsible PrjMgr, but nothing for me.
Comment 4 Stephan Kulow 2008-05-15 19:35:34 UTC
Thorsten, there were no changes in the update workflow and Karl did an update (he says)
Comment 5 Thorsten Kukuk 2008-05-15 20:03:18 UTC
There were no changes to NIS, too, and it only happens with the update workflow according to Karl.

But we really need more informations from Karl, the report is really useless and does not contain any information.

If the home directory does not exist, this means Karl could login? So NIS was working?
Comment 6 Thorsten Kukuk 2008-05-15 20:05:10 UTC
Why does bugzilla reset NEEDINFO itself???
Comment 7 Karl Eichwalder 2008-05-15 20:13:32 UTC
Strange.  I clicked "Throw away my changes, and revisit bug 390676" after #c6.
This is the answer I wrote:

    Yes, at least, user 'ke' was known.  I denied logging in, because I
    was quite in a hurry at noon today, and I assumed rebooting would probably
    help.  And actually, it did.

    If you think it helps I can attach all of /var/log or just
    /var/log/messages.

    As said, the last weeks Tanja noticed something similar after new
    installations.  It usually helps to stop/start rcnetwork/rcypbind/rcautofs. 
    All of them, or only a subset, I'm not sure right now.
Comment 8 Thorsten Kukuk 2008-05-16 10:00:15 UTC
All this systems seems to use dhcp. And on all systems I'm able to reproduce the reason was always the same: network initialisation and dhcp needs longer than the
time was.
Comment 9 Marius Tomaschewski 2008-05-16 11:53:38 UTC
This is a general problem with our init scripts/setup & network.

The autofs script/service and many another services too needs a
working network or they usually just fail.
The LSB $network dependency in the autofs just says, start network
before autofs, but does not provide more detailed dependency (and
when I'm not wrong, not even successfull exit from network service).

This is the reason why the ifservices(5) functionality exists.
It allows you to define, that a service depends on an interface.

Using ifservices, you'll get as usual:

[...]
    br0       Ports: [eth0]
    br0       forwarddelay (see man ifcfg-bridge) ... ready
    br0       (DHCP) . . . . . no IP address yet... backgrounding.   waiting
Setting up service network  .  .  .  .  .  .  .  .  .  .  .  .  .  . done.

And as soon as dhcp has the IP and completed the interface setup,
the services from /etc/sysconfig/network/ifservices[-br0] will be
started.

So I consider the bug as a configuration problem and resolve it as
WONTFIX. When you like, create a feature request to find a better
solution and change to FEATURE then.
Comment 10 Thorsten Kukuk 2008-05-16 12:05:38 UTC
This is no new feature, this is a clear regression.

We changed somewhat in the system that suddenly dhcp does not get the IP anymore early enough. This was no problem (or only very seldom) with previous releases, but is now reported by a lot of people.

And this has nothing to do with NIS and/or autofs. This both services are only the ones people see later prominently (cannot login), since the have this "stupid" splash screen hiding the huge amount of init scripts failing during boot.
Comment 11 Marius Tomaschewski 2008-05-16 12:29:44 UTC
IMO this is a very old and common problem, when the dhcp server is
slow and does not answers in 5 seconds. What I can do is to increase
the default time we wait for dhcpcd to complete:

## Type:        integer
## Default:     5
#
# When the DHCP client is started at boot time, the boot process will stop
# until the interface is successfully configured, but at most for
# DHCLIENT_WAIT_AT_BOOT seconds.
#
DHCLIENT_WAIT_AT_BOOT="5"

(the default of 5 seconds is many years old) then the network script
waits longer and usually you'll get an IP:

    br0       (DHCP) . . . . . . . . . . . . . . . . . . . . . . IP/Network: '192.168.110.1' / '255.255.255.0' 
Setting up service network  .  .  .  .  .  .  .  .  .  .  .  .  .  .  done


Well... but because you mean that this is a regression, I reassign
to the maintainer of the dhcpcd.
Comment 12 Peter Varkoly 2008-05-16 13:18:29 UTC
In /mirror/SuSE/ftp.suse.com/pub/people/varkoly/dhcpcd I've a new version of dhcpcd. Please test this.
Comment 13 Karl Eichwalder 2008-05-19 09:14:09 UTC
I installed it but I do not have time (nor skill) to do detailed testing.
Comment 14 Marius Tomaschewski 2008-05-19 14:30:53 UTC
Created attachment 216440 [details]
proposed patch for sysconfig
Comment 15 Marius Tomaschewski 2008-05-19 14:33:15 UTC
Please let me know, when I should submit the above patch to 11.0.
Comment 16 Thorsten Kukuk 2008-05-19 15:38:31 UTC
Yes, please submit, I hope it will fix the issues.
Comment 17 Milisav Radmanic 2008-05-20 08:02:03 UTC
I have done a few tests on 10.3 and 11.0 Beta 3 to evaluate the issue. My results show, that on a 10.3 it tokk not more than 5 seconds to receive an IP address from the DHCP server. On 11.0 Beta3 it was even within 2 seconds and that reliably.
I even did this test on MacOS X (Leopard) and it claimed an address within 3 seconds repeatedly.

I think this is neither a specific issue of DHCP client nor of our local DHCP server setup. On 10.3 e.g. I can't login sometimes, although networking was set up successfully, only a restart of KDM resolves the issue. I think this is because of KDM starting up earlier than the LDAP service .

It may be that the DHCP client adds to the reported issue but it is not the sole caues of it. I'm reassigning this to the Project Manager to process this bug any further.
Comment 20 Stephan Kulow 2008-05-20 10:08:49 UTC
the "KDM starts early" is a feature - so far I got only little reports from NIS users that there is a problem. Technically we could disable early start for NIS completely - so far we only do it for autologin.

But you can't compare dhcp speed during normal system load and booting. During booting 5s are pretty quickly over. So I think the sysconfig patch will help, the only other way I can think of is giving the dhcp client more priority.
Comment 21 Stephan Kulow 2008-05-20 10:15:09 UTC
I think #14 will make the problem unlikely enough
Comment 22 Marius Tomaschewski 2008-05-23 11:04:05 UTC
Submitted patch from comment #14 to stable:

- Increased DHCLIENT_WAIT_AT_BOOT to 15 and added comment note,
  that RFC 2131 specifies, that the dhcp client should wait a
  random time between one and ten seconds to desynchronize the
  use of DHCP at startup (bnc#390676).

See also bug #393801 (may be a duplicate).
Comment 23 Peter Varkoly 2008-05-23 13:25:49 UTC
Primarily that is an infrastructure problem. In this case incrasing DHCLIENT_WAIT_AT_BOOT may help.