Bug 1149603 - nscd startup fails if /var/run in tmpfs (regression related to bug #600804 ?)
nscd startup fails if /var/run in tmpfs (regression related to bug #600804 ?)
Status: RESOLVED FIXED
Classification: openSUSE
Product: openSUSE Distribution
Classification: openSUSE
Component: Basesystem
Leap 15.1
Other Other
: P5 - None : Normal (vote)
: ---
Assigned To: Oliver Kurz
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-09-05 12:33 UTC by Oliver Kurz
Modified: 2019-10-09 12:28 UTC (History)
2 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments
system journal of openqaworker-arm-1 form 2019-09-28, covering "ordering cycles" (500.82 KB, text/x-log)
2019-10-08 15:12 UTC, Oliver Kurz
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Oliver Kurz 2019-09-05 12:33:17 UTC
## Observation

As observed on three aarch64 machines but I think this is not architecture-specific, the nscd service fails after bootup where /var/run is a symlink to /run which is mounted as tmpfs.

This looks just like
https://bugzilla.suse.com/show_bug.cgi?id=600804

Maybe the problem is related to order of setting up services and mount points.


## Workaround

Restarting the service usually works. Most likely creating the directory in the systemd service if it does not exist would work.
Comment 1 Andreas Schwab 2019-09-05 12:48:15 UTC
Does /run already exist at this point?
Comment 2 Andreas Schwab 2019-09-05 12:52:24 UTC
I mean, does /run/nscd already exist?  It should be created by /usr/lib/tmpfiles.d/nscd.conf, and nscd does not touch anything else under /run.
Comment 3 Oliver Kurz 2019-09-05 13:19:33 UTC
I can check that. For now I have implemented a workaround with bug detection in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/162 . Maybe after the next days and some reboots I can give an answer to your question.
Comment 4 Oliver Kurz 2019-09-14 16:38:19 UTC
I could not find this error again. Let's treat this is WORKSFORME unless I find this again.
Comment 5 Oliver Kurz 2019-09-30 08:46:50 UTC
On the three ARM machines we have I have found this problem on all at least once:

```
sudo salt '*arm*' cmd.run 'journalctl -u nscd | grep 1149603'
openqaworker-arm-1.suse.de:
    Sep 28 14:23:40 linux-rz6p mkdir_nscd[1143]: boo#1149603: /run/nscd does not exist
openqaworker-arm-2.suse.de:
    Sep 19 18:38:39 linux-79cz mkdir_nscd[1576]: boo#1149603: /run/nscd does not exist
openqaworker-arm-3.suse.de:
    Sep 17 20:20:37 openqaworker-arm-3 mkdir_nscd[1662]: boo#1149603: /run/nscd does not exist
    Sep 26 12:07:28 openqaworker-arm-3 mkdir_nscd[1546]: boo#1149603: /run/nscd does not exist
```

So to answer your question: Clearly the directory `/run/ncsd` does not exist when the above scripts are triggered as workaround. What else can I check? I should have the full journals available.

The file `/usr/lib/tmpfiles.d/nscd.conf` exists correctly and has content:

```
d /run/nscd 0755 root root
```

On openqaworker-arm-1.suse.de where /run/nscd was not existing on "Sep 28 14:23:40" I would see the following output from systemd-tmpfiles-setup.servic

```
> sudo journalctl -u systemd-tmpfiles-setup.service
-- Reboot --
Sep 27 07:58:20 linux-rz6p systemd[1]: Starting Create Volatile Files and Directories...
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/net-snmp.conf:1] Line references path below legacy directory /var/run/, updating /var/run/net-snmp → /run/net-snmp; please update the tmpfiles.d/ drop-i>
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/salt.conf:2] Line references path below legacy directory /var/run/, updating /var/run/salt → /run/salt; please update the tmpfiles.d/ drop-in file accor>
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/salt.conf:3] Line references path below legacy directory /var/run/, updating /var/run/salt/master → /run/salt/master; please update the tmpfiles.d/ drop>
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/salt.conf:4] Line references path below legacy directory /var/run/, updating /var/run/salt/minion → /run/salt/minion; please update the tmpfiles.d/ drop>
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/samba.conf:1] Line references path below legacy directory /var/run/, updating /var/run/samba → /run/samba; please update the tmpfiles.d/ drop-in file ac>
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/tmp.conf:13] Duplicate line for path "/var/tmp", ignoring.
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/var.conf:21] Duplicate line for path "/var/lib", ignoring.
Sep 27 07:58:20 linux-rz6p systemd-tmpfiles[1132]: [/usr/lib/tmpfiles.d/var.conf:23] Duplicate line for path "/var/spool", ignoring.
Sep 27 07:58:20 linux-rz6p systemd[1]: Started Create Volatile Files and Directories.
-- Reboot --
Sep 28 14:23:29 linux-rz6p systemd[1]: systemd-tmpfiles-setup.service: Job systemd-tmpfiles-setup.service/start deleted to break ordering cycle starting with sysinit.target/start
```

One can see that the bootup on "Sep 27" was fine – except for many warnings – but on "Sep 28" it actually failed and I can confirm the same on another machine. systemd-tmpfiles-setup.service actually fails to startup, that's why we have this problem. So it seems the problem is actually in systemd-tmpfiles-setup.service so let me assign the bug to systemd maintainers. The warnings seem to be "acceptable" according to my research, e.g. following https://lists.freedesktop.org/archives/systemd-devel/2018-May/040726.html and https://bugs.launchpad.net/ubuntu/+source/rsyslog/+bug/1484027
Comment 6 Franck Bui 2019-10-08 13:42:46 UTC
(In reply to Oliver Kurz from comment #5)
> Sep 28 14:23:29 linux-rz6p systemd[1]: systemd-tmpfiles-setup.service: Job
> systemd-tmpfiles-setup.service/start deleted to break ordering cycle
> starting with sysinit.target/start
> ```

Do you still have the complete journal logs of the boot when this happened so we can see what other units are involved in this ordering cycle ?

If so can you attach it ?

Thanks.
Comment 7 Oliver Kurz 2019-10-08 15:12:38 UTC
Created attachment 820804 [details]
system journal of openqaworker-arm-1 form 2019-09-28, covering "ordering cycles"

Please find the full log attached

All mentions of "ordering cycle" are as follows:

```
Sep 28 14:23:29 linux-rz6p systemd[1]: haveged.service: Current command vanished from the unit file, execution of the command list won't be resumed.
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found ordering cycle on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on systemd-journal-catalog-update.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on local-fs.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on var-lib-openqa-nvme.mount/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on openqa_nvme_format.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on basic.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sockets.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on dbus.socket/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Breaking ordering cycle by deleting job systemd-journal-catalog-update.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: systemd-journal-catalog-update.service: Job systemd-journal-catalog-update.service/start deleted to break ordering cycle starting with sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found ordering cycle on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on systemd-tmpfiles-setup.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on local-fs.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on var-lib-openqa-nvme.mount/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on openqa_nvme_format.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on basic.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sockets.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on dbus.socket/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Breaking ordering cycle by deleting job systemd-tmpfiles-setup.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: systemd-tmpfiles-setup.service: Job systemd-tmpfiles-setup.service/start deleted to break ordering cycle starting with sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found ordering cycle on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on systemd-machine-id-commit.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on local-fs.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on var-lib-openqa-nvme.mount/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on openqa_nvme_format.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on basic.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sockets.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on dbus.socket/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Breaking ordering cycle by deleting job systemd-machine-id-commit.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: systemd-machine-id-commit.service: Job systemd-machine-id-commit.service/start deleted to break ordering cycle starting with sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Found ordering cycle on basic.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Found dependency on paths.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Found dependency on ca-certificates.path/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Found dependency on local-fs.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Found dependency on var-lib-openqa-nvme.mount/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Found dependency on openqa_nvme_format.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Found dependency on basic.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: basic.target: Breaking ordering cycle by deleting job paths.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: paths.target: Job paths.target/start deleted to break ordering cycle starting with basic.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found ordering cycle on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on local-fs.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on var-lib-openqa-nvme.mount/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on openqa_nvme_format.service/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on basic.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sockets.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on dbus.socket/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Found dependency on sysinit.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: sysinit.target: Breaking ordering cycle by deleting job local-fs.target/start
Sep 28 14:23:29 linux-rz6p systemd[1]: local-fs.target: Job local-fs.target/start deleted to break ordering cycle starting with sysinit.target/start
Sep 28 14:23:29 linux-rz6p auditd[878]: Couldn't open log file /var/log/audit/audit.log (Read-only file system)
```

I have the suspicion that our custom service "openqa_nvme_format.service" is the culprit. It's also mentioned within the above log excerpt but that is not obvious for me from the log messages. Maybe you can teach me how to read them properly in this regard? :)
Comment 8 Franck Bui 2019-10-08 15:56:26 UTC
(In reply to Oliver Kurz from comment #7)
> I have the suspicion that our custom service "openqa_nvme_format.service" is
> the culprit.

Agreed.

> It's also mentioned within the above log excerpt but that is
> not obvious for me from the log messages. Maybe you can teach me how to read
> them properly in this regard? :)

I can try...

> sysinit.target: Found ordering cycle on sysinit.target/start

So sysinit.target depends (maybe indirectly) on a unit which in its turn depends on sysinit.target hence the cycle.

Then systemd lists the potential culprit which is likely to be openqa_nvme_format.service.

Where can the unit file can be found ?
Comment 9 Oliver Kurz 2019-10-08 16:10:26 UTC
(In reply to Franck Bui from comment #8)
> (In reply to Oliver Kurz from comment #7)
> > I have the suspicion that our custom service "openqa_nvme_format.service" is
> > the culprit.
> 
> Agreed.
> 
> > It's also mentioned within the above log excerpt but that is
> > not obvious for me from the log messages. Maybe you can teach me how to read
> > them properly in this regard? :)
> 
> I can try...
> 
> > sysinit.target: Found ordering cycle on sysinit.target/start
> 
> So sysinit.target depends (maybe indirectly) on a unit which in its turn
> depends on sysinit.target hence the cycle.
> 
> Then systemd lists the potential culprit which is likely to be
> openqa_nvme_format.service.
> 
> Where can the unit file can be found ?

https://gitlab.suse.de/openqa/salt-states-openqa/blob/master/openqa/nvme_store/openqa_nvme_format.service

```
[Unit]
Description=Format NVMe before mounting it
Before=var-lib-openqa-nvme.mount

[Service]
Type=oneshot
ExecStart=/sbin/mkfs.ext2 -F /dev/nvme0n1p1

[Install]
WantedBy=multi-user.target
```
Comment 10 Franck Bui 2019-10-08 16:34:14 UTC
(In reply to Oliver Kurz from comment #9)
> [Unit]
> Description=Format NVMe before mounting it
> Before=var-lib-openqa-nvme.mount
> 
> [Service]

So this unit wants to be started before var-lib-openqa-nvme.mount which is supposed to be started before local-fs.target which in its turn is supposed to be started before sysinit.target.

OTOH it also implicitly uses DefaultDependencies=yes (the default) which means that it wants to be started after basic.target (see man system.unit). However basic.target is run after sysinit.target.

Hence the issue.

So the unit probably needs to disable the default dependencies with "DefaultDependencies=no".

Let me know if it helps.
Comment 11 Oliver Kurz 2019-10-09 12:28:23 UTC
Of course! I just read the past days about this flag as well. This helped when I also added the same line to other service files we have created. Thanks a lot.

Fixed in https://gitlab.suse.de/openqa/salt-states-openqa/merge_requests/192