Bug 1221458

Summary: [Build 20240314] test docker_firewall: fails to restart docker
Product: [openSUSE] openSUSE Tumbleweed Reporter: Dominique Leuenberger <dimstar>
Component: ContainersAssignee: Containers Team <containers-bugowner>
Status: RESOLVED FIXED QA Contact: E-mail List <qa-bugs>
Severity: Normal    
Priority: P5 - None CC: dimstar, fbui, rbranco, systemd-maintainers
Version: Current   
Target Milestone: ---   
Hardware: Other   
OS: Other   
URL: https://openqa.opensuse.org/tests/4015303/modules/docker_firewall/steps/143
Whiteboard:
Found By: openQA Services Priority:
Business Priority: Blocker: Yes
Marketing QA Status: --- IT Deployment: ---

Description Dominique Leuenberger 2024-03-15 09:56:51 UTC
## Observation

The test fails at

"systemctl --no-pager restart docker; echo yoQuJ-\$?-"

Job for docker.service failed.
See "systemctl status docker.service" and "journalctl -xeu docker.service" for details.

unfortunately, that job does not (yet) export a journal, making debugging a bit harder than it should be


openQA test in scenario opensuse-Tumbleweed-JeOS-for-kvm-and-xen-x86_64-jeos-container_host@64bit_virtio-2G fails in
[docker_firewall](https://openqa.opensuse.org/tests/4015303/modules/docker_firewall/steps/143)

## Test suite description
Container host tests on JeOS image.


## Reproducible

Fails since (at least) Build [20240314](https://openqa.opensuse.org/tests/4014897)


## Expected result

Last good: [20240313](https://openqa.opensuse.org/tests/4012987) (or more recent)


## Further details

Always latest result in this scenario: [latest](https://openqa.opensuse.org/tests/latest?arch=x86_64&distri=opensuse&flavor=JeOS-for-kvm-and-xen&machine=64bit_virtio-2G&test=jeos-container_host&version=Tumbleweed)
Comment 1 Dominique Leuenberger 2024-03-15 09:59:56 UTC
earlier in the test suite, the restarting of docker service still worked

it's only once the firewall setup is being done we fail to restart the service
Comment 2 Dan Čermák 2024-03-15 10:20:37 UTC
(In reply to Dominique Leuenberger from comment #1)
> earlier in the test suite, the restarting of docker service still worked
> 
> it's only once the firewall setup is being done we fail to restart the
> service

Do you have the journal or an error message from that test run?
Comment 3 Dominique Leuenberger 2024-03-15 10:41:49 UTC
(In reply to Dan Čermák from comment #2)
> (In reply to Dominique Leuenberger from comment #1)
> > earlier in the test suite, the restarting of docker service still worked
> > 
> > it's only once the firewall setup is being done we fail to restart the
> > service
> 
> Do you have the journal or an error message from that test run?

based on comment#0:

unfortunately, that job does not (yet) export a journal, making debugging a bit harder than it should be

I filed https://progress.opensuse.org/issues/157330 for openQA to fix that
Comment 4 Ricardo Branco 2024-03-15 11:00:26 UTC
(In reply to Dan Čermák from comment #2)
> (In reply to Dominique Leuenberger from comment #1)
> > earlier in the test suite, the restarting of docker service still worked
> > 
> > it's only once the firewall setup is being done we fail to restart the
> > service
> 
> Do you have the journal or an error message from that test run?

https://openqa.opensuse.org/tests/4015685#step/docker_firewall/138

× docker.service - Docker Application Container Engine
     Loaded: loaded (/usr/lib/systemd/system/docker.service; enabled; preset: disabled)
     Active: failed (Result: start-limit-hit) since Fri 2024-03-15 10:26:33 UTC; 60ms ago
   Duration: 20.308s
TriggeredBy: × docker.socket
       Docs: http://docs.docker.com
    Process: 6825 ExecStart=/usr/bin/dockerd --add-runtime oci=/usr/sbin/docker-runc $DOCKER_NETWORK_OPTIONS $DOCKER_OPTS (code=exited, status=0/SUCCESS)
   Main PID: 6825 (code=exited, status=0/SUCCESS)
        CPU: 1.373s

Mar 15 10:26:32 localhost.localdomain dockerd[6825]: time="2024-03-15T10:26:32.444638010Z" level=info msg="stopping healthcheck following graceful shutdown" module=libcontainerd
Mar 15 10:26:32 localhost.localdomain dockerd[6825]: time="2024-03-15T10:26:32.444820153Z" level=info msg="stopping event stream following graceful shutdown" error="context canceled" module=libcontainerd namespace=plugins.moby
Mar 15 10:26:32 localhost.localdomain dockerd[6831]: time="2024-03-15T10:26:32.445037306Z" level=debug msg="received signal" signal=terminated
Mar 15 10:26:32 localhost.localdomain dockerd[6831]: time="2024-03-15T10:26:32.445154898Z" level=debug msg="sd notification" notified=false state="STOPPING=1"
Mar 15 10:26:33 localhost.localdomain systemd[1]: docker.service: Deactivated successfully.
Mar 15 10:26:33 localhost.localdomain systemd[1]: Stopped Docker Application Container Engine.
Mar 15 10:26:33 localhost.localdomain systemd[1]: docker.service: Consumed 1.373s CPU time.
Mar 15 10:26:33 localhost.localdomain systemd[1]: docker.service: Start request repeated too quickly.
Mar 15 10:26:33 localhost.localdomain systemd[1]: docker.service: Failed with result 'start-limit-hit'.
Mar 15 10:26:33 localhost.localdomain systemd[1]: Failed to start Docker Application Container Engine.

---

░░ Subject: Unit succeeded
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit docker.service has successfully entered the 'dead' state.
Mar 15 10:26:33 localhost.localdomain systemd[1]: Stopped Docker Application Container Engine.
░░ Subject: A stop job for unit docker.service has finished
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A stop job for unit docker.service has finished.
░░ 
░░ The job identifier is 10208 and the job result is done.
Mar 15 10:26:33 localhost.localdomain systemd[1]: docker.service: Consumed 1.373s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit docker.service completed and consumed the indicated resources.
Mar 15 10:26:33 localhost.localdomain systemd[1]: docker.service: Start request repeated too quickly.
Mar 15 10:26:33 localhost.localdomain systemd[1]: docker.service: Failed with result 'start-limit-hit'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ The unit docker.service has entered the 'failed' state with result 'start-limit-hit'.
Mar 15 10:26:33 localhost.localdomain systemd[1]: Failed to start Docker Application Container Engine.
░░ Subject: A start job for unit docker.service has failed
░░ Defined-By: systemd
░░ Support: https://lists.freedesktop.org/mailman/listinfo/systemd-devel
░░ 
░░ A start job for unit docker.service has finished with a failure.
░░ 
░░ The job identifier is 10208 and the job result is failed.
Comment 5 Dominique Leuenberger 2024-03-15 13:54:40 UTC
>      Active: failed (Result: start-limit-hit) since Fri 2024-03-15 10:26:33 UTC; 60ms ago

Makes me wonder if systemd (updated to 255.3 in this snapshot) is being more aggressive in stopping us from repeatedly restarting the docker service.

@Franck: are you already aware of any issues around rate-limiting service restarts?
Comment 6 Dominique Leuenberger 2024-03-15 15:19:42 UTC
docker.service comes with its own set of limits:

# Restart the docker process if it exits prematurely.
Restart=on-failure
StartLimitBurst=3
StartLimitInterval=60s

Seems we might just be too fast now (or systemd properly honoring it finally)
Comment 7 Dominique Leuenberger 2024-03-15 22:27:01 UTC
https://openqa.opensuse.org/tests/4017525/file/docker_firewall-journal.log

There we now have a full journal
Comment 8 Dominique Leuenberger 2024-03-16 10:05:24 UTC
https://github.com/os-autoinst/os-autoinst-distri-opensuse/pull/18900

Adding a drop-in configuration to change the service definition. Allow 'more frequent restarts' as performed by the test
Comment 9 Dominique Leuenberger 2024-03-16 14:57:44 UTC
Fixed and confirmed:
https://openqa.opensuse.org/tests/4018060