Skip to content

fix(healthcheck): cleanup transient units on container exit and start#4935

Open
haytok wants to merge 2 commits into
containerd:mainfrom
haytok:systemd-run-in-an-idempotent-state
Open

fix(healthcheck): cleanup transient units on container exit and start#4935
haytok wants to merge 2 commits into
containerd:mainfrom
haytok:systemd-run-in-an-idempotent-state

Conversation

@haytok
Copy link
Copy Markdown
Member

@haytok haytok commented May 25, 2026

Details are described in this commit (adfb7d4) message.


Note that my investigation into this issue are as follows:

> sudo nerdctl run --debug -d --name hoge \
  --health-cmd "echo hoge" --health-interval=3s --health-start-period=60s --health-retries=2 \
  alpine sleep 1
...

> sleep 10

> sudo nerdctl ps -a --filter "name=hoge"
CONTAINER ID    IMAGE                              COMMAND      CREATED               STATUS                           PORTS    NAMES
dd94022f7dd0    docker.io/library/alpine:latest    "sleep 1"    About a minute ago    Exited (0) About a minute ago             hoge
> CID=$(sudo nerdctl ps -aq --no-trunc --filter "name=hoge")

The message Active: failed (Result: exit-code) indicates that the health
check for the container, which has entered a stopped state, has failed.

> sudo systemctl status ${CID}.timer
● dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.timer - /usr/local/bin/nerdctl --debug=true container healthcheck dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606>
     Loaded: loaded (/run/systemd/transient/dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.timer; transient)
  Transient: yes
     Active: active (running) since Sun 2026-05-24 18:22:01 JST; 2min 50s ago
 Invocation: ac06f3332a7c4223aa15b9ab7758b4e3
    Trigger: n/a
   Triggers: ● dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.service

May 24 18:22:01 lima-haytok systemd[1]: Started dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.timer - /usr/local/bin/nerdctl --debug=true container healthcheck dd94022f
> sudo systemctl status ${CID}.service
× dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.service - /usr/local/bin/nerdctl --debug=true container healthcheck dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c6>
     Loaded: loaded (/run/systemd/transient/dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.service; transient)
  Transient: yes
     Active: failed (Result: exit-code) since Sun 2026-05-24 18:29:14 JST; 922ms ago
...
TriggeredBy: ● dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.timer
    Process: 1067977 ExecStart=/usr/local/bin/nerdctl --debug=true container healthcheck dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918 (code=exited, status=1/FAILURE)
...

May 24 18:29:14 lima-haytok systemd[1]: Started dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.service - /usr/local/bin/nerdctl --debug=true container healthcheck dd9402>
May 24 18:29:14 lima-haytok nerdctl[1067977]: time="2026-05-24T18:29:14+09:00" level=fatal msg="container is not running (status: stopped)"
May 24 18:29:14 lima-haytok systemd[1]: dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.service: Main process exited, code=exited, status=1/FAILURE
May 24 18:29:14 lima-haytok systemd[1]: dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.service: Failed with result 'exit-code'.

@haytok
Copy link
Copy Markdown
Member Author

haytok commented May 25, 2026

Since the tests related to this fix are failing, I will look into it.

TestContainerHealthCheckBasic
TestContainerHealthCheckBasic/Health_check_on_stopped_container

haytok added 2 commits May 25, 2026 23:55
Suppose a container with healthcheck enabled has exited.

```bash
> sudo nerdctl ps -a --filter "name=hoge"
CONTAINER ID    IMAGE                              COMMAND      CREATED               STATUS                           PORTS    NAMES
dd94022f7dd0    docker.io/library/alpine:latest    "sleep 1"    About a minute ago    Exited (0) About a minute ago             hoge
```

When we try to run `nerdctl start` on that container, the following error
occurs, and the container cannot be started.

```bash
> sudo nerdctl start hoge
FATA[0000] 1 errors:
failed to create healthcheck timer: systemd-run failed: exit status 1
output: Failed to start transient timer unit: Unit dd94022f7dd0f8b60cb49e803c52de3a7a49c2b39276883ed7c606e13ca1a918.timer was already loaded or has a fragment file.
```

The cause of the failure is the presence of the systemd transient timer
unit used when executing health checks.

When checking the output of `systemctl status`, the status of the
transient timer unit is `active`, but an error has occurred in the
transient service unit that executes the healthcheck command.

In nerdctl, container health check is performed by running the `systemd-run`
command to periodically execute the `exec` command on the target container
via a transient service unit and a transient timer unit, and executing the
command specified with the `--health-cmd` option.

However, the current implementation does not account for the case where
the container has exited.

Therefore, this commit will ensure that transient units are deleted when a
container with a health check enabled exits. It will also ensure that the
system checks for the presence of transient units when restarting a
stopped container with a health check enabled.

The specific approach is as follows:

- Use the `--collect` option of the `systemd-run` command so that the
  transient service unit can be garbage-collected even when it is in a
  failed state.
- Delete the transient timer unit when the process exits and the container
  is in a stopped state.
- Before creating a new transient timer unit in CreateTimer, check whether
  a transient timer unit with the same name already exists and remove it
  if so.

Note that if the `--collect` option is specified when executing the
`systemd-run` command, deleting the transient timer unit will cause it to
be unloaded by systemd's garbage collection.

References:
- https://www.freedesktop.org/software/systemd/man/latest/systemd-run.html#-G
- https://www.freedesktop.org/software/systemd/man/latest/systemd.unit.html#CollectMode=

Signed-off-by: Hayato Kiwata <dev@haytok.jp>
Signed-off-by: Hayato Kiwata <dev@haytok.jp>
@haytok haytok force-pushed the systemd-run-in-an-idempotent-state branch from 6e3f5ba to 08b99f8 Compare May 25, 2026 14:56
@haytok
Copy link
Copy Markdown
Member Author

haytok commented May 25, 2026

I have resolved the CI failures mentioned above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant