Skip to content

fix: reap zombie worker processes in containerized environments#2184

Open
MetaAlms wants to merge 1 commit into
TEN-framework:mainfrom
MetaAlms:fix/reap-zombie-workers
Open

fix: reap zombie worker processes in containerized environments#2184
MetaAlms wants to merge 1 commit into
TEN-framework:mainfrom
MetaAlms:fix/reap-zombie-workers

Conversation

@MetaAlms

@MetaAlms MetaAlms commented Jun 8, 2026

Copy link
Copy Markdown

Problem

In containerized environments without a proper init system (PID 1), TEN Agent Server spawns [main] worker processes via tman run start. When the worker is stopped, tman exits without calling wait() on its child [main], leaving zombie processes that accumulate over time.

Root Cause

The server starts tman with Setpgid: true, then kills the process group via syscall.Kill(-pid, SIGKILL). tman is terminated before it can reap its child [main], so the kernel re-parents the child to PID 1. In a container, PID 1 is the Go server itself, which does not call waitpid() on child processes outside its own cmd.Wait() goroutine.

Fix

Register a SIGCHLD handler in worker_linux.go that calls waitpid(-1, WNOHANG) in a loop to reap any zombie children re-parented to PID 1.

Verification

  • Built on Linux (Docker): go build ./... passes
  • Deployed and tested on live environment: voice sessions start and stop correctly, zombie process count remains zero after multiple sessions

Related

This is a standard Linux pattern for init-less containers. The handler is placed in worker_linux.go (linux || darwin build tag) since it relies on syscall.SIGCHLD and syscall.Wait4.

TEN workers are spawned via tman, which forks a [main] subprocess.

When tman receives SIGKILL during stop(), it has no chance to wait()

for [main]. [main] becomes an orphan and accumulates as zombie.

Register a SIGCHLD handler to automatically reap all orphaned

child processes via waitpid(-1, WNOHANG).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant