Skip to content

python.runScript() deadlocks when Python subprocess produces >208KB stderr output #3356

@NicholasZolton

Description

@NicholasZolton

Summary

python.runScript() from @trigger.dev/python permanently deadlocks the Python subprocess if it writes more than ~208KB to stderr. The root cause is that stderr is connected via a blocking Unix socketpair with a finite kernel buffer, and the trigger-dev-worker intermediary process does not drain it fast enough (or at all under certain conditions).

This is a silent, permanent hang — no error, no timeout, no crash. The task appears to be running but never completes.

Environment

  • @trigger.dev/sdk v4 (latest)
  • @trigger.dev/python extension with custom Python 3.13 layer
  • Machine preset: large-1x
  • Self-hosted Trigger.dev on EKS

Reproduction

Any python.runScript() call where the Python process produces substantial stderr output during startup will deadlock. In our case, importing numpy, scikit-learn, xgboost, and mlflow during task startup is sufficient — especially because the parent environment leaks OTEL_LOG_LEVEL=DEBUG (see issue 2 below).

Minimal reproduction:

// In a Trigger.dev task
const result = await python.runScript("my_script.py", [], {
  env: { PYTHONUNBUFFERED: "1" },
});
# my_script.py
import sys
# Write more than ~208KB to stderr → deadlock
sys.stderr.write("x" * 300_000)
print('{"ok": true}')  # never reached

Diagnosis

We diagnosed this on a live runner pod using /proc forensics:

Process tree

PID 8:  node              (Trigger.dev runtime)      wchan=ep_poll
PID 19: trigger-dev-wor   (intermediary worker)      wchan=ep_poll
PID 30: python            (user subprocess)          wchan=sock_alloc_send_pskb  ← STUCK

Python's file descriptors

fd 0 → socket (blocking, Unix)   ← stdin
fd 1 → socket (blocking, Unix)   ← stdout
fd 2 → socket (blocking, Unix)   ← stderr  ← BLOCKED HERE

All three stdio fds are blocking Unix socketpairs (flags: 02, no O_NONBLOCK) connected to the trigger-dev-worker process. The kernel send buffer (net.core.wmem_default) is 212992 bytes (~208KB).

When Python fills this buffer with stderr writes, the write() syscall blocks in sock_alloc_send_pskb and never returns. The process hangs indefinitely.

Verification

We confirmed by redirecting stderr to /dev/null before the heavy imports:

import os, sys
os.dup2(os.open(os.devnull, os.O_WRONLY), 2)
sys.stderr = open(2, "w")
# Heavy imports now complete in 34 seconds instead of hanging forever

Two Issues

1. stderr uses blocking sockets with no drain guarantee

The trigger-dev-worker process should either:

  • Use non-blocking sockets for the Python subprocess's stdio (falling back to dropping/buffering on backpressure)
  • Actively drain stderr in a dedicated read loop, independent of stdout processing
  • Provide a configurable stderr option (e.g., stderr: "null" or stderr: "pipe" with async drain)

2. Parent environment leaks into python.runScript()

The env option in python.runScript() merges with process.env rather than replacing it. This leaks Trigger.dev's internal environment variables into the Python subprocess:

OTEL_EXPORTER_OTLP_ENDPOINT=http://trigger-otel.trigger.svc.cluster.local:3000/otel
OTEL_LOG_LEVEL=DEBUG
OTEL_RESOURCE_ATTRIBUTES=exec_env=trigger,...
TRIGGER_OTEL_EXPORTER_OTLP_ENDPOINT=...

Any OTEL-aware Python library (e.g., mlflow, opentelemetry-sdk) picks these up and:

  • Initializes tracing connections to the OTEL collector
  • Produces verbose debug-level log output to stderr (due to OTEL_LOG_LEVEL=DEBUG)

This directly exacerbates issue #1 — the debug logs fill the stderr buffer faster.

Suggestion: Either replace the env entirely when env is provided, or filter out OTEL_* and TRIGGER_* prefixed vars from the inherited environment.

Our Workaround

We applied three workarounds on our side:

env: {
  // Prevent OpenBLAS/OpenMP thread deadlocks in containers
  OPENBLAS_NUM_THREADS: "1",
  OMP_NUM_THREADS: "1",
  MKL_NUM_THREADS: "1",
  // Suppress inherited OTEL env vars
  OTEL_SDK_DISABLED: "true",
  OTEL_EXPORTER_OTLP_ENDPOINT: "",
  OTEL_LOG_LEVEL: "",
},
# In our Python entrypoint, before any imports:
if os.environ.get("TRIGGER_RUN_ID"):
    devnull_fd = os.open(os.devnull, os.O_WRONLY)
    os.dup2(devnull_fd, 2)
    os.close(devnull_fd)
    sys.stderr = open(2, "w")

This fixes the hang but means we lose all Python log output in the Trigger.dev dashboard.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions