Summary
python.runScript() from @trigger.dev/python permanently deadlocks the Python subprocess if it writes more than ~208KB to stderr. The root cause is that stderr is connected via a blocking Unix socketpair with a finite kernel buffer, and the trigger-dev-worker intermediary process does not drain it fast enough (or at all under certain conditions).
This is a silent, permanent hang — no error, no timeout, no crash. The task appears to be running but never completes.
Environment
@trigger.dev/sdk v4 (latest)
@trigger.dev/python extension with custom Python 3.13 layer
- Machine preset:
large-1x
- Self-hosted Trigger.dev on EKS
Reproduction
Any python.runScript() call where the Python process produces substantial stderr output during startup will deadlock. In our case, importing numpy, scikit-learn, xgboost, and mlflow during task startup is sufficient — especially because the parent environment leaks OTEL_LOG_LEVEL=DEBUG (see issue 2 below).
Minimal reproduction:
// In a Trigger.dev task
const result = await python.runScript("my_script.py", [], {
env: { PYTHONUNBUFFERED: "1" },
});
# my_script.py
import sys
# Write more than ~208KB to stderr → deadlock
sys.stderr.write("x" * 300_000)
print('{"ok": true}') # never reached
Diagnosis
We diagnosed this on a live runner pod using /proc forensics:
Process tree
PID 8: node (Trigger.dev runtime) wchan=ep_poll
PID 19: trigger-dev-wor (intermediary worker) wchan=ep_poll
PID 30: python (user subprocess) wchan=sock_alloc_send_pskb ← STUCK
Python's file descriptors
fd 0 → socket (blocking, Unix) ← stdin
fd 1 → socket (blocking, Unix) ← stdout
fd 2 → socket (blocking, Unix) ← stderr ← BLOCKED HERE
All three stdio fds are blocking Unix socketpairs (flags: 02, no O_NONBLOCK) connected to the trigger-dev-worker process. The kernel send buffer (net.core.wmem_default) is 212992 bytes (~208KB).
When Python fills this buffer with stderr writes, the write() syscall blocks in sock_alloc_send_pskb and never returns. The process hangs indefinitely.
Verification
We confirmed by redirecting stderr to /dev/null before the heavy imports:
import os, sys
os.dup2(os.open(os.devnull, os.O_WRONLY), 2)
sys.stderr = open(2, "w")
# Heavy imports now complete in 34 seconds instead of hanging forever
Two Issues
1. stderr uses blocking sockets with no drain guarantee
The trigger-dev-worker process should either:
- Use non-blocking sockets for the Python subprocess's stdio (falling back to dropping/buffering on backpressure)
- Actively drain stderr in a dedicated read loop, independent of stdout processing
- Provide a configurable
stderr option (e.g., stderr: "null" or stderr: "pipe" with async drain)
2. Parent environment leaks into python.runScript()
The env option in python.runScript() merges with process.env rather than replacing it. This leaks Trigger.dev's internal environment variables into the Python subprocess:
OTEL_EXPORTER_OTLP_ENDPOINT=http://trigger-otel.trigger.svc.cluster.local:3000/otel
OTEL_LOG_LEVEL=DEBUG
OTEL_RESOURCE_ATTRIBUTES=exec_env=trigger,...
TRIGGER_OTEL_EXPORTER_OTLP_ENDPOINT=...
Any OTEL-aware Python library (e.g., mlflow, opentelemetry-sdk) picks these up and:
- Initializes tracing connections to the OTEL collector
- Produces verbose debug-level log output to stderr (due to
OTEL_LOG_LEVEL=DEBUG)
This directly exacerbates issue #1 — the debug logs fill the stderr buffer faster.
Suggestion: Either replace the env entirely when env is provided, or filter out OTEL_* and TRIGGER_* prefixed vars from the inherited environment.
Our Workaround
We applied three workarounds on our side:
env: {
// Prevent OpenBLAS/OpenMP thread deadlocks in containers
OPENBLAS_NUM_THREADS: "1",
OMP_NUM_THREADS: "1",
MKL_NUM_THREADS: "1",
// Suppress inherited OTEL env vars
OTEL_SDK_DISABLED: "true",
OTEL_EXPORTER_OTLP_ENDPOINT: "",
OTEL_LOG_LEVEL: "",
},
# In our Python entrypoint, before any imports:
if os.environ.get("TRIGGER_RUN_ID"):
devnull_fd = os.open(os.devnull, os.O_WRONLY)
os.dup2(devnull_fd, 2)
os.close(devnull_fd)
sys.stderr = open(2, "w")
This fixes the hang but means we lose all Python log output in the Trigger.dev dashboard.
Summary
python.runScript()from@trigger.dev/pythonpermanently deadlocks the Python subprocess if it writes more than ~208KB to stderr. The root cause is that stderr is connected via a blocking Unix socketpair with a finite kernel buffer, and thetrigger-dev-workerintermediary process does not drain it fast enough (or at all under certain conditions).This is a silent, permanent hang — no error, no timeout, no crash. The task appears to be running but never completes.
Environment
@trigger.dev/sdkv4 (latest)@trigger.dev/pythonextension with custom Python 3.13 layerlarge-1xReproduction
Any
python.runScript()call where the Python process produces substantial stderr output during startup will deadlock. In our case, importingnumpy,scikit-learn,xgboost, andmlflowduring task startup is sufficient — especially because the parent environment leaksOTEL_LOG_LEVEL=DEBUG(see issue 2 below).Minimal reproduction:
Diagnosis
We diagnosed this on a live runner pod using
/procforensics:Process tree
Python's file descriptors
All three stdio fds are blocking Unix socketpairs (
flags: 02, noO_NONBLOCK) connected to thetrigger-dev-workerprocess. The kernel send buffer (net.core.wmem_default) is 212992 bytes (~208KB).When Python fills this buffer with stderr writes, the
write()syscall blocks insock_alloc_send_pskband never returns. The process hangs indefinitely.Verification
We confirmed by redirecting stderr to
/dev/nullbefore the heavy imports:Two Issues
1. stderr uses blocking sockets with no drain guarantee
The
trigger-dev-workerprocess should either:stderroption (e.g.,stderr: "null"orstderr: "pipe"with async drain)2. Parent environment leaks into
python.runScript()The
envoption inpython.runScript()merges withprocess.envrather than replacing it. This leaks Trigger.dev's internal environment variables into the Python subprocess:Any OTEL-aware Python library (e.g.,
mlflow,opentelemetry-sdk) picks these up and:OTEL_LOG_LEVEL=DEBUG)This directly exacerbates issue #1 — the debug logs fill the stderr buffer faster.
Suggestion: Either replace the env entirely when
envis provided, or filter outOTEL_*andTRIGGER_*prefixed vars from the inherited environment.Our Workaround
We applied three workarounds on our side:
This fixes the hang but means we lose all Python log output in the Trigger.dev dashboard.