python.runScript() deadlocks when Python subprocess produces >208KB stderr output

## Summary

`python.runScript()` from `@trigger.dev/python` permanently deadlocks the Python subprocess if it writes more than ~208KB to stderr. The root cause is that stderr is connected via a **blocking Unix socketpair** with a finite kernel buffer, and the `trigger-dev-worker` intermediary process does not drain it fast enough (or at all under certain conditions).

This is a silent, permanent hang — no error, no timeout, no crash. The task appears to be running but never completes.

## Environment

- `@trigger.dev/sdk` v4 (latest)
- `@trigger.dev/python` extension with custom Python 3.13 layer
- Machine preset: `large-1x`
- Self-hosted Trigger.dev on EKS

## Reproduction

Any `python.runScript()` call where the Python process produces substantial stderr output during startup will deadlock. In our case, importing `numpy`, `scikit-learn`, `xgboost`, and `mlflow` during task startup is sufficient — especially because the parent environment leaks `OTEL_LOG_LEVEL=DEBUG` (see issue 2 below).

Minimal reproduction:

```typescript
// In a Trigger.dev task
const result = await python.runScript("my_script.py", [], {
  env: { PYTHONUNBUFFERED: "1" },
});
```

```python
# my_script.py
import sys
# Write more than ~208KB to stderr → deadlock
sys.stderr.write("x" * 300_000)
print('{"ok": true}')  # never reached
```

## Diagnosis

We diagnosed this on a live runner pod using `/proc` forensics:

### Process tree

```
PID 8:  node              (Trigger.dev runtime)      wchan=ep_poll
PID 19: trigger-dev-wor   (intermediary worker)      wchan=ep_poll
PID 30: python            (user subprocess)          wchan=sock_alloc_send_pskb  ← STUCK
```

### Python's file descriptors

```
fd 0 → socket (blocking, Unix)   ← stdin
fd 1 → socket (blocking, Unix)   ← stdout
fd 2 → socket (blocking, Unix)   ← stderr  ← BLOCKED HERE
```

All three stdio fds are **blocking** Unix socketpairs (`flags: 02`, no `O_NONBLOCK`) connected to the `trigger-dev-worker` process. The kernel send buffer (`net.core.wmem_default`) is 212992 bytes (~208KB).

When Python fills this buffer with stderr writes, the `write()` syscall blocks in `sock_alloc_send_pskb` and never returns. The process hangs indefinitely.

### Verification

We confirmed by redirecting stderr to `/dev/null` before the heavy imports:

```python
import os, sys
os.dup2(os.open(os.devnull, os.O_WRONLY), 2)
sys.stderr = open(2, "w")
# Heavy imports now complete in 34 seconds instead of hanging forever
```

## Two Issues

### 1. stderr uses blocking sockets with no drain guarantee

The `trigger-dev-worker` process should either:
- Use **non-blocking** sockets for the Python subprocess's stdio (falling back to dropping/buffering on backpressure)
- **Actively drain** stderr in a dedicated read loop, independent of stdout processing
- Provide a configurable `stderr` option (e.g., `stderr: "null"` or `stderr: "pipe"` with async drain)

### 2. Parent environment leaks into `python.runScript()`

The `env` option in `python.runScript()` **merges** with `process.env` rather than replacing it. This leaks Trigger.dev's internal environment variables into the Python subprocess:

```
OTEL_EXPORTER_OTLP_ENDPOINT=http://trigger-otel.trigger.svc.cluster.local:3000/otel
OTEL_LOG_LEVEL=DEBUG
OTEL_RESOURCE_ATTRIBUTES=exec_env=trigger,...
TRIGGER_OTEL_EXPORTER_OTLP_ENDPOINT=...
```

Any OTEL-aware Python library (e.g., `mlflow`, `opentelemetry-sdk`) picks these up and:
- Initializes tracing connections to the OTEL collector
- Produces verbose debug-level log output to stderr (due to `OTEL_LOG_LEVEL=DEBUG`)

This directly exacerbates issue #1 — the debug logs fill the stderr buffer faster.

**Suggestion:** Either replace the env entirely when `env` is provided, or filter out `OTEL_*` and `TRIGGER_*` prefixed vars from the inherited environment.

## Our Workaround

We applied three workarounds on our side:

```typescript
env: {
  // Prevent OpenBLAS/OpenMP thread deadlocks in containers
  OPENBLAS_NUM_THREADS: "1",
  OMP_NUM_THREADS: "1",
  MKL_NUM_THREADS: "1",
  // Suppress inherited OTEL env vars
  OTEL_SDK_DISABLED: "true",
  OTEL_EXPORTER_OTLP_ENDPOINT: "",
  OTEL_LOG_LEVEL: "",
},
```

```python
# In our Python entrypoint, before any imports:
if os.environ.get("TRIGGER_RUN_ID"):
    devnull_fd = os.open(os.devnull, os.O_WRONLY)
    os.dup2(devnull_fd, 2)
    os.close(devnull_fd)
    sys.stderr = open(2, "w")
```

This fixes the hang but means we lose all Python log output in the Trigger.dev dashboard.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

python.runScript() deadlocks when Python subprocess produces >208KB stderr output #3356

Summary

Environment

Reproduction

Diagnosis

Process tree

Python's file descriptors

Verification

Two Issues

1. stderr uses blocking sockets with no drain guarantee

2. Parent environment leaks into `python.runScript()`

Our Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

python.runScript() deadlocks when Python subprocess produces >208KB stderr output #3356

Description

Summary

Environment

Reproduction

Diagnosis

Process tree

Python's file descriptors

Verification

Two Issues

1. stderr uses blocking sockets with no drain guarantee

2. Parent environment leaks into python.runScript()

Our Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

2. Parent environment leaks into `python.runScript()`