Skip to content

feat: MLflow tracing with async Stop hook (opt-in)#139

Closed
dgokeeffe wants to merge 1 commit intodatasciencemonkey:mainfrom
dgokeeffe:feat/mlflow-tracing
Closed

feat: MLflow tracing with async Stop hook (opt-in)#139
dgokeeffe wants to merge 1 commit intodatasciencemonkey:mainfrom
dgokeeffe:feat/mlflow-tracing

Conversation

@dgokeeffe
Copy link
Copy Markdown
Contributor

Summary

  • Enables opt-in MLflow tracing for Claude Code sessions via MLFLOW_CLAUDE_TRACING_ENABLED=true in app.yaml
  • Stop hook delegates to mlflow-trace-stop.sh, which backgrounds the handler via nohup timeout 30 … & disown — returns in <1s so the Stop chain (brain-push, /til, etc.) is not blocked
  • Handler receives hook-event JSON via a temp file captured synchronously before backgrounding (naive nohup would redirect stdin to /dev/null)
  • Hard 30s ceiling on the backgrounded flush to prevent stuck handlers leaking memory/CPU
  • Pins mlflow-skinny and mlflow-tracing to 3.11.1 to match the Apps runtime image (version mismatch caused silent import failures)

Tracing is disabled by default — no behaviour change for existing deployments.

Test plan

  • uv run pytest tests/test_mlflow_tracing.py — all pass locally
  • Deploy with MLFLOW_CLAUDE_TRACING_ENABLED=true, run a session, confirm trace appears in MLflow experiment
  • Verify Stop hook returns in <2s (doesn't block session teardown)
  • Verify that without the env var, no hook is registered in ~/.claude/settings.json

This pull request and its description were written by Isaac.

Enables opt-in MLflow tracing for Claude Code sessions. Key design:

- setup_mlflow.py registers a Stop hook when MLFLOW_CLAUDE_TRACING_ENABLED=true
- Hook delegates to mlflow-trace-stop.sh which backgrounds the handler via
  `nohup timeout 30 ... & disown`, returning in <1s so the Stop chain
  (brain-push, /til, etc.) is not blocked
- Handler receives hook-event JSON via a temp file captured synchronously
  before backgrounding (naive nohup would redirect stdin to /dev/null)
- Hard 30s ceiling on the backgrounded flush to prevent stuck handlers
  leaking memory/CPU across sessions
- Pins mlflow-skinny and mlflow-tracing to 3.11.1 to match the Apps
  runtime image (version mismatch caused silent import failures)

Tracing is disabled by default — set MLFLOW_CLAUDE_TRACING_ENABLED=true
in app.yaml to opt in.

Tests: TestStopHook and TestSettingsMerge updated to match shell-script
       delegation model; TestAppOwnerExport mocks app_state.set_app_owner
       to avoid ~/.coda writes in unit test context.

Co-authored-by: Isaac
@dgokeeffe
Copy link
Copy Markdown
Contributor Author

Migrating to the new repo home. This work continues at databrickslabs/coding-agents-databricks-apps#15 (also resolves the MLflow tracing bug filed at databrickslabs/coding-agents-databricks-apps#9). Closing this stale duplicate.

@dgokeeffe dgokeeffe closed this May 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant