fix: prevent settings.json race that drops MLflow env vars by datasciencemonkey · Pull Request #154 · datasciencemonkey/coding-agents-databricks-apps

datasciencemonkey · 2026-04-30T11:08:33Z

Summary

setup_claude.py and _configure_all_cli_auth() were overwriting ~/.claude/settings.json from scratch, nuking MLflow env vars added by setup_mlflow.py
setup_mlflow.py ran in parallel with setup_claude.py, creating a race condition on the same file

Changes

setup_claude.py: read-merge-write instead of overwrite — preserves existing env vars (hooks, MLflow config, etc.)
app.py:_configure_all_cli_auth(): same read-merge-write pattern
app.py:run_setup(): setup_mlflow.py now runs sequentially after the parallel agent setup batch, not inside it

Test plan

All CoDA tests pass (131/135, 4 pre-existing test isolation failures unrelated to this fix)
Gateway discovery tests pass in isolation (7/7)
Telemetry tests pass (17/17)
Deploy to Databricks Apps and verify MLflow env vars persist in ~/.claude/settings.json after setup completes

Fixes #153

Two changes: 1. setup_claude.py and _configure_all_cli_auth() now read-merge-write settings.json instead of overwriting it. This preserves env vars added by other setup scripts (e.g. setup_mlflow.py). 2. setup_mlflow.py now runs sequentially after the parallel agent setup batch, not inside it. This eliminates the race where setup_claude.py and setup_mlflow.py both write settings.json concurrently. Fixes #153

The test_returns_empty_list test failed when run after other test classes because sessions leaked across classes. Adding session cleanup at fixture setup (not just teardown) fixes the ordering dependency.

- setup_codex.py, setup_gemini.py, setup_opencode.py: retry npm install up to 3 times with 5s delay, print full stderr/stdout on failure - tests/test_session_limit.py: clear leaked sessions before slot-freeing test - .gitignore: add codex and agent-plane-ref entries

Skills already live in ~/.agents/skills/ (copied by setup_codex.py). Copying them again into ~/.gemini/skills/ caused Gemini CLI to log "Skill conflict detected" warnings for every skill on startup.

datasciencemonkey self-assigned this Apr 30, 2026

datasciencemonkey added the bug Something isn't working label Apr 30, 2026

datasciencemonkey force-pushed the fix/mlflow-settings-race-153 branch from 11f539e to 0cbe1b6 Compare April 30, 2026 16:34

datasciencemonkey added 5 commits April 30, 2026 13:34

fix: clear stale sessions before TestListSessions to fix test isolation

43fbfe7

The test_returns_empty_list test failed when run after other test classes because sessions leaked across classes. Adding session cleanup at fixture setup (not just teardown) fixes the ordering dependency.

fix: remove duplicate skills copy in setup_gemini.py

c580943

Skills already live in ~/.agents/skills/ (copied by setup_codex.py). Copying them again into ~/.gemini/skills/ caused Gemini CLI to log "Skill conflict detected" warnings for every skill on startup.

chore: bump version to 0.18.1

46f05c6

datasciencemonkey force-pushed the fix/mlflow-settings-race-153 branch from f2c5ac1 to 46f05c6 Compare April 30, 2026 17:34

datasciencemonkey merged commit dccea2d into main Apr 30, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: prevent settings.json race that drops MLflow env vars#154

fix: prevent settings.json race that drops MLflow env vars#154
datasciencemonkey merged 5 commits intomainfrom
fix/mlflow-settings-race-153

datasciencemonkey commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

datasciencemonkey commented Apr 30, 2026

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant