Skip to content

fix: prevent settings.json race that drops MLflow env vars#154

Merged
datasciencemonkey merged 5 commits intomainfrom
fix/mlflow-settings-race-153
Apr 30, 2026
Merged

fix: prevent settings.json race that drops MLflow env vars#154
datasciencemonkey merged 5 commits intomainfrom
fix/mlflow-settings-race-153

Conversation

@datasciencemonkey
Copy link
Copy Markdown
Owner

Summary

  • setup_claude.py and _configure_all_cli_auth() were overwriting ~/.claude/settings.json from scratch, nuking MLflow env vars added by setup_mlflow.py
  • setup_mlflow.py ran in parallel with setup_claude.py, creating a race condition on the same file

Changes

  1. setup_claude.py: read-merge-write instead of overwrite — preserves existing env vars (hooks, MLflow config, etc.)
  2. app.py:_configure_all_cli_auth(): same read-merge-write pattern
  3. app.py:run_setup(): setup_mlflow.py now runs sequentially after the parallel agent setup batch, not inside it

Test plan

  • All CoDA tests pass (131/135, 4 pre-existing test isolation failures unrelated to this fix)
  • Gateway discovery tests pass in isolation (7/7)
  • Telemetry tests pass (17/17)
  • Deploy to Databricks Apps and verify MLflow env vars persist in ~/.claude/settings.json after setup completes

Fixes #153

@datasciencemonkey datasciencemonkey self-assigned this Apr 30, 2026
@datasciencemonkey datasciencemonkey added the bug Something isn't working label Apr 30, 2026
@datasciencemonkey datasciencemonkey force-pushed the fix/mlflow-settings-race-153 branch from 11f539e to 0cbe1b6 Compare April 30, 2026 16:34
Two changes:

1. setup_claude.py and _configure_all_cli_auth() now read-merge-write
   settings.json instead of overwriting it. This preserves env vars
   added by other setup scripts (e.g. setup_mlflow.py).

2. setup_mlflow.py now runs sequentially after the parallel agent setup
   batch, not inside it. This eliminates the race where setup_claude.py
   and setup_mlflow.py both write settings.json concurrently.

Fixes #153
The test_returns_empty_list test failed when run after other test
classes because sessions leaked across classes. Adding session cleanup
at fixture setup (not just teardown) fixes the ordering dependency.
- setup_codex.py, setup_gemini.py, setup_opencode.py: retry npm install
  up to 3 times with 5s delay, print full stderr/stdout on failure
- tests/test_session_limit.py: clear leaked sessions before slot-freeing test
- .gitignore: add codex and agent-plane-ref entries
Skills already live in ~/.agents/skills/ (copied by setup_codex.py).
Copying them again into ~/.gemini/skills/ caused Gemini CLI to log
"Skill conflict detected" warnings for every skill on startup.
@datasciencemonkey datasciencemonkey force-pushed the fix/mlflow-settings-race-153 branch from f2c5ac1 to 46f05c6 Compare April 30, 2026 17:34
@datasciencemonkey datasciencemonkey merged commit dccea2d into main Apr 30, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: MLflow env vars overwritten in settings.json due to race condition

1 participant