Skip to content

fix: core app reliability β€” auth, session management, gateway, .databricksignore#19

Open
dgokeeffe wants to merge 1 commit intomainfrom
fix/core-app-reliability
Open

fix: core app reliability β€” auth, session management, gateway, .databricksignore#19
dgokeeffe wants to merge 1 commit intomainfrom
fix/core-app-reliability

Conversation

@dgokeeffe
Copy link
Copy Markdown
Collaborator

@dgokeeffe dgokeeffe commented May 6, 2026

Priority

P2 β€” multiple latent reliability fixes (case-insensitive auth, gateway auto-discovery, session-limit cap, fresh-token-on-rotation). Bundled because the fixes share auth/session/gateway surface area and tests were written holistically. Note in PR body: requirements.lock and requirements.txt should be regenerated from pyproject.toml via uv lock && uv pip compile before merge β€” they were taken whole-cloth from main during rebase rather than manually merged.


Summary

Migrates datasciencemonkey PR #138 to the new repo home, rebased onto databrickslabs/main.

Bundle of reliability fixes:

  • Case-insensitive email auth β€” Databricks SSO headers can deliver mixed-case email; current strict comparison locks legitimate users out.
  • 30-char app-name truncation fix β€” long app names break the databricks apps get lookup and the deploy script silently fails.
  • AI Gateway auto-discovery from DATABRICKS_WORKSPACE_ID with reachability probe before use; falls back to direct serving endpoints if unreachable.
  • MAX_CONCURRENT_SESSIONS cap (default 5) to prevent resource exhaustion on a single-container app.
  • Fresh proxy token on PAT rotation β€” currently emits 401s until the next app restart.
  • Owner resolution: 6-attempt retry with exponential backoff for SP propagation delay; supports the spawner pattern (description: "owner:{email}").
  • app_state.py β€” lightweight persistence for owner + last rotation under ~/.coda/.
  • .databricksignore β€” exclude .venv/ and caches from databricks sync (huge upload otherwise).

Rebase notes

10 conflicts resolved against current main:

  • Trivial / kept main's version β€” .gitignore (codex/.agents ignores), Makefile (.PHONY includes test), app.yaml (databricks-gpt-5-5, dropped redundant MLFLOW_CLAUDE_TRACING_ENABLED since feat: MLflow tracing with async Stop hook (opt-in)Β #15 owns that), pyproject.toml (kept v0.18.1 + cryptography 46.0.7), GH workflow files (commit-pinned setup-uv, databrickslabs-protected-runner-group), tests/test_session_limit.py (kept main's session-clear test isolation).
  • requirements.txt + requirements.lock β€” took main's versions (27 conflict blocks total in auto-generated lock files; not worth manual merge). TODO: regenerate via uv lock && uv pip compile pyproject.toml -o requirements.txt before merge if any new deps from this branch aren't transitively included.
  • app.py (4 blocks):
    1. Settings.json write in _configure_all_cli_auth: kept main's read-merge-write (PR #153 race fix). Branch's full-overwrite was redundant β€” theme/permissions set by setup_claude.py at startup are preserved naturally by the merge pattern.
    2. Owner resolution loop: combined branch's 6-retry + description-fallback structure with main's set_product_info(w) telemetry call.
    3. PAT-fallback owner: kept main's set_product_info(w) after WorkspaceClient(...).
    4. Shell env strip: kept main's GEMINI_API_KEY pop (so Gemini reads from config, not stale env after rotation).

Test plan

  • uv run pytest tests/ --ignore=tests/gates β€” 184 should pass
  • tests/test_gateway_discovery, test_session_limit, test_clipboard_addon exercise the new behaviour
  • Deploy to a daveok / dogfood instance, verify setup completes and databricks apps get coding-agents returns OK (no 30-char truncation)
  • Mixed-case email (User@Company.com) authorises correctly
  • Confirm .venv/ absent from workspace after databricks sync
  • Regenerate requirements.lock + requirements.txt from pyproject.toml before merge if any new deps need pinning

Closes #10

This pull request and its description were written by Isaac.

- Normalize emails to lowercase for case-insensitive SSO header auth
- Fix 30-character Databricks App name truncation in setup
- Cap concurrent sessions via MAX_CONCURRENT_SESSIONS (default 5)
- Inject fresh token in content-filter proxy on PAT rotation (avoids 401s)
- Auto-discover AI Gateway from DATABRICKS_WORKSPACE_ID; probe reachability
  before using auto-discovered URL to avoid silent fallback
- Add app_state.py for persistent owner/rotation tracking (~/.coda/)
- Add .databricksignore to exclude .venv and caches from workspace sync
- Session management, clipboard, and rendering fixes (v0.17.0)
- Add PR template with dogfood testing checklist
- Update GitHub Actions: dependency audit + auto-lockfile update workflows

Tests: test_gateway_discovery (new), test_session_limit (new),
       test_clipboard_addon (new), full existing test suite unaffected.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Core app reliability β€” auth, session management, gateway, .databricksignore

1 participant