Skip to content

feat: add zero-PAT spawner for one-click CODA provisioning#92

Draft
dgokeeffe wants to merge 48 commits intodatasciencemonkey:mainfrom
dgokeeffe:feat/zero-pat-spawner
Draft

feat: add zero-PAT spawner for one-click CODA provisioning#92
dgokeeffe wants to merge 48 commits intodatasciencemonkey:mainfrom
dgokeeffe:feat/zero-pat-spawner

Conversation

@dgokeeffe
Copy link
Copy Markdown
Contributor

Summary

  • Adds a spawner app (spawner/) that provisions individual coding-agents-{user} Databricks Apps with a single click — the user only provides their PAT via the web UI
  • Spawner handles identity resolution (SCIM), secret scope creation, app creation (owned by the user), SP ACL grants, and deployment from a shared workspace template
  • Updates the main app.py owner resolution to support APP_OWNER_EMAIL env var and owner:{email} in app description, so spawned child apps know their owner without requiring a PAT for that lookup

What's in spawner/

File Purpose
app.py Flask app with /api/provision endpoint and full provisioning pipeline
app.yaml Databricks App config exposing ADMIN_TOKEN secret
static/index.html Single-page UI with deploy progress and app listing
Makefile Deploy/manage targets (make deploy, make redeploy, make logs)
README.md Architecture docs and usage guide

Token model

  • Admin PAT (stored in spawner secret scope) — used for privileged ops (scope creation, ACLs, deployment)
  • User PAT (stored in per-user secret scope) — used for app creation (so user owns it) and as runtime DATABRICKS_TOKEN

Test plan

  • Deploy spawner with make deploy PROFILE=<profile> ADMIN_PAT=<token>
  • Visit spawner UI, paste a user PAT, verify provisioning completes
  • Confirm spawned app is RUNNING and accessible at coding-agents-{username}
  • Verify owner resolution in child app picks up the owner:{email} from description
  • Test /api/apps endpoint lists all spawned apps
  • Test /api/status returns correct state for authenticated user

This pull request was AI-assisted by Isaac.

Adds a spawner app that provisions per-user coding-agents instances
without requiring users to provide a PAT at provisioning time.

Spawner flow:
1. User clicks "Provision" — identity from SSO (X-Forwarded-Email)
2. Admin SP creates the app with owner:{email} in description
3. Grants child app's SP access to UC Volume for offline installs
4. Deploys from shared template at /Workspace/Shared/apps/coding-agents
5. User opens their app and pastes PAT there (rotation starts)

CODA get_token_owner() updated with new resolution chain:
  APP_OWNER_EMAIL env var → app description → app.creator → PAT fallback

No secret scopes. No PAT storage. No DATABRICKS_TOKEN resource binding.

Co-authored-by: Isaac
@mpkrass7 mpkrass7 self-requested a review April 1, 2026 13:45
Copy link
Copy Markdown
Collaborator

@mpkrass7 mpkrass7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgokeeffe I'm sorry man but I can't approve a 1,300 line AI generated PR without you at least including some screenshots or a video showing what this actually does and proving it works. The checkboxes in your PR should also be checked off by you which is a validation again that you tested the checks.

@datasciencemonkey feel free to veto my veto

Lastly, it would be a good practice to open up and validate bigger features as a github issue before dev-ing a big PR for it. I think a spawner app sounds great as a quality of life improvement, it would just be nice to see a conversation somewhere to the tune of "Hey this is something we should do", "Yea I agree, customer X asked for this", "So did customer Y", "Cool lets knock it out". Possible you guys already had that conversation, just put it in the repo somewhere as an issue.

@dgokeeffe dgokeeffe marked this pull request as draft April 1, 2026 23:38
dgokeeffe and others added 25 commits April 6, 2026 13:05
The auto-generated tests assumed the PRD's PAT-based architecture (mint_pat,
store_pat_in_secret_scope, link_secret_to_app) but the implementation uses
M2M OAuth with owner-in-description. Rewrote all 12 AC gate tests to verify
the actual code: get_admin_token, create_app, grant_sp_volume_access, async
provisioning flow, and Flask route handlers.

- AC-2: async provision + poll-based progress
- AC-3: M2M OAuth token caching + fallback
- AC-4: create_app with owner:{email} in description
- AC-5: 409 idempotency on app creation
- AC-6: app_name_from_email (NEW - was missing entirely)
- AC-7: UC Volume permissions for child SP
- AC-10: error recording in provision jobs
- AC-11: completed job returns app_url via poll endpoint
- Added make test target to spawner/Makefile

25 passed, 1 skipped (AC-12 manual), 0 failed.

Co-authored-by: Isaac
Remove PAT input field — the backend already uses SP M2M OAuth.
Add editable email field pre-populated from SSO so admins can
provision CODA instances for other users. App name preview updates
live as the email changes.

Co-authored-by: Isaac
databricks sync requires --exclude-from flag (no auto-read of ignore
files). Added .syncignore at repo root and updated Makefile sync
targets to use it.

Co-authored-by: Isaac
Databricks Apps runtime detects pyproject.toml + uv.lock and tries
the uv-based install path, which fails with "fork/exec .venv/bin/pip:
no such file or directory" on fresh deploys (known platform bug
ES-1783185). Spawner only needs requirements.txt.

Also added uv.lock to .syncignore for template syncs.

Co-authored-by: Isaac
Databricks app creation returns immediately but the service principal
takes 10-30s to propagate. This caused two failures:
1. PRINCIPAL_DOES_NOT_EXIST when granting UC Volume permissions
2. get_token_owner() returning None when child app boots before SP OAuth is ready

Adds exponential backoff retry (5s base, 6 attempts, ~3min max) to both
paths. Also PATCHes app description on re-provision (409) to ensure
owner:{email} is always set.

Co-authored-by: Isaac
…eway resource

The sync-template Makefile target was generating an app.yaml with
`uv run gunicorn` which conflicts with the runtime's pip install step
(fork/exec .venv/bin/pip: no such file or directory). Also removed
DATABRICKS_GATEWAY_HOST valueFrom since child apps don't have that
resource registered — setup scripts already fall back to DATABRICKS_HOST.

Co-authored-by: Isaac
Git URLs force source builds — cryptography needs Rust, which the
Databricks Apps runtime can't download (no outbound internet to
static.rust-lang.org). PyPI wheels are pre-built and the CVE-fixed
versions are already published there.

Co-authored-by: Isaac
applyTheme() was never called during initialization — only on user
interaction. The body stayed at browser defaults (white background)
until the user manually changed the theme. Now calls applyTheme()
on load and sets a dark CSS default to prevent FOUC.

Co-authored-by: Isaac
The spawner UI previously disabled the deploy button while provisioning,
forcing one-at-a-time. Now each provision gets its own status card,
the button resets immediately, and you can fire off multiple provisions
in parallel. Also adds POST /api/provision-bulk for batch deploys
from a list of emails.

Co-authored-by: Isaac
The stop hook was failing with "No module named mlflow" because
`uv run python` resolved packages from the user's project directory,
not the app's. Fixed by using `--project` to pin resolution to the
app directory where mlflow[genai] is declared.

Re-enables tracing by default (was disabled as a band-aid in b8a06c9).
Hook is only registered when tracing is enabled, and can be opted out
via MLFLOW_CLAUDE_TRACING_ENABLED=false.

Co-authored-by: Isaac
Stop hook stalls on transcript processing, so tracing is now opt-in
via MLFLOW_CLAUDE_TRACING_ENABLED=true. Hook infrastructure kept
intact with --project fix for when it's re-enabled.

Also runs `gh auth setup-git` during git config setup so git
operations use gh as the credential helper.

Co-authored-by: Isaac
…ettings

- Strip all DATABRICKS_* env vars from PTY sessions so SDK reads
  entirely from ~/.databrickscfg (fixes Azure auth fallback bug)
- Add `wsync` command to ~/.local/bin for manual workspace sync
- Increase PAT rotation interval to 3h with 4h lifetime (1h buffer)
- Enable auto mode, dark theme, and pre-approve common CLI tools

Co-authored-by: Isaac
# Conflicts:
#	app.py
#	content_filter_proxy.py
#	requirements.txt
SSO headers can return emails with different casing than what the spawner
stores in the app description (e.g. RC.Guan@ vs rc.guan@), causing
authorization to fail. Normalize to lowercase at all email ingestion
points: get_token_owner(), get_request_user(), WebSocket auth, and
spawner routes.

Co-authored-by: Isaac
Workspaces without AI Gateway enabled get a FailedToOpenSocket error
because get_gateway_host() auto-constructs a gateway URL from
DATABRICKS_WORKSPACE_ID without checking if the host exists.

Add a lightweight probe (2s timeout GET) for auto-discovered URLs.
Cache the result in _GATEWAY_RESOLVED env var so subprocesses skip
re-probing. Explicit DATABRICKS_GATEWAY_HOST is still trusted without
probing.

Co-authored-by: Isaac
The Databricks Apps API rate-limits deploy requests. Redeploy-all now
deploys in batches of 3 with a 2s pause between batches, and retries
individual 429 responses with exponential backoff (2s, 4s, 8s, 16s, 32s).

Co-authored-by: Isaac
Adds a "Bulk Deploy" card to the spawner UI with a textarea that
accepts emails in any format (one per line, comma/semicolon separated,
or Outlook "Name <email>" format). Parses, deduplicates, and calls
the existing /api/provision-bulk endpoint with per-app progress cards.

Co-authored-by: Isaac
Bundles the workshop repo (attendee-facing files only, 4.3MB) under
projects/coles-vibe-workshop/ so it deploys with the app source.
A new "projects" setup step copies it to ~/projects/ and git-inits it
at startup, so attendees land in a ready-to-go workshop repo with
lab guides, starter kit, demo pipeline, and CLAUDE.md pre-loaded.

Facilitator-only files (scripts, speaker notes, pre-workshop setup)
are excluded from the embedded copy.

Co-authored-by: Isaac
Includes bdd-scaffold, bdd-features, bdd-steps, and bdd-run skills
from the Vibe plugin ecosystem so attendees can use /bdd-scaffold,
/bdd-features etc. when working in the workshop repo.

Co-authored-by: Isaac
Root cause: bulk provision fired all threads simultaneously with no
concurrency limit, and neither create_app nor deploy_app retried on
400/429 API errors. With 54 users, this overwhelmed the Apps API.

Three fixes:
1. Semaphore (_PROVISION_CONCURRENCY=3) gates provision_app_async so
   only 3 apps provision concurrently — rest queue up and proceed as
   slots free.
2. create_app() and deploy_app() now retry with exponential backoff
   on 429 and 400 responses (up to 5 attempts, 2s/4s/8s/16s/32s).
3. Bulk endpoint no longer calls check_existing_app per email in the
   request handler (was 54 sequential API calls blocking the response).
   Existing apps are handled by create_app's 409 path instead.

Also updated _deploy_with_backoff (redeploy-all) to retry 400s.

Co-authored-by: Isaac
- Switch default Opus model to databricks-claude-opus-4-7 across
  setup_claude.py, setup_opencode.py, app.py, app.yaml(.template),
  README.md, tests, and opencode.json.
- Add deny rules for process kills, catastrophic rm, credential
  destruction, shared Workspace deletion, cross-tenant app deletion,
  and system-level destructive commands.
- Whitelist writes to /Workspace/Shared/apps/coding-agents so the
  shared template sync is unblocked.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
dgokeeffe and others added 22 commits April 17, 2026 16:25
Claude Code's auto-memory lives in ~/.claude/projects/{slug}/memory/ and
accumulates user/project/feedback memories that make future sessions
smarter. On ephemeral Databricks Apps compute those files vanish on
every redeploy unless we persist them.

- claude_brain_sync.py pushes each project's memory dir to
  /Workspace/Users/{email}/.coda/claude-brain/projects/ via databricks
  sync, and pulls it back via workspace export-dir on boot.
- Stop hook fire-and-forgets the push so closing a session is never
  blocked on network.
- setup_claude.py pulls the brain from workspace (timeout-bounded) so
  accumulated memory survives redeploys.
- SessionStart hooks: staleness check on memory frontmatter + git
  activity context loader (both ported from local env, Linux-adapted).
- PostToolUse hook stamps last_verified on memory edits (GNU sed).
- /til slash command for end-of-session crystallization; Stop hook
  nudges when meaningful work (commits or 3+ file changes) happened.

Scope is Claude Code only by design — Codex/Gemini/OpenCode configs
are untouched. --profile flag on claude_brain_sync.py lets admins
debug against a named profile without disturbing [DEFAULT].

Co-authored-by: Isaac
Workshop attendees benefit from the Insight blocks that the
Explanatory style emits — they're educational framing that
explains *why* code is written a certain way, not just *what*
it does. Fits CODA's learning-oriented use case.

Co-authored-by: Isaac
Adds /setup-track-env skill that creates isolated uv virtual environments
for each workshop track (DE, DS, Analyst) with pinned latest packages:
- pyspark 4.1, mlflow 3.11, scikit-learn 1.8, xgboost 3.2
- behave + pytest-asyncio shared across all tracks
- Documents PySpark/Java limitation in Databricks App environment

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds 4 BDD skills for Behave + Databricks testing:
- bdd-scaffold: set up Behave project with Databricks SDK wiring
- bdd-features: generate Gherkin feature files from requirements
- bdd-steps: implement step definitions using Statement Execution API
- bdd-run: execute Behave with tag filtering and reporting

Includes test-suite with ephemeral schema lifecycle, Unity Catalog
operations, and SQL data assertion patterns.

Source: https://github.com/dgokeeffe/databricks-bdd-tools

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ired)

Three tiers of local tests that run without Java or a Databricks cluster:

- DE: test_pipeline_local.py (13 pure Python tests, 0.03s) + BDD feature
- DS: test_features_local.py (24 tests, 0.04s) + test_model_local.py
      (15 sklearn/xgboost tests, 12s) + BDD feature (5 scenarios)
- BDD: Behave feature files with step defs for DE pipeline and DS features

Teams validate logic locally, then wire into PySpark on the cluster.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the 4 subagents, 5 lifecycle hooks, and /til slash command
out of loose top-level dirs (agents/, claude-hooks/, claude-commands/)
into a proper Claude Code plugin at:

  coda-marketplace/plugins/coda-essentials/

The marketplace is registered via extraKnownMarketplaces (directory
source) and auto-enabled via enabledPlugins in the settings.json
written at setup time. Because the marketplace lives inside the CODA
source tree, every databricks apps deploy ships it automatically —
no external repo, no CI pull, no network dependency.

- Agents and slash commands are auto-discovered by Claude Code from
  the plugin's agents/ and commands/ dirs.
- Hook scripts still need explicit settings.json registration, so
  setup_claude.py points them at the plugin's hooks/ dir directly.
- setup_claude.py drops its copy-agents / copy-hooks / copy-commands
  blocks; the plugin resolver replaces all three.

Structure mirrors the vibe marketplace pattern (.claude-plugin/
marketplace.json at the root, per-plugin .claude-plugin/plugin.json)
so future CODA plugins (e.g., a databricks-skills pack wrapping
ai-dev-kit) can be added without changing setup_claude.py.

Co-authored-by: Isaac
Move the 23 Databricks platform skills CODA was bundling as stale
snapshots in .claude/skills/ into a proper marketplace plugin:

  coda-marketplace/plugins/coda-databricks-skills/skills/

Source: databricks-solutions/ai-dev-kit databricks-skills/ at
commit 326c8e7 (2026-04-xx). Gains vs the prior snapshot:

- 4 new skills not previously bundled:
    databricks-ai-functions, databricks-bdd-testing,
    databricks-execution-compute, databricks-iceberg
- upstream renames applied:
    databricks-asset-bundles -> databricks-bundles
    databricks-synthetic-data-generation -> databricks-synthetic-data-gen
- drift corrections inside existing skills (e.g., docs updated
  create_or_update_pipeline -> manage_pipeline(action="create_or_update")
  for the current MCP tool API).

Plugin is registered in coda-marketplace/.claude-plugin/marketplace.json
and auto-enabled via setup_claude.py's enabledPlugins. Because plugin
and .claude/skills/ loading paths would double-register identically
named skills, the 23 stale copies are removed from .claude/skills/.
Remaining .claude/skills/ entries (BDD scaffold tools, Superpowers
workflow skills, refresh-databricks-skills meta-skill, databricks-app-apx)
are orthogonal to ai-dev-kit and stay put.

Future ai-dev-kit refreshes become a single-plugin content update:
copy databricks-skills/ into the plugin, bump plugin version, commit.

Co-authored-by: Isaac
When CODA is deployed to an enterprise workspace with a restrictive
outbound allowlist (common in Azure/MRA environments), docs.databricks.com
is typically blocked while learn.microsoft.com is allowed. The
databricks-docs skill points agents at docs.databricks.com URLs, so
WebFetch calls hang and time out with no useful error.

At PAT-configuration time, setup_claude.py now probes docs.databricks.com
with a 3s timeout. If unreachable, it appends a one-paragraph fallback
note to ~/.claude/CLAUDE.md telling agents to substitute:
  docs.databricks.com/azure/en/X -> learn.microsoft.com/en-us/azure/databricks/X
  docs.databricks.com/aws/en/X   -> learn.microsoft.com/en-us/azure/databricks/X

Microsoft Learn mirrors the Azure Databricks docs one-to-one, so the
substitution is content-equivalent. The note is gated by a marker
comment so re-running setup never duplicates it.

Stdlib-only (urllib.request). No additional CODA dependencies.
Self-healing in workspaces where egress is later opened up — the note
stays until ~/.claude/CLAUDE.md is cleared, but Claude will only act on
it if URLs are actually blocked.

Co-authored-by: Isaac
These are hosted on the open internet (mcp.deepwiki.com, mcp.exa.ai)
and won't resolve in air-gapped or secure-egress CODA deployments.
Default state is now zero public MCPs — team-memory continues to be
wired when TEAM_MEMORY_MCP_URL is set, and the workspace-reachable
Databricks docs fallback lives in the CLAUDE.md egress note.

Set ENABLE_PUBLIC_MCPS=true to restore deepwiki + exa for deployments
where the runtime can reach those domains.

Co-authored-by: Isaac
Databricks Apps base image pre-installs mlflow==3.11.1 which pins
mlflow-skinny==3.11.1 and mlflow-tracing==3.11.1 as hard equalities.
CODA was installing mlflow-skinny==3.10.1 on top, creating a runtime
version skew that pip flagged at build time and that could bite at
import time (setup_mlflow.py imports from mlflow.claude_code.hooks).

Bumping the pin aligns the three packages and eliminates the warning.
No code changes needed — mlflow-skinny 3.11.x is API-compatible for
the claude_code.hooks surface we use.

(dash 2.18.1 Flask<3.1 / Werkzeug<3.1 warnings from the same build
log are artifacts of dash being pre-installed on the Apps runtime
without CODA using it — not actionable from CODA's side.)

Co-authored-by: Isaac
Second half of the Apps runtime dep alignment. The base image preinstalls
mlflow==3.11.1 which pins mlflow-tracing==3.11.1 as a hard equality, but
the runtime also had mlflow-tracing 3.10.1 installed alongside. CODA's
pyproject.toml only pinned mlflow-skinny, not mlflow-tracing, so the
3.10.1 version persisted after our mlflow-skinny bump and pip kept
flagging the conflict.

Pinning mlflow-tracing==3.11.1 as a direct dep forces pip to upgrade it
during install, eliminating the last conflict warning. Verified clean
in the redeploy to daveok — no [BUILD] ERROR lines, all three mlflow
packages at 3.11.1.

Co-authored-by: Isaac
Workshop quiz-app accepted arbitrary UTF-8 for team names (only stripped
+ truncated to 30 chars) and rendered them via raw innerHTML template
literal interpolation across 4 sites in static/index.html (team-chip,
two leaderboard rows, podium). A 30-char payload like
`<img src=x onerror=fetch('/api/control/reset',{method:'POST'})>`
fits the budget and executes in every host + player browser on the
next /api/state poll.

Fixes:

- server: validate team names against ^[A-Za-z0-9 _-]{1,30}$ in
  /api/teams, rejecting HTML-punctuation payloads at the boundary
- server: add Content-Security-Policy (default-src self) +
  X-Content-Type-Options + Referrer-Policy via middleware — inline
  <script> from any future innerHTML slip is blocked by the browser
- client: add esc() HTML-entity-escape helper and wrap every
  ${t.name} interpolation with it, as belt-and-braces defence
  against future regressions on the server-side allowlist

Impact amplified by supply-chain: quiz-app is embedded in CODA and
copied into every spawned instance's ~/projects/, so every attendee
who deploys their workshop app inherits the fix.

Found by /security-review on branch feat/zero-pat-spawner.

Co-authored-by: Isaac
Reads MLflow traces captured by setup_mlflow.py and reports:
- prompt-cache hit rate (cache_read / (cache_read + input_tokens))
- cached tokens served
- estimated $ saved vs uncached (Opus default, model-aware)
- diagnosis if hit rate < 50% (< 1024-token prefix, > 5 min idle,
  system prompt churn, trace sampling)

Gives CODA users observability on how well the marketplace-stable
context layer is caching in practice — the marketplace refactor
makes skills + CLAUDE.md deterministic across sessions, which
should produce high hit rates. This command tells you whether
that theory matches reality.

Gracefully no-ops when MLFLOW_CLAUDE_TRACING_ENABLED is off
(flag the user to enable it rather than failing).

Co-authored-by: Isaac
Flip MLFLOW_CLAUDE_TRACING_ENABLED=true in app.yaml so the Stop hook
that calls mlflow.claude_code.hooks.stop_hook_handler registers
automatically in every new CODA deploy, and so /cache-stats has
trace data to read.

Guard against the known "Stop hook stalls on transcript processing"
issue by adding a 15-second timeout to the hook spec. If the handler
hangs, Claude Code will abort it and continue the Stop chain — one
dropped trace beats a hung session.

Rollback is one flag if stalls return: flip MLFLOW_CLAUDE_TRACING_ENABLED
back to false via CLI env update, no redeploy of code needed.

Co-authored-by: Isaac
The inline 15s-timeout approach blocked the session-close Stop chain
for up to 15 seconds on every turn even when the handler worked. With
the flag now default-on in app.yaml, that's paid on every user, every
turn — not acceptable for the 90% of flushes that take <1s.

New behaviour: Stop hook invokes coda-essentials/hooks/mlflow-trace-stop.sh
which backgrounds the transcript flush via `nohup timeout 30 ... & disown`.

- Async: wrapper returns in <1s, Stop chain unblocks immediately.
- Bounded: GNU `timeout 30` kills the backgrounded handler if it
  hangs — prevents the "hung background Python eating memory for
  the life of the container" failure mode the user flagged.
- Observable: stdout/stderr land in ~/.mlflow-hook.log for forensics
  when traces don't land.
- Fits plugin architecture: the script ships in coda-essentials/hooks/
  alongside push-brain-to-workspace.sh (same pattern) instead of being
  an inline command string in setup_mlflow.py.

Co-authored-by: Isaac
stop_hook_handler() reads the hook-event JSON from stdin. The previous
wrapper backgrounded the whole thing with `nohup ... & disown`, which
redirected stdin to /dev/null — so the handler failed on every session
close with:

  "Failed to parse hook input: Expecting value: line 1 column 1 (char 0)"

Fix: read stdin synchronously into a temp file (fast, one read), then
background a subshell that pipes that file into the Python handler via
`< "$STDIN_FILE"`. The wrapper still returns in <1s, the handler still
has a 30s ceiling, and now it actually gets the JSON it needs.

Verified against today's daveok redeploy where the broken version
produced the error on every Claude Code -p exit.

Co-authored-by: Isaac
settings.json's extraKnownMarketplaces + enabledPlugins alone is not
enough to activate a plugin — Claude Code also reads two state files
under ~/.claude/plugins/ at startup:

  - known_marketplaces.json: declares marketplace sources
  - installed_plugins.json: declares which plugins are installed and
    where to find them on disk (version 2 schema)

Without those, enabledPlugins entries are silently ignored and slash
commands like /cache-stats and /til from the coda-essentials plugin
fail with "Unknown command".

Fix: setup_claude.py now writes both state files explicitly pointing
at the bundled marketplace at /app/python/source_code/coda-marketplace/.
Because this is a directory-source marketplace, installLocation == source
path, so no copy/extract step is needed — the state files just tell
Claude Code where to scan.

Verified by inspecting the equivalent files a working Claude Code
install maintains (~/.claude/plugins/ on a dev machine running plugins
from vibe-ebc-fix via the same directory-source mechanism).

Co-authored-by: Isaac
… loader

Claude Code's plugin loader requires plugins to live at the cache path:
  ~/.claude/plugins/cache/<marketplace>/<plugin>/<version>/

even for directory-source marketplaces. The source path is recorded as
the marketplace's installLocation, but plugin.installPath in
installed_plugins.json MUST point at the cache copy — not the source.

Previous attempt pointed installPath directly at the source tree, so
/cache-stats (and /til, and the plugin skills) all failed with
"Unknown command" despite the state files being present and the JSON
parsing clean.

Verified by inspecting a working fe-vibe@directory-source install on
a dev machine: marketplace installLocation = ~/Repos/vibe-ebc-fix,
but plugin installPath = ~/.claude/plugins/cache/fe-vibe/fe-html-slides/1.1.4/.

Fix: setup_claude.py now shutil.copytree()s each marketplace plugin
into the expected cache layout at boot, then points the state file
installPaths at those cache copies. Hook commands in settings.json
also redirected at the cache path so the Stop chain loads from the
same tree Claude Code scans.

Tiny disk overhead (~1.9MB for coda-databricks-skills + KB for
coda-essentials) paid once per container boot, ephemeral.

Co-authored-by: Isaac
The plugin-loader pathway (cache + state files) still doesn't surface
/cache-stats or /til as slash commands in the Databricks Apps runtime's
Claude Code install, even with installPath correctly pointing at the
cache under ~/.claude/plugins/cache/coda/...

Rather than keep chasing the plugin-loader discrepancy, mirror the
coda-essentials plugin's commands/ and agents/ directly into
~/.claude/commands/ and ~/.claude/agents/ as well. User-level paths
are canonical and always scanned by Claude Code regardless of plugin
state (verified by the fact that user local /til etc. works from
~/.claude/commands/til.md with no plugin wrapping).

The marketplace stays the source of truth for the content and for
any future customer who has a working plugin loader; this just
guarantees that slash commands actually work on CODA today.

Co-authored-by: Isaac
Move Python packaging / library floors / Unity Catalog / terminal-editor
conventions into ~/.claude/CLAUDE.md at app startup (idempotent via
<\!-- coda-fork-directives --> marker), mirroring the existing Azure-docs
egress-fallback pattern. User-level CLAUDE.md loads for every Claude
session regardless of cwd, so spawned CODA instances pick up fork-wide
conventions without relying on the project CLAUDE.md being in scope.

Also add an `editors` setup step that probes `command -v` for nine
common editors (micro, nano, vim, vi, emacs, ed, pico, joe, mcedit) and
writes the result to ~/.local/share/coda/editors.txt so users and Claude
can see what's actually installed in the container.

Co-authored-by: Isaac
Without this file, `databricks sync` would upload the full repo tree
including ~900MB of .venv/, which is pointless (Databricks Apps builds
its own venv at deploy time from pyproject.toml) and slow.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants