feature: prime train routes full_finetune TOMLs to hosted endpoint#592
feature: prime train routes full_finetune TOMLs to hosted endpoint#592JannikSt wants to merge 12 commits into
Conversation
…o hosted endpoint `prime train <toml>` stays the single entry point. When the TOML carries `type = "full_finetune"` (or a `[hosted]` block, or a `[deployment]` block matching prime-rl's qwen30b_math/rl.toml shape), the CLI routes to the new public API at /api/v1/training/runs instead of the LoRA shared-cluster path. Backwards compatible: configs without these markers run through the existing flow unchanged. * api/training.py: new HostedTrainingClient + build_payload_from_toml (whitelist-maps prime-rl example fields onto the API payload). * api/rl.py: surface `kind` on RLRun so `prime train delete` can route to the right endpoint based on run kind. * commands/rl.py: peek the TOML before strict RLConfig parse; on full-FT hand off to _dispatch_full_finetune_run with shared env-file/secrets plumbing. Delete looks up kind and dispatches via HostedTrainingClient for DEDICATED_FULL_FT runs. Tested end-to-end against local backend on rft-freyr cluster: dispatch + status mirroring + completion + delete all clean.
Drop the friction of looking up a cluster cuid for the common single- cluster setup. Backend now auto-picks the first uncordoned PrimeCluster when the field is omitted, so `prime train backend/examples/training/ reverse_text.toml` is zero-config.
…eyRef The platform materialises the per-run RFT API token into the per-run k8s Secret on dispatch and the chart binds it to the orchestrator pod's PRIME_API_KEY env var. The token already lives where prime-monitor needs it — surfacing it on stdout just makes it easy to leak into shared shell history or CI logs.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b88b5f7d0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
- Drop prime_cluster_id from CLI plumbing entirely. Backend auto-picks
the first uncordoned PrimeCluster; the CLI never threads a cluster
id, removing a footgun (mistargeting a config to the wrong cluster).
Drops the [hosted] discriminator path too — type = 'full_finetune'
or a [deployment] block remain the only triggers.
- codex P1 / cursor: --output json on the full-FT path no longer
short-circuits with a {would_dispatch} preview. Mirrors the LoRA
path's 'create then format' contract — automation that pipes the
JSON to grab run_id now actually dispatches the run.
- codex P2: env_file (deprecated, singular) is loaded BEFORE env_files
(canonical, plural) so env_files wins on key collision. Matches the
LoRA path's documented precedence.
build_payload_from_toml used to whitelist ~12 individual fields; the backend rebuilt a minimal TOML from them, dropping anything outside the whitelist (custom optim schedules, eval configs, custom scheduler params, …). E2E prime-rl runs that depended on those knobs silently diverged from `uv run rl @ rl.toml` behaviour. Now: ship the whole TOML as 'config' (companion to platform PR #1824 faa934d56). The backend's build_values takes the config dict directly so the same TOML works locally with prime-rl and remote-dispatched through prime-cli with no fork in behaviour.
Unrelated to the full-FT training payload change; accidentally picked up by 'git add -A'. Restoring inference.py to its 4b9be8f state. The streaming improvements will land in a sibling PR off main.
Cursor caught: when rl_client.get_run fails (e.g. pydantic ValidationError because a DEDICATED_FULL_FT row doesn't carry the LoRA-required RLRun fields), the prior except APIError block silently set kind=None and routed delete to the LoRA endpoint. The hosted helm release + namespace would stay live with no signal back. Restructure: try hosted_client.delete_run first; on HTTP 404 the backend's kind gate told us 'not a DEDICATED_FULL_FT', so we fall through to the LoRA path. Removes the get_run discriminator roundtrip entirely — and any pydantic surprise it could have raised.
Cursor flagged: distinguishing 404 by 'HTTP 404' not in str(e) is fragile — depends on the message format never changing AND on no unrelated error body coincidentally containing the substring. - Add NotFoundError(APIError) subclass; APIClient raises it for 404 responses (sync + async paths). - HostedTrainingClient.create_run / delete_run no longer catch Exception and rewrap into a generic APIError — typed APIError subclasses propagate so callers can branch by class. - prime train delete uses except NotFoundError as the 'not a hosted run, try LoRA' fallback signal.
Cursor caught: `if output != "json" and not yes` meant a researcher piping JSON for scripting would have full-FT runs auto-launched without ack. The LoRA path always prompts unless --yes regardless of output format. Match that contract — gate confirmation purely on --yes.
`_dispatch_full_finetune_run` was building the payload without a teamId, so every dispatched run landed on the caller's personal account even when `prime config view` showed a team selected. The LoRA path (rl.py:1216) already passes `app_config.team_id` — match that behaviour for the full-FT path. Backend's `CreateDedicatedRunRequest.team_id` is Optional, so the prior omission was silently wrong instead of returning a 400.
Backend's /api/v1/rft/runs/{run_id}/logs now accepts component +
env_name params (dedicated full-FT). Surface them through the CLI:
prime train logs <run_id> -c trainer
prime train logs <run_id> -c inference
prime train logs <run_id> -c env-server --env <name>
Legacy --env <name>/<idx> still routes through the env-server-logs
endpoint (shared-RFT pods, cluster_id-backed lookup). Dedicated
env-server (slug, no slash) goes through the unified /logs route.
Per-rank --pod-index intentionally not exposed yet: the chart's
torchrun --local-ranks-filter=0 already collapses in-pod rank fan-out
to rank 0 stdout, and Loki's pod-label indexing in this tenant
doesn't actually filter the prime-job-* streams — per-pod inspection
on multi-node runs is kubectl + the PVC log files for now.
…full-finetune # Conflicts: # packages/prime/src/prime_cli/api/rl.py # packages/prime/src/prime_cli/commands/rl.py
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2d15d7a. Configure here.
| env_name=env_name_q, | ||
| env_index=env_index_q, | ||
| tail_lines=t, | ||
| ) |
There was a problem hiding this comment.
Legacy env-server log filters silently dropped
Medium Severity
The new legacy shared-RFT env-server branch (env with "/" qualifier) no longer passes search, regex, normalized_level, or since_seconds to rl_client.get_env_server_logs, even though the method accepts all four. The old code forwarded these filters, so users combining --env name/1 with --search, --level, or --since will now have their filters silently ignored — a regression from the refactor.
Reviewed by Cursor Bugbot for commit 2d15d7a. Configure here.


Note
Medium Risk
Adds new run-dispatch and deletion paths hitting
/training/runsand changes log routing for multiple components; mistakes could start/stop the wrong run type or fetch incorrect logs, but changes are scoped to CLI command behavior.Overview
prime trainnow detects full finetune TOMLs (type="full_finetune"or[deployment]GPU fields) and dispatches them to a new hosted full-FT endpoint (POST /training/runs) instead of the existing LoRA shared/rft/runsflow, including team association and env-file secret collection while intentionally not printing returned per-run tokens.Run deletion now first attempts the hosted full-FT delete (
DELETE /training/runs/{id}) and falls back to the LoRA path only on typedNotFoundError(new 404 exception class). Log fetching is expanded to support component-scoped logs (orchestrator/trainer/inference/env-server) withenv_nameselection, andRLRungains an optionalkinddiscriminator to represent these run types.Reviewed by Cursor Bugbot for commit 2d15d7a. Bugbot is set up for automated code reviews on this repo. Configure here.