Skip to content

feature: prime train routes full_finetune TOMLs to hosted endpoint#592

Open
JannikSt wants to merge 12 commits into
mainfrom
feature/train-hosted-full-finetune
Open

feature: prime train routes full_finetune TOMLs to hosted endpoint#592
JannikSt wants to merge 12 commits into
mainfrom
feature/train-hosted-full-finetune

Conversation

@JannikSt
Copy link
Copy Markdown
Member

@JannikSt JannikSt commented May 2, 2026

Note

Medium Risk
Adds new run-dispatch and deletion paths hitting /training/runs and changes log routing for multiple components; mistakes could start/stop the wrong run type or fetch incorrect logs, but changes are scoped to CLI command behavior.

Overview
prime train now detects full finetune TOMLs (type="full_finetune" or [deployment] GPU fields) and dispatches them to a new hosted full-FT endpoint (POST /training/runs) instead of the existing LoRA shared /rft/runs flow, including team association and env-file secret collection while intentionally not printing returned per-run tokens.

Run deletion now first attempts the hosted full-FT delete (DELETE /training/runs/{id}) and falls back to the LoRA path only on typed NotFoundError (new 404 exception class). Log fetching is expanded to support component-scoped logs (orchestrator/trainer/inference/env-server) with env_name selection, and RLRun gains an optional kind discriminator to represent these run types.

Reviewed by Cursor Bugbot for commit 2d15d7a. Bugbot is set up for automated code reviews on this repo. Configure here.

JannikSt added 3 commits May 1, 2026 11:28
…o hosted endpoint

`prime train <toml>` stays the single entry point. When the TOML carries
`type = "full_finetune"` (or a `[hosted]` block, or a `[deployment]`
block matching prime-rl's qwen30b_math/rl.toml shape), the CLI routes
to the new public API at /api/v1/training/runs instead of the LoRA
shared-cluster path. Backwards compatible: configs without these
markers run through the existing flow unchanged.

* api/training.py: new HostedTrainingClient + build_payload_from_toml
  (whitelist-maps prime-rl example fields onto the API payload).
* api/rl.py: surface `kind` on RLRun so `prime train delete` can route
  to the right endpoint based on run kind.
* commands/rl.py: peek the TOML before strict RLConfig parse; on full-FT
  hand off to _dispatch_full_finetune_run with shared env-file/secrets
  plumbing. Delete looks up kind and dispatches via HostedTrainingClient
  for DEDICATED_FULL_FT runs.

Tested end-to-end against local backend on rft-freyr cluster: dispatch
+ status mirroring + completion + delete all clean.
Drop the friction of looking up a cluster cuid for the common single-
cluster setup. Backend now auto-picks the first uncordoned PrimeCluster
when the field is omitted, so `prime train backend/examples/training/
reverse_text.toml` is zero-config.
…eyRef

The platform materialises the per-run RFT API token into the per-run
k8s Secret on dispatch and the chart binds it to the orchestrator pod's
PRIME_API_KEY env var. The token already lives where prime-monitor
needs it — surfacing it on stdout just makes it easy to leak into
shared shell history or CI logs.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7b88b5f7d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
JannikSt added 3 commits May 2, 2026 16:45
- Drop prime_cluster_id from CLI plumbing entirely. Backend auto-picks
  the first uncordoned PrimeCluster; the CLI never threads a cluster
  id, removing a footgun (mistargeting a config to the wrong cluster).
  Drops the [hosted] discriminator path too — type = 'full_finetune'
  or a [deployment] block remain the only triggers.

- codex P1 / cursor: --output json on the full-FT path no longer
  short-circuits with a {would_dispatch} preview. Mirrors the LoRA
  path's 'create then format' contract — automation that pipes the
  JSON to grab run_id now actually dispatches the run.

- codex P2: env_file (deprecated, singular) is loaded BEFORE env_files
  (canonical, plural) so env_files wins on key collision. Matches the
  LoRA path's documented precedence.
build_payload_from_toml used to whitelist ~12 individual fields; the
backend rebuilt a minimal TOML from them, dropping anything outside
the whitelist (custom optim schedules, eval configs, custom scheduler
params, …). E2E prime-rl runs that depended on those knobs silently
diverged from `uv run rl @ rl.toml` behaviour.

Now: ship the whole TOML as 'config' (companion to platform PR #1824
faa934d56). The backend's build_values takes the config dict directly
so the same TOML works locally with prime-rl and remote-dispatched
through prime-cli with no fork in behaviour.
Unrelated to the full-FT training payload change; accidentally picked
up by 'git add -A'. Restoring inference.py to its 4b9be8f state. The
streaming improvements will land in a sibling PR off main.
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Cursor caught: when rl_client.get_run fails (e.g. pydantic
ValidationError because a DEDICATED_FULL_FT row doesn't carry the
LoRA-required RLRun fields), the prior except APIError block silently
set kind=None and routed delete to the LoRA endpoint. The hosted
helm release + namespace would stay live with no signal back.

Restructure: try hosted_client.delete_run first; on HTTP 404 the
backend's kind gate told us 'not a DEDICATED_FULL_FT', so we fall
through to the LoRA path. Removes the get_run discriminator
roundtrip entirely — and any pydantic surprise it could have raised.
Comment thread packages/prime/src/prime_cli/commands/rl.py Outdated
Cursor flagged: distinguishing 404 by 'HTTP 404' not in str(e) is
fragile — depends on the message format never changing AND on no
unrelated error body coincidentally containing the substring.

- Add NotFoundError(APIError) subclass; APIClient raises it for 404
  responses (sync + async paths).
- HostedTrainingClient.create_run / delete_run no longer catch
  Exception and rewrap into a generic APIError — typed APIError
  subclasses propagate so callers can branch by class.
- prime train delete uses except NotFoundError as the 'not a hosted
  run, try LoRA' fallback signal.
Comment thread packages/prime/src/prime_cli/commands/rl.py
JannikSt added 2 commits May 5, 2026 18:23
Cursor caught: `if output != "json" and not yes` meant a researcher
piping JSON for scripting would have full-FT runs auto-launched
without ack. The LoRA path always prompts unless --yes regardless of
output format. Match that contract — gate confirmation purely on
--yes.
`_dispatch_full_finetune_run` was building the payload without a
teamId, so every dispatched run landed on the caller's personal
account even when `prime config view` showed a team selected. The
LoRA path (rl.py:1216) already passes `app_config.team_id` — match
that behaviour for the full-FT path.

Backend's `CreateDedicatedRunRequest.team_id` is Optional, so the
prior omission was silently wrong instead of returning a 400.
willccbb
willccbb previously approved these changes May 19, 2026
Backend's /api/v1/rft/runs/{run_id}/logs now accepts component +
env_name params (dedicated full-FT). Surface them through the CLI:

  prime train logs <run_id> -c trainer
  prime train logs <run_id> -c inference
  prime train logs <run_id> -c env-server --env <name>

Legacy --env <name>/<idx> still routes through the env-server-logs
endpoint (shared-RFT pods, cluster_id-backed lookup). Dedicated
env-server (slug, no slash) goes through the unified /logs route.

Per-rank --pod-index intentionally not exposed yet: the chart's
torchrun --local-ranks-filter=0 already collapses in-pod rank fan-out
to rank 0 stdout, and Loki's pod-label indexing in this tenant
doesn't actually filter the prime-job-* streams — per-pod inspection
on multi-node runs is kubectl + the PVC log files for now.
…full-finetune

# Conflicts:
#	packages/prime/src/prime_cli/api/rl.py
#	packages/prime/src/prime_cli/commands/rl.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 2d15d7a. Configure here.

env_name=env_name_q,
env_index=env_index_q,
tail_lines=t,
)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Legacy env-server log filters silently dropped

Medium Severity

The new legacy shared-RFT env-server branch (env with "/" qualifier) no longer passes search, regex, normalized_level, or since_seconds to rl_client.get_env_server_logs, even though the method accepts all four. The old code forwarded these filters, so users combining --env name/1 with --search, --level, or --since will now have their filters silently ignored — a regression from the refactor.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 2d15d7a. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants