diff --git a/docs/superpowers/specs/2026-06-20-model-aware-turn-budgets-design.md b/docs/superpowers/specs/2026-06-20-model-aware-turn-budgets-design.md new file mode 100644 index 0000000..054eb4b --- /dev/null +++ b/docs/superpowers/specs/2026-06-20-model-aware-turn-budgets-design.md @@ -0,0 +1,235 @@ +# Model-aware turn budgets — design + +**Issue:** [#1 — Turn-budget exhaustion is the dominant failure mode](https://github.com/VisionForge-OU/foreman/issues/1) +**Date:** 2026-06-20 +**Status:** approved-for-planning + +## Problem + +In the dogfood soak test **21 of 43 agent runs (49%) ended `killed_turns`**. The +per-run turn budget is a single fixed number (`max_turns`) with **no relationship +to the model running the work**. A 30-turn cap suits a frontier model but starves +a small/cheap model (haiku at `effort=low`), which then retries / extends / +escalates — each killed run still costing $0.17–0.26. This is the root cause +behind F1 (build:stuck), F3 (grill fail), the over-cost of F5, and most of the +wall-clock. No skill-text patch can fix it; only the budget policy can. + +Today every run is assembled into a `RunSpec` carrying `model=` and `budget=` +**side by side but independently** (`pipeline._spawn`, `issue_run`, scheduler +sites). The model and the turn budget never inform each other — that is the gap. + +## Goals + +1. Make the per-run turn budget **model-aware**: a small model gets more turns than + a frontier model by default, so it stops running out mid-task. +2. **Per-phase scaling**: heavy phases (grill, slicer) get more turns than a single + TDD slice. +3. **Wall-clock + cost extension ceiling**: when a turn-killed run is resumed, the + stop decision is governed by cumulative wall-clock and cost across the extension + chain, not by an arbitrary re-run-from-a-turn-cap count. +4. **Loud `killed_turns`**: surface the kill reason and extension-chain stats in the + TUI and the build report (display-only; the outcome-taxonomy work is issue #2). + +## Non-goals + +- The outcome taxonomy / `foreman retro` clustering of kills — that is **issue #2**. + This issue only makes kills *visible*; it does not stamp outcome labels or change + retro clustering. +- Triage / fast-path for trivial requests — **issue #5**. +- Any change to the grader rubric or the evaluator's pass/fail logic. + +## Design + +### 1. New module `src/foreman/turns.py` (pure policy) + +The single owner of "how many turns should this run get?" — no I/O, no state, just +functions over `(model, phase, configured_budget, overrides)`. Isolated so it can be +unit-tested exhaustively and reasoned about in one screen. + +```python +# Built-in defaults (operator-overridable via config; see §3). +TURN_TIERS = {"small": 60, "large": 30} # tier -> turn floor +PHASE_FACTOR = { # multiplier on the tier floor + "planner": 1.0, "grill": 1.5, "slicer": 1.5, + "worker": 1.0, "e2e": 1.25, "init": 1.0, "grader": 1.0, +} +SMALL_HINTS = ("haiku", "mini", "small", "flash", "lite", "nano") # substring, case-insensitive +LARGE_HINTS = ("sonnet", "opus", "fable") # known frontier families +DEFAULT_PHASE_FACTOR = 1.0 # unknown phase +DEFAULT_TIER = "small" # unknown model -> generous + +def classify_model(model: str) -> str: + """A small-model hint wins first, then a known frontier family; an + unrecognised id falls back to 'small' (fail safe — give an unknown model MORE + turns, not fewer).""" + +def effective_turns(model, phase, configured, *, overrides, tiers=None, factors=None) -> int: + """Resolve the effective max_turns for a run.""" +``` + +**Resolution rule** (reconciles the three precedence decisions): + +1. **Exact pin.** If `model` ∈ `overrides` (the `turn_budget_by_model` config map), + return that integer **verbatim** — bypassing tier, phase factor, and floor. This + is the operator's precise escape hatch; an overridden model gets the same turn + count in every phase. +2. **Otherwise** — `max(configured, round(tier_floor(model) * phase_factor(phase)))`. + The tier value is a **floor**: it can only raise a too-small configured budget, + never reduce a deliberately large one. + +Worked examples (default config, base `max_turns = 80`): + +| model | phase | tier floor | × factor | floor result | configured | **effective** | +|-------|-------|-----------|----------|--------------|-----------|---------------| +| haiku | worker | 60 | ×1.0 | 60 | 80 | **80** (unchanged) | +| haiku | grill | 60 | ×1.5 | 90 | 80 | **90** (raised) | +| haiku | worker (budget lowered to 30) | 60 | ×1.0 | 60 | 30 | **60** (floored ↑) | +| opus | worker | 30 | ×1.0 | 30 | 80 | **80** (unchanged) | +| haiku (pinned 45) | grill | — | — | — | — | **45** (exact pin) | + +> The floor's biggest payoff is **protection when an operator lowers the budget to +> save cost** (exactly what the dogfood harness did at 30) — a small model is then +> still guaranteed ≥60, never the punishing cap that caused 49% `killed_turns`. Even +> at the default base of 80, phase scaling lifts grill/slicer to 90 for small models, +> matching the evidence that grill/slicer burn turns hardest. + +### 2. Wiring at the `RunSpec` seams + +A thin helper in `turns.py`: + +```python +def resolve_budget(config, model, phase, base) -> Budget: + return replace(base, max_turns=effective_turns( + model, phase, base.max_turns, + overrides=config.turn_budget_by_model, + tiers=config.turn_tiers, factors=config.phase_turn_factors)) +``` + +Called at each spec-assembly site, passing the phase and the model already chosen +there: + +| site | phase arg | model | +|------|-----------|-------| +| `pipeline._spawn` | `ctx.kind` (planner/grill/slicer) | `ctx.model` | +| `issue_run` worker spec | `"worker"` | `model_worker` | +| `scheduler` init spec | `"init"` | `model_planner` | +| `scheduler` e2e spec | `"e2e"` | `model_worker` | +| `scheduler` grader specs (evaluator/auditor/code/security) | `"grader"` | respective model | + +Policy lives only in `turns.py`; call sites pass `(model, phase)` and receive an +adjusted `Budget`. Grader sites get the floor too (harmless — `max(configured, …)` +can only help), keeping one consistent rule everywhere. + +### 3. Config surface (`config.py` + installer template) + +New fields on `Config` (all with safe defaults; round-tripped in `to_dict` / +`from_dict`; validated): + +```yaml +# Per-model exact turn pins (escape hatch; bypasses tiers/phase scaling/floor). +turn_budget_by_model: {} # e.g. { claude-haiku-4-5: 80 } + +# Optional overrides of the built-in tier floors and phase multipliers +# (merged over the defaults — you only specify what you change). +turn_tiers: {} # e.g. { small: 80, large: 40 } +phase_turn_factors: {} # e.g. { grill: 2.0 } + +# Wall-clock + cost ceiling for the turn-extension chain (see §4). +extension_wall_min: 30 +extension_cost_usd: 3.00 +max_turn_extensions: 6 # backstop only (was 2) +``` + +Validation: tier floors and pins are positive ints; phase factors are positive +floats; `extension_wall_min` ≥ 0; `extension_cost_usd` > 0; `max_turn_extensions` +≥ 0 (0 ⇒ no count backstop, wall/cost only). The installer YAML documents each. + +### 4. Wall-clock + cost extension ceiling + +`should_extend()` (`runner.py:45`) is the shared owner of the extend-vs-give-up +decision for all three loops. Extend its signature with cumulative guards: + +```python +def should_extend(terminal_reason, *, has_session, extensions, max_extensions, + auto_extend, requested_more=False, + chain_wall_min=0.0, chain_cost_usd=0.0, + wall_ceiling_min=None, cost_ceiling_usd=None) -> bool: + if not auto_extend or not has_session: + return False + if max_extensions and extensions >= max_extensions: # 0 => no count backstop + return False + if wall_ceiling_min is not None and chain_wall_min >= wall_ceiling_min: + return False + if cost_ceiling_usd is not None and chain_cost_usd >= cost_ceiling_usd: + return False + return requested_more or terminal_reason == KILLED_TURNS +``` + +Each of the three extension loops (`pipeline._spawn`, `issue_run`, the non-worker +agent loop in `scheduler`) accumulates across the chain: + +```python +chain_cost_usd += result.record.cost_usd +chain_wall_min += run_duration_min(result.record) # finished - started +``` + +and passes the new args + the config ceilings into `should_extend`. The effect: +a turn-killed run keeps resuming the **same session** with a healthy turn grant +until it completes or the cumulative wall/cost ceiling bites — the count is just a +runaway backstop. + +`run_duration_min` is derived from the `RunRecord.started` / `finished` ISO +timestamps already persisted (no new timing plumbing). + +### 5. Loud `killed_turns` (display-only) + +- **TUI** — `controller.worker_finished` (`tui/controller.py:202`): when + `terminal_reason != "completed"`, append it with a ⚠ marker plus the + extension-chain summary: + `■ finished: tests_failing ⚠ killed_turns · 3 extensions · 18.2 min · $0.41`. + (Phase-A spawns get the analogous treatment where they log.) +- **Report** — `report.render()` (the object returned by `Conductor.build`): a + "Turn-killed runs" section listing any run that ended `killed_turns` with its + chain stats, so an unattended operator sees it without grepping `runs/`. + +This is presentation only — it reads `terminal_reason`, which already exists. It +does **not** assign outcome labels or touch retro (issue #2). + +## Components & boundaries + +| unit | responsibility | depends on | +|------|----------------|-----------| +| `turns.py` | pure turn-budget policy (tiers, phase factors, classify, effective_turns, resolve_budget) | `models.Budget`, `config.Config` (read-only) | +| `runner.should_extend` | extend-vs-stop decision incl. wall/cost guards | — (pure) | +| extension loops (3) | accumulate chain wall/cost, call `should_extend`, build specs via `resolve_budget` | `turns`, `runner` | +| `config.Config` | new fields + validation + round-trip | `models.Budget` | +| `controller` / report | render terminal_reason + chain stats | `RunRecord` | + +Each is independently testable; `turns.py` and `should_extend` are pure. + +## Testing strategy + +- **`tests/test_turns.py`** — `classify_model` for haiku / sonnet / opus / fable / + mini / unknown; exact-pin precedence (bypasses everything); floor vs configured; + phase factors incl. unknown phase → 1.0; config overrides of tiers/factors merge + over defaults; rounding. +- **`tests/test_should_extend.py`** (extend existing) — stops on wall ceiling; stops + on cost ceiling; stops on count backstop; `max_extensions=0` disables the count + backstop; continues while all under; non-turn kills never extend; cost/timeout/ + stuck never extend. +- **Wiring tests** — a haiku worker spec receives the floored/scaled `max_turns`; a + grill spec gets ×1.5; a frontier model with a high configured budget is unchanged; + a pinned model gets the exact value in every phase. +- **Config tests** — new fields round-trip through `to_dict`/`from_dict`; invalid + values raise `ConfigError`; installer template parses. +- **TUI/report tests** — the killed_turns callout renders with chain stats; a clean + run shows no callout. + +## Rollout / compatibility + +- All new config fields default to empty/identity, so **existing `.foreman/config.yaml` + files behave exactly as before** except for: (a) grill/slicer on a small model rise + to 90 turns at the default base, (b) extensions now also stop on wall/cost, (c) + `max_turn_extensions` default 2 → 6 (only affects installs that don't set it). +- No on-disk schema change; `RunRecord` already carries the fields the loud reporting + reads. diff --git a/src/foreman/config.py b/src/foreman/config.py index a087fb9..324b271 100644 --- a/src/foreman/config.py +++ b/src/foreman/config.py @@ -107,8 +107,21 @@ class Config: # asks via request_more_turns) Foreman can resume the SAME session with more # turns up to ``max_turn_extensions`` times before escalating to a human. auto_extend_turns: bool = True - max_turn_extensions: int = 2 + # Backstop only (issue #1): wall-clock + cost are the primary extension limits. + max_turn_extensions: int = 6 # 0 ⇒ no count backstop (wall/cost only) turn_extension_size: int = 0 # 0 ⇒ reuse run_budget.max_turns per extension + # Cumulative ceilings for the turn-extension chain (issue #1). A turn-killed run + # keeps resuming the SAME session until it completes or these bite. + extension_wall_min: int = 30 + extension_cost_usd: float = 3.0 + + # Model-aware turn budgets (issue #1). ``turn_budget_by_model`` pins an exact + # turn count per model (escape hatch; bypasses tiers/phase-scaling/floor). + # ``turn_tiers`` / ``phase_turn_factors`` override the built-in tables in + # ``turns.py`` (merged over the defaults — only specify what you change). + turn_budget_by_model: dict[str, int] = field(default_factory=dict) + turn_tiers: dict[str, int] = field(default_factory=dict) + phase_turn_factors: dict[str, float] = field(default_factory=dict) # ---- accessors used by the runner / scheduler ---- def command(self, name: str) -> Optional[str]: @@ -143,6 +156,19 @@ def validate(self) -> None: errs.append("max_turn_extensions must be >= 0") if self.turn_extension_size < 0: errs.append("turn_extension_size must be >= 0") + if self.extension_wall_min < 0: + errs.append("extension_wall_min must be >= 0") + if self.extension_cost_usd <= 0: + errs.append("extension_cost_usd must be > 0") + for model, turns in self.turn_budget_by_model.items(): + if int(turns) <= 0: + errs.append(f"turn_budget_by_model[{model!r}] must be > 0") + for tier, floor in self.turn_tiers.items(): + if int(floor) <= 0: + errs.append(f"turn_tiers[{tier!r}] must be > 0") + for phase, factor in self.phase_turn_factors.items(): + if float(factor) <= 0: + errs.append(f"phase_turn_factors[{phase!r}] must be > 0") if self.limits.daily_cost_usd <= 0: errs.append("limits.daily_cost_usd must be > 0") if not self.required_skills: @@ -194,6 +220,11 @@ def to_dict(self) -> dict[str, Any]: "auto_extend_turns": self.auto_extend_turns, "max_turn_extensions": self.max_turn_extensions, "turn_extension_size": self.turn_extension_size, + "extension_wall_min": self.extension_wall_min, + "extension_cost_usd": self.extension_cost_usd, + "turn_budget_by_model": dict(self.turn_budget_by_model), + "turn_tiers": dict(self.turn_tiers), + "phase_turn_factors": dict(self.phase_turn_factors), } @@ -242,8 +273,17 @@ def from_dict(d: dict[str, Any]) -> Config: e2e_enabled=bool(d.get("e2e_enabled", True)), permission_mode=str(d.get("permission_mode", "acceptEdits")), auto_extend_turns=bool(d.get("auto_extend_turns", True)), - max_turn_extensions=int(d.get("max_turn_extensions", 2)), + max_turn_extensions=int(d.get("max_turn_extensions", 6)), turn_extension_size=int(d.get("turn_extension_size", 0)), + extension_wall_min=int(d.get("extension_wall_min", 30)), + extension_cost_usd=float(d.get("extension_cost_usd", 3.0)), + turn_budget_by_model={ + str(k): int(v) for k, v in (d.get("turn_budget_by_model") or {}).items() + }, + turn_tiers={str(k): int(v) for k, v in (d.get("turn_tiers") or {}).items()}, + phase_turn_factors={ + str(k): float(v) for k, v in (d.get("phase_turn_factors") or {}).items() + }, ) if d.get("evaluator_budget"): cfg.evaluator_budget = Budget.from_dict(d.get("evaluator_budget")) diff --git a/src/foreman/installer.py b/src/foreman/installer.py index 48082b6..a1a2a3e 100644 --- a/src/foreman/installer.py +++ b/src/foreman/installer.py @@ -95,6 +95,25 @@ # a failed context). Set to `resume` to continue the prior session instead. retry_strategy: fresh +# Model-aware turn budgets (issue #1). A small/cheap model needs more turns than a +# frontier one; out of the box the budget auto-scales by model tier and pipeline +# phase so cheap models stop running out of turns mid-task. +# - turn_budget_by_model: pin an EXACT turn count for a model (escape hatch; +# bypasses tiers, phase scaling, and the floor). +# - turn_tiers / phase_turn_factors: override the built-in tier floors +# (small: 60, large: 30) and phase multipliers (grill/slicer ×1.5, e2e ×1.25); +# merged over the defaults — only list what you change. +# turn_budget_by_model: {{ claude-haiku-4-5: 80 }} +# turn_tiers: {{ small: 60, large: 30 }} +# phase_turn_factors: {{ grill: 1.5, slicer: 1.5, e2e: 1.25 }} + +# When a run is killed at its turn cap, Foreman resumes the SAME session with more +# turns until it completes or these cumulative ceilings bite. max_turn_extensions is +# only a runaway backstop (0 ⇒ wall/cost are the sole limit). +extension_wall_min: 30 +extension_cost_usd: 3.0 +max_turn_extensions: 6 + # WS4.3: run a specialist janitor pass (dedup → conventions → docs) after every N # merged feature issues, each gated by the same verification pipeline. janitor_enabled: true diff --git a/src/foreman/issue_run.py b/src/foreman/issue_run.py index 7b723ae..ebdcb56 100644 --- a/src/foreman/issue_run.py +++ b/src/foreman/issue_run.py @@ -15,14 +15,14 @@ from dataclasses import replace from typing import Optional -from . import git_ops, hooks, janitor as janitor_mod, locks, prompts, vendored +from . import git_ops, hooks, janitor as janitor_mod, locks, prompts, turns, vendored from .agents import installer as agents_installer from .backend import RunSpec from .context import distiller, initializer from .context.assembler import AssembledPrompt, estimate_tokens from .models import Issue, IssueStatus from .retro import metrics as metrics_mod -from .runner import should_extend, KILLED_USER, KILLED_TURNS +from .runner import should_extend, run_duration_min, KILLED_USER, KILLED_TURNS from .verification import merge_gate @@ -93,6 +93,9 @@ async def run(self) -> str: # Turn-budget extensions used so far this run (in-memory: a crash tears down # the session/worktree, so the count is meaningless after). turn_extensions = 0 + # Cumulative wall-clock + cost of the turn-extension chain (issue #1). + chain_wall_min = 0.0 + chain_cost_usd = 0.0 prd_sections = self.s._prd_sections(slug, issue) feature_state = initializer.read_feature_state( self.s.store.paths.feature_state_file(slug) @@ -106,16 +109,21 @@ async def run(self) -> str: evidence_dir.mkdir(parents=True, exist_ok=True) # WS3.1: every session runs the feature bootstrap first. await self.s._run_init_sh(slug, wt) + # Model-aware turn budget (issue #1): scale the issue's budget by the + # worker model + the build phase before each run. + base_budget = turns.resolve_budget( + self.s.config, self.s.config.model_worker, "worker", issue.budget + ) # Turn-budget extension: this loop continues the SAME session with a # fresh turn allowance, rather than a fresh-context retry. if turn_extensions > 0: ext_turns = (self.s.config.turn_extension_size - or self.s.config.run_budget.max_turns) - run_budget = replace(issue.budget, max_turns=ext_turns) + or base_budget.max_turns) + run_budget = replace(base_budget, max_turns=ext_turns) resume_id = prior_session_id # force resume regardless of retry_strategy else: ext_turns = 0 - run_budget = issue.budget + run_budget = base_budget # WS3.3: fresh session by default; `resume` continues prior context. resume_id = (prior_session_id if self.s.config.retry_strategy == "resume" else None) @@ -185,6 +193,8 @@ async def run(self) -> str: wants_more = bool(summary and summary.request_more_turns and not summary.escalate) hard_turns = result.record.terminal_reason == KILLED_TURNS + chain_wall_min += run_duration_min(result.record) + chain_cost_usd += result.record.cost_usd if should_extend( result.record.terminal_reason, has_session=bool(prior_session_id), @@ -192,6 +202,10 @@ async def run(self) -> str: max_extensions=self.s.config.max_turn_extensions, auto_extend=self.s.config.auto_extend_turns, requested_more=wants_more, + chain_wall_min=chain_wall_min, + chain_cost_usd=chain_cost_usd, + wall_ceiling_min=self.s.config.extension_wall_min, + cost_ceiling_usd=self.s.config.extension_cost_usd, ): turn_extensions += 1 self.s._log( diff --git a/src/foreman/pipeline.py b/src/foreman/pipeline.py index 28e1de9..30b7084 100644 --- a/src/foreman/pipeline.py +++ b/src/foreman/pipeline.py @@ -18,11 +18,11 @@ from pathlib import Path from typing import Callable, Optional -from . import frontmatter, prompts, vendored +from . import frontmatter, prompts, turns, vendored from .backend import AgentBackend, RunSpec from .config import Config from .models import DocStatus, GatedDoc, Phase -from .runner import AgentRunner, RunResult, should_extend +from .runner import AgentRunner, RunResult, run_duration_min, should_extend from .skill_invocation import SkillInvocation from .state import FileStore @@ -88,16 +88,21 @@ def ensure_skills_installed(self) -> None: # ------------------------------------------------------------------ # async def _spawn(self, slug: str, ctx: SpawnContext, budget=None) -> RunResult: budget = budget or self.config.run_budget + # Model-aware turn budget (issue #1): scale by the model + this phase. + budget = turns.resolve_budget(self.config, ctx.model, ctx.kind, budget) # Phase-A agents emit no FOREMAN-SUMMARY, so the only extension trigger here is - # a hard turn cut-off: resume the SAME session with more turns, up to the cap, - # rather than handing back a half-written draft (the planner routinely needs it). + # a hard turn cut-off: resume the SAME session with more turns, until the + # cumulative wall/cost ceiling bites, rather than handing back a half-written + # draft (the planner routinely needs it). extensions = 0 + chain_wall_min = 0.0 + chain_cost_usd = 0.0 session_id: Optional[str] = None prompt = ctx.prompt while True: run_id = f"{self._run_id_clock()}-{ctx.label}" if extensions > 0: - ext_turns = self.config.turn_extension_size or self.config.run_budget.max_turns + ext_turns = self.config.turn_extension_size or budget.max_turns run_budget = replace(budget, max_turns=ext_turns) else: run_budget = budget @@ -124,12 +129,18 @@ async def _spawn(self, slug: str, ctx: SpawnContext, budget=None) -> RunResult: if result.final_text: self.store.write_run_summary(slug, run_id, result.final_text) + chain_wall_min += run_duration_min(result.record) + chain_cost_usd += result.record.cost_usd if should_extend( result.record.terminal_reason, has_session=bool(result.record.session_id), extensions=extensions, max_extensions=self.config.max_turn_extensions, auto_extend=self.config.auto_extend_turns, + chain_wall_min=chain_wall_min, + chain_cost_usd=chain_cost_usd, + wall_ceiling_min=self.config.extension_wall_min, + cost_ceiling_usd=self.config.extension_cost_usd, ): extensions += 1 session_id = result.record.session_id diff --git a/src/foreman/retro/driver.py b/src/foreman/retro/driver.py index 08aaf6e..e82af65 100644 --- a/src/foreman/retro/driver.py +++ b/src/foreman/retro/driver.py @@ -42,6 +42,10 @@ async def analyze( runner = runner or AgentRunner(backend) records = _records_for(store, slugs) clusters = retro_mod.cluster_failures(records) + # Flywheel-blindness fix: a high kill rate (e.g. killed_turns) is a first-class + # proposal trigger drafted deterministically, so the dominant failure is never + # silently ignored even when the analysis agent proposes nothing. + kill_proposals = retro_mod.propose_for_clusters(clusters, len(records)) per_feature = [metrics_mod.load_feature_metrics(store, s) for s in slugs] digest = metrics_mod.trend(per_feature) prompt = retro_mod.build_analysis_prompt(clusters, digest) @@ -54,7 +58,13 @@ async def analyze( budget=config.evaluator_budget, label="retro", agent=RETRO_AGENT, ) result = await runner.run(spec, run_id=run_id) - proposals = retro_mod.parse_proposals(result.final_text) + # Deterministic kill-rate proposals first (never lost), then the agent's — deduped + # by (target, title) so an agent that also names a kill fix doesn't double it up. + agent_proposals = retro_mod.parse_proposals(result.final_text) + seen = {(p.target, p.title) for p in kill_proposals} + proposals = kill_proposals + [ + p for p in agent_proposals if (p.target, p.title) not in seen + ] return proposals, clusters, per_feature diff --git a/src/foreman/retro/metrics.py b/src/foreman/retro/metrics.py index a9ea218..97b59b3 100644 --- a/src/foreman/retro/metrics.py +++ b/src/foreman/retro/metrics.py @@ -27,9 +27,44 @@ HUMAN_REJECTED = "human_rejected" # rendered as human_rejected() LEGACY = "legacy" # pre-WS6 / unlabelled runs +# Terminal-reason outcomes. Every run ends with a runner terminal_reason; these are +# the taxonomy labels for the non-delivery terminal states so a killed/erroring run +# is never left blank/legacy (the flywheel-blindness fix — a turn-budget kill must be +# a first-class, clusterable outcome, not an invisible one). They match the +# ``runner`` terminal-reason constants verbatim. +COMPLETED = "completed" # ran to completion (a phase agent, or a bounced attempt) +KILLED_TURNS = "killed_turns" # cut off at the turn budget (the dominant dogfood failure) +KILLED_COST = "killed_cost" +KILLED_TIMEOUT = "killed_timeout" +KILLED_STUCK = "killed_stuck" +KILLED_USER = "killed_user" # operator-initiated kill — terminal, but not a failure +ERROR = "error" # the agent run errored out + # Labels that count as a delivered issue. SUCCESS_LABELS = (SUCCESS_FIRST_TRY, SUCCESS_AFTER_RETRY) +# Kill outcomes the flywheel treats as recurring FAILURES worth clustering/proposing +# on. ``completed`` is terminal-but-fine and ``killed_user`` is a deliberate human +# action, so neither is a failure-kill. +KILL_OUTCOMES = (KILLED_TURNS, KILLED_COST, KILLED_TIMEOUT, KILLED_STUCK, ERROR) + +# Every recognized terminal-reason outcome (the ones a run may legitimately carry +# straight from its terminal_reason, before any richer taxonomy stamp). +_TERMINAL_OUTCOMES = frozenset((COMPLETED,) + KILL_OUTCOMES + (KILLED_USER,)) + + +def terminal_outcome(terminal_reason: Any) -> str: + """Map a runner ``terminal_reason`` to a non-blank taxonomy outcome label. + + Recognized reasons (``completed`` / the kill reasons / ``error``) map to + themselves; a blank or unrecognized reason maps to :data:`LEGACY`. This is the + default outcome stamped on any run a richer terminal point never labels + (turn-extension intermediates, phase agents, bounced attempts) so no terminal + run is ever persisted blank. + """ + reason = str(terminal_reason or "").strip() + return reason if reason in _TERMINAL_OUTCOMES else LEGACY + def label_success(attempts: int) -> str: """First-try success vs. success-after-retry(n). diff --git a/src/foreman/retro/retro.py b/src/foreman/retro/retro.py index 60834a0..7e5b8d6 100644 --- a/src/foreman/retro/retro.py +++ b/src/foreman/retro/retro.py @@ -59,8 +59,15 @@ def _signature(outcome: str) -> Optional[str]: their reason; evaluator bounces and regressions form their own buckets. """ stem = _metrics.base_label(outcome) - if stem in _metrics.SUCCESS_LABELS or stem == _metrics.LEGACY: + # Non-failure terminals never cluster: successes, legacy/unlabelled, a clean + # completion, and an operator-initiated kill (a deliberate human action). + if stem in _metrics.SUCCESS_LABELS or stem in ( + _metrics.LEGACY, _metrics.COMPLETED, _metrics.KILLED_USER + ): return None + # Kill reasons (killed_turns/cost/timeout/stuck, error) are first-class failure + # signatures — the dominant dogfood failure (killed_turns) clusters here via the + # bare-stem catch-all below, so the flywheel can finally see it. if stem == _metrics.EVALUATOR_BOUNCE: return "evaluator_bounce" if stem in (_metrics.ESCALATED, _metrics.HUMAN_REJECTED): @@ -134,6 +141,138 @@ def to_dict(self) -> dict[str, Any]: } +# --------------------------------------------------------------------------- # +# Deterministic kill-rate proposals — the flywheel must never miss the dominant +# failure just because it is a turn/cost/timeout kill rather than an escalation. +# --------------------------------------------------------------------------- # +# A kill cluster at/above this share of all runs is itself a finding: the flywheel +# drafts the corresponding fix directly, without waiting on the analysis agent. +KILL_RATE_PROPOSAL_THRESHOLD = 0.20 +# Below this absolute count a kill is treated as noise, not a recurring pattern. +KILL_RATE_MIN_COUNT = 2 + +# One concrete, reviewable proposal per recurring kill reason. The {count}/{total}/ +# {pct} fields are filled from the cluster so the rationale cites the real numbers. +_KILL_PROPOSAL_TEMPLATES: dict[str, dict[str, str]] = { + _metrics.KILLED_TURNS: { + "target": "config:turn_budget", + "title": "Raise the worker turn budget — recurring turn-budget kills", + "rationale": ( + "{count} of {total} runs ({pct}%) ended killed_turns — the turn budget is " + "cutting workers off before they finish, the dominant failure. Raise the " + "model-aware turn budget (config.run_budget.max_turns / " + "turns.resolve_budget) and/or the turn-extension ceiling " + "(config.extension_wall_min, config.max_turn_extensions) so a worker can " + "finish or hand off cleanly instead of being cut off mid-slice." + ), + "diff": ( + "# config / turns.resolve_budget (the worker turn budget)\n" + "- max_turns: \n" + "+ max_turns: \n" + " # or raise extension_wall_min / max_turn_extensions so the existing\n" + " # turn-extension chain runs longer before it gives up." + ), + }, + _metrics.KILLED_COST: { + "target": "config:cost_budget", + "title": "Re-tune the worker cost budget — recurring cost kills", + "rationale": ( + "{count} of {total} runs ({pct}%) ended killed_cost. Either raise " + "config.run_budget.max_cost_usd for these slices or cut the per-turn cost " + "(smaller assembled prompt / cheaper model for the phase)." + ), + "diff": ( + "# config.run_budget.max_cost_usd (or the assembled-prompt size)\n" + "- max_cost_usd: \n" + "+ max_cost_usd: # or reduce prompt_tokens / model cost" + ), + }, + _metrics.KILLED_TIMEOUT: { + "target": "config:timeout", + "title": "Raise the wall-clock timeout — recurring timeout kills", + "rationale": ( + "{count} of {total} runs ({pct}%) ended killed_timeout. Raise " + "config.run_budget.timeout_min for these slices or split them smaller." + ), + "diff": ( + "# config.run_budget.timeout_min\n" + "- timeout_min: \n" + "+ timeout_min: # or slice the work smaller" + ), + }, + _metrics.KILLED_STUCK: { + "target": "prompt:worker", + "title": "Address stuck workers — recurring no-progress kills", + "rationale": ( + "{count} of {total} runs ({pct}%) ended killed_stuck (consecutive turns " + "with no file/test progress). Sharpen the worker prompt's next-step " + "guidance, or raise config.stuck_turns if it is firing too early." + ), + "diff": ( + "# prompt:worker (progress guidance) or config.stuck_turns\n" + "- (stuck detection fires after N idle turns)\n" + "+ (make the next concrete step explicit, or raise stuck_turns)" + ), + }, + _metrics.ERROR: { + "target": "prompt:worker", + "title": "Investigate agent-run errors — recurring error terminations", + "rationale": ( + "{count} of {total} runs ({pct}%) ended error (the agent run itself " + "errored). Inspect the run transcripts for the common failure and harden " + "the worker prompt / run environment against it." + ), + "diff": ( + "# investigate run transcripts; harden prompt:worker or the run env\n" + "- (agent run errors out)\n" + "+ (handle the recurring error mode)" + ), + }, +} + + +def propose_for_clusters( + clusters: list[FailureCluster], + total_runs: int, + *, + rate_threshold: float = KILL_RATE_PROPOSAL_THRESHOLD, + min_count: int = KILL_RATE_MIN_COUNT, +) -> list[PatchProposal]: + """Deterministic patch proposals for kill clusters above a rate threshold (WS6). + + The analysis agent only ever sees clustered failures; before the flywheel-blindness + fix a turn/cost/timeout kill never even formed a cluster, so the dominant failure + was silently ignored ("retro found nothing" while half the runs were killed). This + makes a high kill rate a first-class proposal trigger: when a kill cluster accounts + for at least ``rate_threshold`` of all runs (and at least ``min_count`` runs), the + flywheel drafts the corresponding fix directly — no model needed — so it can never + miss the turn-budget failure again. Each proposal still goes through the same + hash-sealed review gate and bench requirement as any other. + """ + if total_runs <= 0: + return [] + out: list[PatchProposal] = [] + for c in clusters: + template = _KILL_PROPOSAL_TEMPLATES.get(c.pattern) + if template is None: + continue + if c.count < min_count or (c.count / total_runs) < rate_threshold: + continue + pct = round(100 * c.count / total_runs) + out.append( + PatchProposal( + target=template["target"], + title=template["title"], + rationale=template["rationale"].format( + count=c.count, total=total_runs, pct=pct + ), + diff=template["diff"], + version_bump=1, + ) + ) + return out + + def build_analysis_prompt(clusters: list[FailureCluster], runs_digest: str) -> str: """The prompt the ``kind="retro"`` analysis agent gets. diff --git a/src/foreman/runner.py b/src/foreman/runner.py index ca91806..c180d3b 100644 --- a/src/foreman/runner.py +++ b/src/foreman/runner.py @@ -50,6 +50,10 @@ def should_extend( max_extensions: int, auto_extend: bool, requested_more: bool = False, + chain_wall_min: float = 0.0, + chain_cost_usd: float = 0.0, + wall_ceiling_min: Optional[float] = None, + cost_ceiling_usd: Optional[float] = None, ) -> bool: """Decide 'resume the SAME session with more turns' vs 'give up' (R5/§9, WS3.3). @@ -58,14 +62,42 @@ def should_extend( A turn cut-off (``KILLED_TURNS``) — or an explicit worker request via ``requested_more`` — extends; cost/timeout/stuck/error kills never do. An - extension is only possible when auto-extend is enabled, a resumable session - exists, and the per-run extension cap has not yet been reached. + extension is only possible when auto-extend is enabled and a resumable session + exists. + + The primary limit (issue #1) is the cumulative wall-clock + cost of the + extension chain: extension stops once ``chain_wall_min`` reaches + ``wall_ceiling_min`` or ``chain_cost_usd`` reaches ``cost_ceiling_usd`` (each + consulted only when its ceiling is provided). ``max_extensions`` is a runaway + backstop on the resume count; ``0`` disables it (wall/cost only). """ - if not auto_extend or not has_session or extensions >= max_extensions: + if not auto_extend or not has_session: + return False + if max_extensions and extensions >= max_extensions: + return False + if wall_ceiling_min is not None and chain_wall_min >= wall_ceiling_min: + return False + if cost_ceiling_usd is not None and chain_cost_usd >= cost_ceiling_usd: return False return requested_more or terminal_reason == KILLED_TURNS +def run_duration_min(record: RunRecord) -> float: + """Wall-clock minutes between a run's ``started`` and ``finished`` timestamps. + + Returns 0.0 when either timestamp is missing or unparseable — used to + accumulate the wall-clock of a turn-extension chain (issue #1). + """ + if not record.started or not record.finished: + return 0.0 + try: + start = datetime.fromisoformat(record.started.replace("Z", "+00:00")) + end = datetime.fromisoformat(record.finished.replace("Z", "+00:00")) + except ValueError: + return 0.0 + return max(0.0, (end - start).total_seconds() / 60.0) + + EventCallback = Callable[[StreamEvent], None] diff --git a/src/foreman/scheduler.py b/src/foreman/scheduler.py index db20f9b..304971a 100644 --- a/src/foreman/scheduler.py +++ b/src/foreman/scheduler.py @@ -22,7 +22,7 @@ from . import ( audit as audit_mod, conflicts, git_ops, janitor as janitor_mod, - locks, notify as notify_mod, prd, prompts, vendored, + locks, notify as notify_mod, prd, prompts, turns, vendored, ) from .agents import evaluator as evaluator_mod from .agents import installer as agents_installer @@ -35,7 +35,7 @@ from .issue_run import IssueRun from .ledger import CostLedger from .models import DocStatus, IssueStatus, Issue -from .runner import AgentRunner, RunResult, should_extend +from .runner import AgentRunner, RunResult, run_duration_min, should_extend, KILLED_TURNS from .skill_invocation import SkillInvocation from .state import FileStore from .verify import verify @@ -74,6 +74,9 @@ class BuildReport: e2e: Optional[str] = None audit: Optional[str] = None # WS5.1: satisfied | amendment_drafted(n) | ... stopped_reason: str = "" + # Issue #1: runs that ended killed_turns — (label, num_turns) — surfaced loudly so + # an unattended operator sees the dominant failure without grepping runs/. + turn_killed: list[tuple[str, int]] = field(default_factory=list) def render(self) -> str: lines = [f"# Build report — {self.slug}", ""] @@ -87,6 +90,9 @@ def render(self) -> str: if self.janitor: lines.append("- Janitor passes:") lines += [f" - {iid} ({kind}): {outcome}" for iid, kind, outcome in self.janitor] + if self.turn_killed: + lines.append("- Turn-killed runs (ran out of turns):") + lines += [f" - {label}: {n} turns" for label, n in self.turn_killed] lines.append(f"- Total cost: ${self.total_cost_usd:.4f}") lines.append(f"- Retries: {self.retries} · Escalations: {len(self.escalated)}") if self.e2e: @@ -318,6 +324,12 @@ def _tally(self, slug: str, report: BuildReport) -> None: report.blocked.append(i.id) report.retries = sum(i.attempts for i in state.issues if not i.is_janitor) report.total_cost_usd = self.feature_cost(slug) + # Issue #1: surface every run that ran out of turns, loudly, in the report. + report.turn_killed = [ + (str(r.get("label") or r.get("run_id") or "?"), int(r.get("num_turns") or 0)) + for r in self.store.usage_records(slug) + if r.get("terminal_reason") == KILLED_TURNS + ] # ------------------------------------------------------------------ # # Single issue lifecycle (with retries) @@ -335,17 +347,22 @@ async def _work_issue( async def _run_agent_with_extensions( self, slug: str, *, label: str, monitor_id: Optional[str], base_budget, - build_spec, continuation: str = "", + build_spec, continuation: str = "", phase: str = "grader", model: str = "", ): """Run a non-worker agent (evaluator / e2e / auditor) with bounded turn-budget - extensions: on a hard turn cut-off, resume the SAME session with more turns up - to ``max_turn_extensions`` before giving up — so a long evaluation/audit/e2e + extensions: on a hard turn cut-off, resume the SAME session with more turns + until the cumulative wall/cost ceiling bites — so a long evaluation/audit/e2e isn't lost to the budget. Persists each run's record + summary and accrues cost. + ``base_budget`` is first scaled for ``model`` + ``phase`` (issue #1). ``build_spec(session_id, budget) -> RunSpec`` builds the spec per attempt. Returns ``(final RunResult, final run_id)``. """ + if model: + base_budget = turns.resolve_budget(self.config, model, phase, base_budget) extensions = 0 + chain_wall_min = 0.0 + chain_cost_usd = 0.0 session_id: Optional[str] = None while True: run_id = f"{self._run_id_clock()}-{label}" @@ -372,12 +389,18 @@ async def _run_agent_with_extensions( self.store.write_run_summary(slug, run_id, result.final_text) self.ledger.add(result.record.cost_usd) + chain_wall_min += run_duration_min(result.record) + chain_cost_usd += result.record.cost_usd if should_extend( result.record.terminal_reason, has_session=bool(result.record.session_id), extensions=extensions, max_extensions=self.config.max_turn_extensions, auto_extend=self.config.auto_extend_turns, + chain_wall_min=chain_wall_min, + chain_cost_usd=chain_cost_usd, + wall_ceiling_min=self.config.extension_wall_min, + cost_ceiling_usd=self.config.extension_cost_usd, ): extensions += 1 session_id = result.record.session_id @@ -410,6 +433,7 @@ def _build(session_id, budget): result, run_id = await self._run_agent_with_extensions( slug, label=f"{issue.id}-eval", monitor_id=f"{issue.id}-eval", base_budget=self.config.evaluator_budget, build_spec=_build, + phase="grader", model=self.config.model_evaluator, continuation=prompts.agent_continuation( "grading this slice and emit the required " "verdict JSON block, then stop. Do not start over."), @@ -451,6 +475,7 @@ def _build(session_id, b): ) result, run_id = await self._run_agent_with_extensions( slug, label=label, monitor_id=label, base_budget=budget, build_spec=_build, + phase="grader", model=model, continuation=prompts.agent_continuation(task), ) verdict = parse(result.final_text) @@ -547,7 +572,9 @@ async def _run_initializer(self, slug: str) -> None: kind="initializer", slug=slug, repo_root=self.store.paths.root, cwd=self.store.paths.root, prompt=prompt, model=self.config.model_planner, effort=self.config.effort, permission_mode=self.config.permission_mode, - budget=self.config.run_budget, label="init", + budget=turns.resolve_budget( + self.config, self.config.model_planner, "init", self.config.run_budget + ), label="init", extra_dirs=[self.store.paths.feature_dir(slug)], ) if self.monitor: @@ -720,7 +747,7 @@ def _build(session_id, budget): ) await self._run_agent_with_extensions( slug, label="e2e", monitor_id="e2e", base_budget=self.config.run_budget, - build_spec=_build, + build_spec=_build, phase="e2e", model=self.config.model_worker, continuation=prompts.agent_continuation("the e2e flow and stop."), ) if e2e_cmd: @@ -762,6 +789,7 @@ def _build(session_id, budget): result, run_id = await self._run_agent_with_extensions( slug, label="audit", monitor_id="audit", base_budget=self.config.evaluator_budget, build_spec=_build, + phase="grader", model=self.config.model_auditor, continuation=prompts.agent_continuation( "the audit and emit the required audit JSON " "block, then stop. Do not start over."), diff --git a/src/foreman/state.py b/src/foreman/state.py index 7b8a44b..e31dde5 100644 --- a/src/foreman/state.py +++ b/src/foreman/state.py @@ -412,8 +412,17 @@ def _queue_confirmed(self, slug: str) -> bool: def write_run_record(self, slug: str, record: RunRecord) -> None: rdir = self.paths.run_dir(slug, record.run_id) rdir.mkdir(parents=True, exist_ok=True) + data = record.to_dict() + # Flywheel-blindness fix: a run a richer terminal point never labels + # (turn-extension intermediates, phase agents, bounced attempts) is stamped + # with its terminal_reason here, so no terminal run is ever persisted blank. + # A specific taxonomy stamp (success/escalated/evaluator_bounce) is written + # with a non-blank outcome and so passes through unchanged. + if not data.get("outcome"): + from .retro import metrics as _metrics # lazy: avoids an import cycle + data["outcome"] = _metrics.terminal_outcome(data.get("terminal_reason")) self.paths.run_usage(slug, record.run_id).write_text( - json.dumps(record.to_dict(), indent=2) + json.dumps(data, indent=2) ) def write_run_summary(self, slug: str, run_id: str, summary: str) -> None: diff --git a/src/foreman/tui/controller.py b/src/foreman/tui/controller.py index 4fcf49f..e1d82ec 100644 --- a/src/foreman/tui/controller.py +++ b/src/foreman/tui/controller.py @@ -205,8 +205,14 @@ def worker_finished(self, issue_id: str, status: str, result) -> None: wl.cost = result.record.cost_usd or wl.cost wl.prompt_tokens = result.record.prompt_tokens or wl.prompt_tokens tok = f", {wl.prompt_tokens} ctx-tok" if wl.prompt_tokens else "" - wl.append(f"■ finished: {status} (${wl.cost:.4f}, {result.record.num_turns} turns{tok})") - self._emit(f" ■ worker {issue_id}: {status} " + # Issue #1: make a non-clean terminal reason (esp. killed_turns) LOUD instead + # of silently hiding behind the issue status. + reason = result.record.terminal_reason + flag = f" ⚠ {reason}" if reason and reason != "completed" else "" + wl.append( + f"■ finished: {status}{flag} (${wl.cost:.4f}, {result.record.num_turns} turns{tok})" + ) + self._emit(f" ■ worker {issue_id}: {status}{flag} " f"(${wl.cost:.4f}, {result.record.num_turns} turns{tok})") def escalated(self, issue_id: str, reason: str) -> None: diff --git a/src/foreman/turns.py b/src/foreman/turns.py new file mode 100644 index 0000000..a2661d4 --- /dev/null +++ b/src/foreman/turns.py @@ -0,0 +1,97 @@ +"""Model-aware turn-budget policy (issue #1). + +The single owner of "how many turns should this run get?". Pure functions over +``(model, phase, configured_budget, overrides)`` — no I/O, no state — so the +policy can be unit-tested exhaustively and reasoned about in one screen. + +Resolution rule: + +1. **Exact pin** — if the model has an entry in ``turn_budget_by_model`` + (``overrides``), that integer is returned verbatim, bypassing tier, phase + factor, and floor. The operator's precise escape hatch. +2. **Otherwise** — ``max(configured, round(tier_floor × phase_factor))``. The + tier floor can only raise a too-small configured budget, never reduce a + deliberately large one. + +A 30-turn cap suits a frontier model but starves a small/cheap one; the dogfood +soak test ended 49% of runs ``killed_turns`` for exactly this reason. +""" + +from __future__ import annotations + +from dataclasses import replace +from typing import Mapping, Optional + +from .models import Budget + +# Built-in defaults — operator-overridable via config (turn_tiers / phase_turn_factors). +TURN_TIERS: dict[str, int] = {"small": 60, "large": 30} # tier -> turn floor +PHASE_FACTOR: dict[str, float] = { # multiplier on the tier floor + "planner": 1.0, + "grill": 1.5, + "slicer": 1.5, + "worker": 1.0, + "e2e": 1.25, + "init": 1.0, + "grader": 1.0, +} + +# Substring hints (case-insensitive) used to classify a model id into a tier. +SMALL_HINTS: tuple[str, ...] = ("haiku", "mini", "small", "flash", "lite", "nano") +LARGE_HINTS: tuple[str, ...] = ("sonnet", "opus", "fable") + +DEFAULT_PHASE_FACTOR: float = 1.0 # unknown phase +DEFAULT_TIER: str = "small" # unknown model -> generous (more turns, not fewer) + + +def classify_model(model: str) -> str: + """Classify a model id into a turn tier. + + A small-model hint wins first, then a known frontier family; an unrecognised + id falls back to ``small`` so an unknown model gets MORE turns, not fewer. + """ + m = (model or "").lower() + if any(h in m for h in SMALL_HINTS): + return "small" + if any(h in m for h in LARGE_HINTS): + return "large" + return DEFAULT_TIER + + +def effective_turns( + model: str, + phase: str, + configured: int, + *, + overrides: Optional[Mapping[str, int]] = None, + tiers: Optional[Mapping[str, int]] = None, + factors: Optional[Mapping[str, float]] = None, +) -> int: + """Resolve the effective ``max_turns`` for a run (see module docstring).""" + overrides = overrides or {} + if model in overrides: + return int(overrides[model]) # exact pin — verbatim, every phase + + tier_table = {**TURN_TIERS, **(tiers or {})} + factor_table = {**PHASE_FACTOR, **(factors or {})} + floor = tier_table[classify_model(model)] + factor = factor_table.get(phase, DEFAULT_PHASE_FACTOR) + scaled = round(floor * factor) + return max(int(configured), scaled) + + +def resolve_budget(config, model: str, phase: str, base: Budget) -> Budget: + """A copy of ``base`` with ``max_turns`` resolved for this model + phase. + + The single wiring helper called at every ``RunSpec`` assembly site. Only + ``max_turns`` changes; ``max_cost_usd`` / ``timeout_min`` are preserved. + """ + return replace( + base, + max_turns=effective_turns( + model, phase, base.max_turns, + overrides=config.turn_budget_by_model, + tiers=config.turn_tiers, + factors=config.phase_turn_factors, + ), + ) diff --git a/tests/test_budget_policy.py b/tests/test_budget_policy.py index ceec18a..0b6e9f6 100644 --- a/tests/test_budget_policy.py +++ b/tests/test_budget_policy.py @@ -5,8 +5,10 @@ extension cap is not yet reached. """ +from foreman.models import RunRecord from foreman.runner import ( should_extend, + run_duration_min, KILLED_TURNS, KILLED_COST, KILLED_TIMEOUT, @@ -51,3 +53,66 @@ def test_explicit_request_extends_even_on_non_turn_terminal(): def test_explicit_request_still_blocked_without_budget(): assert should_extend(COMPLETED, **{**BASE, "extensions": 3}, requested_more=True) is False assert should_extend(COMPLETED, **{**BASE, "has_session": False}, requested_more=True) is False + + +# --------------------------------------------------------------------------- # +# Wall + cost extension ceiling (issue #1) — the primary stop; count is a backstop. +# --------------------------------------------------------------------------- # +def test_under_all_ceilings_still_extends(): + assert should_extend( + KILLED_TURNS, **BASE, + chain_wall_min=10.0, chain_cost_usd=0.5, + wall_ceiling_min=30.0, cost_ceiling_usd=3.0, + ) is True + + +def test_wall_ceiling_stops_extension(): + assert should_extend( + KILLED_TURNS, **BASE, + chain_wall_min=30.0, chain_cost_usd=0.5, + wall_ceiling_min=30.0, cost_ceiling_usd=3.0, + ) is False + + +def test_cost_ceiling_stops_extension(): + assert should_extend( + KILLED_TURNS, **BASE, + chain_wall_min=5.0, chain_cost_usd=3.0, + wall_ceiling_min=30.0, cost_ceiling_usd=3.0, + ) is False + + +def test_zero_max_extensions_disables_count_backstop(): + # Count backstop off (0): a turn-kill keeps extending on wall/cost alone, + # even after many prior extensions. + args = {**BASE, "max_extensions": 0, "extensions": 99} + assert should_extend( + KILLED_TURNS, **args, + chain_wall_min=5.0, chain_cost_usd=0.5, + wall_ceiling_min=30.0, cost_ceiling_usd=3.0, + ) is True + # …but the wall ceiling still bites with the count backstop off. + assert should_extend( + KILLED_TURNS, **args, + chain_wall_min=40.0, chain_cost_usd=0.5, + wall_ceiling_min=30.0, cost_ceiling_usd=3.0, + ) is False + + +def test_ceilings_default_off_preserve_legacy_behavior(): + # No ceilings passed (None) ⇒ wall/cost are not consulted; count rules. + assert should_extend(KILLED_TURNS, **BASE) is True + + +# --------------------------------------------------------------------------- # +# run_duration_min — minutes between a run's start/finish ISO timestamps +# --------------------------------------------------------------------------- # +def test_run_duration_min_from_timestamps(): + r = RunRecord(run_id="x", label="l", + started="2026-06-20T10:00:00Z", finished="2026-06-20T10:09:00Z") + assert run_duration_min(r) == 9.0 + + +def test_run_duration_min_missing_finish_is_zero(): + r = RunRecord(run_id="x", label="l", started="2026-06-20T10:00:00Z") + assert run_duration_min(r) == 0.0 diff --git a/tests/test_config.py b/tests/test_config.py index 2713c09..af9601d 100644 --- a/tests/test_config.py +++ b/tests/test_config.py @@ -67,3 +67,76 @@ def test_gate_review_flags_load_from_yaml(tmp_path): assert cfg.security_review_enabled is True assert cfg.model_security_reviewer == "claude-opus-4-8" assert cfg.security_review_budget.max_turns == 12 + + +# --------------------------------------------------------------------------- # +# Model-aware turn budgets (issue #1) +# --------------------------------------------------------------------------- # +def test_turn_budget_defaults(): + cfg = Config() + assert cfg.turn_budget_by_model == {} + assert cfg.turn_tiers == {} + assert cfg.phase_turn_factors == {} + # wall + cost are the primary extension limits; the count is a backstop. + assert cfg.extension_wall_min == 30 + assert cfg.extension_cost_usd == 3.0 + assert cfg.max_turn_extensions == 6 + + +def test_turn_budget_fields_load_from_yaml(tmp_path): + src = tmp_path / "config.yaml" + src.write_text( + "turn_budget_by_model:\n claude-haiku-4-5: 80\n" + "turn_tiers:\n small: 70\n" + "phase_turn_factors:\n grill: 2.0\n" + "extension_wall_min: 45\n" + "extension_cost_usd: 4.5\n" + "max_turn_extensions: 4\n" + ) + cfg = config.load(src) + assert cfg.turn_budget_by_model == {"claude-haiku-4-5": 80} + assert cfg.turn_tiers == {"small": 70} + assert cfg.phase_turn_factors == {"grill": 2.0} + assert cfg.extension_wall_min == 45 + assert cfg.extension_cost_usd == 4.5 + assert cfg.max_turn_extensions == 4 + + +def test_turn_budget_fields_roundtrip(): + cfg = Config( + turn_budget_by_model={"claude-haiku-4-5": 80}, + turn_tiers={"small": 70, "large": 35}, + phase_turn_factors={"grill": 2.0}, + extension_wall_min=45, + extension_cost_usd=4.5, + ) + again = config.from_dict(cfg.to_dict()) + assert again.to_dict() == cfg.to_dict() + assert again.turn_budget_by_model == {"claude-haiku-4-5": 80} + assert again.turn_tiers == {"small": 70, "large": 35} + assert again.phase_turn_factors == {"grill": 2.0} + + +def test_negative_extension_wall_rejected(): + with pytest.raises(ConfigError): + Config(extension_wall_min=-1).validate() + + +def test_non_positive_extension_cost_rejected(): + with pytest.raises(ConfigError): + Config(extension_cost_usd=0).validate() + + +def test_non_positive_tier_floor_rejected(): + with pytest.raises(ConfigError): + Config(turn_tiers={"small": 0}).validate() + + +def test_non_positive_phase_factor_rejected(): + with pytest.raises(ConfigError): + Config(phase_turn_factors={"grill": 0}).validate() + + +def test_non_positive_model_pin_rejected(): + with pytest.raises(ConfigError): + Config(turn_budget_by_model={"claude-haiku-4-5": 0}).validate() diff --git a/tests/test_controller.py b/tests/test_controller.py index 688c0d4..79e559d 100644 --- a/tests/test_controller.py +++ b/tests/test_controller.py @@ -38,6 +38,41 @@ async def test_controller_demo_drives_pipeline(tmp_path): assert c.feature_cost(slug) > 0 +@pytest.mark.asyncio +async def test_worker_finished_surfaces_killed_turns(tmp_path): + """Issue #1: a turn-killed run is loud in the worker log (no longer silent).""" + from foreman.models import RunRecord + from foreman.runner import RunResult + + c = Controller(tmp_path, demo=True) + rec = RunRecord( + run_id="r1-ISS-001", label="ISS-001", + started="2026-06-20T10:00:00Z", finished="2026-06-20T10:05:00Z", + num_turns=30, cost_usd=0.25, terminal_reason="killed_turns", + ) + c.worker_finished("ISS-001", "tests_failing", RunResult(rec, "", None, None)) + line = c.workers["ISS-001"].lines[-1] + assert "killed_turns" in line + assert "⚠" in line + + +@pytest.mark.asyncio +async def test_worker_finished_clean_run_has_no_warning(tmp_path): + from foreman.models import RunRecord + from foreman.runner import RunResult + + c = Controller(tmp_path, demo=True) + rec = RunRecord( + run_id="r1-ISS-001", label="ISS-001", + started="2026-06-20T10:00:00Z", finished="2026-06-20T10:05:00Z", + num_turns=12, cost_usd=0.10, terminal_reason="completed", + ) + c.worker_finished("ISS-001", "done", RunResult(rec, "", None, None)) + line = c.workers["ISS-001"].lines[-1] + assert "⚠" not in line + assert "completed" not in line # a clean finish stays terse + + @pytest.mark.asyncio async def test_request_changes_on_prd_amendment_creates_fix_issues(tmp_path): """H6: requesting changes on a PRD that carries an auto-drafted amendment diff --git a/tests/test_integration_ws56.py b/tests/test_integration_ws56.py index a404d3e..fbd4d31 100644 --- a/tests/test_integration_ws56.py +++ b/tests/test_integration_ws56.py @@ -219,6 +219,53 @@ async def retro_script(spec): assert store.paths.skill_changelog_file.exists() +# --- Flywheel-blindness fix: a high kill rate yields a drafted proposal (AC5) --- # + +@pytest.mark.asyncio +async def test_retro_proposes_on_killed_runs_even_when_agent_is_silent(tmp_path): + """Regression for the central Phase-2 finding: a campaign dominated by + killed_turns must yield a NON-EMPTY retro proposal set — even when the analysis + agent proposes nothing — instead of the false 'retro found nothing'.""" + from foreman.retro import driver + from foreman.demo_scripts import _init, _result + from foreman.stream_parser import parse_event + + store = FileStore(tmp_path / "repo", clock=lambda: "2026-01-01T00:00:00Z") + slug = store.create_feature("killed campaign", "soak test") + # Seed a fixture of killed_turns runs (the 49% dogfood pattern) + one success. + seed = [("001-ISS-001", "killed_turns"), ("002-ISS-002", "killed_turns"), + ("003-ISS-003", "killed_turns"), ("004-ISS-004", "killed_turns"), + ("005-ISS-005", "success_first_try")] + for rid, outcome in seed: + store.paths.run_dir(slug, rid).mkdir(parents=True, exist_ok=True) + store.paths.run_usage(slug, rid).write_text(json.dumps({ + "run_id": rid, "label": rid.split("-", 1)[1], + "issue_id": rid.split("-", 1)[1], "outcome": outcome, + "terminal_reason": outcome, + })) + + # A retro analysis agent that finds nothing worth proposing. + async def silent_retro(spec): + yield _init(spec) + yield parse_event({"type": "assistant", "message": {"content": [{ + "type": "text", + "text": '```json\n{"schema":"foreman-retro/v1","proposals":[]}\n```', + }], "usage": {"input_tokens": 1}}}) + yield _result() + + backend = MockBackend({"retro": silent_retro}) + proposals, clusters, _ = await driver.analyze(store, _config(), backend, slugs=[slug]) + + # The dominant killed_turns cluster surfaced... + assert any(c.pattern == "killed_turns" for c in clusters) + # ...and the flywheel drafted the turn-budget fix despite the silent agent. + assert proposals, "killed_turns must not be silently ignored" + assert any("turn" in p.title.lower() for p in proposals) + # The drafted proposal survives the hash-sealed gate as an in_review doc. + names = driver.draft(store, proposals) + assert names and driver.load(store, names[0]).status == "in_review" + + def test_cli_has_retro_and_bench(): from foreman.cli import build_parser p = build_parser() diff --git a/tests/test_metrics.py b/tests/test_metrics.py index bc9160d..7ba664e 100644 --- a/tests/test_metrics.py +++ b/tests/test_metrics.py @@ -55,6 +55,34 @@ def test_is_success(): assert not M.is_success("legacy") +# --------------------------------------------------------------------------- # +# Kill-reason outcomes (flywheel-blindness fix): every terminal run, incl. the +# turn/cost/timeout/error kills, is a first-class outcome — no more blank/legacy. +# --------------------------------------------------------------------------- # +def test_terminal_outcome_maps_kill_reasons_to_themselves(): + assert M.terminal_outcome("killed_turns") == "killed_turns" + assert M.terminal_outcome("killed_cost") == "killed_cost" + assert M.terminal_outcome("killed_timeout") == "killed_timeout" + assert M.terminal_outcome("killed_stuck") == "killed_stuck" + assert M.terminal_outcome("error") == "error" + assert M.terminal_outcome("completed") == "completed" + + +def test_terminal_outcome_blank_or_unknown_is_legacy(): + assert M.terminal_outcome("") == "legacy" + assert M.terminal_outcome(None) == "legacy" + assert M.terminal_outcome("nonsense-reason") == "legacy" + + +def test_kill_outcomes_are_failures_not_successes(): + assert "killed_turns" in M.KILL_OUTCOMES # the dominant dogfood failure + for k in M.KILL_OUTCOMES: + assert not M.is_success(k) + # completed / killed_user are terminal but NOT failure-kills. + assert "completed" not in M.KILL_OUTCOMES + assert "killed_user" not in M.KILL_OUTCOMES + + # --------------------------------------------------------------------------- # # from_record # --------------------------------------------------------------------------- # diff --git a/tests/test_pipeline.py b/tests/test_pipeline.py index 5b27aad..0feb779 100644 --- a/tests/test_pipeline.py +++ b/tests/test_pipeline.py @@ -248,6 +248,36 @@ async def planner_script(spec): assert plan.body.strip() == "# revised plan" # agent frontmatter stripped +@pytest.mark.asyncio +async def test_planner_budget_is_model_floored(repo): + """Issue #1 wiring: a small-tier planner model receives the tier-floored turn + budget at the real spawn site, even when run_budget.max_turns is set lower.""" + from foreman.demo_scripts import _init, _result, _assistant + + counter = itertools.count(1) + store = FileStore(repo, clock=lambda: f"2026-01-01T00:00:{next(counter):02d}Z") + cfg = Config() + cfg.model_planner = "claude-haiku-4-5" # small tier (floor 60) + cfg.run_budget.max_turns = 30 # below the floor → should be raised + slug = store.create_feature("Add tagging", "tags on notes") + seen_turns = [] + + async def planner_script(spec): + seen_turns.append(spec.budget.max_turns) + yield _init(spec) + draft = store.paths.doc_draft_file(slug, "plan") + draft.parent.mkdir(parents=True, exist_ok=True) + draft.write_text("# Implementation Plan\n\nThe plan body.") + yield _assistant(text="wrote the plan draft") + yield _result() + + rc = itertools.count(1) + pipe = Pipeline(store, cfg, MockBackend({"planner": planner_script}), + run_id_clock=lambda: f"r{next(rc):04d}") + await pipe.run_planner(slug) + assert seen_turns == [60] # 30 floored up to the small-tier floor + + @pytest.mark.asyncio async def test_planner_resumes_on_turn_kill(repo): """Phase-A: a planner cut off by the turn budget is resumed (same session) to @@ -259,6 +289,9 @@ async def test_planner_resumes_on_turn_kill(repo): store = FileStore(repo, clock=lambda: f"2026-01-01T00:00:{next(counter):02d}Z") cfg = Config() cfg.run_budget.max_turns = 2 # tiny → first run is cut off + # This test targets the extension loop; keep the tiny budget by disabling the + # model-aware turn floor (issue #1) so it actually reaches the runner. + cfg.turn_tiers = {"small": 1, "large": 1} slug = store.create_feature("Add tagging", "tags on notes") sessions = [] diff --git a/tests/test_retro.py b/tests/test_retro.py index abcd6b2..c15c4eb 100644 --- a/tests/test_retro.py +++ b/tests/test_retro.py @@ -50,6 +50,66 @@ def test_cluster_failures_empty(): assert R.cluster_failures([{"outcome": "success_first_try"}]) == [] +def _kill_records(): + return [ + {"run_id": "k1", "issue_id": "ISS-001", "outcome": "killed_turns"}, + {"run_id": "k2", "issue_id": "ISS-002", "outcome": "killed_turns"}, + {"run_id": "k3", "issue_id": "ISS-003", "outcome": "killed_turns"}, + {"run_id": "k4", "issue_id": "ISS-004", "outcome": "killed_cost"}, + {"run_id": "c1", "label": "planner", "outcome": "completed"}, # not a failure + {"run_id": "u1", "issue_id": "ISS-005", "outcome": "killed_user"}, # deliberate kill + {"run_id": "s1", "issue_id": "ISS-006", "outcome": "success_first_try"}, + ] + + +def test_cluster_failures_surfaces_kill_reasons(): + """The dominant dogfood failure (killed_turns) must now form a cluster — the + flywheel-blindness fix. completed/killed_user/success never cluster.""" + clusters = R.cluster_failures(_kill_records()) + patterns = {c.pattern: c.count for c in clusters} + assert patterns.get("killed_turns") == 3 + assert patterns.get("killed_cost") == 1 + assert "completed" not in patterns + assert "killed_user" not in patterns + assert "success_first_try" not in patterns + # sorted by descending count -> the turn-budget kill leads. + assert clusters[0].pattern == "killed_turns" + + +# --------------------------------------------------------------------------- # +# propose_for_clusters — a high kill rate is a first-class proposal trigger (AC4) +# --------------------------------------------------------------------------- # +def test_high_kill_rate_drafts_a_proposal(): + clusters = R.cluster_failures(_kill_records()) # killed_turns = 3 of 7 runs (43%) + proposals = R.propose_for_clusters(clusters, total_runs=7) + assert proposals, "a dominant killed_turns cluster must draft a proposal" + turn_props = [p for p in proposals if "turn" in p.title.lower()] + assert turn_props, "the killed_turns cluster must propose the turn-budget fix" + p = turn_props[0] + assert "3" in p.rationale # cites the cluster count + assert p.diff.strip() # concrete + reviewable + assert R.PatchProposal is type(p) # a real proposal that can go through the gate + + +def test_low_kill_rate_drafts_nothing(): + # One stray kill among many runs is below the rate threshold -> no proposal. + clusters = R.cluster_failures( + [{"run_id": "k1", "issue_id": "ISS-001", "outcome": "killed_turns"}] + + [{"run_id": f"s{i}", "issue_id": f"ISS-{i:03d}", + "outcome": "success_first_try"} for i in range(2, 21)] + ) + assert R.propose_for_clusters(clusters, total_runs=20) == [] + + +def test_propose_for_clusters_ignores_non_kill_clusters(): + # Escalations are the analysis agent's job, not the deterministic kill trigger. + clusters = R.cluster_failures([ + {"run_id": "e1", "issue_id": "ISS-001", "outcome": "escalated(budget)"}, + {"run_id": "e2", "issue_id": "ISS-002", "outcome": "escalated(budget)"}, + ]) + assert R.propose_for_clusters(clusters, total_runs=2) == [] + + # --------------------------------------------------------------------------- # # build_analysis_prompt # --------------------------------------------------------------------------- # diff --git a/tests/test_scheduler.py b/tests/test_scheduler.py index ce18bc1..57fbcb0 100644 --- a/tests/test_scheduler.py +++ b/tests/test_scheduler.py @@ -87,6 +87,25 @@ async def test_report_includes_retries_count(tmp_path): assert "Total cost:" in rendered # cost + escalations already present +def test_report_lists_turn_killed_runs(): + """Issue #1: turn-killed runs get a loud callout in the build report.""" + from foreman.scheduler import BuildReport + + rep = BuildReport(slug="x", merged=["ISS-001"], + turn_killed=[("ISS-001", 30), ("grill", 90)]) + out = rep.render() + assert "Turn-killed runs" in out + assert "ISS-001" in out and "30 turns" in out + assert "grill" in out and "90 turns" in out + + +def test_report_no_turn_killed_section_when_clean(): + from foreman.scheduler import BuildReport + + rep = BuildReport(slug="x", merged=["ISS-001"]) + assert "Turn-killed" not in rep.render() + + @pytest.mark.asyncio async def test_build_requires_queue_confirmation(tmp_path): repo, store, slug = await _prepare_feature(tmp_path) @@ -590,12 +609,19 @@ async def script(spec): scripts = demo_scripts() scripts["tdd:ISS-001"] = script - sched = _scheduler(store, _config(), scripts=scripts) - await sched.build(slug) + cfg = _config() + # Targets the extension loop; disable the model-aware turn floor (issue #1) so the + # tiny issue budget reaches the runner and trips KILLED_TURNS. + cfg.turn_tiers = {"small": 1, "large": 1} + sched = _scheduler(store, cfg, scripts=scripts) + report = await sched.build(slug) iss1 = store.load_issue(slug, "ISS-001") assert iss1.status == IssueStatus.MERGED assert iss1.attempts == 0 assert sessions[1] == "demo-tdd" # resumed after the cut-off + # Issue #1: the killed-turns run is surfaced loudly in the report. + assert ("ISS-001", 3) in report.turn_killed + assert "Turn-killed runs" in report.render() @pytest.mark.asyncio @@ -619,6 +645,8 @@ async def script(spec): scripts["tdd:ISS-001"] = script cfg = _config() cfg.auto_extend_turns = False + # Keep the tiny budget reaching the runner (disable the issue #1 turn floor). + cfg.turn_tiers = {"small": 1, "large": 1} sched = _scheduler(store, cfg, scripts=scripts) await sched.build(slug) iss1 = store.load_issue(slug, "ISS-001") @@ -685,6 +713,8 @@ async def test_evaluator_resumes_on_turn_kill_then_grades(tmp_path): cfg = _config() cfg.evaluator_budget = Budget(max_turns=2, max_cost_usd=0, timeout_min=20) # tiny → cut off cfg.turn_extension_size = 30 + # Keep the tiny evaluator budget reaching the runner (disable the issue #1 floor). + cfg.turn_tiers = {"small": 1, "large": 1} sessions = [] def eval_script(spec): diff --git a/tests/test_state.py b/tests/test_state.py index a489e6f..e84ef72 100644 --- a/tests/test_state.py +++ b/tests/test_state.py @@ -2,7 +2,7 @@ import pytest -from foreman.models import Budget, DocStatus, Issue, IssueStatus, Phase +from foreman.models import Budget, DocStatus, Issue, IssueStatus, Phase, RunRecord from foreman.state import FileStore @@ -162,3 +162,35 @@ def test_doc_with_noncanonical_status_loads_tolerantly(store): assert plan.status == DocStatus.DRAFTING # unknown status -> non-approved default assert plan.version == 1 # garbage version -> default assert st.phase == Phase.PLAN_REVIEW # feature still derivable, no crash + + +# --------------------------------------------------------------------------- # +# Run-record outcome stamping: every persisted terminal run is non-blank (AC1). +# A run a richer terminal point never labels (turn-extension intermediates, phase +# agents, bounced attempts) defaults to its terminal_reason — no more blank/legacy. +# --------------------------------------------------------------------------- # +def test_write_run_record_stamps_kill_reason_outcome(store): + slug = store.create_feature("F", "d") + rec = RunRecord(run_id="r1", label="ISS-001", started="t", + terminal_reason="killed_turns") # outcome left blank + store.write_run_record(slug, rec) + (data,) = store.usage_records(slug) + assert data["outcome"] == "killed_turns" + + +def test_write_run_record_stamps_phase_agent_completion(store): + slug = store.create_feature("F", "d") + rec = RunRecord(run_id="r1", label="planner", started="t", + terminal_reason="completed") # planner/grill/slicer were blank + store.write_run_record(slug, rec) + (data,) = store.usage_records(slug) + assert data["outcome"] == "completed" + + +def test_write_run_record_preserves_explicit_outcome(store): + slug = store.create_feature("F", "d") + rec = RunRecord(run_id="r1", label="ISS-001", started="t", + terminal_reason="completed", outcome="success_first_try") + store.write_run_record(slug, rec) + (data,) = store.usage_records(slug) + assert data["outcome"] == "success_first_try" # a richer taxonomy stamp wins diff --git a/tests/test_turns.py b/tests/test_turns.py new file mode 100644 index 0000000..f0c66f8 --- /dev/null +++ b/tests/test_turns.py @@ -0,0 +1,135 @@ +"""turns.py — model-aware turn-budget policy (issue #1). + +Pure functions over (model, phase, configured_budget, overrides). The effective +turn budget is an exact per-model pin if one is configured, otherwise the larger +of the configured budget and the phase-scaled tier floor. +""" + +from foreman.config import Config +from foreman.models import Budget +from foreman.turns import ( + TURN_TIERS, + PHASE_FACTOR, + classify_model, + effective_turns, + resolve_budget, +) + + +# --------------------------------------------------------------------------- # +# classify_model +# --------------------------------------------------------------------------- # +def test_haiku_is_small(): + assert classify_model("claude-haiku-4-5") == "small" + assert classify_model("claude-haiku-4-5-20251001") == "small" + + +def test_frontier_families_are_large(): + assert classify_model("claude-sonnet-4-6") == "large" + assert classify_model("claude-opus-4-8") == "large" + assert classify_model("claude-fable-5") == "large" + + +def test_small_hints_match_anywhere_case_insensitive(): + assert classify_model("Some-MINI-model") == "small" + assert classify_model("vendor-small-v2") == "small" + + +def test_unknown_model_defaults_small_for_generosity(): + # Fail safe: an unrecognised id gets MORE turns, not fewer. + assert classify_model("my-custom-model") == "small" + assert classify_model("") == "small" + + +# --------------------------------------------------------------------------- # +# effective_turns — exact pin +# --------------------------------------------------------------------------- # +def test_exact_pin_bypasses_tier_phase_and_floor(): + overrides = {"claude-haiku-4-5": 45} + # grill would normally scale (×1.5); the pin is honored verbatim. + assert effective_turns("claude-haiku-4-5", "grill", 80, overrides=overrides) == 45 + # and the same in every phase + assert effective_turns("claude-haiku-4-5", "worker", 80, overrides=overrides) == 45 + + +# --------------------------------------------------------------------------- # +# effective_turns — floor semantics +# --------------------------------------------------------------------------- # +def test_tier_floor_raises_a_too_small_budget(): + # haiku small floor (60) beats a configured 30. + assert effective_turns("claude-haiku-4-5", "worker", 30) == 60 + + +def test_larger_configured_budget_is_never_reduced(): + assert effective_turns("claude-haiku-4-5", "worker", 80) == 80 + assert effective_turns("claude-opus-4-8", "worker", 80) == 80 + + +def test_frontier_floor_is_lower_than_small(): + # opus large floor (30); configured 30 -> 30 (unchanged). + assert effective_turns("claude-opus-4-8", "worker", 30) == 30 + + +# --------------------------------------------------------------------------- # +# effective_turns — phase scaling +# --------------------------------------------------------------------------- # +def test_grill_phase_scales_above_worker(): + # small floor 60 × grill 1.5 = 90, beats configured 30. + assert effective_turns("claude-haiku-4-5", "grill", 30) == 90 + + +def test_e2e_factor_rounds(): + # 60 × 1.25 = 75 + assert effective_turns("claude-haiku-4-5", "e2e", 30) == 75 + + +def test_unknown_phase_uses_factor_one(): + assert effective_turns("claude-haiku-4-5", "mystery", 30) == 60 + + +# --------------------------------------------------------------------------- # +# effective_turns — config overrides of the built-in tables (merged) +# --------------------------------------------------------------------------- # +def test_tier_override_merges_over_defaults(): + assert effective_turns("claude-haiku-4-5", "worker", 30, tiers={"small": 80}) == 80 + # an unspecified tier still uses the default + assert effective_turns("claude-opus-4-8", "worker", 10, tiers={"small": 80}) == 30 + + +def test_phase_factor_override_merges_over_defaults(): + assert effective_turns("claude-haiku-4-5", "grill", 30, factors={"grill": 2.0}) == 120 + + +# --------------------------------------------------------------------------- # +# table sanity +# --------------------------------------------------------------------------- # +def test_default_tables_present(): + assert TURN_TIERS["small"] > TURN_TIERS["large"] + assert PHASE_FACTOR["grill"] > PHASE_FACTOR["worker"] + + +# --------------------------------------------------------------------------- # +# resolve_budget — Config-driven, preserves the non-turn Budget fields +# --------------------------------------------------------------------------- # +def test_resolve_budget_floors_small_model_worker(): + base = Budget(max_turns=30, max_cost_usd=2.0, timeout_min=20) + b = resolve_budget(Config(), "claude-haiku-4-5", "worker", base) + assert b.max_turns == 60 + assert b.max_cost_usd == 2.0 # untouched + assert b.timeout_min == 20 # untouched + + +def test_resolve_budget_grill_scales(): + b = resolve_budget(Config(), "claude-haiku-4-5", "grill", Budget(max_turns=30)) + assert b.max_turns == 90 + + +def test_resolve_budget_honors_config_pin(): + cfg = Config(turn_budget_by_model={"claude-haiku-4-5": 45}) + b = resolve_budget(cfg, "claude-haiku-4-5", "grill", Budget(max_turns=30)) + assert b.max_turns == 45 + + +def test_resolve_budget_frontier_unchanged_at_high_base(): + b = resolve_budget(Config(), "claude-opus-4-8", "worker", Budget(max_turns=80)) + assert b.max_turns == 80