diff --git a/CLAUDE.md b/CLAUDE.md index 40910496..71c4f1ce 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -60,6 +60,7 @@ Quick reference (agent_docs/): Full documentation (docs/): - principles.md - Agent programming paradigm - architecture.md - Extension model and browser sandbox +- cost-optimization.md - Prompt/token cost playbook + Max-plan price ramp - SPEC.md - Technical specification - API_REFERENCE.md - API docs - DEVELOPMENT.md - Dev guide @@ -241,6 +242,20 @@ Quality gates scale with tier — don't over-engineer AUTOMATE tasks, don't unde For AUTOMATE and STANDARD tiers: make only the requested changes. Don't refactor surrounding code, add abstractions for one-time operations, or create helpers that are used once. Three similar lines of code is better than a premature abstraction. +## Cost Optimization + +Assume token costs only go up: Max-plan usage ramps from ~80% off to full price over ~3 months (codified in `src/core/models/provider-pricing.ts` → `MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`). The cheapest token is the one you don't send. Full playbook: `docs/cost-optimization.md`. + +Codified defaults (cheapest lever first): +- **Route by complexity.** Keep `multiProvider` on — `getOptimalProvider()`/`scoreComplexity()` send simple tasks to cheap models, hard ones to Anthropic. Opus only for CAREFUL/ARCHITECT; Sonnet default; Haiku/cheap providers for AUTOMATE. Output tokens cost 5× input. +- **Tune `effort` before model.** Default `high` for coding; `medium` for cost-sensitive; `max`/`xhigh` only for correctness-critical. Pair with adaptive thinking. +- **Protect the prompt cache.** Keep the prefix byte-stable — no timestamps/UUIDs/IDs in the system prompt, no mid-session tool/model swaps (full rebuild). Cache reads are ~0.1×. +- **Batch non-interactive work** via `AnthropicBatchClient` (50% off). +- **Cap context** via `ContextBudgetManager`; keep the token-optimization hooks on (dedup/prewarm/script-suggest). +- **Close the loop:** `conductor learn --evolve` (GEPA) + `stackmemory optimize traces` shrink prompts permanently. + +Guardrails (never trade for cost): the sensitive-content guard must keep forcing Anthropic for secrets/PII; correctness tiers stay on the capable model; never truncate inputs silently — cap deliberately via the budget manager. + ## Session Budget - Max 1 major topic per session — split unrelated work into separate sessions diff --git a/docs/cost-optimization.md b/docs/cost-optimization.md new file mode 100644 index 00000000..3d3c8a4f --- /dev/null +++ b/docs/cost-optimization.md @@ -0,0 +1,181 @@ +# Cost Optimization + +Codified practices for keeping StackMemory's prompt and token spend low — and for +staying ahead of rising costs as the Max-plan discount expires. + +> **Planning assumption.** Max-plan usage starts at an **80% discount** and ramps +> **linearly to full price over ~3 months** (≈2026-06-06 → 2026-09-06). Every +> token of agent effort gets ~5× more expensive over that window. This is +> codified once in `src/core/models/provider-pricing.ts` +> (`MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`); routing, budgets, and +> this playbook all read from it. Treat optimizations below as "nice to have" +> today and "load-bearing" by September. + +## Guiding principle + +**Be model-agnostic; route by task value, not by reflex.** The expensive +failure mode is defaulting every workload to the most capable (most expensive) +model "to be safe" — that burns budget with zero governance. Instead: + +- **Match model to task.** Cheap/high-volume models for inference and simple + transforms; a premium model (Opus) only for agent workflows where reliability + pays for itself; the most expensive tier *only* when the incremental capability + demonstrably justifies a multiple-× token premium. +- **Govern, audit, control.** Spend should be measurable per run, attributable + to a task tier, and bounded by a budget — not an untracked aggregate. Cost + controls below are pointless without the trace-level visibility to enforce them. + +This is the same axis as the [Task Delegation Model](../CLAUDE.md) tiers +(AUTOMATE → ARCHITECT): route effort — and spend — by complexity and value. + +## The cost model + +Two distinct meters run: + +1. **API spend** (third-party + Anthropic API) — billed per token via + `MODEL_PRICING`. Cost-aware routing already optimizes this. +2. **Max-plan agent effort** — the conductor spawns Claude Code agents on the + Max plan. Today heavily discounted; ramping to full price. Optimize this by + spending *fewer tokens per task* (tighter prompts, less context, fewer turns) + and *fewer tasks on the expensive tier* (route, cache, batch). + +Current list prices (per 1M tokens, sourced 2026-05-26): + +| Model | Input | Output | Context | +| ---------------- | ----- | ------ | ------- | +| Opus 4.6/4.7/4.8 | $5 | $25 | 1M | +| Sonnet 4.6 | $3 | $15 | 1M | +| Haiku 4.5 | $1 | $5 | 200K | + +Output tokens cost **5×** input. Cache reads cost **~0.1×** input; batch is +**50% off**. The cheapest token is the one you don't send. + +## Where spend happens (inventory) + +| Surface | File(s) | Dominant cost | Lever | +| --- | --- | --- | --- | +| Conductor agent runs | `~/.stackmemory/conductor/prompt-template.md` | Output tokens, turn count | Prompt diet, GEPA, effort | +| Context rehydration | `src/core/context/`, `src/core/digest/` | Input tokens | Budget caps, compression | +| Ralph swarm iterations | `src/integrations/ralph/context/context-budget-manager.ts` | Input tokens/iteration | `maxTokens`, compression | +| LLM retrieval | `src/core/retrieval/llm-*.ts` | Input + output | Cheap-model routing | +| Hook overhead | `src/hooks/` (dedup, prewarm, script-suggest) | Duplicate reads, bad tool choice | Already enabled — keep on | +| MCP tool surface | `src/integrations/mcp/tool-definitions.ts` | Input tokens (schemas in context) | Tool search / trim | + +## Codified practices (ranked by impact) + +### 1. Route by complexity — don't pay Opus for lint fixes +`getOptimalProvider()` / `scoreComplexity()` already route low-complexity tasks +to cheap providers and high-complexity to Anthropic, gated by the `multiProvider` +feature flag. **Keep `multiProvider` enabled.** The sensitive-content guard +(`detectSensitiveContent`) forces Anthropic for secrets/PII — never weaken it to +save money. + +- Default the conductor's *simple* tiers (AUTOMATE/STANDARD) to Sonnet, reserve + Opus for CAREFUL/ARCHITECT. Opus↔Sonnet is a 1.7× swing; Opus↔Haiku is 5×. +- Use **subagents on a cheaper model** for fan-out (Explore/grep/read) rather + than switching the main loop's model — switching mid-session breaks the prompt + cache (see #4). + +### 2. Tune `effort`, not model, first +On Opus 4.6+/Sonnet 4.6, `output_config: {effort: ...}` is the cheapest quality +dial. `low`/`medium` mean fewer, more-consolidated tool calls and less preamble. +Default to **`high`** for coding, drop to `medium` for cost-sensitive routes, and +reserve `max`/`xhigh` for correctness-critical work. Pair with adaptive thinking +(`thinking: {type: "adaptive"}`) so the model self-limits reasoning. + +### 3. Spend fewer output tokens (5× input) +- **Lower-effort, terser agents.** Add a silence-default to the conductor + template: no narration between tool calls, one-or-two-sentence wrap-ups. +- **Don't lowball `max_tokens`** — truncation forces a full re-run. Set a real + ceiling, then let `effort`/`task_budget` moderate actual usage. +- Use **Task Budgets** (`task_budget`, beta) for long agentic loops so the model + sees a countdown and wraps up gracefully instead of being hard-truncated. + +### 4. Prompt caching — keep the prefix frozen +Cache reads are ~0.1× input. The entire win depends on a **byte-stable prefix** +(`tools` → `system` → `messages`): +- No `Date.now()`, UUIDs, or per-session IDs in the system prompt — inject + volatile context later in `messages`. +- Don't reorder/add tools or switch models mid-session (full cache rebuild). +- Verify with `usage.cache_read_input_tokens`; zero across repeats = a silent + invalidator. See the audit table in the `claude-api` skill (`prompt-caching`). +- Pre-warm only when first-request latency is user-visible and traffic is bursty. + +### 5. Cap context aggressively +`ContextBudgetManager` (Ralph) already truncates, compresses, and +priority-weights context with a `DEFAULT_MAX_TOKENS` budget. As prices ramp: +- Lower per-iteration `maxTokens` budgets; keep `compressionEnabled` on. +- Prefer digests/summaries over raw frame dumps for rehydration. +- Trim the MCP tool surface or adopt tool-search so 56 tool schemas aren't all + resident in context. + +### 6. Batch the non-interactive work +`AnthropicBatchClient` runs at **50% off**. Anything not latency-sensitive — +backfills, bulk enrichment, digest regeneration, eval sweeps — belongs in a +batch, not a live request. + +### 7. Let the hooks do their job +The token-optimization hooks (#14) already save ~22% (324K tokens on the +benchmark): `dedup-reads` (escalates to `[STOP]` at 5+ duplicate reads), +`desire-path-hook` (auto-routes Bash→Glob/Read/Grep), `prewarm-tools`, +`script-suggest`. Don't disable them; extend them when new waste patterns show up +in `scripts/benchmark-hooks.ts`. + +### 8. Close the learning loop +`conductor learn --evolve` (GEPA) mutates the prompt template from failure data, +and `stackmemory optimize traces` surfaces repeated, wasteful patterns from +`traces.db`. Run these regularly — a shorter, higher-success prompt is a +permanent per-run discount that compounds as prices rise. + +## 3-month phased playbook + +The ramp is roughly: **month 0** ≈ 20% of list, **month 1.5** ≈ 60%, **month 3+** += full price. Escalate effort to match. + +**Phase 1 — now (≈80% off): instrument & default-good.** +- Confirm `multiProvider` on; verify cost-aware routing decisions in traces. +- Land the terser conductor template + `effort` defaults. +- Add cost-per-run to trace stats so the ramp is visible. Establish a baseline + tokens/task number to measure against. + +**Phase 2 — ~month 1–2 (≈40–70%): squeeze.** +- Tighten `ContextBudgetManager` budgets; expand prompt-caching coverage and + verify hit rates. +- Move all non-interactive workloads to the Batches API. +- Run a GEPA pass; adopt the winning template. + +**Phase 3 — ~month 3 (full price): enforce.** +- Treat budgets as hard limits, not hints. Alert when a run exceeds its + `effectiveCost` budget. +- Down-tier aggressively: Opus only for ARCHITECT/CAREFUL; Sonnet default; + Haiku/cheap providers for AUTOMATE. +- Re-baseline `count_tokens` against current models (token counting shifts + between model versions — don't apply a blanket multiplier). + +## Guardrails (don't optimize these away) + +- **Security routing** — the sensitive-content guard must keep forcing Anthropic + for secrets/PII regardless of cost. +- **Correctness tiers** — CAREFUL/ARCHITECT work stays on the capable model; a + cheap wrong answer that needs a re-run costs more than one right answer. +- **No silent truncation** — cap context deliberately via the budget manager; + never truncate inputs blindly. + +## Quick reference + +```bash +stackmemory conductor learn --evolve # mutate prompt template from failures +stackmemory optimize traces # find repeated wasteful patterns +node scripts/benchmark-hooks.ts # measure hook token savings +stackmemory conductor trace-stats # aggregate token usage +``` + +```ts +import { + effectiveSpendMultiplier, + effectiveCost, +} from './core/models/provider-pricing.js'; + +effectiveSpendMultiplier(); // today's cost factor along the ramp (0.2 → 1.0) +effectiveCost('anthropic', 'claude-opus-4-8', inTok, outTok); // ramp-adjusted cost +``` diff --git a/src/core/models/__tests__/provider-pricing.test.ts b/src/core/models/__tests__/provider-pricing.test.ts new file mode 100644 index 00000000..d71a6a30 --- /dev/null +++ b/src/core/models/__tests__/provider-pricing.test.ts @@ -0,0 +1,99 @@ +import { describe, it, expect } from 'vitest'; +import { + MODEL_PRICING, + calculateCost, + formatCost, + effectiveSpendMultiplier, + effectiveCost, + MAX_PLAN_DISCOUNT_RAMP, + type DiscountRamp, +} from '../provider-pricing.js'; + +describe('provider-pricing table', () => { + it('prices Opus 4.x at $5/$25 per 1M', () => { + for (const id of [ + 'anthropic/claude-opus-4-8', + 'anthropic/claude-opus-4-7', + 'anthropic/claude-opus-4-6', + ]) { + expect(MODEL_PRICING[id]).toEqual({ + inputPer1M: 5.0, + outputPer1M: 25.0, + source: 'platform.claude.com', + }); + } + }); + + it('prices Sonnet 4.6 and Haiku 4.5 at current rates', () => { + expect(MODEL_PRICING['anthropic/claude-sonnet-4-6'].outputPer1M).toBe(15.0); + expect(MODEL_PRICING['anthropic/claude-haiku-4-5-20251001']).toEqual({ + inputPer1M: 1.0, + outputPer1M: 5.0, + source: 'platform.claude.com', + }); + }); + + it('calculates cost from token counts', () => { + const c = calculateCost('anthropic', 'claude-opus-4-8', 1_000_000, 1_000_000); + expect(c).not.toBeNull(); + expect(c!.totalCost).toBeCloseTo(30.0, 6); // $5 in + $25 out + }); + + it('returns null for unknown models', () => { + expect(calculateCost('acme', 'gpt-9', 1, 1)).toBeNull(); + }); + + it('formats sub-cent and larger costs distinctly', () => { + expect(formatCost(0.000123)).toBe('$0.000123'); + expect(formatCost(1.5)).toBe('$1.5000'); + }); +}); + +describe('Max-plan discount ramp', () => { + const ramp: DiscountRamp = { + start: '2026-06-06', + end: '2026-09-06', + startMultiplier: 0.2, + endMultiplier: 1.0, + }; + + it('is 80% off at (or before) the ramp start', () => { + expect(effectiveSpendMultiplier(new Date('2026-06-06'), ramp)).toBeCloseTo(0.2); + expect(effectiveSpendMultiplier(new Date('2026-01-01'), ramp)).toBeCloseTo(0.2); + }); + + it('is full price at (or after) the ramp end', () => { + expect(effectiveSpendMultiplier(new Date('2026-09-06'), ramp)).toBeCloseTo(1.0); + expect(effectiveSpendMultiplier(new Date('2027-01-01'), ramp)).toBeCloseTo(1.0); + }); + + it('interpolates linearly mid-ramp', () => { + // ~halfway through the ~3-month window + const mid = effectiveSpendMultiplier(new Date('2026-07-22'), ramp); + expect(mid).toBeGreaterThan(0.5); + expect(mid).toBeLessThan(0.65); + }); + + it('falls back to full price on a misconfigured ramp', () => { + const bad: DiscountRamp = { ...ramp, start: '2026-09-06', end: '2026-06-06' }; + expect(effectiveSpendMultiplier(new Date('2026-07-01'), bad)).toBe(1.0); + }); + + it('exposes a default ramp ending at full price', () => { + expect(MAX_PLAN_DISCOUNT_RAMP.endMultiplier).toBe(1.0); + }); + + it('effectiveCost scales list cost by the ramp multiplier', () => { + const r = effectiveCost( + 'anthropic', + 'claude-opus-4-8', + 1_000_000, + 0, + new Date('2026-06-06') + ); + expect(r).not.toBeNull(); + expect(r!.listCost).toBeCloseTo(5.0, 6); + expect(r!.effectiveCost).toBeCloseTo(1.0, 6); // 20% of list at ramp start + expect(r!.multiplier).toBeCloseTo(0.2, 6); + }); +}); diff --git a/src/core/models/provider-pricing.ts b/src/core/models/provider-pricing.ts index 66446ee9..03de7ef4 100644 --- a/src/core/models/provider-pricing.ts +++ b/src/core/models/provider-pricing.ts @@ -4,7 +4,8 @@ * Cost per 1M tokens (USD) for each provider+model pair. * Used by provider benchmarks and cost-aware routing. * - * Prices sourced 2026-02-13. Update periodically. + * Anthropic prices sourced 2026-05-26 (platform.claude.com pricing). + * Other providers sourced 2026-02-13. Update periodically. */ export interface ModelPricing { @@ -17,7 +18,28 @@ export interface ModelPricing { * Pricing table keyed by "provider/model" */ export const MODEL_PRICING: Record = { - // Anthropic (direct API) + // Anthropic (direct API) — Opus 4.x share one price; 1M context, no + // long-context premium. Sourced platform.claude.com 2026-05-26. + 'anthropic/claude-opus-4-8': { + inputPer1M: 5.0, + outputPer1M: 25.0, + source: 'platform.claude.com', + }, + 'anthropic/claude-opus-4-7': { + inputPer1M: 5.0, + outputPer1M: 25.0, + source: 'platform.claude.com', + }, + 'anthropic/claude-opus-4-6': { + inputPer1M: 5.0, + outputPer1M: 25.0, + source: 'platform.claude.com', + }, + 'anthropic/claude-sonnet-4-6': { + inputPer1M: 3.0, + outputPer1M: 15.0, + source: 'platform.claude.com', + }, 'anthropic/claude-sonnet-4-5-20250929': { inputPer1M: 3.0, outputPer1M: 15.0, @@ -29,9 +51,9 @@ export const MODEL_PRICING: Record = { source: 'anthropic.com', }, 'anthropic/claude-haiku-4-5-20251001': { - inputPer1M: 0.8, - outputPer1M: 4.0, - source: 'anthropic.com', + inputPer1M: 1.0, + outputPer1M: 5.0, + source: 'platform.claude.com', }, // OpenAI (direct API) @@ -89,3 +111,81 @@ export function formatCost(usd: number): string { if (usd < 0.01) return `$${usd.toFixed(6)}`; return `$${usd.toFixed(4)}`; } + +/** + * Max-plan discount ramp. + * + * Codifies the planning assumption that Max-plan usage starts at an 80% + * discount and ramps linearly to full price over ~3 months. This is the + * single source of truth for "how expensive is a token of effort *today*" + * so cost-aware routing, budgets, and the optimization playbook + * (docs/cost-optimization.md) all agree. + * + * `effectiveSpendMultiplier()` returns a factor in [startMultiplier, 1.0]: + * 0.20 → paying 20% of list (80% off), 1.0 → full price. + * + * Override the dates/multipliers via env for testing or a different ramp: + * STACKMEMORY_COST_RAMP_START / _END (ISO dates), + * STACKMEMORY_COST_RAMP_START_MULTIPLIER (0–1). + */ +export interface DiscountRamp { + start: string; // ISO date — ramp begins (startMultiplier in effect) + end: string; // ISO date — ramp complete (full price) + startMultiplier: number; // fraction of list price at `start` (0.2 = 80% off) + endMultiplier: number; // fraction of list price at `end` (1.0 = full price) +} + +export const MAX_PLAN_DISCOUNT_RAMP: DiscountRamp = { + start: process.env.STACKMEMORY_COST_RAMP_START ?? '2026-06-06', + end: process.env.STACKMEMORY_COST_RAMP_END ?? '2026-09-06', + startMultiplier: Number( + process.env.STACKMEMORY_COST_RAMP_START_MULTIPLIER ?? '0.2' + ), + endMultiplier: 1.0, +}; + +/** + * Effective spend multiplier for a given date along the discount ramp. + * Linear interpolation, clamped to [startMultiplier, endMultiplier]. + */ +export function effectiveSpendMultiplier( + date: Date = new Date(), + ramp: DiscountRamp = MAX_PLAN_DISCOUNT_RAMP +): number { + const start = Date.parse(ramp.start); + const end = Date.parse(ramp.end); + const now = date.getTime(); + + if (!Number.isFinite(start) || !Number.isFinite(end) || end <= start) { + return ramp.endMultiplier; // misconfigured ramp → assume full price + } + if (now <= start) return ramp.startMultiplier; + if (now >= end) return ramp.endMultiplier; + + const progress = (now - start) / (end - start); + return ( + ramp.startMultiplier + + progress * (ramp.endMultiplier - ramp.startMultiplier) + ); +} + +/** + * Cost adjusted for where we are on the Max-plan discount ramp. + * Use this for budgeting/forecasting; use calculateCost() for raw list cost. + */ +export function effectiveCost( + provider: string, + model: string, + inputTokens: number, + outputTokens: number, + date: Date = new Date() +): { listCost: number; effectiveCost: number; multiplier: number } | null { + const list = calculateCost(provider, model, inputTokens, outputTokens); + if (!list) return null; + const multiplier = effectiveSpendMultiplier(date); + return { + listCost: list.totalCost, + effectiveCost: list.totalCost * multiplier, + multiplier, + }; +}