docs(cost): codify prompt/token cost playbook + Max-plan price ramp (#18)

jonathanpeterwu · web-flow · commit 64a3674986e6 · 2026-06-08T23:20:30.000-04:00
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -60,6 +60,7 @@ Quick reference (agent_docs/):
 Full documentation (docs/):
 - principles.md - Agent programming paradigm
 - architecture.md - Extension model and browser sandbox
+- cost-optimization.md - Prompt/token cost playbook + Max-plan price ramp
 - SPEC.md - Technical specification
 - API_REFERENCE.md - API docs
 - DEVELOPMENT.md - Dev guide
@@ -241,6 +242,20 @@ Quality gates scale with tier — don't over-engineer AUTOMATE tasks, don't unde
 
 For AUTOMATE and STANDARD tiers: make only the requested changes. Don't refactor surrounding code, add abstractions for one-time operations, or create helpers that are used once. Three similar lines of code is better than a premature abstraction.
 
+## Cost Optimization
+
+Assume token costs only go up: Max-plan usage ramps from ~80% off to full price over ~3 months (codified in `src/core/models/provider-pricing.ts` → `MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`). The cheapest token is the one you don't send. Full playbook: `docs/cost-optimization.md`.
+
+Codified defaults (cheapest lever first):
+- **Route by complexity.** Keep `multiProvider` on — `getOptimalProvider()`/`scoreComplexity()` send simple tasks to cheap models, hard ones to Anthropic. Opus only for CAREFUL/ARCHITECT; Sonnet default; Haiku/cheap providers for AUTOMATE. Output tokens cost 5× input.
+- **Tune `effort` before model.** Default `high` for coding; `medium` for cost-sensitive; `max`/`xhigh` only for correctness-critical. Pair with adaptive thinking.
+- **Protect the prompt cache.** Keep the prefix byte-stable — no timestamps/UUIDs/IDs in the system prompt, no mid-session tool/model swaps (full rebuild). Cache reads are ~0.1×.
+- **Batch non-interactive work** via `AnthropicBatchClient` (50% off).
+- **Cap context** via `ContextBudgetManager`; keep the token-optimization hooks on (dedup/prewarm/script-suggest).
+- **Close the loop:** `conductor learn --evolve` (GEPA) + `stackmemory optimize traces` shrink prompts permanently.
+
+Guardrails (never trade for cost): the sensitive-content guard must keep forcing Anthropic for secrets/PII; correctness tiers stay on the capable model; never truncate inputs silently — cap deliberately via the budget manager.
+
 ## Session Budget
 
 - Max 1 major topic per session — split unrelated work into separate sessions
diff --git a/docs/cost-optimization.md b/docs/cost-optimization.md
@@ -0,0 +1,181 @@
+# Cost Optimization
+
+Codified practices for keeping StackMemory's prompt and token spend low — and for
+staying ahead of rising costs as the Max-plan discount expires.
+
+> **Planning assumption.** Max-plan usage starts at an **80% discount** and ramps
+> **linearly to full price over ~3 months** (≈2026-06-06 → 2026-09-06). Every
+> token of agent effort gets ~5× more expensive over that window. This is
+> codified once in `src/core/models/provider-pricing.ts`
+> (`MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`); routing, budgets, and
+> this playbook all read from it. Treat optimizations below as "nice to have"
+> today and "load-bearing" by September.
+
+## Guiding principle
+
+**Be model-agnostic; route by task value, not by reflex.** The expensive
+failure mode is defaulting every workload to the most capable (most expensive)
+model "to be safe" — that burns budget with zero governance. Instead:
+
+- **Match model to task.** Cheap/high-volume models for inference and simple
+  transforms; a premium model (Opus) only for agent workflows where reliability
+  pays for itself; the most expensive tier *only* when the incremental capability
+  demonstrably justifies a multiple-× token premium.
+- **Govern, audit, control.** Spend should be measurable per run, attributable
+  to a task tier, and bounded by a budget — not an untracked aggregate. Cost
+  controls below are pointless without the trace-level visibility to enforce them.
+
+This is the same axis as the [Task Delegation Model](../CLAUDE.md) tiers
+(AUTOMATE → ARCHITECT): route effort — and spend — by complexity and value.
+
+## The cost model
+
+Two distinct meters run:
+
+1. **API spend** (third-party + Anthropic API) — billed per token via
+   `MODEL_PRICING`. Cost-aware routing already optimizes this.
+2. **Max-plan agent effort** — the conductor spawns Claude Code agents on the
+   Max plan. Today heavily discounted; ramping to full price. Optimize this by
+   spending *fewer tokens per task* (tighter prompts, less context, fewer turns)
+   and *fewer tasks on the expensive tier* (route, cache, batch).
+
+Current list prices (per 1M tokens, sourced 2026-05-26):
+
+| Model            | Input | Output | Context |
+| ---------------- | ----- | ------ | ------- |
+| Opus 4.6/4.7/4.8 | $5    | $25    | 1M      |
+| Sonnet 4.6       | $3    | $15    | 1M      |
+| Haiku 4.5        | $1    | $5     | 200K    |
+
+Output tokens cost **5×** input. Cache reads cost **~0.1×** input; batch is
+**50% off**. The cheapest token is the one you don't send.
+
+## Where spend happens (inventory)
+
+| Surface | File(s) | Dominant cost | Lever |
+| --- | --- | --- | --- |
+| Conductor agent runs | `~/.stackmemory/conductor/prompt-template.md` | Output tokens, turn count | Prompt diet, GEPA, effort |
+| Context rehydration | `src/core/context/`, `src/core/digest/` | Input tokens | Budget caps, compression |
+| Ralph swarm iterations | `src/integrations/ralph/context/context-budget-manager.ts` | Input tokens/iteration | `maxTokens`, compression |
+| LLM retrieval | `src/core/retrieval/llm-*.ts` | Input + output | Cheap-model routing |
+| Hook overhead | `src/hooks/` (dedup, prewarm, script-suggest) | Duplicate reads, bad tool choice | Already enabled — keep on |
+| MCP tool surface | `src/integrations/mcp/tool-definitions.ts` | Input tokens (schemas in context) | Tool search / trim |
+
+## Codified practices (ranked by impact)
+
+### 1. Route by complexity — don't pay Opus for lint fixes
+`getOptimalProvider()` / `scoreComplexity()` already route low-complexity tasks
+to cheap providers and high-complexity to Anthropic, gated by the `multiProvider`
+feature flag. **Keep `multiProvider` enabled.** The sensitive-content guard
+(`detectSensitiveContent`) forces Anthropic for secrets/PII — never weaken it to
+save money.
+
+- Default the conductor's *simple* tiers (AUTOMATE/STANDARD) to Sonnet, reserve
+  Opus for CAREFUL/ARCHITECT. Opus↔Sonnet is a 1.7× swing; Opus↔Haiku is 5×.
+- Use **subagents on a cheaper model** for fan-out (Explore/grep/read) rather
+  than switching the main loop's model — switching mid-session breaks the prompt
+  cache (see #4).
+
+### 2. Tune `effort`, not model, first
+On Opus 4.6+/Sonnet 4.6, `output_config: {effort: ...}` is the cheapest quality
+dial. `low`/`medium` mean fewer, more-consolidated tool calls and less preamble.
+Default to **`high`** for coding, drop to `medium` for cost-sensitive routes, and
+reserve `max`/`xhigh` for correctness-critical work. Pair with adaptive thinking
+(`thinking: {type: "adaptive"}`) so the model self-limits reasoning.
+
+### 3. Spend fewer output tokens (5× input)
+- **Lower-effort, terser agents.** Add a silence-default to the conductor
+  template: no narration between tool calls, one-or-two-sentence wrap-ups.
+- **Don't lowball `max_tokens`** — truncation forces a full re-run. Set a real
+  ceiling, then let `effort`/`task_budget` moderate actual usage.
+- Use **Task Budgets** (`task_budget`, beta) for long agentic loops so the model
+  sees a countdown and wraps up gracefully instead of being hard-truncated.
+
+### 4. Prompt caching — keep the prefix frozen
+Cache reads are ~0.1× input. The entire win depends on a **byte-stable prefix**
+(`tools` → `system` → `messages`):
+- No `Date.now()`, UUIDs, or per-session IDs in the system prompt — inject
+  volatile context later in `messages`.
+- Don't reorder/add tools or switch models mid-session (full cache rebuild).
+- Verify with `usage.cache_read_input_tokens`; zero across repeats = a silent
+  invalidator. See the audit table in the `claude-api` skill (`prompt-caching`).
+- Pre-warm only when first-request latency is user-visible and traffic is bursty.
+
+### 5. Cap context aggressively
+`ContextBudgetManager` (Ralph) already truncates, compresses, and
+priority-weights context with a `DEFAULT_MAX_TOKENS` budget. As prices ramp:
+- Lower per-iteration `maxTokens` budgets; keep `compressionEnabled` on.
+- Prefer digests/summaries over raw frame dumps for rehydration.
+- Trim the MCP tool surface or adopt tool-search so 56 tool schemas aren't all
+  resident in context.
+
+### 6. Batch the non-interactive work
+`AnthropicBatchClient` runs at **50% off**. Anything not latency-sensitive —
+backfills, bulk enrichment, digest regeneration, eval sweeps — belongs in a
+batch, not a live request.
+
+### 7. Let the hooks do their job
+The token-optimization hooks (#14) already save ~22% (324K tokens on the
+benchmark): `dedup-reads` (escalates to `[STOP]` at 5+ duplicate reads),
+`desire-path-hook` (auto-routes Bash→Glob/Read/Grep), `prewarm-tools`,
+`script-suggest`. Don't disable them; extend them when new waste patterns show up
+in `scripts/benchmark-hooks.ts`.
+
+### 8. Close the learning loop
+`conductor learn --evolve` (GEPA) mutates the prompt template from failure data,
+and `stackmemory optimize traces` surfaces repeated, wasteful patterns from
+`traces.db`. Run these regularly — a shorter, higher-success prompt is a
+permanent per-run discount that compounds as prices rise.
+
+## 3-month phased playbook
+
+The ramp is roughly: **month 0** ≈ 20% of list, **month 1.5** ≈ 60%, **month 3+**
+= full price. Escalate effort to match.
+
+**Phase 1 — now (≈80% off): instrument & default-good.**
+- Confirm `multiProvider` on; verify cost-aware routing decisions in traces.
+- Land the terser conductor template + `effort` defaults.
+- Add cost-per-run to trace stats so the ramp is visible. Establish a baseline
+  tokens/task number to measure against.
+
+**Phase 2 — ~month 1–2 (≈40–70%): squeeze.**
+- Tighten `ContextBudgetManager` budgets; expand prompt-caching coverage and
+  verify hit rates.
+- Move all non-interactive workloads to the Batches API.
+- Run a GEPA pass; adopt the winning template.
+
+**Phase 3 — ~month 3 (full price): enforce.**
+- Treat budgets as hard limits, not hints. Alert when a run exceeds its
+  `effectiveCost` budget.
+- Down-tier aggressively: Opus only for ARCHITECT/CAREFUL; Sonnet default;
+  Haiku/cheap providers for AUTOMATE.
+- Re-baseline `count_tokens` against current models (token counting shifts
+  between model versions — don't apply a blanket multiplier).
+
+## Guardrails (don't optimize these away)
+
+- **Security routing** — the sensitive-content guard must keep forcing Anthropic
+  for secrets/PII regardless of cost.
+- **Correctness tiers** — CAREFUL/ARCHITECT work stays on the capable model; a
+  cheap wrong answer that needs a re-run costs more than one right answer.
+- **No silent truncation** — cap context deliberately via the budget manager;
+  never truncate inputs blindly.
+
+## Quick reference
+
+```bash
+stackmemory conductor learn --evolve     # mutate prompt template from failures
+stackmemory optimize traces              # find repeated wasteful patterns
+node scripts/benchmark-hooks.ts          # measure hook token savings
+stackmemory conductor trace-stats        # aggregate token usage
+```
+
+```ts
+import {
+  effectiveSpendMultiplier,
+  effectiveCost,
+} from './core/models/provider-pricing.js';
+
+effectiveSpendMultiplier();              // today's cost factor along the ramp (0.2 → 1.0)
+effectiveCost('anthropic', 'claude-opus-4-8', inTok, outTok); // ramp-adjusted cost
+```
diff --git a/src/core/models/__tests__/provider-pricing.test.ts b/src/core/models/__tests__/provider-pricing.test.ts
@@ -0,0 +1,99 @@
+import { describe, it, expect } from 'vitest';
+import {
+  MODEL_PRICING,
+  calculateCost,
+  formatCost,
+  effectiveSpendMultiplier,
+  effectiveCost,
+  MAX_PLAN_DISCOUNT_RAMP,
+  type DiscountRamp,
+} from '../provider-pricing.js';
+
+describe('provider-pricing table', () => {
+  it('prices Opus 4.x at $5/$25 per 1M', () => {
+    for (const id of [
+      'anthropic/claude-opus-4-8',
+      'anthropic/claude-opus-4-7',
+      'anthropic/claude-opus-4-6',
+    ]) {
+      expect(MODEL_PRICING[id]).toEqual({
+        inputPer1M: 5.0,
+        outputPer1M: 25.0,
+        source: 'platform.claude.com',
+      });
+    }
+  });
+
+  it('prices Sonnet 4.6 and Haiku 4.5 at current rates', () => {
+    expect(MODEL_PRICING['anthropic/claude-sonnet-4-6'].outputPer1M).toBe(15.0);
+    expect(MODEL_PRICING['anthropic/claude-haiku-4-5-20251001']).toEqual({
+      inputPer1M: 1.0,
+      outputPer1M: 5.0,
+      source: 'platform.claude.com',
+    });
+  });
+
+  it('calculates cost from token counts', () => {
+    const c = calculateCost('anthropic', 'claude-opus-4-8', 1_000_000, 1_000_000);
+    expect(c).not.toBeNull();
+    expect(c!.totalCost).toBeCloseTo(30.0, 6); // $5 in + $25 out
+  });
+
+  it('returns null for unknown models', () => {
+    expect(calculateCost('acme', 'gpt-9', 1, 1)).toBeNull();
+  });
+
+  it('formats sub-cent and larger costs distinctly', () => {
+    expect(formatCost(0.000123)).toBe('$0.000123');
+    expect(formatCost(1.5)).toBe('$1.5000');
+  });
+});
+
+describe('Max-plan discount ramp', () => {
+  const ramp: DiscountRamp = {
+    start: '2026-06-06',
+    end: '2026-09-06',
+    startMultiplier: 0.2,
+    endMultiplier: 1.0,
+  };
+
+  it('is 80% off at (or before) the ramp start', () => {
+    expect(effectiveSpendMultiplier(new Date('2026-06-06'), ramp)).toBeCloseTo(0.2);
+    expect(effectiveSpendMultiplier(new Date('2026-01-01'), ramp)).toBeCloseTo(0.2);
+  });
+
+  it('is full price at (or after) the ramp end', () => {
+    expect(effectiveSpendMultiplier(new Date('2026-09-06'), ramp)).toBeCloseTo(1.0);
+    expect(effectiveSpendMultiplier(new Date('2027-01-01'), ramp)).toBeCloseTo(1.0);
+  });
+
+  it('interpolates linearly mid-ramp', () => {
+    // ~halfway through the ~3-month window
+    const mid = effectiveSpendMultiplier(new Date('2026-07-22'), ramp);
+    expect(mid).toBeGreaterThan(0.5);
+    expect(mid).toBeLessThan(0.65);
+  });
+
+  it('falls back to full price on a misconfigured ramp', () => {
+    const bad: DiscountRamp = { ...ramp, start: '2026-09-06', end: '2026-06-06' };
+    expect(effectiveSpendMultiplier(new Date('2026-07-01'), bad)).toBe(1.0);
+  });
+
+  it('exposes a default ramp ending at full price', () => {
+    expect(MAX_PLAN_DISCOUNT_RAMP.endMultiplier).toBe(1.0);
+  });
+
+  it('effectiveCost scales list cost by the ramp multiplier', () => {
+    const r = effectiveCost(
+      'anthropic',
+      'claude-opus-4-8',
+      1_000_000,
+      0,
+      new Date('2026-06-06')
+    );
+    expect(r).not.toBeNull();
+    expect(r!.listCost).toBeCloseTo(5.0, 6);
+    expect(r!.effectiveCost).toBeCloseTo(1.0, 6); // 20% of list at ramp start
+    expect(r!.multiplier).toBeCloseTo(0.2, 6);
+  });
+});
diff --git a/src/core/models/provider-pricing.ts b/src/core/models/provider-pricing.ts