|
| 1 | +# Cost Optimization |
| 2 | + |
| 3 | +Codified practices for keeping StackMemory's prompt and token spend low — and for |
| 4 | +staying ahead of rising costs as the Max-plan discount expires. |
| 5 | + |
| 6 | +> **Planning assumption.** Max-plan usage starts at an **80% discount** and ramps |
| 7 | +> **linearly to full price over ~3 months** (≈2026-06-06 → 2026-09-06). Every |
| 8 | +> token of agent effort gets ~5× more expensive over that window. This is |
| 9 | +> codified once in `src/core/models/provider-pricing.ts` |
| 10 | +> (`MAX_PLAN_DISCOUNT_RAMP` / `effectiveSpendMultiplier()`); routing, budgets, and |
| 11 | +> this playbook all read from it. Treat optimizations below as "nice to have" |
| 12 | +> today and "load-bearing" by September. |
| 13 | +
|
| 14 | +## Guiding principle |
| 15 | + |
| 16 | +**Be model-agnostic; route by task value, not by reflex.** The expensive |
| 17 | +failure mode is defaulting every workload to the most capable (most expensive) |
| 18 | +model "to be safe" — that burns budget with zero governance. Instead: |
| 19 | + |
| 20 | +- **Match model to task.** Cheap/high-volume models for inference and simple |
| 21 | + transforms; a premium model (Opus) only for agent workflows where reliability |
| 22 | + pays for itself; the most expensive tier *only* when the incremental capability |
| 23 | + demonstrably justifies a multiple-× token premium. |
| 24 | +- **Govern, audit, control.** Spend should be measurable per run, attributable |
| 25 | + to a task tier, and bounded by a budget — not an untracked aggregate. Cost |
| 26 | + controls below are pointless without the trace-level visibility to enforce them. |
| 27 | + |
| 28 | +This is the same axis as the [Task Delegation Model](../CLAUDE.md) tiers |
| 29 | +(AUTOMATE → ARCHITECT): route effort — and spend — by complexity and value. |
| 30 | + |
| 31 | +## The cost model |
| 32 | + |
| 33 | +Two distinct meters run: |
| 34 | + |
| 35 | +1. **API spend** (third-party + Anthropic API) — billed per token via |
| 36 | + `MODEL_PRICING`. Cost-aware routing already optimizes this. |
| 37 | +2. **Max-plan agent effort** — the conductor spawns Claude Code agents on the |
| 38 | + Max plan. Today heavily discounted; ramping to full price. Optimize this by |
| 39 | + spending *fewer tokens per task* (tighter prompts, less context, fewer turns) |
| 40 | + and *fewer tasks on the expensive tier* (route, cache, batch). |
| 41 | + |
| 42 | +Current list prices (per 1M tokens, sourced 2026-05-26): |
| 43 | + |
| 44 | +| Model | Input | Output | Context | |
| 45 | +| ---------------- | ----- | ------ | ------- | |
| 46 | +| Opus 4.6/4.7/4.8 | $5 | $25 | 1M | |
| 47 | +| Sonnet 4.6 | $3 | $15 | 1M | |
| 48 | +| Haiku 4.5 | $1 | $5 | 200K | |
| 49 | + |
| 50 | +Output tokens cost **5×** input. Cache reads cost **~0.1×** input; batch is |
| 51 | +**50% off**. The cheapest token is the one you don't send. |
| 52 | + |
| 53 | +## Where spend happens (inventory) |
| 54 | + |
| 55 | +| Surface | File(s) | Dominant cost | Lever | |
| 56 | +| --- | --- | --- | --- | |
| 57 | +| Conductor agent runs | `~/.stackmemory/conductor/prompt-template.md` | Output tokens, turn count | Prompt diet, GEPA, effort | |
| 58 | +| Context rehydration | `src/core/context/`, `src/core/digest/` | Input tokens | Budget caps, compression | |
| 59 | +| Ralph swarm iterations | `src/integrations/ralph/context/context-budget-manager.ts` | Input tokens/iteration | `maxTokens`, compression | |
| 60 | +| LLM retrieval | `src/core/retrieval/llm-*.ts` | Input + output | Cheap-model routing | |
| 61 | +| Hook overhead | `src/hooks/` (dedup, prewarm, script-suggest) | Duplicate reads, bad tool choice | Already enabled — keep on | |
| 62 | +| MCP tool surface | `src/integrations/mcp/tool-definitions.ts` | Input tokens (schemas in context) | Tool search / trim | |
| 63 | + |
| 64 | +## Codified practices (ranked by impact) |
| 65 | + |
| 66 | +### 1. Route by complexity — don't pay Opus for lint fixes |
| 67 | +`getOptimalProvider()` / `scoreComplexity()` already route low-complexity tasks |
| 68 | +to cheap providers and high-complexity to Anthropic, gated by the `multiProvider` |
| 69 | +feature flag. **Keep `multiProvider` enabled.** The sensitive-content guard |
| 70 | +(`detectSensitiveContent`) forces Anthropic for secrets/PII — never weaken it to |
| 71 | +save money. |
| 72 | + |
| 73 | +- Default the conductor's *simple* tiers (AUTOMATE/STANDARD) to Sonnet, reserve |
| 74 | + Opus for CAREFUL/ARCHITECT. Opus↔Sonnet is a 1.7× swing; Opus↔Haiku is 5×. |
| 75 | +- Use **subagents on a cheaper model** for fan-out (Explore/grep/read) rather |
| 76 | + than switching the main loop's model — switching mid-session breaks the prompt |
| 77 | + cache (see #4). |
| 78 | + |
| 79 | +### 2. Tune `effort`, not model, first |
| 80 | +On Opus 4.6+/Sonnet 4.6, `output_config: {effort: ...}` is the cheapest quality |
| 81 | +dial. `low`/`medium` mean fewer, more-consolidated tool calls and less preamble. |
| 82 | +Default to **`high`** for coding, drop to `medium` for cost-sensitive routes, and |
| 83 | +reserve `max`/`xhigh` for correctness-critical work. Pair with adaptive thinking |
| 84 | +(`thinking: {type: "adaptive"}`) so the model self-limits reasoning. |
| 85 | + |
| 86 | +### 3. Spend fewer output tokens (5× input) |
| 87 | +- **Lower-effort, terser agents.** Add a silence-default to the conductor |
| 88 | + template: no narration between tool calls, one-or-two-sentence wrap-ups. |
| 89 | +- **Don't lowball `max_tokens`** — truncation forces a full re-run. Set a real |
| 90 | + ceiling, then let `effort`/`task_budget` moderate actual usage. |
| 91 | +- Use **Task Budgets** (`task_budget`, beta) for long agentic loops so the model |
| 92 | + sees a countdown and wraps up gracefully instead of being hard-truncated. |
| 93 | + |
| 94 | +### 4. Prompt caching — keep the prefix frozen |
| 95 | +Cache reads are ~0.1× input. The entire win depends on a **byte-stable prefix** |
| 96 | +(`tools` → `system` → `messages`): |
| 97 | +- No `Date.now()`, UUIDs, or per-session IDs in the system prompt — inject |
| 98 | + volatile context later in `messages`. |
| 99 | +- Don't reorder/add tools or switch models mid-session (full cache rebuild). |
| 100 | +- Verify with `usage.cache_read_input_tokens`; zero across repeats = a silent |
| 101 | + invalidator. See the audit table in the `claude-api` skill (`prompt-caching`). |
| 102 | +- Pre-warm only when first-request latency is user-visible and traffic is bursty. |
| 103 | + |
| 104 | +### 5. Cap context aggressively |
| 105 | +`ContextBudgetManager` (Ralph) already truncates, compresses, and |
| 106 | +priority-weights context with a `DEFAULT_MAX_TOKENS` budget. As prices ramp: |
| 107 | +- Lower per-iteration `maxTokens` budgets; keep `compressionEnabled` on. |
| 108 | +- Prefer digests/summaries over raw frame dumps for rehydration. |
| 109 | +- Trim the MCP tool surface or adopt tool-search so 56 tool schemas aren't all |
| 110 | + resident in context. |
| 111 | + |
| 112 | +### 6. Batch the non-interactive work |
| 113 | +`AnthropicBatchClient` runs at **50% off**. Anything not latency-sensitive — |
| 114 | +backfills, bulk enrichment, digest regeneration, eval sweeps — belongs in a |
| 115 | +batch, not a live request. |
| 116 | + |
| 117 | +### 7. Let the hooks do their job |
| 118 | +The token-optimization hooks (#14) already save ~22% (324K tokens on the |
| 119 | +benchmark): `dedup-reads` (escalates to `[STOP]` at 5+ duplicate reads), |
| 120 | +`desire-path-hook` (auto-routes Bash→Glob/Read/Grep), `prewarm-tools`, |
| 121 | +`script-suggest`. Don't disable them; extend them when new waste patterns show up |
| 122 | +in `scripts/benchmark-hooks.ts`. |
| 123 | + |
| 124 | +### 8. Close the learning loop |
| 125 | +`conductor learn --evolve` (GEPA) mutates the prompt template from failure data, |
| 126 | +and `stackmemory optimize traces` surfaces repeated, wasteful patterns from |
| 127 | +`traces.db`. Run these regularly — a shorter, higher-success prompt is a |
| 128 | +permanent per-run discount that compounds as prices rise. |
| 129 | + |
| 130 | +## 3-month phased playbook |
| 131 | + |
| 132 | +The ramp is roughly: **month 0** ≈ 20% of list, **month 1.5** ≈ 60%, **month 3+** |
| 133 | += full price. Escalate effort to match. |
| 134 | + |
| 135 | +**Phase 1 — now (≈80% off): instrument & default-good.** |
| 136 | +- Confirm `multiProvider` on; verify cost-aware routing decisions in traces. |
| 137 | +- Land the terser conductor template + `effort` defaults. |
| 138 | +- Add cost-per-run to trace stats so the ramp is visible. Establish a baseline |
| 139 | + tokens/task number to measure against. |
| 140 | + |
| 141 | +**Phase 2 — ~month 1–2 (≈40–70%): squeeze.** |
| 142 | +- Tighten `ContextBudgetManager` budgets; expand prompt-caching coverage and |
| 143 | + verify hit rates. |
| 144 | +- Move all non-interactive workloads to the Batches API. |
| 145 | +- Run a GEPA pass; adopt the winning template. |
| 146 | + |
| 147 | +**Phase 3 — ~month 3 (full price): enforce.** |
| 148 | +- Treat budgets as hard limits, not hints. Alert when a run exceeds its |
| 149 | + `effectiveCost` budget. |
| 150 | +- Down-tier aggressively: Opus only for ARCHITECT/CAREFUL; Sonnet default; |
| 151 | + Haiku/cheap providers for AUTOMATE. |
| 152 | +- Re-baseline `count_tokens` against current models (token counting shifts |
| 153 | + between model versions — don't apply a blanket multiplier). |
| 154 | + |
| 155 | +## Guardrails (don't optimize these away) |
| 156 | + |
| 157 | +- **Security routing** — the sensitive-content guard must keep forcing Anthropic |
| 158 | + for secrets/PII regardless of cost. |
| 159 | +- **Correctness tiers** — CAREFUL/ARCHITECT work stays on the capable model; a |
| 160 | + cheap wrong answer that needs a re-run costs more than one right answer. |
| 161 | +- **No silent truncation** — cap context deliberately via the budget manager; |
| 162 | + never truncate inputs blindly. |
| 163 | + |
| 164 | +## Quick reference |
| 165 | + |
| 166 | +```bash |
| 167 | +stackmemory conductor learn --evolve # mutate prompt template from failures |
| 168 | +stackmemory optimize traces # find repeated wasteful patterns |
| 169 | +node scripts/benchmark-hooks.ts # measure hook token savings |
| 170 | +stackmemory conductor trace-stats # aggregate token usage |
| 171 | +``` |
| 172 | + |
| 173 | +```ts |
| 174 | +import { |
| 175 | + effectiveSpendMultiplier, |
| 176 | + effectiveCost, |
| 177 | +} from './core/models/provider-pricing.js'; |
| 178 | + |
| 179 | +effectiveSpendMultiplier(); // today's cost factor along the ramp (0.2 → 1.0) |
| 180 | +effectiveCost('anthropic', 'claude-opus-4-8', inTok, outTok); // ramp-adjusted cost |
| 181 | +``` |
0 commit comments