Diff-aware context-caching proxy for Llama 4 Scout (10M context).
A local, OpenAI-compatible proxy that turns a 10M-token context window into a cheap, fast one. It watches your git repository, splits the codebase into an immutable snapshot (stable per commit) and the recent working diff (volatile), and reassembles every prompt so the big stable part sits at the front — exactly where an inference server's prefix / KV cache can reuse it.
client (OpenAI SDK)
│ POST /v1/chat/completions
▼
┌─────────────────────────────────────────────────────────┐
│ scout-diff-cache │
│ │
│ git HEAD ──▶ snapshot (cached per commit) ─┐ │
│ git diff ──▶ working diff (recomputed) ├─▶ rebuilt │
│ client messages ────────────────────────────┘ prompt │
└─────────────────────────────────────────────────────────┘
│ POST {TARGET_API_URL}/chat/completions
▼
upstream model server (vLLM / llama.cpp / TGI …)
Because the snapshot block is byte-identical for a given commit, the upstream server matches it as a cached prefix and only pays to process the small diff plus the live conversation — large latency and cost savings on a 10M-token codebase.
Prefix caches match the longest identical leading span of tokens. The proxy therefore emits messages in this order:
| # | Message | Volatility | Cacheable? |
|---|---|---|---|
| 1 | system: codebase snapshot @<commit> |
changes only on commit | ✅ yes |
| 2 | system: working diff |
changes on every edit | ❌ no |
| 3… | original client messages | the live conversation | ❌ no |
Put the diff before the snapshot and you'd invalidate the cache on every keystroke. Order is the whole trick.
Requirements: Node.js ≥ 20, a running OpenAI-compatible model server, and a git repository to watch.
npm install
cp .env.example .env # then edit TARGET_API_URL / GIT_REPO_PATH
npm run dev # hot-reload dev server (tsx)
# or
npm run build && npm start # productionPoint any OpenAI client at the proxy:
curl http://127.0.0.1:8787/v1/chat/completions \
-H 'content-type: application/json' \
-d '{
"model": "llama-4-scout",
"stream": true,
"messages": [{ "role": "user", "content": "Where is the cache invalidated?" }]
}'The proxy injects the snapshot + diff automatically — your client only sends the actual question.
A multi-stage Dockerfile (git included for simple-git, runs as
a non-root user, with a /health healthcheck) and an example
docker-compose.yml are provided.
Every push to main publishes a multi-tag image to the GitHub Container
Registry via CI:
docker pull ghcr.io/nagayu/scout-diff-cache:latest
docker run --init -p 8787:8787 \
-e TARGET_API_URL=http://host.docker.internal:8000/v1 \
-e GIT_REPO_PATH=/repo \
-v "$PWD:/repo:ro" \
ghcr.io/nagayu/scout-diff-cache:latestdocker compose up --buildMount the repository you want injected at /repo and point TARGET_API_URL at
your model server (use host.docker.internal to reach the host). Or run the
image directly:
docker build -t scout-diff-cache .
docker run --init -p 8787:8787 \
-e TARGET_API_URL=http://host.docker.internal:8000/v1 \
-e GIT_REPO_PATH=/repo \
-v "$PWD:/repo:ro" \
scout-diff-cacheThis proxy reads every tracked file in the repository and sends it to the upstream model server. Before pointing it at a repo, understand the implications:
- Secrets in the repo leak. If your repository tracks a
.env, private keys, credentials, or customer data, those bytes are embedded in the prompt and transmitted upstream. Add such paths toEXCLUDE_PATH_PATTERNS(it already excludes lockfiles andnode_modules/), and prefer a model server you control. The proxy honors your exclude list but does not scan for secrets — that is your responsibility. - No authentication. The proxy itself is unauthenticated and binds to
127.0.0.1by default. Do not bind to0.0.0.0on an untrusted network without putting an authenticating reverse proxy in front of it. - Trust the upstream.
TARGET_API_KEYand your entire codebase are sent toTARGET_API_URL. Only use an endpoint you trust.
- Snapshot reads the working tree, not HEAD. Clean files match HEAD exactly; uncommitted edits to a file appear in both the snapshot and the diff. This is a deliberate trade-off for speed and simplicity.
git diff HEADexcludes untracked file contents. New (untracked) files are listed under "changed files" but their contents are not in the patch until staged. Rungit addto include them.- No timeout once a stream has started.
UPSTREAM_TIMEOUT_MSbounds the time-to-first-byte; a stream that stalls mid-response relies on the client to disconnect (which the proxy detects and propagates upstream).
All configuration is via environment variables (validated at startup — the
process refuses to boot on a bad value). See .env.example for
the full list with defaults. Key ones:
| Variable | Default | Purpose |
|---|---|---|
PORT / HOST |
8787 / 127.0.0.1 |
where the proxy listens |
TARGET_API_URL |
http://127.0.0.1:8000/v1 |
upstream OpenAI-compatible base URL |
TARGET_API_KEY |
— | optional bearer token forwarded upstream |
GIT_REPO_PATH |
cwd |
repository to inject as context |
CACHE_TTL_MS |
300000 |
snapshot freshness window |
MAX_FILE_BYTES |
524288 |
per-file embed ceiling |
MAX_SNAPSHOT_BYTES |
67108864 |
total embed memory guard |
INJECT_CONTEXT |
true |
set false for transparent pass-through |
EXCLUDE_PATH_PATTERNS |
node_modules/,dist/,… |
paths to omit from the snapshot |
| Method & path | Description |
|---|---|
POST /v1/chat/completions |
OpenAI-compatible; streaming + non-streaming |
GET /v1/models |
advertises llama-4-scout |
GET /health |
liveness probe |
Errors use the OpenAI error envelope ({ "error": { message, type, code } })
with precise status codes: 400 invalid request, 500 git failure, 502
upstream error, 503 upstream unreachable, 504 upstream timeout.
src/
index.ts entry point + graceful shutdown
server.ts Fastify app, routes, SSE streaming, error handler
config/index.ts env loading & validation (zod)
types/index.ts all shared type definitions
utils/
git.ts repo inspection, snapshot & diff extraction
logger.ts pino structured logging
errors.ts typed AppError hierarchy
services/
cache.ts single-slot snapshot cache (commit + TTL)
context.ts orchestrates git + cache (build coalescing)
promptBuilder.ts message reconstruction
proxy.ts upstream forwarding (timeout/abort aware)
validation.ts request validation (zod)
tests/ vitest unit tests
npm run typecheck # strict tsc, no emit
npm run test # vitest
npm run lint # eslintMIT