Skip to content

NagaYu/scout-diff-cache

Repository files navigation

scout-diff-cache

CI License: MIT Node

Diff-aware context-caching proxy for Llama 4 Scout (10M context).

A local, OpenAI-compatible proxy that turns a 10M-token context window into a cheap, fast one. It watches your git repository, splits the codebase into an immutable snapshot (stable per commit) and the recent working diff (volatile), and reassembles every prompt so the big stable part sits at the front — exactly where an inference server's prefix / KV cache can reuse it.

client (OpenAI SDK)
      │  POST /v1/chat/completions
      ▼
┌─────────────────────────────────────────────────────────┐
│  scout-diff-cache                                         │
│                                                           │
│   git HEAD ──▶ snapshot (cached per commit)  ─┐           │
│   git diff ──▶ working diff (recomputed)      ├─▶ rebuilt │
│   client messages ────────────────────────────┘   prompt │
└─────────────────────────────────────────────────────────┘
      │  POST {TARGET_API_URL}/chat/completions
      ▼
upstream model server (vLLM / llama.cpp / TGI …)

Because the snapshot block is byte-identical for a given commit, the upstream server matches it as a cached prefix and only pays to process the small diff plus the live conversation — large latency and cost savings on a 10M-token codebase.

Why prefix ordering matters

Prefix caches match the longest identical leading span of tokens. The proxy therefore emits messages in this order:

# Message Volatility Cacheable?
1 system: codebase snapshot @<commit> changes only on commit ✅ yes
2 system: working diff changes on every edit ❌ no
3… original client messages the live conversation ❌ no

Put the diff before the snapshot and you'd invalidate the cache on every keystroke. Order is the whole trick.

Quick start

Requirements: Node.js ≥ 20, a running OpenAI-compatible model server, and a git repository to watch.

npm install
cp .env.example .env          # then edit TARGET_API_URL / GIT_REPO_PATH
npm run dev                   # hot-reload dev server (tsx)
# or
npm run build && npm start    # production

Point any OpenAI client at the proxy:

curl http://127.0.0.1:8787/v1/chat/completions \
  -H 'content-type: application/json' \
  -d '{
    "model": "llama-4-scout",
    "stream": true,
    "messages": [{ "role": "user", "content": "Where is the cache invalidated?" }]
  }'

The proxy injects the snapshot + diff automatically — your client only sends the actual question.

Docker

A multi-stage Dockerfile (git included for simple-git, runs as a non-root user, with a /health healthcheck) and an example docker-compose.yml are provided.

Prebuilt image (GHCR)

Every push to main publishes a multi-tag image to the GitHub Container Registry via CI:

docker pull ghcr.io/nagayu/scout-diff-cache:latest
docker run --init -p 8787:8787 \
  -e TARGET_API_URL=http://host.docker.internal:8000/v1 \
  -e GIT_REPO_PATH=/repo \
  -v "$PWD:/repo:ro" \
  ghcr.io/nagayu/scout-diff-cache:latest

Build locally

docker compose up --build

Mount the repository you want injected at /repo and point TARGET_API_URL at your model server (use host.docker.internal to reach the host). Or run the image directly:

docker build -t scout-diff-cache .
docker run --init -p 8787:8787 \
  -e TARGET_API_URL=http://host.docker.internal:8000/v1 \
  -e GIT_REPO_PATH=/repo \
  -v "$PWD:/repo:ro" \
  scout-diff-cache

⚠️ Security & privacy

This proxy reads every tracked file in the repository and sends it to the upstream model server. Before pointing it at a repo, understand the implications:

  • Secrets in the repo leak. If your repository tracks a .env, private keys, credentials, or customer data, those bytes are embedded in the prompt and transmitted upstream. Add such paths to EXCLUDE_PATH_PATTERNS (it already excludes lockfiles and node_modules/), and prefer a model server you control. The proxy honors your exclude list but does not scan for secrets — that is your responsibility.
  • No authentication. The proxy itself is unauthenticated and binds to 127.0.0.1 by default. Do not bind to 0.0.0.0 on an untrusted network without putting an authenticating reverse proxy in front of it.
  • Trust the upstream. TARGET_API_KEY and your entire codebase are sent to TARGET_API_URL. Only use an endpoint you trust.

Known limitations

  • Snapshot reads the working tree, not HEAD. Clean files match HEAD exactly; uncommitted edits to a file appear in both the snapshot and the diff. This is a deliberate trade-off for speed and simplicity.
  • git diff HEAD excludes untracked file contents. New (untracked) files are listed under "changed files" but their contents are not in the patch until staged. Run git add to include them.
  • No timeout once a stream has started. UPSTREAM_TIMEOUT_MS bounds the time-to-first-byte; a stream that stalls mid-response relies on the client to disconnect (which the proxy detects and propagates upstream).

Configuration

All configuration is via environment variables (validated at startup — the process refuses to boot on a bad value). See .env.example for the full list with defaults. Key ones:

Variable Default Purpose
PORT / HOST 8787 / 127.0.0.1 where the proxy listens
TARGET_API_URL http://127.0.0.1:8000/v1 upstream OpenAI-compatible base URL
TARGET_API_KEY optional bearer token forwarded upstream
GIT_REPO_PATH cwd repository to inject as context
CACHE_TTL_MS 300000 snapshot freshness window
MAX_FILE_BYTES 524288 per-file embed ceiling
MAX_SNAPSHOT_BYTES 67108864 total embed memory guard
INJECT_CONTEXT true set false for transparent pass-through
EXCLUDE_PATH_PATTERNS node_modules/,dist/,… paths to omit from the snapshot

API

Method & path Description
POST /v1/chat/completions OpenAI-compatible; streaming + non-streaming
GET /v1/models advertises llama-4-scout
GET /health liveness probe

Errors use the OpenAI error envelope ({ "error": { message, type, code } }) with precise status codes: 400 invalid request, 500 git failure, 502 upstream error, 503 upstream unreachable, 504 upstream timeout.

Project layout

src/
  index.ts              entry point + graceful shutdown
  server.ts             Fastify app, routes, SSE streaming, error handler
  config/index.ts       env loading & validation (zod)
  types/index.ts        all shared type definitions
  utils/
    git.ts              repo inspection, snapshot & diff extraction
    logger.ts           pino structured logging
    errors.ts           typed AppError hierarchy
  services/
    cache.ts            single-slot snapshot cache (commit + TTL)
    context.ts          orchestrates git + cache (build coalescing)
    promptBuilder.ts    message reconstruction
    proxy.ts            upstream forwarding (timeout/abort aware)
    validation.ts       request validation (zod)
tests/                  vitest unit tests

Development

npm run typecheck   # strict tsc, no emit
npm run test        # vitest
npm run lint        # eslint

License

MIT

About

Diff-aware context-caching proxy for Llama 4 Scout (10M context): splits an immutable codebase snapshot from the live git diff to maximize LLM prefix-cache hits, exposed as an OpenAI-compatible API.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors