Build software better, together

jjang-ai / vmlx

vMLX - JANGTQ Uber Compressed MLX Models - L2 Disk Cache (survives restart) + L1 Paged (super fast ttft) + Hybrid SSM Scheduler + Cont Batching + etc!

macbook persistent-memory mlx openai-api llm lmstudio anthropic-api mcp-server kvcache-optimization kvcache-compression openclaw kvcache-reuse openclaw-agent prefix-cache mlxllm mlxstudio vmlx omlx omlx-alternative

Updated Jul 19, 2026
Python

huiliyi37 / Tianshu-Tui

Star

天枢 (Tianshu) 是一个全功能、高性能的终端编程智能体运行时（TUI）。它跳出了传统 AI 编程助手把大模型仅当成“工具”的局限，基于认知虚拟机 (CVM)、自感知层和信息素（Stigmergy）自衰减记忆构建，让 AI 成为有独立判断与认知防护的“开发伙伴”。同时针对 DeepSeek V4 做了前缀缓存工程优化（长会话实测稳态命中率 95–99%）

typescript terminal tui ai-agent deepseek context-management coding-agent prefix-cache

Updated Jul 19, 2026
TypeScript

Venkat2811 / wombatkv

Star

Object-storage-native KV cache for LLM inference & RL. Cross-restart, cross-conversation, cross-engine via shared S3 bucket.

caching machine-learning metal amd s3 inference pytorch nvidia object-storage ds4 kv-cache llm vllm sglang prefix-cache

Updated Jul 6, 2026
Rust

jianzhichun / permafrost

Star

Freeze Claude Code's prompt prefix so DeepSeek's automatic cache always hits — alignment proxy + coalescing + keepalive, installable as a CC plugin. Measured 64% cheaper on real Claude Code traffic.

proxy cost-optimization cache-optimization llm deepseek claude-code claude-code-plugin prompt-cache prefix-cache

Updated Jun 23, 2026
Python

DevWhale —— AI 驱动桌面开发工作台。深度契合Deepseek V4,做了针对性缓存优化。Electron + React + TypeScript，流式 Agent对话、Monaco 编辑器、xterm.js 终端、60+ 文件格式多模态输入、多模型切换。DevWhale — AI desktop dev workbench. Electron + React + TS. Streaming Agent chat,Monaco editor, xterm terminal, 60+ file formats, multi-model support.

electron monaco-editor deepseek coding-agents tool-calling deepseek-api prefix-cache deepseek-v4-pro

Updated Jun 14, 2026
TypeScript

qujing226 / kvtide

Star

KVTide is a Kubernetes-native LLM serving system exploring cache-aware scheduling and proactive peer-to-peer KV mobility.

scheduler distributed-computing inference golang-server mlsys ai-infra dynamic-batching llm-serving llm-inference connectrpc prefix-cache prefill-decode

Updated Jul 18, 2026
Go

weksbwrx62862 / deepseek-cache-optimizer

Star

DeepSeek缓存优化器 v1.1 — Reasonix四支柱 + 语义压缩 (命中率+30%)

mimo cache-optimization ai-agent cost-reduction deepseek hermes-agent prefix-cache hermes-plugin

Updated Jul 19, 2026
Python

davccavalcante / racs

Sponsor

Star

RACS (Remote Agent Context Store): prefix-cache management for production agents. Stability-aware prompt planning, provider-faithful cache directives for Anthropic, OpenAI, Gemini, Bedrock and more, TTL keep-warm scheduling, prefix-drift detection, and hit-ratio and savings analytics. Zero dependencies, TypeScript, edge-ready.

Updated Jul 6, 2026
TypeScript

davccavalcante / treasury

Sponsor

Star

Cost governance for Massive Intelligence (IM) agent orchestration: hard per-request, per-task, and per-day USD budgets with depletion events, pre-flight cost forecasting, prefix-cache break-even planning across sixteen provider profiles, and cheapest-first cascade routing. Zero dependencies, node-free core, deterministic, TypeScript-first.

Updated Jul 1, 2026
HTML

JohnScheuer / prefix-cache-sim

Star

Event-driven simulator for prefix KV-cache eviction policies in LLM serving systems

cmake simulator radix-tree cpp20 kv-cache llm vllm cache-eviction sglang prefix-cache

Updated Jul 8, 2026
Python

SuperMarioYL / cachepin

Star

cachepin

go proxy cache kv-cache llm token-cost coding-agent prefix-cache ai-radar

Updated Jul 17, 2026
Go

JohnScheuer / llm-serving-sim

Star

End-to-end LLM serving simulator integrating scheduling, prefix caching, tensor allocation, and KV-cache management. 168-run sweep (72 baseline + 96 pressure). Key finding: ChunkedPrefill + LFU cache achieves 41% lower TTFT p95 and 94% prefix hit rate, but hits OOM first under memory pressure.

simulation gpu scheduler inference memory-management systems-programming serving cpp20 kv-cache llm continuous-batching prefix-cache

Updated Jul 10, 2026
C++

SuperMarioYL / cacheguard

Star

CacheGuard（缓存卫士）— a drop-in proxy that keeps DeepSeek's server-side prefix-cache stable in front of any coding agent, so cache-hit pricing never silently breaks.

go cli moe llm-proxy deepseek token-cost coding-agent prefix-cache ai-radar

Updated Jun 22, 2026
Go

SuperMarioYL / dscache

Star

dscache

python agent cli deepseek llm-cost prefix-cache token-accounting ai-radar

Updated Jul 17, 2026
Python

armanas / BCR-memory-2

Star

Correctness-fixed Rust/PyO3 flat-array DFA prefix cache — rewrite of BCR-memory v1 with regression tests for four bugs and an SGLang/vLLM head-to-head harness.

rust prefix-trie pyo3 kv-cache llm vllm sglang prefix-cache

Updated Apr 17, 2026
Python

fengrunda / hermes-deepseek-cache

Star

Hermes plugin: DeepSeek prefix-cache wire shaping (Reasonix-inspired)

plugin hermes llm deepseek prefix-cache

Updated Jul 10, 2026
Python

JohnScheuer / prefix-cache-real

Star

Measures real prefix cache costs on GPU across 3 experimental versions. Key findings: 2.41x speedup with prefix=512; multi-turn speedup grows to 2.06x over 10 turns; batch sharing breakeven at n=2 (prefix=512) vs n=12 (prefix=128); LFU cache with Zipf alpha=2.0 achieves 82% hit rate with only 4 cache slots.

python gpu inference pytorch transformer profiling serving multi-turn kv-cache llm prefix-cache rtx-2070

Updated Jul 13, 2026
Python

vltech55 / bastion-gateway

Star

Production LLM gateway: OpenAI-compatible API in front of OpenAI, Anthropic, and Bedrock. Ordered-fallback routing with per-provider circuit breakers, Redis prefix cache, Prometheus + Grafana, Kubernetes.

kubernetes grafana routing fallback prometheus bedrock circuit-breaker observability openai-proxy openai-compatible-api llm-gateway anthropic-proxy prefix-cache

Updated Jun 9, 2026
Python

wzqhbustb / kv_cache

Star

The cache of the LLM's memory

rust caching unix-domain-socket inference radix-tree shared-memory tiered-storage kv-cache llm prefix-cache

Updated Jul 2, 2026
Rust

Neyyeby / edge-llm-deploy-bench

Star

Block-aligned resident prefix cache for llama.cpp on Jetson, with CUDA validation, multi-model long-prefix benchmarks, and cache stress testing.

jetson kv-cache llama-cpp prefix-cache

Updated Jul 18, 2026
C++

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

prefix-cache

Here are 20 public repositories matching this topic...

jjang-ai / vmlx

huiliyi37 / Tianshu-Tui

Venkat2811 / wombatkv

jianzhichun / permafrost

tzz123-hub / DevWhale

qujing226 / kvtide

weksbwrx62862 / deepseek-cache-optimizer

davccavalcante / racs

davccavalcante / treasury

JohnScheuer / prefix-cache-sim

SuperMarioYL / cachepin

JohnScheuer / llm-serving-sim

SuperMarioYL / cacheguard

SuperMarioYL / dscache

armanas / BCR-memory-2

fengrunda / hermes-deepseek-cache

JohnScheuer / prefix-cache-real

vltech55 / bastion-gateway

wzqhbustb / kv_cache

Neyyeby / edge-llm-deploy-bench

Improve this page

Add this topic to your repo