All notable changes to Dendrite are documented here. Format follows Keep a Changelog.
PrefixCache(cache/prefix_cache.rs) — cascade attention block reuse viaRadixTree+BlockPoolcommit()pins completed-sequence blocks;lookup()returnsPrefixHitwith matched token countevict_lru(n)/evict_older_than(duration)— LRU and TTL eviction back to poolPrefixCacheStats— hit_rate, tokens_saved, eviction counts- Impact: 4096-token shared system prompt across 64 requests → 262K fewer KV ops per step
CompressedKvCache(model/kv_cache.rs) — drop-in replacement forKvCachewith transparent compressionCompressedLayerCache/CompressedKvCache— sameappend(k, v) → (k, v)API- Pluggable via
KvCompressortrait (TurboQuant, PolarQuant, Identity)
TurboQuantKV compression (cache/compress.rs)PolarQuantCompressor— L2-normalize + fixed angular grid (2/4/8-bit), no per-block scale constantsQjlCompressor— deterministic Rademacher projection (seeded), store sign bits, zero storage overheadTurboQuantCompressor— two-stage: PolarQuant primary + QJL residual- Measured: 3.05x compression vs FP16 at head_dim=128 (4-bit + 64-dim QJL)
KvCompressortrait +IdentityCompressorbaseline
- Launch documentation (
docs/launch/)blog-post.md— technical deep-dive for the Dendrite launchhn-post.md— Hacker News "Show HN" drafttwitter-thread.md— 9-tweet thread
ContinuousBatcher(scheduler/) — Orca algorithm implementation- Mixed decode + chunked prefill steps per iteration
- Per-request state machine:
Waiting → Running → Done
- Criterion benchmark suite (
benches/scheduler.rs)decode_throughput,prefill_latency,mixed_step_overhead,add_requestgroups
BlockPoolbatch allocation (cache/pool.rs)allocate_batch(n)/free_batch(ids)— atomic batch ops (3.2x faster than N×allocate)PoolStatswithwatermark_used,watermark_freewith_headroom()constructor — reserves 2% for CoW emergency
- Example gallery (
examples/)continuous_batching.rs— invariant verification, ~7µs/stepmemory_pool.rs— 3.2x batch speedup demo, CoW headroom
Fp8AttentionandFp8SwiGluMlp(model/fp8_layer.rs)- MXFP8-quantized attention and FFN layers
- Pure-Rust dequant-on-forward; <5% MAE vs FP16 baseline
- TE FFI stub (
dendrite-ffi/src/te_ffi.rs)is_te_available(),te_fp8_gemm()interface defined; wires to real NVIDIA Transformer Engine when CUDA available
quantize_weights()— batch FP8 quantization for weight matrices
ConstrainedDecoder(grammar/) — structured output via llguidance- Token-level mask application during beam search
- JSON / regex / CFG constraint support
TokenMask— GPU-compatible token validity maskfrom_llg_bytes()constructor for llguidance integration
- Tokenizer bridge (
grammar/tokenizer_bridge.rs)apply_llg_mask()— applies llguidance token mask to logit slice- Supports HuggingFace tokenizer format
RadixTree(cache/radix.rs) — O(L) prefix lookup for KV block sharingfind_prefix(),find_exact(),remove_by_value()(eviction support)RadixTreeStats— utilization metrics
PagedKvCache(cache/paged.rs) — vLLM-style paged attentionPagePoolwithPageTableper sequenceslice_setfor token-level KV writes
- Beam/tree search (
search/,tree/)BeamSearchwith configurable width and length penaltyTreeState— fork/merge token trees with O(1) CoW- Golden harness for determinism testing
- Transformer model (
model/)TransformerLayer,TransformerModelwith RoPE, RMSNorm, SwiGLUKvCache/LayerCache— grow-on-append per-layer cache- FP8 linear layer (
quantization/fp8_linear.rs)
- Reference attention backend (
attention/backend.rs)- Scaled dot-product attention with causal masking
- Flash attention FFI interface (GPU path)
BlockPool(cache/pool.rs) — pre-allocated block manager with CoWBlock+BlockId, reference-countedBlockTable— per-sequence logical→physical mapping
Scheduler(scheduler/) — request lifecycle management- Waiting/running/done state machine
- Token budget enforcement
- Project infrastructure
- Cargo workspace:
dendrite,dendrite-core,dendrite-ffi - CI, benchmarks, property tests (proptest)
ROADMAP.mdwith 14-milestone plan
- Cargo workspace:
| Metric | Value |
|---|---|
| Tests | 359 passing |
| Crates | 3 (dendrite, dendrite-core, dendrite-ffi) |
| Milestones complete | M1–M7 ✅ |
| Active milestone | M8 (Launch) |
| Compression | 3x vs FP16 (TurboQuant, CPU) |
| Benchmark | ~7µs/step continuous batching (CPU) |