Skip to content

feat: quant-server-unified — server built directly on quant.h#79

Merged
unamedkr merged 1 commit intomainfrom
feat/unified-server
Apr 12, 2026
Merged

feat: quant-server-unified — server built directly on quant.h#79
unamedkr merged 1 commit intomainfrom
feat/unified-server

Conversation

@unamedkr
Copy link
Copy Markdown
Collaborator

Summary

New server binary that compiles against quant.h directly, eliminating the sync-divergence bug between quant.h and libturboquant (#77, #78).

Problem

quant-server (libturboquant) produces garbage for SmolLM2-1.7B due to numerical instability at layer 7 (max=18,359). The same model works perfectly via quant.h. Root cause: split sources have diverged from quant.h.

Solution

Single-file server (tools/quant_server_unified.c) that #include "quant.h" directly:

cc -O2 -o quant-server-unified tools/quant_server_unified.c -lm -lpthread
./quant-server-unified model.gguf -p 8080 -j 8

Benchmark (Apple M3, 16GB)

Model libturboquant server unified server
SmolLM2-1.7B GARBAGE (layer 7 explosion) 23 tok/s
Phi-3.5-mini CRASH / garbage 6.5 tok/s

Features

  • /v1/chat/completions (OpenAI compatible)
  • /v1/models, /health
  • SSE streaming (stream: true)
  • CORS headers
  • Auto-detect Phi-3 chat template vs ChatML
  • Template token filtering
  • Mutex-serialized inference
  • Graceful port-in-use error

Test plan

  • SmolLM2-1.7B: 56 tokens, coherent output, 23 tok/s
  • Phi-3.5-mini: 60 tokens, coherent output, 6.5 tok/s
  • SSE streaming: correct chunk format, [DONE] signal
  • Health check: returns version
  • CORS: preflight OPTIONS returns 204

Fixes #77
Refs #78

🤖 Generated with Claude Code

New server binary that compiles against quant.h instead of
libturboquant, eliminating the sync-divergence bug (#77, #78).

Key results (Apple M3, 16GB):
  SmolLM2-1.7B: 23 tok/s (was: garbage via libturboquant)
  Phi-3.5-mini:  6.5 tok/s (was: crash or garbage via libturboquant)

Build:
  cc -O2 -o quant-server-unified tools/quant_server_unified.c -lm -lpthread

Features:
- OpenAI-compatible API (/v1/chat/completions, /v1/models, /health)
- SSE streaming (stream: true)
- CORS headers
- Auto-detect Phi-3 chat template vs ChatML
- Template token filtering (<|im_end|>, <|end|>, etc.)
- Mutex-serialized inference (safe for concurrent HTTP clients)
- Graceful port-in-use error

No libturboquant dependency. No Metal/CUDA (pure CPU NEON).
Single file, zero external dependencies beyond libc.

Fixes #77 (SmolLM2 numerical instability in libturboquant)
Refs #78 (quant.h as single source of truth)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@unamedkr unamedkr merged commit 27671f5 into main Apr 12, 2026
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SmolLM2-1.7B server inference regression after 91814d4 (Phi-3 CPU fallback)

1 participant