diff --git a/README.ko.md b/README.ko.md index 0c69f90..fe56c0f 100644 --- a/README.ko.md +++ b/README.ko.md @@ -28,28 +28,30 @@ ```bash pip install quantcpp -quantcpp pull llama3.2:1b # HuggingFace에서 다운로드 -quantcpp run llama3.2:1b # 대화형 채팅 -quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍) +quantcpp pull phi-3.5-mini # HuggingFace에서 다운로드 (~2.4 GB) +quantcpp run phi-3.5-mini # 대화형 채팅 +quantcpp serve phi-3.5-mini -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍) quantcpp client "안녕" # 스트리밍 클라이언트 → :8080 서버 quantcpp list # 캐시된 모델 목록 ``` -짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답). +추천 기본 모델: **Phi-3.5-mini** (3.8B params, vocab 32K). registry의 모든 모델 중 가장 작은 vocab(32K)이라 토큰당 `lm_head` matmul이 가장 빠릅니다 — 노트북에서 속도와 품질의 최적 조합입니다. 다른 별칭: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. `run`/`serve` 첫 실행 시 자동 다운로드. + +`serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답). **한 줄 질문:** ```bash -quantcpp run llama3.2:1b "중력이란 무엇인가요?" +quantcpp run phi-3.5-mini "중력이란 무엇인가요?" ``` **Python API (3줄):** ```python from quantcpp import Model -m = Model.from_pretrained("Llama-3.2-1B") +m = Model.from_pretrained("Phi-3.5-mini") print(m.ask("중력이란 무엇인가요?")) ``` -API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/) +API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. 지원되는 architecture와 모델 선택 가이드는 [`docs/supported_models.md`](docs/supported_models.md)를 참고하세요. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/) --- diff --git a/README.md b/README.md index 703c293..621ea99 100644 --- a/README.md +++ b/README.md @@ -41,28 +41,30 @@ ```bash pip install quantcpp -quantcpp pull llama3.2:1b # download from HuggingFace -quantcpp run llama3.2:1b # interactive chat -quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server (SSE streaming) +quantcpp pull phi-3.5-mini # download from HuggingFace (~2.4 GB) +quantcpp run phi-3.5-mini # interactive chat +quantcpp serve phi-3.5-mini -p 8080 # OpenAI-compatible HTTP server (SSE streaming) quantcpp client "Hi" # streaming client → server on :8080 quantcpp list # show cached models ``` -Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response). +Recommended default: **Phi-3.5-mini** (3.8B params, vocab 32K). The 32K vocab is the smallest in the registry, which makes the per-token `lm_head` matmul the fastest of any model we ship — Phi-3.5-mini is the best speed/quality combo on a laptop. Other aliases: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. Auto-pulls on first `run` / `serve`. + +The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response). **One-shot question:** ```bash -quantcpp run llama3.2:1b "What is gravity?" +quantcpp run phi-3.5-mini "What is gravity?" ``` **Python API (3 lines):** ```python from quantcpp import Model -m = Model.from_pretrained("Llama-3.2-1B") +m = Model.from_pretrained("Phi-3.5-mini") print(m.ask("What is gravity?")) ``` -Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/) +Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. See [`docs/supported_models.md`](docs/supported_models.md) for the architecture support matrix and model selection guide. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/) --- diff --git a/bindings/python/README.md b/bindings/python/README.md index 306e7ab..7d9ea2d 100644 --- a/bindings/python/README.md +++ b/bindings/python/README.md @@ -33,14 +33,30 @@ pip install . ## Usage -### Basic question answering +### Quick start (auto-download) ```python from quantcpp import Model +m = Model.from_pretrained("Phi-3.5-mini") # ~2.4 GB, downloaded once and cached +print(m.ask("What is 2+2?")) +``` + +`from_pretrained` accepts any name from `quantcpp.available_models()`. +**Phi-3.5-mini** is the recommended default — 3.8B params with the smallest +vocab (32K) in the registry, which makes the per-token `lm_head` matmul +the fastest of any model we ship. Other ready-to-use names: + +- `SmolLM2-1.7B` — lightweight all-rounder (1.7 GB, vocab 49K) +- `Llama-3.2-1B` — smallest download (750 MB) but slower at inference +- `SmolLM2-135M` — 138 MB demo model, low quality +- `Qwen3.5-0.8B` + +You can also load any local GGUF file directly: + +```python m = Model("model.gguf") -answer = m.ask("What is 2+2?") -print(answer) +print(m.ask("What is 2+2?")) ``` ### Streaming generation @@ -50,10 +66,30 @@ for token in m.generate("Once upon a time"): print(token, end="", flush=True) ``` +### Multi-turn chat with KV cache reuse + +```python +m = Model.from_pretrained("Phi-3.5-mini") +history = "" +while True: + user = input("\nYou: ") + history += f"<|user|>\n{user}<|end|>\n<|assistant|>\n" + print("AI: ", end="", flush=True) + reply = "" + for tok in m.chat(history): + print(tok, end="", flush=True) + reply += tok + history += reply + "<|end|>\n" +``` + +`m.chat()` reuses the KV cache across turns — turn N's prefill cost is +O(new tokens), not O(history). Catch `quantcpp.ChatContextOverflow` if +the conversation exceeds the model's context window. + ### Context manager ```python -with Model("model.gguf") as m: +with Model.from_pretrained("Phi-3.5-mini") as m: print(m.ask("Explain gravity in one sentence")) ``` @@ -92,6 +128,12 @@ Load a GGUF model file and create an inference context. - `n_threads` -- CPU thread count. - `kv_compress` -- KV cache compression mode (0=off, 1=4-bit, 2=delta+3-bit). +### `Model.from_pretrained(name) -> Model` + +Download a registered model from HuggingFace (cached at +`~/.cache/quantcpp/`) and return an open Model. See +`quantcpp.available_models()` for the registry. + ### `Model.ask(prompt) -> str` Generate a complete response. Returns the full text. @@ -100,6 +142,14 @@ Generate a complete response. Returns the full text. Stream tokens one at a time. Yields individual token strings. +### `Model.chat(prompt) -> Iterator[str]` + +Stream tokens with KV cache reuse across calls — turn N pays only for +the new bytes since turn N-1. Pass `prompt=None` (or call +`Model.reset_chat()`) to start a fresh session. Raises +`quantcpp.ChatContextOverflow` when the history exceeds the model's +context window (the C side has already auto-reset by then). + ### `Model.close()` Release resources. Called automatically via `with` or garbage collection. diff --git a/bindings/python/quantcpp/__init__.py b/bindings/python/quantcpp/__init__.py index 886d21e..25dcfa7 100644 --- a/bindings/python/quantcpp/__init__.py +++ b/bindings/python/quantcpp/__init__.py @@ -4,14 +4,20 @@ Quick start: from quantcpp import Model - m = Model.from_pretrained("SmolLM2-1.7B") + m = Model.from_pretrained("Phi-3.5-mini") print(m.ask("What is gravity?")) Model selection guide: - SmolLM2-1.7B (1.7 GB, vocab 49K) — recommended. ~12 tok/s on Apple M3. - Llama-3.2-1B (750 MB, vocab 128K) — smaller download but slower + Phi-3.5-mini (2.4 GB, vocab 32K) — DEFAULT. 3.8B params with the + smallest lm_head in the registry, + producing the best speed/quality + combo. Coherent multi-paragraph + output even at Q4_K_M. + SmolLM2-1.7B (1.7 GB, vocab 49K) — lightweight all-rounder. ~12 tok/s + on Apple M3, smaller download. + Llama-3.2-1B (750 MB, vocab 128K) — smallest download but slower due to large vocab (~2 tok/s on M3). - SmolLM2-135M (138 MB, vocab 49K) — demo only, low quality output. + SmolLM2-135M (138 MB, vocab 49K) — demo only, low quality output. Larger vocab = slower lm_head matmul → smaller params with smaller vocab often beats larger params with larger vocab. See docs/supported_models.md @@ -65,47 +71,48 @@ class ChatContextOverflow(RuntimeError): # Verify both fields against the actual HuggingFace listing before # adding new entries — there is no integrity check at runtime. _MODEL_REGISTRY = { - # 138 MB demo model. Tokenizer + arch are llama-compatible but the - # model is too small to produce coherent output for general chat. - # Listed only so users can verify the install/load path quickly. - "SmolLM2-135M": ( - "Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct", - "smollm2-135m-instruct-q8_0.gguf", - 135, + # ── DEFAULT ── + # Phi-3.5-mini-instruct (3.8B params, vocab 32K). Set as default on + # 2026-04-12 after end-to-end Phi-3 architecture support landed + # (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab is the + # smallest of the registry, which makes the lm_head matmul the + # fastest per-token. Combined with 3.8B params it produces the + # best quality-per-token of any model we ship. + "Phi-3.5-mini": ( + "bartowski/Phi-3.5-mini-instruct-GGUF", + "Phi-3.5-mini-instruct-Q4_K_M.gguf", + 2400, ), - # Recommended default for first-time users on Apple Silicon / typical - # laptops. vocab 49K keeps the lm_head matmul small, so even on a - # mid-range M-series chip we measure ~12 tok/s — comfortable for - # interactive chat. Same llama arch family as SmolLM2-135M, so it - # exercises the most-tested code path. + # Lightweight all-rounder for users who want a smaller download + # than Phi-3.5-mini. vocab 49K keeps the lm_head matmul small, so + # on a mid-range M-series chip we measure ~12 tok/s — comfortable + # for interactive chat. Same llama arch family as SmolLM2-135M. "SmolLM2-1.7B": ( "bartowski/SmolLM2-1.7B-Instruct-GGUF", "SmolLM2-1.7B-Instruct-Q8_0.gguf", 1700, ), - "Qwen3.5-0.8B": ( - "unsloth/Qwen3.5-0.8B-GGUF", - "Qwen3.5-0.8B-Q4_K_M.gguf", - 508, - ), - # Smaller download than SmolLM2-1.7B but slower at inference time - # because of the 128K Llama-3 vocab (~5x slower lm_head matmul on M3). - # Kept in the registry for users who specifically want a Llama model. + # Smallest download in the "actually usable" tier. Slower at + # inference time because of the 128K Llama-3 vocab (~5x slower + # lm_head matmul on M3). Kept in the registry for users who + # specifically want a Llama model. "Llama-3.2-1B": ( "hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF", "llama-3.2-1b-instruct-q4_k_m.gguf", 750, ), - # Phi-3.5-mini-instruct (3.8B params, vocab 32K). - # Added 2026-04-12 after end-to-end Phi-3 architecture support - # landed (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab - # is the smallest of the registry, which makes the lm_head matmul - # the fastest per-token. Combined with 3.8B params it's the best - # quality-per-token model we ship. - "Phi-3.5-mini": ( - "bartowski/Phi-3.5-mini-instruct-GGUF", - "Phi-3.5-mini-instruct-Q4_K_M.gguf", - 2400, + "Qwen3.5-0.8B": ( + "unsloth/Qwen3.5-0.8B-GGUF", + "Qwen3.5-0.8B-Q4_K_M.gguf", + 508, + ), + # 138 MB demo model. Tokenizer + arch are llama-compatible but the + # model is too small to produce coherent output for general chat. + # Listed only so users can verify the install/load path quickly. + "SmolLM2-135M": ( + "Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct", + "smollm2-135m-instruct-q8_0.gguf", + 135, ), } @@ -208,9 +215,9 @@ class Model: Examples -------- - >>> m = Model.from_pretrained("SmolLM2-1.7B") + >>> m = Model.from_pretrained("Phi-3.5-mini") >>> m.ask("What is gravity?") - 'Gravity is a force that attracts ...' + 'Gravity is a fundamental force that attracts ...' >>> with Model("model.gguf") as m: ... for tok in m.generate("Once upon a time"): diff --git a/bindings/python/quantcpp/cli.py b/bindings/python/quantcpp/cli.py index 1877dc5..3c99cdd 100644 --- a/bindings/python/quantcpp/cli.py +++ b/bindings/python/quantcpp/cli.py @@ -337,13 +337,16 @@ def cmd_client(args): def cmd_chat_default(args): - """Backwards-compatible default: auto-download SmolLM2-1.7B and chat. - - Default switched from Llama-3.2-1B to SmolLM2-1.7B (2026-04-12) after - user feedback that Llama-3.2-1B's 128K vocab makes it ~5x slower at - interactive chat than SmolLM2-1.7B's 49K vocab on Apple Silicon. + """Backwards-compatible default: auto-download Phi-3.5-mini and chat. + + Default progression: + Llama-3.2-1B → SmolLM2-1.7B (2026-04-12, vocab fix) + → Phi-3.5-mini (2026-04-12, after Phi-3 arch support + landed). Phi-3.5-mini has the smallest vocab in + the registry (32K) AND 3.8B params, giving the + best speed/quality combo we ship. """ - args.model = args.model or "SmolLM2-1.7B" + args.model = args.model or "Phi-3.5-mini" args.threads = getattr(args, "threads", 4) args.max_tokens = getattr(args, "max_tokens", 256) args.temperature = getattr(args, "temperature", 0.7) @@ -367,19 +370,20 @@ def main(): client PROMPT Send a request to a running serve (default: SSE streaming) examples: - quantcpp pull smollm2 # recommended: small vocab → fast + quantcpp pull phi-3.5-mini # recommended default (32K vocab → fast) quantcpp list - quantcpp run smollm2 - quantcpp run smollm2 "What is gravity?" - quantcpp serve smollm2 --port 8080 + quantcpp run phi-3.5-mini + quantcpp run phi-3.5-mini "What is gravity?" + quantcpp serve phi-3.5-mini --port 8080 quantcpp client "What is gravity?" # streams from :8080 quantcpp client "Hi" --url http://localhost:8081 quantcpp client "Hi" --no-stream # single JSON response backwards-compat (no subcommand): - quantcpp # default chat with SmolLM2-1.7B + quantcpp # default chat with Phi-3.5-mini quantcpp "What is gravity?" # one-shot - quantcpp --model llama3.2:1b # different model + quantcpp --model smollm2 # lightweight alternative + quantcpp --model llama3.2:1b # smallest download """, )