Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions README.ko.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,28 +28,30 @@
```bash
pip install quantcpp

quantcpp pull llama3.2:1b # HuggingFace에서 다운로드
quantcpp run llama3.2:1b # 대화형 채팅
quantcpp serve llama3.2:1b -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
quantcpp pull phi-3.5-mini # HuggingFace에서 다운로드 (~2.4 GB)
quantcpp run phi-3.5-mini # 대화형 채팅
quantcpp serve phi-3.5-mini -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
quantcpp client "안녕" # 스트리밍 클라이언트 → :8080 서버
quantcpp list # 캐시된 모델 목록
```

짧은 별칭: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드. `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
추천 기본 모델: **Phi-3.5-mini** (3.8B params, vocab 32K). registry의 모든 모델 중 가장 작은 vocab(32K)이라 토큰당 `lm_head` matmul이 가장 빠릅니다 — 노트북에서 속도와 품질의 최적 조합입니다. 다른 별칭: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. `run`/`serve` 첫 실행 시 자동 다운로드.

`serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).

**한 줄 질문:**
```bash
quantcpp run llama3.2:1b "중력이란 무엇인가요?"
quantcpp run phi-3.5-mini "중력이란 무엇인가요?"
```

**Python API (3줄):**
```python
from quantcpp import Model
m = Model.from_pretrained("Llama-3.2-1B")
m = Model.from_pretrained("Phi-3.5-mini")
print(m.ask("중력이란 무엇인가요?"))
```

API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)
API 키 없음. GPU 없음. 설정 없음. 모델은 `~/.cache/quantcpp/`에 캐시됩니다. 지원되는 architecture와 모델 선택 가이드는 [`docs/supported_models.md`](docs/supported_models.md)를 참고하세요. [브라우저에서 바로 체험 →](https://quantumaikr.github.io/quant.cpp/) · [**작동 원리 가이드 →**](https://quantumaikr.github.io/quant.cpp/guide/)

---

Expand Down
16 changes: 9 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,28 +41,30 @@
```bash
pip install quantcpp

quantcpp pull llama3.2:1b # download from HuggingFace
quantcpp run llama3.2:1b # interactive chat
quantcpp serve llama3.2:1b -p 8080 # OpenAI-compatible HTTP server (SSE streaming)
quantcpp pull phi-3.5-mini # download from HuggingFace (~2.4 GB)
quantcpp run phi-3.5-mini # interactive chat
quantcpp serve phi-3.5-mini -p 8080 # OpenAI-compatible HTTP server (SSE streaming)
quantcpp client "Hi" # streaming client → server on :8080
quantcpp list # show cached models
```

Short aliases: `smollm2:135m`, `qwen3.5:0.8b`, `llama3.2:1b`. Auto-pulls on first `run`/`serve`. The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
Recommended default: **Phi-3.5-mini** (3.8B params, vocab 32K). The 32K vocab is the smallest in the registry, which makes the per-token `lm_head` matmul the fastest of any model we ship — Phi-3.5-mini is the best speed/quality combo on a laptop. Other aliases: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. Auto-pulls on first `run` / `serve`.

The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).

**One-shot question:**
```bash
quantcpp run llama3.2:1b "What is gravity?"
quantcpp run phi-3.5-mini "What is gravity?"
```

**Python API (3 lines):**
```python
from quantcpp import Model
m = Model.from_pretrained("Llama-3.2-1B")
m = Model.from_pretrained("Phi-3.5-mini")
print(m.ask("What is gravity?"))
```

Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)
Downloads on first use, cached at `~/.cache/quantcpp/`. No API key, no GPU. See [`docs/supported_models.md`](docs/supported_models.md) for the architecture support matrix and model selection guide. [Try in browser →](https://quantumaikr.github.io/quant.cpp/) · [**Interactive Guide →**](https://quantumaikr.github.io/quant.cpp/guide/)

---

Expand Down
58 changes: 54 additions & 4 deletions bindings/python/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,30 @@ pip install .

## Usage

### Basic question answering
### Quick start (auto-download)

```python
from quantcpp import Model

m = Model.from_pretrained("Phi-3.5-mini") # ~2.4 GB, downloaded once and cached
print(m.ask("What is 2+2?"))
```

`from_pretrained` accepts any name from `quantcpp.available_models()`.
**Phi-3.5-mini** is the recommended default — 3.8B params with the smallest
vocab (32K) in the registry, which makes the per-token `lm_head` matmul
the fastest of any model we ship. Other ready-to-use names:

- `SmolLM2-1.7B` — lightweight all-rounder (1.7 GB, vocab 49K)
- `Llama-3.2-1B` — smallest download (750 MB) but slower at inference
- `SmolLM2-135M` — 138 MB demo model, low quality
- `Qwen3.5-0.8B`

You can also load any local GGUF file directly:

```python
m = Model("model.gguf")
answer = m.ask("What is 2+2?")
print(answer)
print(m.ask("What is 2+2?"))
```

### Streaming generation
Expand All @@ -50,10 +66,30 @@ for token in m.generate("Once upon a time"):
print(token, end="", flush=True)
```

### Multi-turn chat with KV cache reuse

```python
m = Model.from_pretrained("Phi-3.5-mini")
history = ""
while True:
user = input("\nYou: ")
history += f"<|user|>\n{user}<|end|>\n<|assistant|>\n"
print("AI: ", end="", flush=True)
reply = ""
for tok in m.chat(history):
print(tok, end="", flush=True)
reply += tok
history += reply + "<|end|>\n"
```

`m.chat()` reuses the KV cache across turns — turn N's prefill cost is
O(new tokens), not O(history). Catch `quantcpp.ChatContextOverflow` if
the conversation exceeds the model's context window.

### Context manager

```python
with Model("model.gguf") as m:
with Model.from_pretrained("Phi-3.5-mini") as m:
print(m.ask("Explain gravity in one sentence"))
```

Expand Down Expand Up @@ -92,6 +128,12 @@ Load a GGUF model file and create an inference context.
- `n_threads` -- CPU thread count.
- `kv_compress` -- KV cache compression mode (0=off, 1=4-bit, 2=delta+3-bit).

### `Model.from_pretrained(name) -> Model`

Download a registered model from HuggingFace (cached at
`~/.cache/quantcpp/`) and return an open Model. See
`quantcpp.available_models()` for the registry.

### `Model.ask(prompt) -> str`

Generate a complete response. Returns the full text.
Expand All @@ -100,6 +142,14 @@ Generate a complete response. Returns the full text.

Stream tokens one at a time. Yields individual token strings.

### `Model.chat(prompt) -> Iterator[str]`

Stream tokens with KV cache reuse across calls — turn N pays only for
the new bytes since turn N-1. Pass `prompt=None` (or call
`Model.reset_chat()`) to start a fresh session. Raises
`quantcpp.ChatContextOverflow` when the history exceeds the model's
context window (the C side has already auto-reset by then).

### `Model.close()`

Release resources. Called automatically via `with` or garbage collection.
Expand Down
79 changes: 43 additions & 36 deletions bindings/python/quantcpp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,20 @@
Quick start:

from quantcpp import Model
m = Model.from_pretrained("SmolLM2-1.7B")
m = Model.from_pretrained("Phi-3.5-mini")
print(m.ask("What is gravity?"))

Model selection guide:
SmolLM2-1.7B (1.7 GB, vocab 49K) — recommended. ~12 tok/s on Apple M3.
Llama-3.2-1B (750 MB, vocab 128K) — smaller download but slower
Phi-3.5-mini (2.4 GB, vocab 32K) — DEFAULT. 3.8B params with the
smallest lm_head in the registry,
producing the best speed/quality
combo. Coherent multi-paragraph
output even at Q4_K_M.
SmolLM2-1.7B (1.7 GB, vocab 49K) — lightweight all-rounder. ~12 tok/s
on Apple M3, smaller download.
Llama-3.2-1B (750 MB, vocab 128K) — smallest download but slower
due to large vocab (~2 tok/s on M3).
SmolLM2-135M (138 MB, vocab 49K) — demo only, low quality output.
SmolLM2-135M (138 MB, vocab 49K) — demo only, low quality output.

Larger vocab = slower lm_head matmul → smaller params with smaller vocab
often beats larger params with larger vocab. See docs/supported_models.md
Expand Down Expand Up @@ -65,47 +71,48 @@ class ChatContextOverflow(RuntimeError):
# Verify both fields against the actual HuggingFace listing before
# adding new entries — there is no integrity check at runtime.
_MODEL_REGISTRY = {
# 138 MB demo model. Tokenizer + arch are llama-compatible but the
# model is too small to produce coherent output for general chat.
# Listed only so users can verify the install/load path quickly.
"SmolLM2-135M": (
"Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
"smollm2-135m-instruct-q8_0.gguf",
135,
# ── DEFAULT ──
# Phi-3.5-mini-instruct (3.8B params, vocab 32K). Set as default on
# 2026-04-12 after end-to-end Phi-3 architecture support landed
# (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab is the
# smallest of the registry, which makes the lm_head matmul the
# fastest per-token. Combined with 3.8B params it produces the
# best quality-per-token of any model we ship.
"Phi-3.5-mini": (
"bartowski/Phi-3.5-mini-instruct-GGUF",
"Phi-3.5-mini-instruct-Q4_K_M.gguf",
2400,
),
# Recommended default for first-time users on Apple Silicon / typical
# laptops. vocab 49K keeps the lm_head matmul small, so even on a
# mid-range M-series chip we measure ~12 tok/s — comfortable for
# interactive chat. Same llama arch family as SmolLM2-135M, so it
# exercises the most-tested code path.
# Lightweight all-rounder for users who want a smaller download
# than Phi-3.5-mini. vocab 49K keeps the lm_head matmul small, so
# on a mid-range M-series chip we measure ~12 tok/s — comfortable
# for interactive chat. Same llama arch family as SmolLM2-135M.
"SmolLM2-1.7B": (
"bartowski/SmolLM2-1.7B-Instruct-GGUF",
"SmolLM2-1.7B-Instruct-Q8_0.gguf",
1700,
),
"Qwen3.5-0.8B": (
"unsloth/Qwen3.5-0.8B-GGUF",
"Qwen3.5-0.8B-Q4_K_M.gguf",
508,
),
# Smaller download than SmolLM2-1.7B but slower at inference time
# because of the 128K Llama-3 vocab (~5x slower lm_head matmul on M3).
# Kept in the registry for users who specifically want a Llama model.
# Smallest download in the "actually usable" tier. Slower at
# inference time because of the 128K Llama-3 vocab (~5x slower
# lm_head matmul on M3). Kept in the registry for users who
# specifically want a Llama model.
"Llama-3.2-1B": (
"hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",
"llama-3.2-1b-instruct-q4_k_m.gguf",
750,
),
# Phi-3.5-mini-instruct (3.8B params, vocab 32K).
# Added 2026-04-12 after end-to-end Phi-3 architecture support
# landed (fused QKV / fused gate+up FFN / LongRoPE). The 32K vocab
# is the smallest of the registry, which makes the lm_head matmul
# the fastest per-token. Combined with 3.8B params it's the best
# quality-per-token model we ship.
"Phi-3.5-mini": (
"bartowski/Phi-3.5-mini-instruct-GGUF",
"Phi-3.5-mini-instruct-Q4_K_M.gguf",
2400,
"Qwen3.5-0.8B": (
"unsloth/Qwen3.5-0.8B-GGUF",
"Qwen3.5-0.8B-Q4_K_M.gguf",
508,
),
# 138 MB demo model. Tokenizer + arch are llama-compatible but the
# model is too small to produce coherent output for general chat.
# Listed only so users can verify the install/load path quickly.
"SmolLM2-135M": (
"Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
"smollm2-135m-instruct-q8_0.gguf",
135,
),
}

Expand Down Expand Up @@ -208,9 +215,9 @@ class Model:

Examples
--------
>>> m = Model.from_pretrained("SmolLM2-1.7B")
>>> m = Model.from_pretrained("Phi-3.5-mini")
>>> m.ask("What is gravity?")
'Gravity is a force that attracts ...'
'Gravity is a fundamental force that attracts ...'

>>> with Model("model.gguf") as m:
... for tok in m.generate("Once upon a time"):
Expand Down
28 changes: 16 additions & 12 deletions bindings/python/quantcpp/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -337,13 +337,16 @@ def cmd_client(args):


def cmd_chat_default(args):
"""Backwards-compatible default: auto-download SmolLM2-1.7B and chat.

Default switched from Llama-3.2-1B to SmolLM2-1.7B (2026-04-12) after
user feedback that Llama-3.2-1B's 128K vocab makes it ~5x slower at
interactive chat than SmolLM2-1.7B's 49K vocab on Apple Silicon.
"""Backwards-compatible default: auto-download Phi-3.5-mini and chat.

Default progression:
Llama-3.2-1B → SmolLM2-1.7B (2026-04-12, vocab fix)
→ Phi-3.5-mini (2026-04-12, after Phi-3 arch support
landed). Phi-3.5-mini has the smallest vocab in
the registry (32K) AND 3.8B params, giving the
best speed/quality combo we ship.
"""
args.model = args.model or "SmolLM2-1.7B"
args.model = args.model or "Phi-3.5-mini"
args.threads = getattr(args, "threads", 4)
args.max_tokens = getattr(args, "max_tokens", 256)
args.temperature = getattr(args, "temperature", 0.7)
Expand All @@ -367,19 +370,20 @@ def main():
client PROMPT Send a request to a running serve (default: SSE streaming)

examples:
quantcpp pull smollm2 # recommended: small vocab → fast
quantcpp pull phi-3.5-mini # recommended default (32K vocab → fast)
quantcpp list
quantcpp run smollm2
quantcpp run smollm2 "What is gravity?"
quantcpp serve smollm2 --port 8080
quantcpp run phi-3.5-mini
quantcpp run phi-3.5-mini "What is gravity?"
quantcpp serve phi-3.5-mini --port 8080
quantcpp client "What is gravity?" # streams from :8080
quantcpp client "Hi" --url http://localhost:8081
quantcpp client "Hi" --no-stream # single JSON response

backwards-compat (no subcommand):
quantcpp # default chat with SmolLM2-1.7B
quantcpp # default chat with Phi-3.5-mini
quantcpp "What is gravity?" # one-shot
quantcpp --model llama3.2:1b # different model
quantcpp --model smollm2 # lightweight alternative
quantcpp --model llama3.2:1b # smallest download
""",
)

Expand Down
Loading