Skip to content

Commit 7ea9394

Browse files
Merge pull request #9: changed model to q4 version from XyLearningProgramming/feature/model-update
changed model to q4 version
2 parents fa67d59 + f6a6011 commit 7ea9394

10 files changed

Lines changed: 344 additions & 15 deletions

File tree

README.md

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,51 @@ All observability components are configurable and enabled by default:
6262
- **Prometheus Metrics** - Available at `/metrics` (latency, throughput, token rates, memory usage)
6363
- **OpenTelemetry Tracing** - Distributed tracing with request flow visualization
6464

65+
## Model Choice
66+
67+
Default model: **Qwen3-0.6B-Q4_K_M** (484 MB) from [`second-state/Qwen3-0.6B-GGUF`](https://huggingface.co/second-state/Qwen3-0.6B-GGUF).
68+
69+
Previously the default was Qwen3-0.6B-Q8_0 (805 MB) from the [official Qwen repo](https://huggingface.co/Qwen/Qwen3-0.6B-GGUF). The switch to Q4_K_M was made to better fit deployment on resource-constrained VPS nodes (1 CPU / 1 GB RAM each).
70+
71+
### Why Qwen3-0.6B
72+
73+
0.6B parameters is the largest Qwen3 tier that fits on a 1 GB node. The next step up (Qwen3-1.7B) requires ~1 GB+ for model weights alone at even aggressive quantization, leaving nothing for the OS, kubelet, or KV cache.
74+
75+
### Why Q4_K_M over Q8_0
76+
77+
| | Q8_0 | Q4_K_M |
78+
|---|---|---|
79+
| File size | 805 MB | 484 MB |
80+
| Est. RAM (with `use_mlock`, 4096 ctx) | ~750 MB | ~550 MB |
81+
| Quality vs F16 | ~99.9% | ~99% |
82+
| Inference speed (CPU) | Slower (more data through cache) | **~40-50% faster** |
83+
84+
For a 0.6B model the quality bottleneck is parameter count, not quantization precision -- the difference between Q4 and Q8 is negligible in practice. Q4_K_M ("K_M" = mixed precision on important layers) is the community-recommended sweet spot for balanced quality and performance.
85+
86+
The RAM savings (~200 MB) are significant on a 1 GB node: the pod's memory request drops from ~750 Mi to ~600 Mi, leaving headroom for the OS and co-located workloads.
87+
88+
### Resource estimates
89+
90+
Current Helm resource settings (`deploy/helm/values.yaml`):
91+
92+
| Setting | Value | Rationale |
93+
|---|---|---|
94+
| Memory request | 600 Mi | Steady-state with model locked in RAM via `use_mlock` |
95+
| Memory limit | 700 Mi | ~100 Mi headroom over steady-state |
96+
| CPU request | 200 m | Meaningful reservation for inference on 1-core VPS |
97+
| CPU limit | 1 | Matches physical core count |
98+
99+
### Switching models
100+
101+
To use a different quantization, update `scripts/download.sh` and set `SLM_MODEL_PATH`:
102+
103+
```bash
104+
# In .env or as environment variable
105+
SLM_MODEL_PATH=/app/models/Qwen3-0.6B-Q8_0.gguf
106+
```
107+
108+
Available quantizations at [`second-state/Qwen3-0.6B-GGUF`](https://huggingface.co/second-state/Qwen3-0.6B-GGUF): Q2_K (347 MB) through F16 (1.51 GB).
109+
65110
## Configuration
66111

67112
Configure via environment variables (prefix: `SLM_`) or `.env` file. See [`./slm_server/config.py`](./slm_server/config.py) for all options.

deploy/helm/values.yaml

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -62,7 +62,7 @@ autoscaling:
6262
# Example configuration for SLM server settings
6363
env: {}
6464
# Application settings
65-
# SLM_MODEL_PATH: "/app/models/Qwen3-0.6B-Q8_0.gguf"
65+
# SLM_MODEL_PATH: "/app/models/Qwen3-0.6B-Q4_K_M.gguf"
6666
# SLM_N_CTX: "4096"
6767
# SLM_N_THREADS: "2"
6868
# SLM_SEED: "42"
@@ -79,13 +79,15 @@ env: {}
7979

8080
# Resource requests and limits for the container.
8181
# See https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/
82+
# Tuned for Qwen3-0.6B-Q4_K_M (484 MB) on 1-CPU / 1 GB VPS nodes.
83+
# Previous values for Q8_0 (805 MB): limits cpu=3/mem=800Mi, requests cpu=50m/mem=32Mi
8284
resources:
8385
limits:
84-
cpu: 3
85-
memory: 800Mi
86+
cpu: 1
87+
memory: 700Mi
8688
requests:
87-
cpu: 50m
88-
memory: 32Mi
89+
cpu: 200m
90+
memory: 600Mi
8991

9092
# Readiness and liveness probes configuration
9193
probes:

scripts/download.sh

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,11 @@ set -ex
55
# Get the absolute path of the directory where the script is located
66
SCRIPT_DIR=$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" &> /dev/null && pwd)
77

8-
REPO_URL="https://huggingface.co/Qwen/Qwen3-0.6B-GGUF"
8+
# Original (official Qwen repo, Q8_0 only):
9+
# https://huggingface.co/Qwen/Qwen3-0.6B-GGUF -> Qwen3-0.6B-Q8_0.gguf
10+
# Switched to second-state community repo for Q4_K_M quantization.
11+
# See README.md "Model Choice" section for rationale.
12+
REPO_URL="https://huggingface.co/second-state/Qwen3-0.6B-GGUF"
913
# Set model directory relative to the script's location
1014
MODEL_DIR="$SCRIPT_DIR/../models"
1115

@@ -14,8 +18,8 @@ mkdir -p "$MODEL_DIR"
1418

1519
# --- Files to download ---
1620
FILES_TO_DOWNLOAD=(
17-
"Qwen3-0.6B-Q8_0.gguf"
18-
# "params"
21+
"Qwen3-0.6B-Q4_K_M.gguf"
22+
# Previous default: "Qwen3-0.6B-Q8_0.gguf" (805 MB, from Qwen/Qwen3-0.6B-GGUF)
1923
)
2024

2125
echo "Downloading Qwen3-0.6B-GGUF model and params files..."

slm_server/app.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
import json
33
import traceback
44
from http import HTTPStatus
5+
from pathlib import Path
56
from typing import Annotated, AsyncGenerator, Generator, Literal
67

78
from fastapi import Depends, FastAPI, HTTPException
@@ -14,6 +15,8 @@
1415
from slm_server.model import (
1516
ChatCompletionRequest,
1617
EmbeddingRequest,
18+
ModelInfo,
19+
ModelListResponse,
1720
)
1821
from slm_server.trace import setup_tracing
1922
from slm_server.utils import (
@@ -189,6 +192,29 @@ async def create_embeddings(
189192
return embedding_result
190193

191194

195+
@app.get("/api/v1/models", response_model=ModelListResponse)
196+
async def list_models(
197+
settings: Annotated[Settings, Depends(get_settings)],
198+
) -> ModelListResponse:
199+
"""List available models (OpenAI-compatible). Returns the single loaded model."""
200+
model_id = Path(settings.model_path).stem
201+
try:
202+
created = int(Path(settings.model_path).stat().st_mtime)
203+
except (OSError, ValueError):
204+
created = 0
205+
return ModelListResponse(
206+
object="list",
207+
data=[
208+
ModelInfo(
209+
id=model_id,
210+
object="model",
211+
created=created,
212+
owned_by=settings.model_owner,
213+
)
214+
],
215+
)
216+
217+
192218
@app.get("/health")
193219
async def health():
194220
return "ok"

slm_server/config.py

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,8 @@
1313
DOTENV_PATH = PROJECT_ROOT / ".env"
1414

1515

16-
MODEL_PATH_DEFAULT = str(MODELS_DIR / "Qwen3-0.6B-Q8_0.gguf")
16+
MODEL_PATH_DEFAULT = str(MODELS_DIR / "Qwen3-0.6B-Q4_K_M.gguf")
17+
MODEL_OWNER_DEFAULT = "second-state"
1718

1819

1920
class LoggingSettings(BaseModel):
@@ -56,6 +57,10 @@ class Settings(BaseSettings):
5657
)
5758

5859
model_path: str = Field(MODEL_PATH_DEFAULT, description="Model path for llama_cpp.")
60+
model_owner: str = Field(
61+
MODEL_OWNER_DEFAULT,
62+
description="Owner label for /models list. Set SLM_MODEL_OWNER to override.",
63+
)
5964
n_ctx: int = Field(
6065
4096, description="Maximum context window (input + generated tokens)."
6166
)

slm_server/model.py

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,3 +88,20 @@ class EmbeddingRequest(BaseModel):
8888
model: str | None = Field(
8989
default=None, description="Model name, not important for our server"
9090
)
91+
92+
93+
# OpenAI-compatible list models API
94+
class ModelInfo(BaseModel):
95+
"""Single model entry for GET /api/v1/models."""
96+
97+
id: str = Field(description="Model identifier for use in API endpoints")
98+
object: str = Field(default="model", description="Object type")
99+
created: int = Field(description="Unix timestamp when the model was created")
100+
owned_by: str = Field(description="Organization that owns the model")
101+
102+
103+
class ModelListResponse(BaseModel):
104+
"""Response for GET /api/v1/models."""
105+
106+
object: str = Field(default="list", description="Object type")
107+
data: list[ModelInfo] = Field(description="List of available models")

slm_server/trace.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import base64
2+
import logging
23

34
from fastapi import FastAPI
45
from opentelemetry import trace
@@ -11,12 +12,20 @@
1112

1213
from slm_server.config import TraceSettings
1314

15+
logger = logging.getLogger(__name__)
16+
1417

1518
def setup_tracing(app: FastAPI, settings: TraceSettings) -> None:
1619
"""Initialize OpenTelemetry tracing with optional Grafana Tempo export."""
1720
if not settings.enabled:
1821
return
1922

23+
if not settings.endpoint or not settings.username or not settings.password:
24+
logger.warning(
25+
"Grafana Tempo endpoint or credentials not configured, skipping tracing"
26+
)
27+
return
28+
2029
# Define your service name in a Resource
2130
resource = Resource.create(
2231
attributes={

tests/test_app.py

Lines changed: 69 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
from opentelemetry.sdk.trace.export.in_memory_span_exporter import InMemorySpanExporter
99
from opentelemetry.trace import set_tracer_provider
1010

11-
from slm_server.app import DETAIL_SEM_TIMEOUT, app, get_llm
11+
from slm_server.app import DETAIL_SEM_TIMEOUT, app, get_llm, get_settings
12+
from slm_server.config import Settings
1213

1314
# Create a mock Llama instance
1415
mock_llama = MagicMock()
@@ -266,11 +267,10 @@ def test_metrics_endpoint_integration():
266267
assert "python_info" in content
267268
assert "process_virtual_memory_bytes" in content
268269

269-
# Verify custom SLM metrics are present (even if empty)
270-
assert "slm_completion_duration_seconds" in content
271-
assert "slm_tokens_total" in content
272-
assert "slm_completion_tokens_per_second" in content
273-
assert "slm_first_token_delay_ms" in content
270+
# NOTE: SLM-specific metrics (slm_completion_duration_seconds, slm_tokens_total,
271+
# etc.) are only registered when tracing is fully configured with endpoint and
272+
# credentials. In the test environment tracing is not configured, so these
273+
# metrics are not expected here. They are tested via test_trace.py.
274274

275275

276276
def test_streaming_call_with_tracing_integration():
@@ -733,3 +733,66 @@ def test_request_validation_and_defaults():
733733
assert call_args[1]["stream"] is False # Default value
734734

735735

736+
def test_list_models_structure():
737+
"""GET /api/v1/models returns OpenAI-compatible list with one model."""
738+
response = client.get("/api/v1/models")
739+
assert response.status_code == 200
740+
data = response.json()
741+
assert data["object"] == "list"
742+
assert isinstance(data["data"], list)
743+
assert len(data["data"]) == 1
744+
model = data["data"][0]
745+
assert model["object"] == "model"
746+
assert "id" in model and isinstance(model["id"], str)
747+
assert "created" in model and isinstance(model["created"], int)
748+
assert model["owned_by"] == "second-state"
749+
750+
751+
def test_list_models_with_overridden_settings():
752+
"""GET /api/v1/models uses model_path and model_owner from settings."""
753+
settings = Settings(
754+
model_path="/tmp/SomeModel.gguf",
755+
model_owner="custom-org",
756+
)
757+
758+
def override_settings():
759+
return settings
760+
761+
app.dependency_overrides[get_settings] = override_settings
762+
try:
763+
response = client.get("/api/v1/models")
764+
assert response.status_code == 200
765+
data = response.json()
766+
assert data["object"] == "list"
767+
assert len(data["data"]) == 1
768+
model = data["data"][0]
769+
assert model["id"] == "SomeModel"
770+
assert model["object"] == "model"
771+
assert model["owned_by"] == "custom-org"
772+
assert model["created"] == 0 # file does not exist
773+
finally:
774+
app.dependency_overrides.pop(get_settings, None)
775+
776+
777+
def test_list_models_created_from_existing_file(tmp_path):
778+
"""GET /api/v1/models returns file mtime as created when model file exists."""
779+
model_file = tmp_path / "RealModel.gguf"
780+
model_file.write_bytes(b"\x00")
781+
782+
settings = Settings(model_path=str(model_file))
783+
784+
def override_settings():
785+
return settings
786+
787+
app.dependency_overrides[get_settings] = override_settings
788+
try:
789+
response = client.get("/api/v1/models")
790+
assert response.status_code == 200
791+
model = response.json()["data"][0]
792+
assert model["id"] == "RealModel"
793+
assert model["created"] > 0
794+
assert model["created"] == int(model_file.stat().st_mtime)
795+
finally:
796+
app.dependency_overrides.pop(get_settings, None)
797+
798+

tests/test_metrics.py

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
from unittest.mock import MagicMock, patch
2+
3+
from fastapi import FastAPI
4+
from fastapi.testclient import TestClient
5+
6+
from slm_server.config import MetricsSettings
7+
from slm_server.metrics import setup_metrics
8+
9+
10+
def test_setup_metrics_disabled():
11+
"""When metrics are disabled, no /metrics endpoint is added."""
12+
app = FastAPI()
13+
setup_metrics(app, MetricsSettings(enabled=False))
14+
client = TestClient(app)
15+
16+
response = client.get("/metrics")
17+
assert response.status_code == 404
18+
19+
20+
def test_setup_metrics_enabled_does_not_raise():
21+
"""When metrics are enabled, setup_metrics instruments the app without error."""
22+
app = FastAPI()
23+
with (
24+
patch("slm_server.metrics.Instrumentator") as mock_inst,
25+
patch("slm_server.metrics.system_cpu_usage", return_value=lambda info: None),
26+
patch("slm_server.metrics.system_memory_usage", return_value=lambda info: None),
27+
):
28+
mock_instance = MagicMock()
29+
mock_inst.return_value = mock_instance
30+
mock_instance.instrument.return_value = mock_instance
31+
32+
setup_metrics(app, MetricsSettings(enabled=True, endpoint="/metrics"))
33+
34+
mock_inst.assert_called_once()
35+
mock_instance.add.assert_called()
36+
mock_instance.instrument.assert_called_once_with(app)
37+
mock_instance.expose.assert_called_once_with(app, endpoint="/metrics")

0 commit comments

Comments
 (0)