Description
Qwen3.5-4B (DeltaNet hybrid) loads successfully and produces coherent output via quant_generate, but quant_ask returns empty/garbage tokens. This breaks the server API since quant-server uses quant_ask for non-streaming requests.
Evidence
quant_generate (CLI) — WORKS
$ ./qwen35_test Qwen3.5-4B-Q4_K_M.gguf "Hello, who are you?"
--- response ---
I am Qwen3.5, a large language model developed by Alibaba Cloud.
I can help you with various tasks such as answering questions,
solving problems, and generating content.
quant_ask (server) — BROKEN
$ curl localhost:8080/v1/chat/completions \
-d '{"messages":[{"role":"user","content":"What is gravity?"}]}'
{"content": " -\n.\n- \n1. - "} # empty/whitespace tokens
Streaming via server — ALSO BROKEN
data: {"delta":{"content":"!"}}
data: {"delta":{"content":"!"}}
data: {"delta":{"content":"!"}}
# Repeating "!" tokens
Root Cause Hypothesis
quant_ask and quant_generate may handle the prompt/tokenization differently. Possible causes:
quant_ask may apply its own chat template that conflicts with the ChatML template already in the prompt
quant_ask may not properly initialize DeltaNet state (conv buffer, delta state) between calls
- The BOS token handling may differ between the two paths
Steps to Reproduce
// This works:
quant_generate(ctx, "<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n", cb, NULL);
// This produces garbage:
char* result = quant_ask(ctx, "<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n");
Environment
- quant.cpp: latest main (1e1ea2c)
- Model: unsloth/Qwen3.5-4B-GGUF (Q4_K_M, 2.6GB)
- Architecture: qwen35 (DeltaNet hybrid, 8 attn + 24 DeltaNet layers)
- OS: macOS 15 (Apple M3, 16GB)
Reported by ClawTeam Claw-4 (Optimizer) + Claw-5 (Researcher)
Description
Qwen3.5-4B (DeltaNet hybrid) loads successfully and produces coherent output via
quant_generate, butquant_askreturns empty/garbage tokens. This breaks the server API sincequant-serverusesquant_askfor non-streaming requests.Evidence
quant_generate (CLI) — WORKS
quant_ask (server) — BROKEN
Streaming via server — ALSO BROKEN
Root Cause Hypothesis
quant_askandquant_generatemay handle the prompt/tokenization differently. Possible causes:quant_askmay apply its own chat template that conflicts with the ChatML template already in the promptquant_askmay not properly initialize DeltaNet state (conv buffer, delta state) between callsSteps to Reproduce
Environment
Reported by ClawTeam Claw-4 (Optimizer) + Claw-5 (Researcher)