rocm: fix gfx1151 correctness (precise expert-selection math), enable indexer, enable agent#311
rocm: fix gfx1151 correctness (precise expert-selection math), enable indexer, enable agent#311alantsev wants to merge 197 commits into
Conversation
Implements the Responses API endpoint that Codex CLI (and other modern OpenAI tooling) speaks instead of /v1/chat/completions. The wire format is documented in OpenAI's Responses API; this implementation has been iterated against the Codex CLI binary's SSE parser shape until no remaining schema gaps were found. Request parsing (parse_responses_request, parse_responses_input): - Accepts the typed input array (message, function_call, function_call_output, reasoning, custom_tool_call(_output), local_shell_call(_output), web_search_call(_output), tool_search_call(_output), image_generation_call(_output), compaction, context_compaction). - Maps hosted-tool history to function_call/function_call_output so prior actions survive across turns; rejects unknown item types and non-completed status with 400 to avoid silent context loss. - Strict content-array parsing: only string|null|array of recognized text blocks (input_text/output_text/text/summary_text/ reasoning_text); rejects non-text modalities (input_image/file/ audio) instead of accepting an empty prompt. - Merges adjacent function_call items into the preceding assistant message so text + tool-call turns render as a single assistant block. - Honors reasoning.effort (incl. "minimal"/"none") and gates reasoning summary surface on reasoning.summary opt-in. - Rejects previous_response_id, conversation, and forced tool_choice explicitly (constrained decoding / persisted state not supported). Output (responses_sse_*, responses_final_response): - Emits the full streaming lifecycle: response.created, output_item.added/.done, reasoning_summary_part.added/.done, reasoning_summary_text.delta/.done, content_part.added/.done, output_text.delta/.done, function_call_arguments.delta/.done, response.completed. - Branches the terminal event by finish reason: response.failed for errors and response.incomplete with reason "max_tokens" for length. - Every event carries sequence_number; every output_text part carries annotations:[]; function_call output_item.added ships with an empty arguments string (full args arrive via function_call_arguments.done and output_item.done), and item ids are stable across added/done. - Tracks whether </think> was actually observed so a truncated stream marks the reasoning item incomplete instead of "completed". - Recovers gracefully when the DSML tool parse fails after the model was suppressed at the tool marker: the suppressed tail is flushed as additional output_text deltas so the streamed message matches output_item.done. Tested by 25 rounds of /codex:adversarial-review against the same client this is meant to feed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Broaden the DS4 imatrix prompt dataset with provider-neutral agent/tool traffic, multi-language programming prompts, algorithm recall, Bash scripting, and multilingual translation tasks. Remove duplicate rendered prompts and avoid provider-specific client references in the generated calibration corpus. This improves calibration coverage without claiming to fix a distributed GGUF bug.
Fold the successful CUDA selector/top-k/indexed-attention changes into one clean commit. This excludes rejected experiment commits and the local prefill-slope work log.\n\nMeasured on GB10 with speed-bench/promessi_sposi.txt, 2048-token append chunks: 32K prefill improved from 255.61 tok/s on origin/main to 346.49 tok/s. Full-curve average improved from 316.39 tok/s to 369.76 tok/s. 32K full prompt + 128-token generation prefill improved from 312.87 tok/s to 368.43 tok/s, while generation stayed neutral at 12.49 -> 12.48 tok/s.\n\nCorrectness: make cuda-regression; ./ds4_test --logprob-vectors --tool-call-quality; ./ds4_test --server --metal-kernels.
Build score_official against the CUDA runtime on Linux and select the CUDA backend there, while keeping the existing Metal path on macOS.\n\nCorrectness: make -C gguf-tools quality-score; gguf-tools/quality-testing/score_official ds4flash.gguf /tmp/ds4_quality_smoke/manifest.tsv /tmp/ds4_quality_smoke/scores.tsv 16384.
Replace the default long-context continuation check with a deterministic prose-story retrieval test. The fixture embeds spelled-out person-number assignments in a long rendered prompt, and ds4_test now validates the generated Name=number list instead of brittle sampled prose.
Preserve Responses namespace metadata and tool_search calls while rendering DSML-safe internal tool names. Replay function_call, hosted tool, and tool_search_output items into the shared chat/tool path so Codex and Pi can round-trip tool calls without losing KV-cache prefix reuse. Document the /v1/responses endpoint and add server unit coverage for namespace, tool_search, and replay output shapes.
This reverts commit 2a7a5f3. There was no ack from the user. Don't want to take a fix that is astronautically produced from an unclear error trace.
Project sampled DSML tool calls to Anthropic SSE tool_use blocks while keeping raw DSML as the parser/cache source of truth. Reuse streamed tool ids for final parsed calls so tool_result continuation still matches live state.
Keep normal CUDA context buffers on device allocations, but route very large KV-cache tensors through managed memory so million-token contexts do not starve unified-memory systems during graph/session allocation. The fallback is scoped to the long-lived KV/cache tensors and logs when it is used because it may reduce performance. Tested on 0.180 with: - make cpu - make -B cuda-spark - make cuda-regression - ./ds4_test --server --metal-kernels - ./ds4_test --logprob-vectors --tool-call-quality - ds4-bench ctx-alloc 32768, 250000, and 1000000 - ds4-server --ctx 1000000 startup smoke (cherry picked from commit 0b248a65c07d21f2fc8ff4815bd8b75af26719f9)
Parse Anthropic tool_use blocks by their own type field instead of relying on the enclosing message role being parsed first. Some clients serialize messages as content-before-role, which made full-history tool_result replays look like unknown live-only continuations. Fixes antirez#127.
Return a 400 error with error type "context_exceeded" when prompt tokens exceed
context size. The response includes both n_prompt_tokens and n_ctx fields so
clients can determine exactly why the request failed and how far over the limit
they went.
Error response format:
{
"error": {
"message": "Prompt tokens (N) exceeds context size (M)",
"type": "context_exceeded",
"n_prompt_tokens": N,
"n_ctx": M
}
}
dwarfstar is typoed to drawfstar
fix typo in readme
Add coordinator/worker distributed layer execution, pipelined prefill, worker routing, telemetry, activation transport width, and KV mismatch recovery for DeepSeek Flash/Pro.
DGX Spark (GB10, sm_121, 121 GiB UMA, driver 580+) sits in an unusual
spot for CUDA inference: ATS (Address Translation Service) lets the
GPU consume host-mmap'd weights directly, but at significantly lower
effective bandwidth than HBM-resident copies. For an 80 GB IQ2XXS
DeepSeek V4 Flash checkpoint, the difference is the model running
versus the model being usable.
This commit adds:
- Startup HBM cache that copies hot tensor spans (attn projections,
MoE shared experts, output projection) into device memory at engine
init, capped by a configurable budget (defaults sized to leave
headroom for KV cache and a second model load). Cold MoE routed
experts stay ATS-mapped.
- Factored the cudaMalloc+memcpy populate path into a helper and
reordered cuda_model_range_ptr so the HBM-resident lookup is a
single hash-keyed read that wins over the UVA-mapped pointer on
the hot decode path.
- GPU argmax kernel; the prior fallback misused indexer scoring as
an argmax which double-paid the dispatcher cost on N=1 decode.
- Pair-fused Q_A + KV_A matmuls in qkv_rms_fused decode path
(one shared weight load per row, two outputs).
- Parallelized matmul_q8_0_hc_expand epilogue across n_hc lanes
(n_hc parallel residual loads + writes vs n_hc^2 serial reads).
- HBM cache also populated for the MTP support model.
- Drop `cudaHostRegisterReadOnly` flag — unsupported on GB10.
- Drop `!mtp_ready` gate from accelerator_cache_model_tensors so
the MTP support model gets the same HBM-cache treatment.
Bench (DGX Spark / GB10, ds4flash, n=256, "knight" prompt, 3-run mean):
Plain decode before: ~13.9 t/s (ATS-mapped weights, all paths)
Plain decode after: ~16.13 t/s (HBM-resident hot spans + small-N kernel fuses)
Adds `speed-bench/gb10.csv` per CONTRIBUTING.md convention so the
2048..65536 sweep is preserved alongside the existing m2_ultra.csv
and m4_max.csv. Generated via:
./ds4-bench -m ds4flash.gguf \
--prompt-file speed-bench/promessi_sposi.txt \
--ctx-start 2048 --ctx-max 65536 --step-incr 2048 \
--gen-tokens 128 --csv speed-bench/gb10.csv
Hardware: NVIDIA DGX Spark (GB10 / sm_121), driver 580.142, CUDA 13.0
Model: DeepSeek-V4-Flash-IQ2XXS-w2Q2K-AProjQ8-SExpQ8-OutQ8-chat-v2-imatrix.gguf
# Conflicts: # Makefile
(cherry picked from commit e00ad3085c8edbd6c98a50ba4ad49a66c2b23984)
(cherry picked from commit 0b3efaf86f61421330e90629508adbd6228b4a8b)
# Conflicts: # ds4_cuda.cu
Add shared help text across the CLI, server, agent, bench, and eval tools. Expand distributed-mode guidance, clean up endpoint naming, and use a TTY-only 256-color layout with clearer section titles, option arguments, separators, examples, and explanatory text.
CPU support for Q4_K routed experts (fixes antirez#171)
- enable indexer - use precise math functions on the expert selection path - enable agent for the rocm build the test run ``` $ ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.451s ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 long-context: OK tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive skipped (API/official graph mismatch) ds4-test: vector long_code_audit logprob-vectors: OK local-golden-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=4/5 top20_overlap=17/20 top64_overlap=55/64 top20_max_abs=1.67544 local-golden-vectors: OK metal-short-prefill: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics metal-short-prefill: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK metal-tensor-equivalence: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence candidate route=auto ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-15.0155 cand=-15.0155 abs=0 id=1 ref=19.9358 cand=19.9358 abs=0 id=2 ref=-55.9084 cand=-55.9084 abs=0 id=3 ref=17.8982 cand=17.8982 abs=0 id=4 ref=26. 0747 cand=26.0747 abs=0 ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-2.66161 cand=-2.66161 abs=0 id=1 ref=21.3162 cand=21.3162 abs=0 id=2 ref=-45.7824 cand=-45.7824 abs=0 id=3 ref=10.9651 cand=10.9651 abs=0 id=4 ref= 25.8229 cand=25.8229 abs=0 ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-3.03494 cand=-3.03494 abs=0 id=1 ref=23.3849 cand=23.3849 abs=0 id=2 ref=-42.7991 cand=-42.7991 abs=0 id=3 ref=16.0927 cand=16.0927 abs=0 id=4 ref= 18.5051 cand=18.5051 abs=0 ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=4/5 overlap=18/20 max_rank_delta=4 rms=0.582494 max_abs=3.37462 top20_max_abs=1.52215 ds4-test: Tensor equivalence long_memory_archive largest deltas: id=126759 ref=-20.0995 cand=-23.4742 abs=3.37462 id=25690 ref=-20.7083 cand=-17.7369 abs=2.97143 id=61664 ref=-2.01424 cand=0.917232 abs=2.93147 id=78827 ref=- 22.8887 cand=-25.7399 abs=2.8512 id=71753 ref=4.75498 cand=7.58581 abs=2.83083 ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=18/20 max_rank_delta=3 rms=0.433541 max_abs=2.06283 top20_max_abs=1.08788 ds4-test: Tensor equivalence long_code_audit largest deltas: id=113847 ref=-11.6684 cand=-13.7312 abs=2.06283 id=123327 ref=-16.5266 cand=-18.47 abs=1.94345 id=113254 ref=-2.13925 cand=-4.08151 abs=1.94226 id=103699 ref=-0.1 64321 cand=-2.08601 abs=1.92169 id=56957 ref=-6.75335 cand=-4.83732 abs=1.91603 ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=4/5 min_overlap=18/20 worst_rank_delta=4 worst_rms=0.582494 worst_max_abs=3.37462 worst_top20_max_abs=1. 52215 metal-tensor-equivalence: OK server: server: OK ds4 tests: ok ``` the determenistic test run (with DS4_CUDA_MOE_NO_ATOMIC_DOWN=1) ``` $ DS4_CUDA_MOE_NO_ATOMIC_DOWN=1 ./ds4_test long-context: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.315s ds4: cuda backend initialized for graph diagnostics ds4-test: long-context prefill 0/30474 ds4-test: long-context prefill 8192/30474 ds4-test: long-context prefill 16384/30474 ^[^[ds4-test: long-context prefill 24576/30474 ds4-test: long-context prefill 30474/30474 long-context: OK tool-call-quality: ds4-test: tool-call quality fast path ds4-test: tool-call quality exact path ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.000s ds4: cuda backend initialized for graph diagnostics tool-call-quality: OK logprob-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: vector short_italian_fact ds4-test: vector short_code_completion ds4-test: vector short_reasoning_plain ds4-test: vector long_memory_archive skipped (API/official graph mismatch) ds4-test: vector long_code_audit logprob-vectors: OK local-golden-vectors: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics ds4-test: local golden long_story_4096 top1 ref=4371 cand=4371 top5_overlap=4/5 top20_overlap=16/20 top64_overlap=52/64 top20_max_abs=2.3658 local-golden-vectors: OK metal-short-prefill: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.345s ds4: cuda backend initialized for graph diagnostics metal-short-prefill: OK metal-kernels: ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access ds4: CUDA registered 0.00 GiB model mapping for device access metal-kernels: OK metal-tensor-equivalence: ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence candidate route=auto ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.346s ds4: cuda backend initialized for graph diagnostics ds4-test: Tensor equivalence short_italian_fact top1 ref=108149 cand=108149 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_italian_fact largest deltas: id=0 ref=-15.0155 cand=-15.0155 abs=0 id=1 ref=19.9358 cand=19.9358 abs=0 id=2 ref=-55.9084 cand=-55.9084 abs=0 id=3 ref=17.8982 ca nd=17.8982 abs=0 id=4 ref=26.0747 cand=26.0747 abs=0 ds4-test: Tensor equivalence short_code_completion top1 ref=9854 cand=9854 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_code_completion largest deltas: id=0 ref=-2.66161 cand=-2.66161 abs=0 id=1 ref=21.3162 cand=21.3162 abs=0 id=2 ref=-45.7824 cand=-45.7824 abs=0 id=3 ref=10.9651 cand=10.9651 abs=0 id=4 ref=25.8229 cand=25.8229 abs=0 ds4-test: Tensor equivalence short_reasoning_plain top1 ref=926 cand=926 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence short_reasoning_plain largest deltas: id=0 ref=-3.03494 cand=-3.03494 abs=0 id=1 ref=23.3849 cand=23.3849 abs=0 id=2 ref=-42.7991 cand=-42.7991 abs=0 id=3 ref=16.0927 cand=16.0927 abs=0 id=4 ref=18.5051 cand=18.5051 abs=0 ds4-test: Tensor equivalence long_memory_archive top1 ref=32111 cand=32111 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence long_memory_archive largest deltas: id=0 ref=-8.42831 cand=-8.42831 abs=0 id=1 ref=19.284 cand=19.284 abs=0 id=2 ref=-50.653 cand=-50.653 abs=0 id=3 ref=10.6968 cand= 10.6968 abs=0 id=4 ref=21.0302 cand=21.0302 abs=0 ds4-test: Tensor equivalence long_code_audit top1 ref=671 cand=671 top5_overlap=5/5 overlap=20/20 max_rank_delta=0 rms=0 max_abs=0 top20_max_abs=0 ds4-test: Tensor equivalence long_code_audit largest deltas: id=0 ref=-4.50487 cand=-4.50487 abs=0 id=1 ref=19.7669 cand=19.7669 abs=0 id=2 ref=-47.0626 cand=-47.0626 abs=0 id=3 ref=16.7405 cand= 16.7405 abs=0 id=4 ref=23.0197 cand=23.0197 abs=0 ds4-test: Tensor summary route=auto cases=5 capture_fail=0 logits_fail=0 greedy_fail=0 top1_mismatch=0 min_top5_overlap=5/5 min_overlap=20/20 worst_rank_delta=0 worst_rms=0 worst_max_abs=0 worst_ top20_max_abs=0 metal-tensor-equivalence: OK server: server: OK ds4 tests: ok ``` the benchmark run ``` $ ./ds4-bench -m ds4flash.gguf --prompt-file speed-bench/promessi_sposi.txt --ctx-start 2048 --ctx-max 65536 --step-incr 2048 --gen-tokens 128 ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.348s ds4: cuda backend initialized for graph diagnostics ds4-bench: context buffers 1742.43 MiB (ctx=65665, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=16418) ctx_tokens,prefill_tokens,prefill_tps,gen_tokens,gen_tps,kvcache_bytes 2048,2048,66.87,128,11.79,52184460 4096,2048,65.46,128,10.13,80373132 6144,2048,65.30,128,10.07,108561804 8192,2048,65.23,128,9.98,136750476 ... ``` the eval run (until the first failure) ``` $ ./ds4-eval --nothink --temp 3 --min-p 0.25 ds4: CUDA backend initialized on AMD Radeon 8060S Graphics (sm_115) ds4: CUDA registered 80.76 GiB model mapping for device access ds4: CUDA preparing model tensor mappings: 80.24 GiB ds4: CUDA q8 fp16 cache limit reached; using q8 kernels (request=64.00 MiB cached=7.94 GiB limit=8.00 GiB) ds4: CUDA startup model preparation covered 80.76 GiB of tensor spans in 0.314s ds4: cuda backend initialized for graph diagnostics ds4-eval: context auto-sized to 16777 tokens (largest prompt=777 tokens, case=70, generation budget=16000) ds4-eval: model shape DeepSeek V4 Flash ds4-eval: context buffers 718.95 MiB (ctx=16777, backend=cuda, prefill_chunk=4096, raw_kv_rows=4352, compressed_kv_rows=4196) ds4-eval: 14/92 passed, 1 failed, runtime 00h:12m # state prompt gen total given correct test 1 PASSED 201 462 663 B B GPQA Diamond/recNu3MXkvWUzHZr9 2 PASSED 149 64 213 C C SuperGPQA/001b51d76b4d422988f2c11f104a2c6c 3 PASSED 81 408 489 70 70 AIME2025/aime2025-01 4 PASSED 313 170 483 C C GPQA Diamond/recoiTJPGUmzAkief 5 PASSED 272 372 644 J J SuperGPQA/b7e20eac98764fb0bf30e8366d951daa 6 PASSED 146 788 934 468 468 AIME2025/aime2025-16 7 PASSED 156 496 652 B B GPQA Diamond/rec4UqStf9WUVif1f 8 PASSED 127 55 182 E E SuperGPQA/4a1d1780a93f4093b6fb7d3c314cbea8 9 PASSED 633 2537 3170 588 588 AIME2025/aime2025-02 10 PASSED 182 382 564 B B GPQA Diamond/recgI6tUQ7RLJRWGx 11 PASSED 137 85 222 A A SuperGPQA/6082513c8dba4ec68aa68f1bf5854d09 12 PASSED 165 612 777 16 16 AIME2025/aime2025-03 13 PASSED 149 901 1050 A A GPQA Diamond (modified)/recDytVnNYZe2HuUU 14 PASSED 167 76 243 J J SuperGPQA/bebf1ed45ae14ad7b4f205f3909cb58a 15 FAILED 305 880 1185 96 82 AIME2025/aime2025-18 16 STOPPED 131 14 145 - D GPQA Diamond/recNFJjE5PPTqVJGv 17 PENDING 0 0 0 - I SuperGPQA/7ca71b86327744b78e93185a45bc5cef ... ```
|
Tested this branch with a 7900xtx (RDNA3), output is correct, speed is slow (as a single 24GB GPU with 128GB DRAM on Zen2 threadripper CPU, this is expected): Running the tests will take a bit too long in general I think... but happy to spot-check a specific test if that helps. |
|
Thanks @harryhaaren , can you please run it with DS4_ROCM_NO_F16_PAIR_WARP_MATMUL=1 option? This is the only change I made in the kernel logic - this option will turn it off, so you will be running the original matmul_f16_pair_ordered_chunks_kernel kernel. Thanks again. |
|
btw, the whole codebase (i.e. the ds4_cuda kernels) implies that you have a unified memory architecture. |
rocm - fixing the correctness issues
(rebased on main@upstream)
the test run
the determenistic test run (with DS4_CUDA_MOE_NO_ATOMIC_DOWN=1)
the benchmark run
the eval run (until the first failure)