Gemma4 MTP by am17an · Pull Request #17 · am17an/llama.cpp

am17an · 2026-05-19T15:56:42Z

Works with both gemma-31B and gemma-26B but the MoE model is slower. I see a good speed up on my DGX spark (~2-2.5x speedup) on the dense model. The main problem is sharing the memory ctx between the two llama_contexts, so currently it's pretty hacky plus also the ubatch splitting is not super clean.

Replicated the AIME-26 results for Gemma-31B with -np 4

This commit attempts to clarify a code comment in graph_mtp regarding where the MTP layer is stored. The motivation for this is that it was not obvious to me what the original comment meant and hopefully this makes it clearer.

am17an · 2026-05-19T16:58:01Z

+    // of streams (one per active draft seq); q->ne[2] is not divisible by the full
+    // n_stream and the view collapses tokens. Slice k/v down to exactly the streams
+    // referenced by this ubatch. Requires those streams to form a contiguous range.
+    if (k->ne[3] > 1 && (uint32_t) k->ne[3] != ubatch.n_seqs_unq) {


@ggerganov this part

* update mtp related help * remove outdated experimental text

* opencl: add q4_k moe support * opencl: add q5_k moe support * opencl: add q6_k moe support * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

* snapdragon: update compiler flags to enable all CPU features * snapdragon: update readme to point to toolchain v0.6 * snapdragon: bump toolchain docker to v0.6

* metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup

…rg#23330) * refactor: `isMobile` as reactive value in `viewport` store * refactor: Use Svelte media query for the viewport store

* mtmd: fit_params now take into account mmproj * rename alloc_compute_meta to reserve_compute_meta * rm unused functions * add ggml_backend_dev_t support * add debug log

… completions payload (ggml-org#23406)

* app : introduce the llama unified executable Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use serve for server Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Hide completion and bench, add help command Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Remove STATIC Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use -impl targets instead of -lib Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Revert "Remove STATIC" This reverts commit cc44cac. --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>

…DIA GPUs (Hopper+) (ggml-org#22522) * Adds initial PDL setup. * Adds PDL barriers based on simple heuristic: place "sync" before first input pointer access, and "launch" after last write, e.g. to tensors like dst. * Further optimization pass of the first half of kernels * Optimized PDL barriers for the second batch of kernels * Further refinements after rebase. * Moves pdl logic to separate function, removes some whitespace * Strips post-hoc PDL logic * Adds stream capture PDL setup. Enrolls quantize_q8_1 to leverage pdl to overlap execution with previous kernels * Enrolls mul_mat_vec_q, rms_norm_f32 and k_bin_bcast (partly) into PDL * Enrolls mmvf, rope, set-rows and topk kernels for gpt-oss into PDL * Introduce ggml_cuda_kernel_launch, to abstract away cudaLaunchKernelEx, to enable hip/musa compatibility * Enrolls cpy_scalar_contiguous, k_get_rows_float and rms_norm_f32 * Enrolls flash_attn_combine_results * Fix: Drops needless and broken check of CUDA arch for PDL. PDL either works or is without effect. * Enrolls flash-attention kernels to pdl * Fix: inlines ggml_cuda_kernel_launch, and uses perfect forwarding for kernels args. This fixes PDL. * Perf: Enrolls k_bin_bcast variadic template invocation into PDL, via and template alias and template expansion * Enrolls all remaining kernels for qwen3-coder-next into PDL * Remove all PDL LC calls to create a baseline * Added LC according to internal guidance and tested kernel performance. * Enrols missing qwen3-5 kernels passively into PDL. * Kernel optimizations (LC signals) for qwen3.5 * Enrolls ssm-scan kernels into PDL * Adds GGML_CUDA_PDL command line option to toggle PDL. * Fix: Ada and lower compilation by guarding PDL calls correctly * Cleanup: Removes commented out GGML_CUDA_PDL_LC * Cleanup: Removes experimental comments * Adds 90-virtual to build script so that Hopper GPUs can leverage PDL. * Adds stricter checks to enable PDL, adds env-check to disable it, and removes now superfluous compile option to enable PDL. * Fix: Correct PDL en/disablement based on device-side arch check. Host side check is UB. Required moving from macros to inlined functions * Fix: default-disable PDL. Enable by setting GGML_CUDA_ENABLE_PDL=1 * Enable PDL by default for Hopper+ devices * Enrolls softcap_f32 and two flash_attn kernels into PDL. * Improves flash attn PDL barrier placement * Fix: Perf regression on ada; excludes ada and below from PDL launches * Improves some sync barrier placements * Drops superfluous constructor * Adds #endif guard comments * Reverts experimental change to top-k-moe.cu, which moved expensive allocations in front of the PDL barrier. It did not have a meaningful impact. * Exchanges GGML_CUDA_DISABLE_PDL with GGML_CUDA_PDL. IFF GGML_CUDA_PDL=0 PDL is disabled * Revert "Drops superfluous constructor". Adds const to remaining arguments This reverts commit 12b1d25. * Cleanup: Removes and fixes some comments and whitespace * Clarifies comment of sync-barrier position * Relocates and refactors PDL launch functions and accessories * Adds error checking to the regular kernel launch path * Drops "auto" in favor of "ggml_cuda_kernel_params" * Adds "const" to ggml_cuda_kernel_launch_params * [Whitespace] Adds final newline to common.cuh to make editorconfig CI job happy

* hmx-mm: update debug logging in hmx-mm * hmx-mm: update dequant logic to use HVX_vector_x2/4 * hmx-mm: remove non-pipelined version of the quantize matmul It seems that we don't reall need non-pipelined version * hmx-mm: use activation depth mode and update naming Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com> * hex-mm: minor hmx matmul naming updates * hmx-mm: remove unused vars * snapdragon: scripts bump default ubatch-size to 1K * hexagon: combine HMX and power and clock settings into a single set_power call * hmx-mm: remove leftover of the scale repl helper * hexagon: fix editconf error --------- Co-authored-by: Kim-Chyan Gan <kgan@qti.qualcomm.com>

…gml-org#23396)

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

…refactor (ggml-org#23345) * mtmd : deepseek-ocr fixes, improvements and refactoring - image processing changes to achieve full parity with Pillow (reference impl) - SAM mask casting only when flash-attn is on - SAM refactor (build_sam() extracted so deepseek-ocr-2 can reuse it) - llama-chat changes to fix server/WebUI issue (new media_markers_first()) - adapted test-chat-template and added test cases for deepseek-ocr - changed regression test for deepseek-ocr to use CER+chrF scores for ground-truth comparison; removed embedding-model - ty.toml ignore unresolved-import for tools/mtmd/tests/** * image-text reordering fix removed * refactor bool add_padding + pad_rounding enum into a single pad_style enum

…3386) ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi

When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this.

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>

ggml-org#23767) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>

…l-org#23763) * ci : fix undefined sanitizer build to use Debug build type only * ci : ccache the server builds * cont : remove ui dependency + reuse ccache for both ubuntu jobs * tmp : force ccache save * Revert "tmp : force ccache save" This reverts commit a857b03. * cont : no need for node.js

…22455)

* run tests in correct build folder * remove wasm test

* ci : server windows set build type explicitly * cont : try windows-2025 * ci : use llvm * cont : use ninja * cont : fix shell * ci : set number of jobs correctly * ci : fix windows with vulkan ccache by using llvm * ci : server ccache only on master * ocd : fix job names [no release]

…3746) * add conversion folder and update dependencies * limit python version for triton * update dev-dependencies section

…l-org#23780) * ci : move ARM jobs to 3rd-party runners + disable kleidiai release * cont : fix deps + fix names * ocd : fix names * cont : fix PR links

* feat: extend repeat op for vulkan * feat: add repeat_f16 vulkan pipeline * fix: ensure same dst and src types * fix: use type_size instead of data types * fix: use int16 and int32 for repeat shader op * chore: rename repeat_f* to repeat_i* * chore: rename repeat vulkan pipelines

…gml-org#23541)

…22887) * vulkan: Switch MUL_MAT_VEC to 4 K per iteration for F16/32 Against mesa git, this shows a 4.8% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. Note that this breaks some tests until the last commit which fixes OOB A reads. * vulkan: Use aligned loads in mul_mat_vec when available Against mesa git, this shows a 3.3% performance improvement for tg128 on Qwen3.5-9B:BF16 on Intel BMG. * Make explicit that `num_rows` is <= `NUM_ROWS` in mul_mat_vec Mesa's UUB logic can't see through conditionals, limiting its ability to understand the bounds on the `num_rows` field in the cleanup run. Making it explicit that `num_rows` is, indeed, always <= `NUM_ROWS` helps mesa make slightly better codegen. Against mesa git, this currently shows a 1% performance improvement in tg128 on Qwen3.5-9B:BF16 on Intel BMG. * vulkan: Fix OOB A reads in MUL_MAT_VEC for odd sizes There was a TODO to fix the OOB reads from the A matrix which we do here. It is within performance noise (+<0.1%) in tg128 for Qwen3.5-9B:BF16 on Intel BMG.

* hex-mm: add support for Q4_1 matmul/matvec, hvx-only for now * hmx-mm: add support for Q4_1 * hex-mm: use Q8_1 dynamic quantization to avoid having to compute sums in the vec_dot * hexagon: fix repack scratch buffer overflow * hex-mm: fix Q4_1 repack buffer sizing * hexagon: flip the build order for mm and fa (seems to help LTO) * hex-mm: add vec_dot 4x1s and minor HMX cleanup after adding Q4_1 * hex-mm: fix fp16 vec_dot fallback to 2x1 and another issue that could cause incorrect output * hexagon: resurrect early-wake and add support for polling for op-batch completions With Q4_1 ggml-hexagon now claims pretty much the entire graphs which gives the CPU more time to chilax. This is a good thing! But it does add extra latency for the pure benchmark runs. Early wakeup helps recover the latency a bit in the normals runs and op-batch polling is just for benchmarking. --------- Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>

github-actions Bot added examples python server model labels May 19, 2026

danbev and others added 2 commits May 19, 2026 18:41

hexagon: enable support for NORM op (ggml-org#23319)

ac76808

am17an commented May 19, 2026

View reviewed changes

CISC and others added 21 commits May 19, 2026 21:16

convert : update mtp related help (ggml-org#23334)

b7393a4

* update mtp related help * remove outdated experimental text

common: fix --fit verbosity with --verbosity 4 (ggml-org#23282)

7256fce

common: fix --help for --verbosity (ggml-org#23278)

57cb35c

github: mention --log-file in issue templates (ggml-org#23277)

a807867

refactor: Chat Screen UI rendering (ggml-org#23333)

67ace02

hexagon: add MROPE and IMROPE support in HTP rope op (ggml-org#23317)

17d22a3

opencl: add MoE support for q4_k, q5_k, q6_k on Adreno (ggml-org#23303)

b28a2f3

* opencl: add q4_k moe support * opencl: add q5_k moe support * opencl: add q6_k moe support * opencl: adjust format --------- Co-authored-by: Li He <lih@qti.qualcomm.com>

ggml-cuda: tune RDNA3 Q6_K MMVQ nwarps (ggml-org#23349)

b39a7bf

snapdragon: update toolchain to v0.6 (ggml-org#23369)

871b0b7

* snapdragon: update compiler flags to enable all CPU features * snapdragon: update readme to point to toolchain v0.6 * snapdragon: bump toolchain docker to v0.6

metal : optimize pad + cpy (ggml-org#23354)

57ebaf4

* metal : optimize pad * metal : optinmize cpy * cont : better row packing in threadgroup

fix: Div wrapper no pointer events on hidden (ggml-org#23390)

585080d

ui: Refactor isMobile as reactive value in viewport store (ggml-o…

5028447

…rg#23330) * refactor: `isMobile` as reactive value in `viewport` store * refactor: Use Svelte media query for the viewport store

docker : copy conversion files (ggml-org#23370)

7e50ef7

mtmd: fit_params now take into account mmproj (ggml-org#21489)

e2b129e

* mtmd: fit_params now take into account mmproj * rename alloc_compute_meta to reserve_compute_meta * rm unused functions * add ggml_backend_dev_t support * add debug log

refactor: Move text attachments up before the message content in chat…

e6b4acf

… completions payload (ggml-org#23406)

feat: Add WAV MIME type variants and improve audio format detection (g…

6ce9671

…gml-org#23396)

vulkan: optimize operations in the IM2COL shader (ggml-org#22685)

acd604f

* vulkan: optimize operations in the IM2COL shader * Add comments and improve the code formatting

am17an force-pushed the gemma4-mtp branch from cd2e5b2 to a03120c Compare May 20, 2026 16:28

common/speculative : fix nullptr crash in get_devices_str (ggml-org#2…

510b5c2

…3386) ggml_backend_dev_by_name always appends a nullptr sentinel to the devices vector. Skipping nullptr entries prevents assertion failure in ggml_backend_dev_name. Assisted-by: llama.cpp:local pi

rgerganov and others added 24 commits May 27, 2026 08:06

server : fix the log message when using SSL (ggml-org#23393)

7085492

When llama-server is started with SSL key and cert, the log says that it listens on http instead of https. This patch fixes this.

convert: add MiniCPM5 tokenizer support (ggml-org#23384)

9777256

Add minicpm5 pre-tokenizer hash via convert_hf_to_gguf_update.py and implement hardcoded regex handling in llama-vocab.cpp, consistent with other BPE pre-tokenizers. Co-authored-by: zhangtao <zhangtao2@modelbest.cn>

docs : fix duplicated "the" in granitevision and model-conversion docs (

1d971bb

ggml-org#23767) Co-authored-by: Kai Tanaka <275430420+quyentonndbs@users.noreply.github.com>

vulkan: avoid preferring transfer queue on AMD UMA devices (ggml-org#…

4d8cc0c

…22455)

ci : remove wasm test (ggml-org#23733)

b3a739c

* run tests in correct build folder * remove wasm test

common : fix env names to all have LLAMA_ARG_ prefix (ggml-org#23778)

6b4e4bd

ci : bump cuda release to 13.3 (ggml-org#23749)

2d0656f

CUDA: restrict PDL to CTK >= 12.3 due to MSVC issues (ggml-org#23742)

fda8528

pyproject : add conversion folder and update dependencies (ggml-org#2…

87b0a60

…3746) * add conversion folder and update dependencies * limit python version for triton * update dev-dependencies section

vendor : update cpp-httplib to 0.46.0 (ggml-org#23650)

617255d

ci : move ARM jobs to self-hosted + disable kleidiai mac release (ggm…

ba4dd0b

…l-org#23780) * ci : move ARM jobs to 3rd-party runners + disable kleidiai release * cont : fix deps + fix names * ocd : fix names * cont : fix PR links

vulkan: use GL_NV_cooperative_matrix_decode_vector for faster matmul (g…

b36eefc

…gml-org#23541)

ggml-webgpu: Fix how to dispatch WG to some ops (ggml-org#23750)

c40006a

ggml-webgpu: remove legacy constants (ggml-org#23672)

f12cc6d

llama: Gemma 4 MTP

dfc02c9

fix multi-seq

ee1ee38

add assert that draft + shared kv should be on same device

01134dd

add Q rot when cache is quantized

1e4fb9f

add temp hack to not use fit with gemma4, rm later

c073320

am17an force-pushed the gemma4-mtp branch from 4b1d1ae to c073320 Compare May 28, 2026 04:57

github-actions Bot added SYCL AMD ZenDNN android labels May 28, 2026

am17an added 2 commits May 28, 2026 13:41

add exception in test-llama-archs

79098ee

move assistant to separate file

e21d64b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma4 MTP#17

Gemma4 MTP#17
am17an wants to merge 143 commits into
masterfrom
gemma4-mtp

am17an commented May 19, 2026 •

edited

Loading

Uh oh!

am17an May 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

am17an commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

am17an commented May 19, 2026 •

edited

Loading