Skip to content

Add GLM4-MOE Mode w/Disaggregated Prefill and Decode Support#988

Open
vbaddi wants to merge 8 commits into
quic:release/v1.22.0_tmpfrom
vbaddi:feat/enable_glm4_moe
Open

Add GLM4-MOE Mode w/Disaggregated Prefill and Decode Support#988
vbaddi wants to merge 8 commits into
quic:release/v1.22.0_tmpfrom
vbaddi:feat/enable_glm4_moe

Conversation

@vbaddi
Copy link
Copy Markdown
Contributor

@vbaddi vbaddi commented May 14, 2026

Summary

Adds GLM4-MOE support for disaggregated serving with chunked prefill.

Supported

  • GLM4-MOE decode path
  • Chunked prefill MoE path with packed expert dispatch
  • KV-blocked attention path
  • Disaggregated prefill/decode serving example
  • ONNX subfunction export for decode and prefill

Tested

  • Added GLM4-MOE prefill/blocked export tests
  • Verified packed MoE custom-op counts for prefill_seq_len=512, packed chunk size 256
  • Ran GLM4-MOE disaggregated example end-to-end w/tiny config.
 pytest -q tests/transformers/models/test_moe_prefill_blocked.py
 python examples/disagg_serving/glm4_moe_disagg_mode_with_chunking.py

@vbaddi vbaddi marked this pull request as draft May 14, 2026 14:41
@quic-rishinr quic-rishinr requested review from ochougul and quic-rishinr and removed request for ochougul May 15, 2026 10:30
@vbaddi vbaddi force-pushed the feat/enable_glm4_moe branch from 6e91468 to 77e65e9 Compare May 15, 2026 14:41
@vbaddi vbaddi marked this pull request as ready for review May 18, 2026 21:36
@vbaddi vbaddi mentioned this pull request May 27, 2026
@quic-rishinr quic-rishinr changed the base branch from main to release/v1.22.0_tmp May 27, 2026 05:14
vbaddi added 8 commits May 27, 2026 10:55
Enable GLM4-MOE chunked prefill MoE, KV-blocked attention, and disaggregated
serving export with subfunctions.

  - GLM4-MOE decode path
  - Chunked prefill MoE path with packed expert dispatch
  - KV-blocked attention path
  - Disaggregated prefill/decode serving example
  - ONNX subfunction export for decode and prefill

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Use the headpar_offline KV-blocking path by default for GQA-compatible KV
blocking, with fallback to the previous online implementation for
unsupported masking/bias cases.

Revert to previous commit if fails. WIP

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Route KV-containing combined blocking modes through the
headpar_offline path when supported, and pass user-tiled
compile flags explicitly in the GLM4 MoE disagg example.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…on export and update example

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks,
restore torch.full_likeindex init, and add ONNX coverage for the second packed chunk slice.

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
…ss/qwen3/pr935

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>
@vbaddi vbaddi force-pushed the feat/enable_glm4_moe branch from 8452e31 to 5c632b7 Compare May 27, 2026 05:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants