feat(moe): NSP-blocked expert dispatch for Qwen3MOE and GPT-OSS prefill by vbaddi · Pull Request #935 · quic/efficient-transformers

vbaddi · 2026-04-21T21:01:31Z

Adds NSP-parallel expert-blocked dispatch to the chunked prefill MoE path for Qwen3MOE and GPT-OSS, replacing the sequential per-expert loop with a batched packed-prefix approach.

Configuration:
  export EXPERT_BLOCKING_NUM_NSP=16   # default: 1 NSP per expert (best perf at T=256)
  export EXPERT_BLOCKING_NUM_NSP=8    # 2 NSPs per expert
  export EXPERT_BLOCKING_NUM_NSP=2    # for testing

Falls back to the original per-expert loop if num_experts % EXPERT_BLOCKING_NUM_NSP !=0.
EXPERT_BLOCKING_NUM_NSP=2 pytest tests/transformers/models/test_moe_prefill_blocked.py -v

Update (0429):
export EXPERT_BLOCKING_PACKED_CHUNK_SIZE=256 for chunk PL of 512

Update (0525):
Configuration is now compile-API driven:

num_cores controls NSP parallelism.
moe_prefill_packed_chunk_size controls packed chunk size.
No EXPERT_BLOCKING_NUM_NSP / EXPERT_BLOCKING_PACKED_CHUNK_SIZE env vars are required.

Example:

  qeff_model.compile(
      prefill_seq_len=512,
      ctx_len=...,
      num_cores=16,
      prefill_only=True,
      enable_chunking=True,
      moe_prefill_packed_chunk_size=256,
      ...
  )

Notes:

Qwen3-MoE and GPT-OSS disaggregated serving examples are updated to use PL=512 and packed chunk size=256.
The optimized path requires num_experts % num_cores == 0.
Qwen3-MoE and GPT-OSS now use the same packed-chunk flow as the standalone benchmark.
torch.clamp is retained for bench alignment, with tensor bounds to avoid QAIC Clip dtype issues.
Subfunction-specific ReduceSum/Einsum cleanup is deferred and will be handled separately.

Validation:

pytest -q \
  tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_blocked_forward_parity \
  tests/transformers/models/test_moe_prefill_blocked.py::test_qwen3moe_prefill_chunked_subfunction_export_contains_cumsum_custom_ops \
  tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_blocked_forward_parity \
  tests/transformers/models/test_moe_prefill_blocked.py::test_gptoss_prefill_chunked_export_traces_packed_chunks

Also verified tiny non-subfunction QAIC compile for Qwen3-MoE and GPT-OSS with:

prefill_seq_len=512
moe_prefill_packed_chunk_size=256
num_cores=2

Add expert-blocked NSP-parallel prefill forward to QEffPrefillChunkedQwen3MoeSparseMoeBlock and QEffPrefillOnlyChunkedGptOssMLP. Controlled via EXPERT_BLOCKING_NUM_NSP env var. Fix CtxScatterFunc3D/CtxGatherFunc3D eager forward for INT32_MAX sentinel handling. Add disagg-mode tests for both models with tiny configs. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…prefill Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

- Root cause: CtxGather3D ONNX symbolic expanded ctx_indices to Shape(data)[:2] ([batch, seq_len]), which is wrong for packed dispatch. - In expert-blocked MoE prefill, ctx_indices is intentionally [batch, packed_chunk_size] (e.g. [16, 256]) while data stays [batch, seq_len, ...] (e.g. [16, 512, ...]). - This caused invalid Expand attempts ([16,256] -> [16,512]) and QAIC compile/runtime failure on /model/layers.0/mlp/CtxGather3D/.... Fix: - Update CtxGather3D expand target to: - batch dim from data - index-seq dim from ctx_indices - New expand shape is [batch_size(data), idx_seq_len(ctx_indices)], preserving packed chunk length. Signed-off-by: vtirumal <vtirumal@qti.qualcomm.com>

…port Add missing CustomOpTransform mappings for CtxScatterFunc3DInt and generalized 3D scatter/gather ops, plus a prefill-only subfunction export regression test to verify the ONNX graph includes the required CtxScatter3DInt/CtxScatter3D/CtxGather3D ops. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

…on export Replace MoE prefill sum reductions with equivalent einsum forms and rewrite int32 clamp bounds using where to avoid QAIC subfunction compile failures for GPT-OSS and Qwen3-MoE. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Trace chunked prefill exports with the requested prefill_seq_len so packed MoE dispatch unrolls all packed chunks, restore torch.full_like index init, and add ONNX coverage for the second packed chunk slice. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Move MoE prefill blocking from env vars to compile/export API. Derive packed chunk iterations from compile prefill_seq_len while keeping ONNX export tracing small, and make the optimized GPT-OSS/Qwen3-MoE chunked forward path default. Update tests and examples for the new API. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

Use API-driven packed chunk config, align Qwen3/GPT-OSS MoE prefill scatter-gather flow with benchmark, and update disagg examples for PL512/packed256. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

vbaddi assigned vbaddi and quic-mamta Apr 21, 2026

vbaddi added the enhancement New feature or request label Apr 21, 2026

anujgupt-github reviewed Apr 21, 2026

View reviewed changes

Comment thread QEfficient/transformers/models/gpt_oss/modeling_gpt_oss.py Outdated

ochougul reviewed Apr 23, 2026

View reviewed changes

Comment thread QEfficient/customop/ctx_scatter_gather.py

vbaddi added 6 commits April 30, 2026 07:17

nit: weights re-route fixes

a5bd93a

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit: weights re-route fixes v1

c4ef4c8

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit(0423): gpt oss moe fixed and nit

290839e

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit(0424): ctx batch idx cast to int32

2804851

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

nit(0429): qwen3_moe, gpt_oss: port cumsum scatter-gather-update MoE …

6b049bc

…prefill Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

vbaddi force-pushed the feat/prefill_moe branch from a0fe82c to 6b049bc Compare April 30, 2026 01:49

vbaddi and others added 7 commits April 30, 2026 07:36

nit(0429): update modeling files

1ae7b23

Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

fix(0525): Align MoE prefill blocking with bench path

22ac72d

Use API-driven packed chunk config, align Qwen3/GPT-OSS MoE prefill scatter-gather flow with benchmark, and update disagg examples for PL512/packed256. Signed-off-by: vbaddi <vbaddi@qti.qualcomm.com>

quic-rishinr changed the base branch from main to release/v1.22.0_tmp May 27, 2026 05:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(moe): NSP-blocked expert dispatch for Qwen3MOE and GPT-OSS prefill#935

feat(moe): NSP-blocked expert dispatch for Qwen3MOE and GPT-OSS prefill#935
vbaddi wants to merge 13 commits into
quic:release/v1.22.0_tmpfrom
vbaddi:feat/prefill_moe

vbaddi commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

vbaddi commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

vbaddi commented Apr 21, 2026 •

edited

Loading