Skip to content

feat(speculative): add Qwen3 dense target support for EAGLE-1/2/3#2313

Open
khazic wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
khazic:khazic/feat/eagle-qwen3-support
Open

feat(speculative): add Qwen3 dense target support for EAGLE-1/2/3#2313
khazic wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
khazic:khazic/feat/eagle-qwen3-support

Conversation

@khazic
Copy link
Copy Markdown
Contributor

@khazic khazic commented May 25, 2026

What does this PR do?

Add Qwen3 (Qwen3ForCausalLM) as a supported target for EAGLE-1 /
EAGLE-2 / EAGLE-3 training. Stacked on top of #2312 (Phi-3 support);
the actual Qwen3-specific delta is one registry entry + three example
configs + docstring updates.

Depends on #2312. Until #2312 lands, the diff shown here also
includes the Phi-3 PR's commits as a prefix. After #2312 merges this
branch will rebase onto main and become a 3-line incremental PR.
The Qwen3-specific commit is b3d018c8.

Changelog (Qwen3 delta only)

  • components/speculative/eagle/registry.py: append
    Qwen3ForCausalLM to _DENSE_ARCHITECTURES. Qwen3 already works
    through the existing config-driven draft path -- it decouples
    head_dim from hidden_size / num_attention_heads, which the
    attention layer already reads via
    getattr(config, "head_dim", ...); attention_bias and
    mlp_bias are exposed on Qwen3Config so they are read normally.
  • Add example YAMLs:
    examples/speculative/eagle{1,2,3}/qwen3_eagle{1,2,3}_perfectblend.yaml.
  • Update draft / recipe docstrings to mention Qwen3 alongside Llama and Phi-3.

No code-path changes were required. The registry dispatch already
exists from #2312.

Verification

End-to-end smoke test on 8 x H100:

  • Target: Qwen/Qwen3-8B (15.26 GB, 8.19 B params, model_type=qwen3).
  • Dataset: PerfectBlend (200-sample slice).
  • EAGLE-3 over 25 optimizer steps:
2026-05-25 20:34:04 INFO Training start: start_epoch=0 num_epochs=1 batches_per_epoch=25
2026-05-25 20:34:06 INFO epoch=0 step=1  train_loss=9.846308  train_acc=0.000000
2026-05-25 20:34:06 INFO epoch=0 step=2  train_loss=8.477696  train_acc=0.029953
2026-05-25 20:34:07 INFO epoch=0 step=3  train_loss=8.218086  train_acc=0.053741
...
2026-05-25 20:34:13 INFO epoch=0 step=23 train_loss=6.182835  train_acc=0.113525
2026-05-25 20:34:13 INFO epoch=0 step=24 train_loss=6.219086  train_acc=0.098724
2026-05-25 20:34:14 INFO epoch=0 step=25 train_loss=6.175844  train_acc=0.093933
2026-05-25 20:34:14 INFO Epoch 0 done: total_batches_seen=25 global_step=25

Loss decreases 9.85 -> 6.18 over 25 steps (~37% drop), accuracy ticks
up from 0 to ~0.09. No TypeError / AttributeError at
draft construction, target load (Liger applied to model type: qwen3
without complaint), or training step.

Before your PR is "Ready for review"

Pre checks:

  • Contributor guidelines followed
  • Did you write any new necessary tests? No -- end-to-end repro needs
    multi-GPU + a real Qwen3 target; smoke-test evidence above.
  • Example configs added under examples/speculative/eagle{1,2,3}/.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test b3d018c

@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test ee3bdc3

@khazic khazic force-pushed the khazic/feat/eagle-qwen3-support branch from fa4074f to 0be62cc Compare May 25, 2026 16:00
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test 0be62cc

khazic added 2 commits May 26, 2026 09:56
Register ``Qwen3ForCausalLM`` in the EAGLE dense draft dispatch table.
Qwen3 already works through the existing config-driven draft path:
``head_dim`` is read via ``getattr(config, "head_dim", ...)`` (Qwen3
decouples it from ``hidden_size / num_attention_heads``), and
``attention_bias`` / ``mlp_bias`` are read via ``getattr(..., False)``
so Qwen3's config exposes them correctly. No code-path changes
required; just an allowlist entry plus example configs and docstrings.

  - registry.py: append "Qwen3ForCausalLM" to ``_DENSE_ARCHITECTURES``.
  - Add example YAMLs: ``qwen3_eagle{1,2,3}_perfectblend.yaml``.
  - Update docstrings (draft modules + recipes) to mention Qwen3.

End-to-end smoke-tested on 8x H100 with Qwen/Qwen3-8B target on a
PerfectBlend 200-sample slice (EAGLE-3, 25 steps): loss decreases
9.85 -> 6.18 (~37% drop), train_acc ticks up from 0 to ~0.09. No
construction-time / load-time errors.

Signed-off-by: khazic <khazzz1c@gmail.com>
EAGLE-3 draft reads ACT2FN[config.hidden_act] from the target config,
but EAGLE-1/2 draft hardcoded nn.SiLU(). All currently registered
dense architectures (Llama / Phi-3 / Qwen3) happen to use silu, so the
hardcode is correct today.

However, the dense registry is intended to grow to cover non-SiLU
families next (e.g. Gemma uses gelu_pytorch_tanh). With the hardcode
in place, registering such an architecture would silently mismatch the
target's activation: no crash, no error, training still converges, but
draft hidden states drift from target and speculative acceptance rate
quietly drops with no observable symptom.

Read hidden_act from config so the draft matches the target by
construction and adding new architectures stays a one-line registry
change.

Signed-off-by: khazic <khazzz1c@gmail.com>
@khazic khazic force-pushed the khazic/feat/eagle-qwen3-support branch from 0be62cc to ef5eb6c Compare May 26, 2026 01:57
@HuiyingLi
Copy link
Copy Markdown
Contributor

/ok to test ef5eb6c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants