Skip to content

[Feature] Support training wit Qwen3.5-4B#1844

Open
HAOCHENYE wants to merge 6 commits into
InternLM:mainfrom
HAOCHENYE:qwen3_5-4B
Open

[Feature] Support training wit Qwen3.5-4B#1844
HAOCHENYE wants to merge 6 commits into
InternLM:mainfrom
HAOCHENYE:qwen3_5-4B

Conversation

@HAOCHENYE
Copy link
Copy Markdown
Collaborator

  • [Fix] Forward rms_norm_type to DenseDecoderLayer in Dense.build_layers
  • [Feature] Add Qwen3.5-4B dense VLM support
  • [Feature] Add XTUNER_HF_IMPL op-level switch for HF-exact parity
  • [Test] Add Qwen3.5-4B dense parity and save_hf tests
  • [Fix] Select torch.compile fullgraph per layer type in Dense.fully_shard

HAOCHENYE added 5 commits May 26, 2026 21:02
Dense.build_layers omitted rms_norm_type when constructing DenseDecoderLayer,
so a zero_centered dense model silently used the default RMSNorm for the
per-layer input/post norms. MoE.build_layers already forwards it. No-op for
existing dense models, which use the default rms_norm_type.
Qwen3.5-4B is a dense VLM whose text tower is a hybrid of GatedDeltaNet linear
attention and gated MHA full attention (every 4th layer), reusing the existing
Dense body's per-layer attention dispatch. Adds the dense text tower
(Qwen3_5_VLTextDense + config), the Qwen3_5_VLDense4BConfig compose config
(reusing the existing Qwen3.5 vision/projector towers), and registration.

MTP is deferred: the checkpoint's mtp.* keys are skipped on load (matching HF)
and not re-saved. See docs/design/model/qwen3_5_dense_4b.md.
XTUNER_HF_IMPL selects HuggingFace-exact op implementations at the ops layer so
decoder layers can be aligned bitwise against transformers: get_attn_impl_fn
forces the eager attention path (XTuner's eager_attention matches HF's
eager_attention_forward bitwise, fp32 softmax + dense causal mask), and
get_rms_norm_fn forces the native torch path over triton. Both read the env var
live so tests can toggle it per model instance. Not for production training.
Adds decoder-layer bitwise parity (all linear + full layers and final-norm
hidden match HF bitwise under XTUNER_HF_IMPL), model forward parity on the
default flash path (within 1e-2), and a save_hf round-trip over non-mtp keys
(MTP is deferred). Reads the checkpoint from QWEN3_5_DENSE_4B_PATH.
Hybrid dense models (e.g. Qwen3.5) mix gated-MHA full-attention layers with
GatedDeltaNet linear-attention layers. GatedDeltaNet writes seq_ctx.seq_idx
inside the activation-checkpoint region; compiling that layer with
fullgraph=True turns the checkpoint into a HigherOrderOperator that rejects the
side effect (torch._dynamo SideEffects). Pick fullgraph per layer type so linear
layers compile with fullgraph=False (the write graph-breaks) while full-attention
layers keep fullgraph=True. No-op for pure-MHA dense models (all full_attention).
@HAOCHENYE HAOCHENYE changed the title [Feature] Support training wit Qwen3.5B [Feature] Support training wit Qwen3.5-4B May 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants