[Feature] Support training wit Qwen3.5-4B by HAOCHENYE · Pull Request #1844 · InternLM/xtuner

HAOCHENYE · 2026-05-26T23:01:59Z

[Fix] Forward rms_norm_type to DenseDecoderLayer in Dense.build_layers
[Feature] Add Qwen3.5-4B dense VLM support
[Feature] Add XTUNER_HF_IMPL op-level switch for HF-exact parity
[Test] Add Qwen3.5-4B dense parity and save_hf tests
[Fix] Select torch.compile fullgraph per layer type in Dense.fully_shard

Dense.build_layers omitted rms_norm_type when constructing DenseDecoderLayer, so a zero_centered dense model silently used the default RMSNorm for the per-layer input/post norms. MoE.build_layers already forwards it. No-op for existing dense models, which use the default rms_norm_type.

Qwen3.5-4B is a dense VLM whose text tower is a hybrid of GatedDeltaNet linear attention and gated MHA full attention (every 4th layer), reusing the existing Dense body's per-layer attention dispatch. Adds the dense text tower (Qwen3_5_VLTextDense + config), the Qwen3_5_VLDense4BConfig compose config (reusing the existing Qwen3.5 vision/projector towers), and registration. MTP is deferred: the checkpoint's mtp.* keys are skipped on load (matching HF) and not re-saved. See docs/design/model/qwen3_5_dense_4b.md.

XTUNER_HF_IMPL selects HuggingFace-exact op implementations at the ops layer so decoder layers can be aligned bitwise against transformers: get_attn_impl_fn forces the eager attention path (XTuner's eager_attention matches HF's eager_attention_forward bitwise, fp32 softmax + dense causal mask), and get_rms_norm_fn forces the native torch path over triton. Both read the env var live so tests can toggle it per model instance. Not for production training.

Adds decoder-layer bitwise parity (all linear + full layers and final-norm hidden match HF bitwise under XTUNER_HF_IMPL), model forward parity on the default flash path (within 1e-2), and a save_hf round-trip over non-mtp keys (MTP is deferred). Reads the checkpoint from QWEN3_5_DENSE_4B_PATH.

Hybrid dense models (e.g. Qwen3.5) mix gated-MHA full-attention layers with GatedDeltaNet linear-attention layers. GatedDeltaNet writes seq_ctx.seq_idx inside the activation-checkpoint region; compiling that layer with fullgraph=True turns the checkpoint into a HigherOrderOperator that rejects the side effect (torch._dynamo SideEffects). Pick fullgraph per layer type so linear layers compile with fullgraph=False (the write graph-breaks) while full-attention layers keep fullgraph=True. No-op for pure-MHA dense models (all full_attention).

HAOCHENYE added 5 commits May 26, 2026 21:02

HAOCHENYE changed the title ~~[Feature] Support training wit Qwen3.5B~~ [Feature] Support training wit Qwen3.5-4B May 26, 2026

[Docs] Update claude code add hf model skills

d697761

hhaAndroid approved these changes May 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Support training wit Qwen3.5-4B#1844

[Feature] Support training wit Qwen3.5-4B#1844
HAOCHENYE wants to merge 6 commits into
InternLM:mainfrom
HAOCHENYE:qwen3_5-4B

HAOCHENYE commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HAOCHENYE commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants