[Feature] Support training wit Qwen3.5-4B#1844
Open
HAOCHENYE wants to merge 6 commits into
Open
Conversation
Collaborator
HAOCHENYE
commented
May 26, 2026
- [Fix] Forward rms_norm_type to DenseDecoderLayer in Dense.build_layers
- [Feature] Add Qwen3.5-4B dense VLM support
- [Feature] Add XTUNER_HF_IMPL op-level switch for HF-exact parity
- [Test] Add Qwen3.5-4B dense parity and save_hf tests
- [Fix] Select torch.compile fullgraph per layer type in Dense.fully_shard
Dense.build_layers omitted rms_norm_type when constructing DenseDecoderLayer, so a zero_centered dense model silently used the default RMSNorm for the per-layer input/post norms. MoE.build_layers already forwards it. No-op for existing dense models, which use the default rms_norm_type.
Qwen3.5-4B is a dense VLM whose text tower is a hybrid of GatedDeltaNet linear attention and gated MHA full attention (every 4th layer), reusing the existing Dense body's per-layer attention dispatch. Adds the dense text tower (Qwen3_5_VLTextDense + config), the Qwen3_5_VLDense4BConfig compose config (reusing the existing Qwen3.5 vision/projector towers), and registration. MTP is deferred: the checkpoint's mtp.* keys are skipped on load (matching HF) and not re-saved. See docs/design/model/qwen3_5_dense_4b.md.
XTUNER_HF_IMPL selects HuggingFace-exact op implementations at the ops layer so decoder layers can be aligned bitwise against transformers: get_attn_impl_fn forces the eager attention path (XTuner's eager_attention matches HF's eager_attention_forward bitwise, fp32 softmax + dense causal mask), and get_rms_norm_fn forces the native torch path over triton. Both read the env var live so tests can toggle it per model instance. Not for production training.
Adds decoder-layer bitwise parity (all linear + full layers and final-norm hidden match HF bitwise under XTUNER_HF_IMPL), model forward parity on the default flash path (within 1e-2), and a save_hf round-trip over non-mtp keys (MTP is deferred). Reads the checkpoint from QWEN3_5_DENSE_4B_PATH.
Hybrid dense models (e.g. Qwen3.5) mix gated-MHA full-attention layers with GatedDeltaNet linear-attention layers. GatedDeltaNet writes seq_ctx.seq_idx inside the activation-checkpoint region; compiling that layer with fullgraph=True turns the checkpoint into a HigherOrderOperator that rejects the side effect (torch._dynamo SideEffects). Pick fullgraph per layer type so linear layers compile with fullgraph=False (the write graph-breaks) while full-attention layers keep fullgraph=True. No-op for pure-MHA dense models (all full_attention).
hhaAndroid
approved these changes
May 27, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.