Skip to content

fix: Mixtral compatibility with transformers >=5.0 fused MoE API#451

Open
Medhatt21 wants to merge 1 commit intoModelTC:mainfrom
Medhatt21:fix/mixtral-transformers-5x-compat
Open

fix: Mixtral compatibility with transformers >=5.0 fused MoE API#451
Medhatt21 wants to merge 1 commit intoModelTC:mainfrom
Medhatt21:fix/mixtral-transformers-5x-compat

Conversation

@Medhatt21
Copy link

Summary

Fixes Mixtral quantization broken by transformers 5.x API changes. Currently crashes with:

'MixtralDecoderLayer' object has no attribute 'block_sparse_moe'

Problem

In transformers>=5.0, the Mixtral architecture changed significantly:

Component transformers <5.0 transformers >=5.0
MoE container block.block_sparse_moe block.mlp
Experts ModuleList of MixtralBLockSparseTop2MLP (each with w1, w2, w3 as nn.Linear) MixtralExperts with fused 3D nn.Parameter tensors (gate_up_proj, down_proj)
Gate nn.Linear MixtralTopKRouter
Expert access experts[i].w1 Not subscriptable

Fix

  • Detect old vs new API via hasattr(block, 'block_sparse_moe')
  • get_extra_modules(): returns the MoE container from the appropriate attribute
  • get_moe_gate(): new method to access the gate regardless of API version
  • get_subsets_in_block(): dispatches to _get_subsets_legacy() (old per-expert w1/w2/w3) or _get_subsets_fused() (new fused experts)
  • For the fused API, attention layers are quantized per-subset; the MoE block is passed via get_extra_modules for activation-aware hooks
  • Legacy path preserved unchanged for older transformers versions

Test plan

  • Verify mistralai/Mixtral-8x7B-v0.1 loads without error on transformers 5.x
  • Verify attention layers are quantized correctly (GPTQ/RTN/SmoothQuant)
  • Verify MoE block is passed through as extra_modules for SmoothQuant-style calibration
  • Verify backward compatibility on transformers <5.0 (legacy block_sparse_moe path)

Made with Cursor

In transformers 5.x, MixtralDecoderLayer renamed block_sparse_moe to
mlp, replaced the ModuleList of individual expert modules with a fused
MixtralExperts class (3D nn.Parameter tensors), and changed the gate
from nn.Linear to MixtralTopKRouter.

This broke all Mixtral quantization with:
  'MixtralDecoderLayer' object has no attribute 'block_sparse_moe'

Fix:
- Add _has_legacy_moe() to detect old vs new API via hasattr
- get_extra_modules: returns block_sparse_moe (old) or mlp (new)
- get_moe_gate: returns the gate from the appropriate MoE container
- get_subsets_in_block: dispatches to _get_subsets_legacy (old per-expert
  w1/w2/w3 Linear modules) or _get_subsets_fused (new fused experts)
- For the fused API, attention layers are quantized per-subset; the MoE
  block is passed as extra_modules for activation-aware hooks

The legacy path is preserved unchanged for older transformers versions.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant