Rough implementation of DeepSeek-V2 paper, including Multi-Head Latent Attention (MLA) with decoupled RoPE, and the DeepSeekMoE FFN replacement with shared + routed experts (using top-k routing).
One thing I didn't do (for later) is they have an auxiliary loss (in addition to typical cross entropy loss) to prevent the MoE router from sending tokens to the same few experts (basically using a load-balancer).