issue/349 - Support GLM4 model by rubik-hua · Pull Request #370 · InfiniTensor/InfiniLM

rubik-hua · 2026-05-12T08:58:02Z

根据#352 这个PR里面的检视意见，进行了重构建议参考一下修改点
（1）新增模型不应该修改已有模型代码，不要修改llama_legacy文件中的代码。
（2）请删除config_factory.cpp和rank_worker.cpp的改动
（3）参考已有代码实现（非llama_legacy文件夹），mlp model causual_lm 应该可以使用现有的模块。（4）请在glm4文件添加如下文件 glm4_decoder_layer.cpp/hpp + glm4_for_causal_lm.cpp/hpp。如果有必要添加glm4_attention.cpp/hpp，用来应对rope的修改。
（5）csrc/models/glm4/glm4_for_causal_lm.cpp中，需要定义一个自己的Glm4ForCausalLM类，不要使用nfinilm::models::llama::LlamaForCausalLM。（6）RoPE类型问题：请增加https://github.com/InfiniTensor/InfiniLM/blob/main/csrc/layers/rotary_embedding/rotary_embedding.cpp 中get_rope函数的功能，在这个函数中处理GPT_J类型和"partial_rotary_factor"超参数。

1、模型test_infer.py测试截图：
命令：python examples/test_infer.py --device nvidia --model=/data/rubik/models/GLM-4-9B-0414/

2、推理服务启动
命令：python python/infinilm/server/inference_server.py --device nvidia --model=/data/rubik/models/GLM-4-9B-0414/

客户端命令：python scripts/test_perf.py --verbose
客户端部分输出截图：

另外由于修改了csrc/layers/rotary_embedding/rotary_embedding.cpp中的代码，algo默认参数为infinicore::nn::RoPE::Algo algo = infinicore::nn::RoPE::Algo::GPT_NEOX，逻辑跟原来一样。
跑两个之前的ok的模型进行验证：

根据InfiniTensor#352 这个PR里面的检视意见，进行了重构建议参考一下修改点（1）新增模型不应该修改已有模型代码，不要修改llama_legacy文件中的代码。（2）请删除config_factory.cpp和rank_worker.cpp的改动（3）参考已有代码实现（非llama_legacy文件夹），mlp model causual_lm 应该可以使用现有的模块。（4）请在glm4文件添加如下文件 glm4_decoder_layer.cpp/hpp + glm4_for_causal_lm.cpp/hpp。如果有必要添加glm4_attention.cpp/hpp，用来应对rope的修改。（5）csrc/models/glm4/glm4_for_causal_lm.cpp中，需要定义一个自己的Glm4ForCausalLM类，不要使用nfinilm::models::llama::LlamaForCausalLM。（6）RoPE类型问题：请增加https://github.com/InfiniTensor/InfiniLM/blob/main/csrc/layers/rotary_embedding/rotary_embedding.cpp 中get_rope函数的功能，在这个函数中处理GPT_J类型和"partial_rotary_factor"超参数。

pengcheng888 · 2026-05-12T10:51:24Z

+    std::shared_ptr<infinilm::layers::attention::AttentionLayer> attn_;
+    ::infinilm::backends::AttentionBackend attention_backend_;
+    std::shared_ptr<infinicore::nn::RoPE> rotary_emb_;
+    std::shared_ptr<infinicore::nn::RMSNorm> norm_;


norm_变量是不是没有被使用，请删除

pengcheng888 · 2026-05-12T10:51:44Z

+
+    infinicore::Tensor forward(const infinicore::Tensor &positions, infinicore::Tensor &hidden_states);
+
+    void set_rotary_emb(const std::shared_ptr<infinicore::nn::RoPE> &rotary_emb);


Glm4Attention::set_rotary_emb函数好像没有被使用到，请删除

pengcheng888 · 2026-05-12T10:54:29Z

+
+class Glm4ForCausalLM : public InfinilmModel {
+public:
+    Glm4ForCausalLM(std::shared_ptr<infinilm::config::ModelConfig> model_config,


ForCausalLM对象不再需要rank_info和attention_backend这两个参数，请删除

pengcheng888 · 2026-05-12T10:56:05Z

+    return {logits};
+}
+
+void Glm4ForCausalLM::reset_cache(const cache::CacheConfig *cache_config) {


InfinilmModel 中提供了默认的reset_cache函数实现，可以不用重载reset_cache函数。请删除 Glm4ForCausalLM::reset_cache

pengcheng888 · 2026-05-12T10:59:10Z

+
+using Glm4Model = infinilm::layers::causal_lm_templates::TextModel<Glm4DecoderLayer>;
+
+class Glm4ForCausalLM : public InfinilmModel {


建议复用已有模块，减少冗余代码。

在给出Glm4ForCausalLM的定义时，考虑使用以下代码：
using Glm4ForCausalLM = infinilm::layers::causal_lm_templates::TextCausalLM;

pengcheng888 · 2026-05-12T11:01:39Z

+        import json
+        with open(config_path, "r") as f:
+            hf_config = json.load(f)
+            model_type = hf_config.get("model_type", "")


应该可以从 model.hf_config 变量中，获取model_type。请删除读config.json的代码

pengcheng888 · 2026-05-12T11:03:13Z

+        raise ValueError(f"Cannot split {name} with shape {tensor.shape}")
+    return torch.split(tensor, sizes, dim=0)
+
+def _remap_glm4_weights(state_dict):


_split_first_dim函数，放到_remap_glm4_weights中，只作为_remap_glm4_weights内部使用。这样可以么

pengcheng888 · 2026-05-12T11:03:59Z

+
+    auto q_in = infinicore::Tensor::empty({batch_size, num_attention_heads_, seq_len, head_dim_}, q->dtype(), q->device())
+                    ->permute({0, 2, 1, 3});


这里为什么要定义q_in，k_in，v_in 新变量做copy_from

是因为维度对不上么？

pengcheng888 · 2026-05-12T11:05:06Z

+
+namespace infinilm::models::glm4 {
+
+std::vector<infinicore::Tensor> glm4_allocate_kv_cache_tensors(


InfiniLM中提供了默认的kv cache创建函数default_allocate_kv_cache_tensors，在csrc/models/infinilm_model.cpp。

在csrc/models/infinilm_model.cpp中提供了虚函数InfinilmModel::reset_cache的默认实现，默认调用default_allocate_kv_cache_tensors函数。

只有在default_allocate_kv_cache_tensors函数不满足要求时，才需要重新定义。

我对比了glm4_allocate_kv_cache_tensors函数，关键代码和default_allocate_kv_cache_tensors一样的。
建议删除 glm4_allocate_kv_cache_tensors.cpp/hpp文件，减少冗余代码。

wooway777 · 2026-05-12T11:09:14Z

+    const auto &dtype{model_config->get_dtype()};
+
+    INFINICORE_NN_MODULE_INIT(model, model_config, device);
+    INFINICORE_NN_MODULE_INIT(lm_head, lm_head, hidden_size, vocab_size, false, dtype, device);


这里好像多了个"lm_head, "

pengcheng888 · 2026-05-12T11:10:07Z

+// 计算实际参与旋转的维度数
+inline size_t get_rotary_dim(size_t head_dim, double partial_rotary_factor) {
+    if (partial_rotary_factor <= 0.0 || partial_rotary_factor >= 1.0) {
+        return head_dim;


.hpp中提供get_rotary_dim函数的声明，将实现放到 rotary_embedding.cpp

wooway777 · 2026-05-12T11:22:02Z

+        break;
+    }
+    case backends::AttentionBackend::PAGED_ATTN: {
+        auto paged_kv_cache_config = dynamic_cast<const cache::PagedKVCacheConfig *>(cache_config);


如果加上--enable-paged-attn会seg fault core dumped

pengcheng888 · 2026-05-12T11:33:30Z

建议将中文注释修改为简洁的英文注释，因为个别平台对中文显示不全

pengcheng888 · 2026-05-13T01:45:56Z

+        INFINICORE_NN_PARAMETER_INIT(kv_cache_v_scale, ({1}, infinicore::DataType::F32, device, 0, 0, 1));
+    }
+}
+


该forward是static cache的， paged建议也实现

pengcheng888 · 2026-05-13T01:47:43Z

+
+namespace infinilm::models::glm4 {
+
+class Glm4Attention : public infinicore::nn::Module {


Glm4Attention模块名要和文件名对上，建议将 glm4_attention_layer.cpp/hpp 修改为 glm4_attention.cpp/hpp

rubik-hua requested a review from a team May 12, 2026 08:58

pengcheng888 reviewed May 12, 2026

View reviewed changes

wooway777 reviewed May 12, 2026

View reviewed changes

pengcheng888 reviewed May 12, 2026

View reviewed changes

wooway777 reviewed May 12, 2026

View reviewed changes

pengcheng888 reviewed May 13, 2026

View reviewed changes


		infinicore::Tensor forward(const infinicore::Tensor &positions, infinicore::Tensor &hidden_states);

		void set_rotary_emb(const std::shared_ptr<infinicore::nn::RoPE> &rotary_emb);


		using Glm4Model = infinilm::layers::causal_lm_templates::TextModel<Glm4DecoderLayer>;

		class Glm4ForCausalLM : public InfinilmModel {


		auto q_in = infinicore::Tensor::empty({batch_size, num_attention_heads_, seq_len, head_dim_}, q->dtype(), q->device())
		->permute({0, 2, 1, 3});


		namespace infinilm::models::glm4 {

		std::vector<infinicore::Tensor> glm4_allocate_kv_cache_tensors(


		namespace infinilm::models::glm4 {

		class Glm4Attention : public infinicore::nn::Module {

Conversation

rubik-hua commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengcheng888 May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pengcheng888 commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pengcheng888 May 12, 2026 •

edited

Loading