Skip to content

issue/349 - Support GLM4 model#370

Open
rubik-hua wants to merge 1 commit into
InfiniTensor:mainfrom
rubik-hua:glm4
Open

issue/349 - Support GLM4 model#370
rubik-hua wants to merge 1 commit into
InfiniTensor:mainfrom
rubik-hua:glm4

Conversation

@rubik-hua
Copy link
Copy Markdown

根据#352 这个PR里面的检视意见,进行了重构 建议参考一下修改点
(1)新增模型不应该修改已有模型代码,不要修改llama_legacy文件中的代码。
(2)请删除config_factory.cpp和rank_worker.cpp的改动
(3)参考已有代码实现(非llama_legacy文件夹),mlp model causual_lm 应该可以使用现有的模块。 (4)请在glm4文件添加如下文件 glm4_decoder_layer.cpp/hpp + glm4_for_causal_lm.cpp/hpp。 如果有必要添加glm4_attention.cpp/hpp,用来应对rope的修改。
(5)csrc/models/glm4/glm4_for_causal_lm.cpp中,需要定义一个自己的Glm4ForCausalLM类,不要使用nfinilm::models::llama::LlamaForCausalLM。 (6)RoPE类型问题:请增加https://github.com/InfiniTensor/InfiniLM/blob/main/csrc/layers/rotary_embedding/rotary_embedding.cpp 中get_rope函数的功能,在这个函数中处理GPT_J类型和"partial_rotary_factor"超参数。

1、模型test_infer.py测试截图:
命令:python examples/test_infer.py --device nvidia --model=/data/rubik/models/GLM-4-9B-0414/
image
2、推理服务启动
命令:python python/infinilm/server/inference_server.py --device nvidia --model=/data/rubik/models/GLM-4-9B-0414/
9935ba401a8cf8fa2231e3859dbd0fd5
客户端命令:python scripts/test_perf.py --verbose
客户端部分输出截图:
f86a09cfbd5c50da66238ebf02c4f7ae
98d11eebab0edb1277c98440870cb57f
92bc919e118dae545126d76cfee63a90
f2d15507f83601b54089455da652b5e4
2aa026aa1dcf8e1420154d6fd21be30f

另外由于修改了csrc/layers/rotary_embedding/rotary_embedding.cpp中的代码,algo默认参数为infinicore::nn::RoPE::Algo algo = infinicore::nn::RoPE::Algo::GPT_NEOX,逻辑跟原来一样。
跑两个之前的ok的模型进行验证:
b89e6d56b60f1964e0e77bec054bc1d7
image

根据InfiniTensor#352 这个PR里面的检视意见,进行了重构
建议参考一下修改点
(1)新增模型不应该修改已有模型代码,不要修改llama_legacy文件中的代码。
(2)请删除config_factory.cpp和rank_worker.cpp的改动
(3)参考已有代码实现(非llama_legacy文件夹),mlp model causual_lm 应该可以使用现有的模块。
(4)请在glm4文件添加如下文件 glm4_decoder_layer.cpp/hpp + glm4_for_causal_lm.cpp/hpp。
如果有必要添加glm4_attention.cpp/hpp,用来应对rope的修改。
(5)csrc/models/glm4/glm4_for_causal_lm.cpp中,需要定义一个自己的Glm4ForCausalLM类,不要使用nfinilm::models::llama::LlamaForCausalLM。
(6)RoPE类型问题:请增加https://github.com/InfiniTensor/InfiniLM/blob/main/csrc/layers/rotary_embedding/rotary_embedding.cpp 中get_rope函数的功能,在这个函数中处理GPT_J类型和"partial_rotary_factor"超参数。
@rubik-hua rubik-hua requested a review from a team May 12, 2026 08:58
std::shared_ptr<infinilm::layers::attention::AttentionLayer> attn_;
::infinilm::backends::AttentionBackend attention_backend_;
std::shared_ptr<infinicore::nn::RoPE> rotary_emb_;
std::shared_ptr<infinicore::nn::RMSNorm> norm_;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

norm_变量是不是没有被使用,请删除


infinicore::Tensor forward(const infinicore::Tensor &positions, infinicore::Tensor &hidden_states);

void set_rotary_emb(const std::shared_ptr<infinicore::nn::RoPE> &rotary_emb);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glm4Attention::set_rotary_emb函数好像没有被使用到,请删除


class Glm4ForCausalLM : public InfinilmModel {
public:
Glm4ForCausalLM(std::shared_ptr<infinilm::config::ModelConfig> model_config,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ForCausalLM对象不再需要rank_info和attention_backend这两个参数,请删除

return {logits};
}

void Glm4ForCausalLM::reset_cache(const cache::CacheConfig *cache_config) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InfinilmModel 中提供了默认的reset_cache函数实现, 可以不用重载reset_cache函数。 请删除 Glm4ForCausalLM::reset_cache


using Glm4Model = infinilm::layers::causal_lm_templates::TextModel<Glm4DecoderLayer>;

class Glm4ForCausalLM : public InfinilmModel {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议复用已有模块,减少冗余代码。

在给出Glm4ForCausalLM的定义时,考虑使用以下代码:
using Glm4ForCausalLM = infinilm::layers::causal_lm_templates::TextCausalLM;

import json
with open(config_path, "r") as f:
hf_config = json.load(f)
model_type = hf_config.get("model_type", "")
Copy link
Copy Markdown
Collaborator

@pengcheng888 pengcheng888 May 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

应该可以从 model.hf_config 变量中,获取model_type。 请删除读config.json的代码

raise ValueError(f"Cannot split {name} with shape {tensor.shape}")
return torch.split(tensor, sizes, dim=0)

def _remap_glm4_weights(state_dict):
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_split_first_dim函数,放到_remap_glm4_weights中,只作为_remap_glm4_weights内部使用。这样可以么

Comment on lines +112 to +114

auto q_in = infinicore::Tensor::empty({batch_size, num_attention_heads_, seq_len, head_dim_}, q->dtype(), q->device())
->permute({0, 2, 1, 3});
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么要定义q_in,k_in,v_in 新变量 做copy_from

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是因为维度对不上么?


namespace infinilm::models::glm4 {

std::vector<infinicore::Tensor> glm4_allocate_kv_cache_tensors(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InfiniLM中提供了默认的kv cache创建函数default_allocate_kv_cache_tensors,在csrc/models/infinilm_model.cpp。

在csrc/models/infinilm_model.cpp中提供了虚函数InfinilmModel::reset_cache的默认实现,默认调用default_allocate_kv_cache_tensors函数。

只有在default_allocate_kv_cache_tensors函数不满足要求时,才需要重新定义。

我对比了glm4_allocate_kv_cache_tensors函数,关键代码和default_allocate_kv_cache_tensors一样的。
建议删除 glm4_allocate_kv_cache_tensors.cpp/hpp文件, 减少冗余代码。

const auto &dtype{model_config->get_dtype()};

INFINICORE_NN_MODULE_INIT(model, model_config, device);
INFINICORE_NN_MODULE_INIT(lm_head, lm_head, hidden_size, vocab_size, false, dtype, device);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里好像多了个"lm_head, "

// 计算实际参与旋转的维度数
inline size_t get_rotary_dim(size_t head_dim, double partial_rotary_factor) {
if (partial_rotary_factor <= 0.0 || partial_rotary_factor >= 1.0) {
return head_dim;
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.hpp中提供get_rotary_dim函数的声明,将实现放到 rotary_embedding.cpp

break;
}
case backends::AttentionBackend::PAGED_ATTN: {
auto paged_kv_cache_config = dynamic_cast<const cache::PagedKVCacheConfig *>(cache_config);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果加上--enable-paged-attn会seg fault core dumped

@pengcheng888
Copy link
Copy Markdown
Collaborator

建议将中文注释修改为简洁的英文注释,因为个别平台对中文显示不全

INFINICORE_NN_PARAMETER_INIT(kv_cache_v_scale, ({1}, infinicore::DataType::F32, device, 0, 0, 1));
}
}

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

该forward是static cache的, paged建议也实现


namespace infinilm::models::glm4 {

class Glm4Attention : public infinicore::nn::Module {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glm4Attention模块名要和文件名对上, 建议将 glm4_attention_layer.cpp/hpp 修改为 glm4_attention.cpp/hpp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants