refactor: unify linear/quantization architecture and remove deprecate…#366
refactor: unify linear/quantization architecture and remove deprecate…#366qinyiqun wants to merge 1 commit into
Conversation
…d interfaces - Move linear module from InfiniCore to InfiniLM with quantization-based dispatch - Add GPTQ->GPTQ_QY weight conversion gated by QY device type - Implement fused linear weight splitting and re-registration - Fix TP split dimensions for all quantization schemes - Add alpha scaling parameter and logical dim size delegation - Move set_zeros/set_minus_one to utils.hpp as shared utilities
|
需要讨论的问题:module在InfiniLM中应该以宏的形式声明和初始化还是应该以智能指针的形式存在。 |
| k_dim_}, | ||
| dtype_, | ||
| rank_info.device); | ||
| set_zeros(k_caches_); |
There was a problem hiding this comment.
kv cache需要置0么,这是哪个平台的要求么
There was a problem hiding this comment.
国产芯片有脏内存,malloc内存不设置0会留下原来的数据。
There was a problem hiding this comment.
kv_cache.cpp中有四个函数中有申请kv cache的, 那这四个函数都的加
| rank_info_.device, | ||
| pending_cache_config_ != nullptr ? pending_cache_config_.get() : nullptr); | ||
| } else { | ||
| std::vector<std::string> classic_models = {"llama", "qwen2", "minicpm", "fm9g", "fm9g7b"}; |
There was a problem hiding this comment.
这段classic_models代码暂时不要删除,如果要删除lama_legacy文件夹的话,应该单独提pr删除
| break; | ||
| } | ||
| } | ||
| auto register_fn = [this](const std::string &n, infinicore::nn::Parameter p) { this->register_parameter(n, std::move(p)); }; |
There was a problem hiding this comment.
register_fn变量 能移动到 init_kv_cache_quant_params 函数中么
| break; | ||
| } | ||
| } | ||
| init_kv_cache_quant_params(register_fn, device, kv_cache_k_scale_, kv_cache_v_scale_); |
There was a problem hiding this comment.
这个函数可以作为 Attention类的private函数,自己调用么
| break; | ||
| } | ||
| } | ||
| auto register_fn = [this](const std::string &n, infinicore::nn::Parameter p) { this->register_parameter(n, std::move(p)); }; |
There was a problem hiding this comment.
这个变量register_fn时不是没有被使用
| @@ -19,12 +19,12 @@ MoeMLP::MoeMLP(std::shared_ptr<infinilm::config::ModelConfig> model_config, | |||
| auto quant_scheme = model_config->get_quant_scheme(); | |||
| auto quantization_method = model_config->get_quantization_method(); | |||
| switch (quant_scheme) { | |||
There was a problem hiding this comment.
switch (quant_scheme) 不是藏在了linear里面么,为什么这个还要switch (quant_scheme)
There was a problem hiding this comment.
ColumnParallelLinear和RowParallelLinear可以调用带有quantization参数构造么
There was a problem hiding this comment.
看起来这个文件是我修改之后加进去的,我改一下
| break; | ||
| } | ||
| } | ||
| infinilm::layers::attention::init_kv_cache_quant_params(register_fn, device, kv_cache_k_scale_, kv_cache_v_scale_); |
There was a problem hiding this comment.
这里要复用init_kv_cache_quant_params函数么
| std::shared_ptr<infinilm::config::ModelConfig> model_config, | ||
| engine::distributed::RankInfo rank_info = engine::distributed::RankInfo(), | ||
| const cache::CacheConfig *cache = nullptr, | ||
| backends::AttentionBackend attention_backend = backends::AttentionBackend::Default); |
There was a problem hiding this comment.
这个也构造也删除的话, llama_legacy就走不到了
|
|
||
| } // namespace infinilm::nn | ||
|
|
||
| #include "fused_linear.hpp" |
| const size_t block_per_req = nblocks; | ||
| input.block_tables = block_tables_holder_->as_strided({b, block_per_req}, {(ptrdiff_t)block_per_req, 1}); | ||
| input.slot_mapping = infinicore::Tensor::empty({b}, infinicore::DataType::I64, infinicore::context::getDevice()); | ||
| set_zeros(input.slot_mapping.value()); |
There was a problem hiding this comment.
这行代码为什么要删了,不用重置input.slot_mapping了么
Summary
Motivation
Linear/quantization should be moved to InfiniLM from InfiniCore.
Closes #
Type of Change
feat— new feature / new modelfix— bug fixperf— performance improvement (no behavioral change)refactor— code restructuring without behavior changetest— adding or fixing tests onlydocs— documentation onlybuild/ci— build system or CI configurationchore— tooling, formatting, or other non-code changesTest Results of Involved Models on Supported Platforms (Please attach screenshots)
Benchmark / Performance Impact
Notes for Reviewers
Checklist
Title, Branch, and Commits
feat(nvidia): …,fix(cuda/gemm): …).<type>/xxx-yyyy-zzzzwhere<type>matches the PR title's Conventional Commits type and words are joined with hyphens (seeCONTRIBUTING.md§Branches).CONTRIBUTING.md§Pull Requests).main— the branch is rebased cleanly on top of the currentmain.fixup!/squash!/wipcommits remain.Scope and Design
CONTRIBUTING.md§Code/General).printf/std::cout/print(...)left behind, orTODOwithout an owner and issue link.General Code Hygiene (applies to all languages)
CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).the `seqlens_k` tensor) (CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General).CONTRIBUTING.md§Code/General; §Python).C++ Specific (if C++ files changed)
CONTRIBUTING.md§C++).CONTRIBUTING.md§C++).new/delete; RAII / smart pointers / existing allocators are used.scripts/format.py.csrc/models/llama_legacy/.Python Specific (if Python files changed)
CONTRIBUTING.md§Python).CONTRIBUTING.md§Python).scripts/format.py.python/infinilm/auto_config.py.Testing
examples/test_infer.py), or specify the reason for skipping.examples/bench.py), or specify the reason for skipping.test/bench/test_benchmark.py), or specify the reason for skipping.python/infinilm/server/inference_server.py+scripts/test_perf.py), or specify the reason for skipping.Build, CI, and Tooling
Documentation
README.md,CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.!orBREAKING CHANGE:footer.Security and Safety