Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
75ec504
Support MThreads (MUSA) GPU (#1162)
yeahdongcn Jan 6, 2026
11b4de3
Rl weight (#1143)
shihaobai Dec 8, 2025
04b214b
fix-internvl (#1171)
SangChengC Jan 9, 2026
b485822
add att backend (#1165)
hiworldwzj Jan 9, 2026
5d2f630
fix unit test (#1173)
hiworldwzj Jan 10, 2026
b79f44b
refactor norm and add platform
shihaobai Jan 11, 2026
46a16bb
merge main
shihaobai Jan 11, 2026
5740a2e
norm
shihaobai Jan 12, 2026
0e609c8
Embedding and LMHead
sufubao Jan 12, 2026
39a738b
fix LMHeadWeight
sufubao Jan 12, 2026
74adfc5
mm weight refactor
shihaobai Jan 12, 2026
65512a4
Merge branch 'weight_refactor' of https://github.com/ModelTC/lightllm…
shihaobai Jan 12, 2026
0cd8fca
MOE
sufubao Jan 12, 2026
cbd1726
fix gemma norm & slicer
shihaobai Jan 12, 2026
12ddd8a
fix
shihaobai Jan 12, 2026
6c06c23
Merge branch 'weight_refactor' of https://github.com/ModelTC/lightllm…
shihaobai Jan 12, 2026
ade109f
remove data_type
shihaobai Jan 12, 2026
45a415a
remove fused_moe_weight_tp
shihaobai Jan 12, 2026
6e227b1
qk norm
shihaobai Jan 12, 2026
eadccca
remove PlatformAwareOp.__init__()
shihaobai Jan 12, 2026
7ea5e77
fix model call
sufubao Jan 13, 2026
057f742
remove torchao
sufubao Jan 13, 2026
cee6e23
[MUSA] Add shell script to generate requirements-musa.txt and update …
yeahdongcn Jan 13, 2026
e2b3305
quantization draft
sufubao Jan 13, 2026
f0481a8
fix openai v1 (#1178)
sufubao Jan 13, 2026
2fbd2b8
add diverse_stage2 add optimize diverse_stage1 (#1174)
WANDY666 Jan 15, 2026
3e2e030
refactor quantization (draft)
shihaobai Jan 15, 2026
55fdd2f
fix
shihaobai Jan 15, 2026
208f1b0
unit_test
shihaobai Jan 15, 2026
206b170
check image tag and image num (#1176)
SangChengC Jan 16, 2026
7a0a4d7
fix
shihaobai Jan 19, 2026
d5ac1a8
update docs
shihaobai Jan 19, 2026
c1274ea
fix pre-weight
shihaobai Jan 19, 2026
1bd148d
fix cpu kv cache offload async error (#1180)
hiworldwzj Jan 20, 2026
148e7f1
fix deepseek
shihaobai Jan 20, 2026
0f6687c
Merge remote-tracking branch 'origin/main' into weight_refactor
sufubao Jan 20, 2026
dd04d13
Merge branch 'weight_refactor' of https://github.com/ModelTC/lightllm…
shihaobai Jan 20, 2026
cb4aecc
fix unitest
shihaobai Jan 20, 2026
9242821
[MUSA] Add shell script to generate requirements-musa.txt and update …
yeahdongcn Jan 13, 2026
35902c4
fix openai v1 (#1178)
sufubao Jan 13, 2026
0f89a7a
add diverse_stage2 add optimize diverse_stage1 (#1174)
WANDY666 Jan 15, 2026
89e7b9e
check image tag and image num (#1176)
SangChengC Jan 16, 2026
0777e28
fix cpu kv cache offload async error (#1180)
hiworldwzj Jan 20, 2026
8781d63
Merge branch 'main' into weight_refactor
shihaobai Jan 21, 2026
ac2ec92
refactor fuse_moe
shihaobai Jan 22, 2026
e8463f1
redunancy_expert(draft)
shihaobai Jan 22, 2026
ec0fe0f
remove weight_ep
shihaobai Jan 22, 2026
4b27aac
add redundancy assert
shihaobai Jan 23, 2026
351387d
fix mm weight with bias
shihaobai Jan 23, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@ dist
.idea
.vscode
tmp/
requirements-musa.txt
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,4 +10,4 @@ repos:
rev: 6.1.0
hooks:
- id: flake8
args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231']
args: ['--max-line-length=120', '--ignore=TYP001, E722, C901, E203, E266, E402, E302, E241, E902, E731, F403, E701, F405, F401, W292, W293, W503, W606, E231, F541']
12 changes: 8 additions & 4 deletions docs/CN/source/getting_started/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
$ # 前请确保你的docker设置中已经分配了足够的共享内存,否则可能导致
$ # 服务无法正常启动。
$ # 1.如果是纯文本服务,建议分配2GB以上的共享内存, 如果你的内存充足,建议分配16GB以上的共享内存.
$ # 2.如果是多模态服务,建议分配16GB以上的共享内存,具体可以根据实际情况进行调整.
$ # 2.如果是多模态服务,建议分配16GB以上的共享内存,具体可以根据实际情况进行调整.
$ # 如果你没有足够的共享内存,可以尝试在启动服务的时候调低 --running_max_req_size 参数,这会降低
$ # 服务的并发请求数量,但可以减少共享内存的占用。如果是多模态服务,也可以通过降低 --cache_capacity
$ # 参数来减少共享内存的占用。
Expand All @@ -38,7 +38,7 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
你也可以使用源码手动构建镜像并运行,建议手动构建镜像,因为更新比较频繁:

.. code-block:: console

$ # 进入代码仓库的根目录
$ cd /lightllm
$ # 手动构建镜像, docker 目录下有不同功能场景的镜像构建文件,按需构建。
Expand All @@ -52,7 +52,7 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
或者你也可以直接使用脚本一键启动镜像并且运行:

.. code-block:: console

$ # 查看脚本参数
$ python tools/quick_launch_docker.py --help

Expand Down Expand Up @@ -80,6 +80,10 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
$ # 安装lightllm的依赖 (cuda 12.4)
$ pip install -r requirements.txt --extra-index-url https://download.pytorch.org/whl/cu124
$
$ # 安装lightllm的依赖 (摩尔线程 GPU)
$ ./generate_requirements_musa.sh
$ pip install -r requirements-musa.txt
$
$ # 安装lightllm
$ python setup.py install

Expand All @@ -97,6 +101,6 @@ Lightllm 是一个纯python开发的推理框架,其中的算子使用triton
.. code-block:: console

$ pip install -U --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/Triton-Nightly/pypi/simple/ triton-nightly --no-deps

具体原因可以参考:`issue <https://github.com/triton-lang/triton/issues/3619>`_ 和 `fix PR <https://github.com/triton-lang/triton/pull/3638>`_

37 changes: 0 additions & 37 deletions docs/CN/source/models/add_new_model.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,19 +162,6 @@ class BloomPreAndPostLayerWeight(PreAndPostLayerWeight):
self.tp_rank_: split_vob_size * (self.tp_rank_ + 1), :])
self.lm_head_weight_ = self.wte_weight_
return

def verify_load(self):
errors = "weights load not ok"
weights = [self.pre_norm_weight_,
self.pre_norm_bias_,
self.final_norm_weight_,
self.final_norm_bias_,
self.wte_weight_,
self.lm_head_weight_]
for i in range(len(weights)):
assert weights[i] is not None, "index:" + str(i) + " " + errors
return

~~~

***transformer_layer_weight.py***
Expand Down Expand Up @@ -204,30 +191,6 @@ class BloomTransformerLayerWeight(TransformerLayerWeight):
self._load_qkvo_weights(weights)
self._load_ffn_weights(weights)
return

def verify_load(self):
errors = "weights load not ok"
weights = [self.att_norm_weight_,
self.att_norm_bias_,
self.q_weight_,
self.k_weight_,
self.v_weight_,
self.q_bias_,
self.k_bias_,
self.v_bias_,
self.o_weight_,
self.o_bias_,

self.ffn_norm_weight_,
self.ffn_norm_bias_,
self.ffn_1_weight_,
self.ffn_1_bias_,
self.ffn_2_weight_,
self.ffn_2_bias_,
]
for i in range(len(weights)):
assert weights[i] is not None, "index:" + str(i) + " " + errors
return

def _load_qkvo_weights(self, weights):
if f"h.{self.layer_num_}.input_layernorm.weight" in weights:
Expand Down
47 changes: 7 additions & 40 deletions docs/CN/source/tutorial/api_server_args_zh.rst
Original file line number Diff line number Diff line change
Expand Up @@ -183,22 +183,6 @@ PD 分离模式参数
设置为 True 时,--nccl_host 必须等于 config_server_host,--nccl_port 对于 config_server 必须是唯一的,
不要为不同的推理节点使用相同的 nccl_port,这将是严重错误

attention类型选择参数
---------------------

.. option:: --mode

模型推理模式,可以指定多个值:

* ``triton_int8kv``: 使用 int8 存储 kv cache,可增加 token 容量,使用 triton kernel
* ``ppl_int8kv``: 使用 int8 存储 kv cache,使用 ppl 快速 kernel
* ``ppl_fp16``: 使用 ppl 快速 fp16 解码注意力 kernel
* ``triton_flashdecoding``: 用于长上下文的 flashdecoding 模式,当前支持 llama llama2 qwen
* ``triton_gqa_attention``: 使用 GQA 的模型的快速 kernel
* ``triton_gqa_flashdecoding``: 使用 GQA 的模型的快速 flashdecoding kernel
* ``triton_fp8kv``: 使用 float8 存储 kv cache,目前仅用于 deepseek2

需要阅读源代码以确认所有模型支持的具体模式

调度参数
--------
Expand Down Expand Up @@ -327,17 +311,9 @@ attention类型选择参数

推理后端将为解码使用微批次重叠模式

.. option:: --enable_flashinfer_prefill
.. option:: --llm_kv_type

推理后端将为预填充使用 flashinfer 的注意力 kernel

.. option:: --enable_flashinfer_decode

推理后端将为解码使用 flashinfer 的注意力 kernel

.. option:: --enable_fa3

推理后端将为预填充和解码使用 fa3 注意力 kernel
推理后端使用什么类型的数据存储kv cache, 可选值为 "None", "int8kv", "int4kv", "fp8kv"

.. option:: --disable_cudagraph

Expand Down Expand Up @@ -373,17 +349,14 @@ attention类型选择参数
.. option:: --quant_type

量化方法,可选值:

* ``ppl-w4a16-128``
* ``flashllm-w6a16``
* ``ao-int4wo-[32,64,128,256]``
* ``ao-int8wo``
* ``ao-fp8w8a16``
* ``ao-fp6w6a16``

* ``vllm-w8a8``
* ``vllm-fp8w8a8``
* ``vllm-fp8w8a8-b128``
* ``deepgemm-fp8w8a8-b128``
* ``triton-fp8w8a8-block128``
* ``awq``
* ``awq_marlin``
* ``none`` (默认)

.. option:: --quant_cfg
Expand All @@ -395,13 +368,7 @@ attention类型选择参数
.. option:: --vit_quant_type

ViT 量化方法,可选值:

* ``ppl-w4a16-128``
* ``flashllm-w6a16``
* ``ao-int4wo-[32,64,128,256]``
* ``ao-int8wo``
* ``ao-fp8w8a16``
* ``ao-fp6w6a16``

* ``vllm-w8a8``
* ``vllm-fp8w8a8``
* ``none`` (默认)
Expand Down
69 changes: 43 additions & 26 deletions docs/CN/source/tutorial/deepseek_deployment.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,12 +33,14 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3

**参数说明:**
- `LOADWORKER=18`: 模型加载线程数,提高加载速度
- `--tp 8`: 张量并行度,使用8个GPU
- `--enable_fa3`: 启用 Flash Attention 3.0
- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0
- `--port 8088`: 服务端口

1.2 单机 DP + EP 模式 (Data Parallel + Expert Parallel)
Expand All @@ -51,17 +53,20 @@ LightLLM 支持以下几种部署模式:
.. code-block:: bash

# H200 单机 DeepSeek-R1 DP + EP 模式
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 8 \
--dp 8 \
--enable_fa3
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--enable_ep_moe

**参数说明:**
- `MOE_MODE=EP`: 设置专家并行模式
- `--enable_ep_moe`: 设置专家并行模式
- `--tp 8`: 张量并行度
- `--dp 8`: 数据并行度,通常设置为与 tp 相同的值
- `--enable_fa3`: 启用 Flash Attention 3.0
- `--llm_prefill_att_backend fa3`: 启用 Flash Attention 3.0
- `--llm_decode_att_backend fa3`: 启用 Flash Attention 3.0

**可选优化参数:**
- `--enable_prefill_microbatch_overlap`: 启用预填充微批次重叠
Expand All @@ -85,7 +90,8 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
Expand All @@ -101,7 +107,8 @@ LightLLM 支持以下几种部署模式:
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
Expand All @@ -125,15 +132,16 @@ LightLLM 支持以下几种部署模式:
# H200 多机 DeepSeek-R1 EP 模式 Node 0
# 使用方法: sh multi_node_ep_node0.sh <nccl_host>
export nccl_host=$1
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 0 \
--nccl_host $nccl_host \
--nccl_port 2732
--nccl_port 2732 --enable_ep_moe

**Node 1 启动命令:**

Expand All @@ -142,15 +150,16 @@ LightLLM 支持以下几种部署模式:
# H200 多机 DeepSeek-R1 EP 模式 Node 1
# 使用方法: sh multi_node_ep_node1.sh <nccl_host>
export nccl_host=$1
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
--model_dir /path/DeepSeek-R1 \
--tp 16 \
--dp 16 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--nnodes 2 \
--node_rank 1 \
--nccl_host $nccl_host \
--nccl_port 2732
--nccl_port 2732 --enable_ep_moe

**可选优化参数:**
- `--enable_prefill_microbatch_overlap`: 启用预填充微批次重叠
Expand Down Expand Up @@ -187,18 +196,20 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
export host=$1
export pd_master_ip=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "prefill" \
--tp 8 \
--dp 8 \
--host $host \
--port 8019 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
--pd_master_port 60011 \
--enable_ep_moe
# 如果需要启用微批次重叠,可以取消注释以下行
#--enable_prefill_microbatch_overlap

Expand All @@ -211,18 +222,20 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
export host=$1
export pd_master_ip=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "decode" \
--tp 8 \
--dp 8 \
--host $host \
--port 8121 \
--nccl_port 12322 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--pd_master_ip $pd_master_ip \
--pd_master_port 60011
--pd_master_port 60011 \
--enable_ep_moe
# 如果需要启用微批次重叠,可以取消注释以下行
#--enable_decode_microbatch_overlap

Expand Down Expand Up @@ -279,36 +292,40 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
export host=$1
export config_server_host=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "prefill" \
--host $host \
--port 8019 \
--tp 8 \
--dp 8 \
--nccl_port 2732 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--disable_cudagraph \
--config_server_host $config_server_host \
--config_server_port 60088
--config_server_port 60088 \
--enable_ep_moe
# 如果需要启用微批次重叠,可以取消注释以下行
#--enable_prefill_microbatch_overlap

# Decode 服务
export host=$1
export config_server_host=$2
nvidia-cuda-mps-control -d
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
LOADWORKER=18 python -m lightllm.server.api_server \
--model_dir /path/DeepSeek-R1 \
--run_mode "decode" \
--host $host \
--port 8121 \
--nccl_port 12322 \
--tp 8 \
--dp 8 \
--enable_fa3 \
--llm_prefill_att_backend fa3 \
--llm_decode_att_backend fa3 \
--config_server_host $config_server_host \
--config_server_port 60088
--config_server_port 60088 \
--enable_ep_moe
# 如果需要启用微批次重叠,可以取消注释以下行
#--enable_decode_microbatch_overlap

Expand Down
Loading