feat: support chunkprefill and prefill cuda graph by Simon12345777 · Pull Request #371 · InfiniTensor/InfiniLM

Simon12345777 · 2026-05-12T09:51:36Z

Summary

C++ 端引入 chunked-prefill 编译路径：新增 csrc/engine/compiler/chunk_prefill_compiler.{hpp,cpp}；GeneralCompiler 拼装 ChunkPrefillCompiler 并在 get_compiled() 中先于 decode 匹配；新增 enable_chunk_prefill_graph 开关贯通 InferEngine → RankWorker → GeneralCompiler。
pybind11 与 Python InferEngine 包装层（csrc/pybind11/engine/engine.hpp、python/infinilm/infer_engine.py）将 enable_chunk_prefill_graph 透传到底层。
脚本层移植：把 InfiniGraph/InfiniLM/scripts 中 infer_task.py 与 launch_server.py 的 chunked-prefill 调度逻辑（setup_chunked_prefill / advance_prefill_chunk、--chunk-size、优先级 worker_loop）原样移植到 hds/InfiniLM/scripts/，并保留 hds 既有的 KVCache.init 签名以兼容 jiuge*/deepseek/qwen3vl 的 create_kv_cache()。
Python 服务侧 chunked-prefill 调度（接通 enable_chunk_prefill_graph）：
llm/request.py 给 InferenceRequest 增加 chunk_size、chunk_prefill_offset 与 is_chunking() / chunk_is_last()。
llm/scheduler.py 新增 chunking_queue，schedule() 采用三级优先级：running（decode）> chunking（续片）> waiting（新请求）；长 prompt 进入 chunking 时以 batch=1 单请求返回，匹配 C++ 侧 (batch_size, chunk_size) 预编图签名。
processors/basic_llm_processor.py 的 paged 分支按 chunk_prefill_offset / chunk_size 切片 input_ids / position_ids / slot_mapping，并设置正确的 past_kv_lengths / total_kv_lengths。
llm/llm.py::_update_requests 识别 chunk 中间步骤：不消费采样 token、不触发 reset_req_blocks，仅推进 offset 后 requeue_chunking；最后一片走正常路径并清零 chunk 状态。
配置链路：EngineConfig / LLM / LLMEngine / AsyncLLMEngine / InferenceServer 增加 chunk_size；BaseConfig 增加 --chunk-size（默认 512）与 --enable-chunk-prefill-graph。

Motivation

InfiniGraph 分支上 csrc 已经有 chunk-prefill 图编译路径与配套的 infer_task.py / launch_server.py 分片调度，但 hds 分支只有传统 prefill。本 PR 把整条链路移到 hds：底层补齐 ChunkPrefillCompiler 与 enable_chunk_prefill_graph 开关，并在 inference_server.py 这条新链路里以原生方式实现 chunked-prefill 调度，使开关一打开就能命中预编图，从而：

降低首 token 延迟时的瞬时显存压力（长 prompt 不必一次性进入 forward）；
与 paged KV cache 一致地复用 C++ 侧预编 (batch_size, chunk_size) 图。

Closes #

Type of Change

feat — new feature / new model
fix — bug fix
perf — performance improvement (no behavioral change)
refactor — code restructuring without behavior change
test — adding or fixing tests only
docs — documentation only
build / ci — build system or CI configuration
chore — tooling, formatting, or other non-code changes
Breaking change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Every contributor must verify every item below before requesting
review. Tick each box only after the check has actually been performed —
do not tick speculatively. If an item truly does not apply, replace the
checkbox with N/A and briefly explain why in an inline comment.

Title, Branch, and Commits

PR title follows Conventional Commits (e.g. feat(nvidia): …, fix(cuda/gemm): …).
Branch name follows <type>/xxx-yyyy-zzzz where <type> matches the PR title's Conventional Commits type and words are joined with hyphens (see CONTRIBUTING.md §Branches).
Each commit message follows Conventional Commits.
Small PR is a single squashable commit; or, for a large PR, every commit is meaningful, well-formed, and independently reviewable (see CONTRIBUTING.md §Pull Requests).
No stray merge commits from main — the branch is rebased cleanly on top of the current main.
No fixup! / squash! / wip commits remain.
Existing PR/branch/commit that followed the legacy issue format.

Scope and Design

Changes are minimal — nothing unrelated to the stated motivation was added (CONTRIBUTING.md §Code/General).
No dead code, commented-out blocks, debug prints, printf/std::cout/print(...) left behind, or TODO without an owner and issue link.
No unrelated formatting churn that would obscure the diff.
Public API changes (if any) are intentional, documented, and reflected in affected callers/tests.

General Code Hygiene (applies to all languages)

The code is self-explanatory; comments were added only where the why is non-obvious (CONTRIBUTING.md §Code/General).
Every modified or added file ends with a single trailing newline (CONTRIBUTING.md §Code/General).
No trailing whitespace, tab/space mixing, or stray BOMs.
Identifiers in comments and error messages are wrapped in backticks (e.g. the `seqlens_k` tensor) (CONTRIBUTING.md §Code/General).
All comments and error messages are in English (CONTRIBUTING.md §Code/General).
Comments and error messages are complete sentences — capitalized first letter, terminal punctuation — unless the language/framework convention says otherwise (CONTRIBUTING.md §Code/General; §Python).

C++ Specific (if C++ files changed)

Code follows the Google C++ Style Guide strictly.
Error and warning message wording follows the LLVM Coding Standards (CONTRIBUTING.md §C++).
Constructor initializer list order matches member declaration order (CONTRIBUTING.md §C++).
No raw new/delete; RAII / smart pointers / existing allocators are used.
Changed files are formatted by scripts/format.py.
No changes/reference to csrc/models/llama_legacy/.

Python Specific (if Python files changed)

Code is PEP 8 compliant.
Comments are complete English sentences, starting with a capital letter and ending with punctuation; Markdown backticks are used for code references (CONTRIBUTING.md §Python).
Docstrings (if any) follow PEP 257 (CONTRIBUTING.md §Python).
Changed files are formatted by scripts/format.py.
No changes/reference to python/infinilm/auto_config.py.

Testing

For any platform that could not be tested, an explicit reason is given in the table and a reviewer with access has been tagged.
Passed single request test (examples/test_infer.py), or specify the reason for skipping.
Passed offline performance test (examples/bench.py), or specify the reason for skipping.
Passed sanity test (test/bench/test_benchmark.py), or specify the reason for skipping.
Passed service test (python/infinilm/server/inference_server.py + scripts/test_perf.py), or specify the reason for skipping.

Build, CI, and Tooling

The project builds cleanly from a fresh directory on at least one affected platform.

Documentation

README.md, CONTRIBUTING.md, or inline docs updated when behavior, build flags, or developer workflow changed.
Any user-visible breaking change is called out explicitly under "Motivation" and in the commit/PR title with a ! or BREAKING CHANGE: footer.

Security and Safety

No secrets, access tokens, internal URLs, customer data, or personal hardware identifiers have been committed.
Third-party code is license-compatible and attributed.
No unsafe pointer arithmetic, uninitialized reads, or missing bounds checks were introduced.

wooway777 · 2026-05-12T11:22:49Z

@@ -1,12 +1,13 @@
 #pragma once

+#include "chunk_prefill_compiler.hpp"


这个文件在哪里

wooway777

请补充启动指令和测试截图，请支持flash attn后端

add chunkprefill and prefill cuda graph

f2c8bab

Simon12345777 requested a review from a team May 12, 2026 09:51

wooway777 requested review from PanZezhong1725 and wooway777 May 12, 2026 10:45

wooway777 reviewed May 12, 2026

View reviewed changes

add chunk_prefill_compiler.cpp/.hpp

bb68ca5

wooway777 requested changes May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support chunkprefill and prefill cuda graph#371

feat: support chunkprefill and prefill cuda graph#371
Simon12345777 wants to merge 2 commits into
InfiniTensor:mainfrom
Simon12345777:main

Simon12345777 commented May 12, 2026 •

edited

Loading

Uh oh!

wooway777 May 12, 2026

Uh oh!

wooway777 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -1,12 +1,13 @@
		#pragma once

		#include "chunk_prefill_compiler.hpp"

Conversation

Simon12345777 commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Type of Change

Test Results of Involved Models on Supported Platforms (Please attach screenshots)

Benchmark / Performance Impact

Notes for Reviewers

Checklist

Title, Branch, and Commits

Scope and Design

General Code Hygiene (applies to all languages)

C++ Specific (if C++ files changed)

Python Specific (if Python files changed)

Testing

Build, CI, and Tooling

Documentation

Security and Safety

Uh oh!

wooway777 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

wooway777 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Simon12345777 commented May 12, 2026 •

edited

Loading

wooway777 left a comment •

edited

Loading