Skip to content

issue/277 perf: batch async dispatch, fix GIL release, fix build_model_inputs#278

Open
ma-hang wants to merge 1 commit intomainfrom
issue/277
Open

issue/277 perf: batch async dispatch, fix GIL release, fix build_model_inputs#278
ma-hang wants to merge 1 commit intomainfrom
issue/277

Conversation

@ma-hang
Copy link
Contributor

@ma-hang ma-hang commented Mar 24, 2026

测试平台

A100

优化前

============ Serving Benchmark Result ============
Successful requests:                     16
Failed requests:                         0
Maximum request concurrency:             16
Benchmark duration (s):                  27.37
Total input tokens:                      4080
Total generated tokens:                  16508
Request throughput (req/s):              0.58
Output token throughput (tok/s):         603.17
Peak output token throughput (tok/s):    704.00
Peak concurrent requests:                16.00
Total token throughput (tok/s):          752.24
---------------Time to First Token----------------
Mean TTFT (ms):                          1059.22
Median TTFT (ms):                        1056.99
P99 TTFT (ms):                           1236.73
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          25.48
Median TPOT (ms):                        25.66
P99 TPOT (ms):                           25.73
---------------Inter-token Latency----------------
Mean ITL (ms):                           25.65
Median ITL (ms):                         25.53
P99 ITL (ms):                            67.47
==================================================

优化后

============ Serving Benchmark Result ============
Successful requests:                     16        
Failed requests:                         0         
Maximum request concurrency:             16        
Benchmark duration (s):                  20.38     
Total input tokens:                      4080      
Total generated tokens:                  16400     
Request throughput (req/s):              0.79      
Output token throughput (tok/s):         804.65    
Peak output token throughput (tok/s):    1008.00   
Peak concurrent requests:                16.00     
Total token throughput (tok/s):          1004.83   
---------------Time to First Token----------------
Mean TTFT (ms):                          697.80    
Median TTFT (ms):                        723.32    
P99 TTFT (ms):                           725.15    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          19.22     
Median TPOT (ms):                        19.23     
P99 TPOT (ms):                           19.51     
---------------Inter-token Latency----------------
Mean ITL (ms):                           19.23     
Median ITL (ms):                         19.32     
P99 ITL (ms):                            21.85     
==================================================

@ma-hang ma-hang linked an issue Mar 24, 2026 that may be closed by this pull request
@pengcheng888
Copy link
Collaborator

补充一个测试截图

.def(
"forward", [](InferEngine &self, const InferEngine::Input &input) -> InferEngine::Output { return self.forward(input); }, "Run inference on all ranks with arbitrary arguments")
"forward", [](InferEngine &self, const InferEngine::Input &input) -> InferEngine::Output {
py::gil_scoped_release release;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

py::gil_scoped_release release; 这是什么

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[DEV] 推理服务性能优化

2 participants