Add /get_ppl endpoint#4679
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds a dedicated perplexity scoring API (POST /get_ppl) and wires end-to-end support for returning prompt cross-entropy loss (ce_loss) from both TurboMind and PyTorch backends, enabling PPL computation without exporting full logits to CPU.
Changes:
- Add OpenAI-style
/get_pplendpoint + request/response models, backed byAsyncEngine.async_get_ppl. - Implement backend-side prompt CE-loss accumulation (TurboMind: CUDA kernel + output processor; PyTorch: per-step CE reduction with chunked-prefill boundary handling).
- Simplify
Pipeline.get_pplto callasync_get_pplper sequence (removing the previous batching/long-text implementation).
Reviewed changes
Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/csrc/unittests/test_logprob_kernels.cu | Removed logprob kernel unit tests (file deleted). |
| tests/csrc/unittests/CMakeLists.txt | Stop building/linking the removed logprob kernel unit test. |
| src/turbomind/python/bind.cpp | Expose GenerationConfig.return_ppl to Python via pybind. |
| src/turbomind/models/output_processor.cc | Add CE-loss target collection, CE-loss computation from chunked logits, and output emission. |
| src/turbomind/models/CMakeLists.txt | Link cross_entropy_kernels into the models target. |
| src/turbomind/kernels/logprob_kernels.cu | Remove old logprob kernel implementation (file deleted). |
| src/turbomind/kernels/cross_entropy_kernels.h | Replace previous declaration with invokeCrossEntropyLoss API. |
| src/turbomind/kernels/cross_entropy_kernels.cu | New CUDA kernel to compute and accumulate cross-entropy loss on device. |
| src/turbomind/kernels/CMakeLists.txt | Replace logprob_kernels library target with cross_entropy_kernels. |
| src/turbomind/engine/request.h | Add return_ppl to config and per-request CE-loss state (input_ce_loss, ce_loss). |
| src/turbomind/engine/request.cc | Include return_ppl in config logging output. |
| src/turbomind/engine/model_request.cc | Allocate ce_loss output tensor when return_ppl is requested. |
| src/turbomind/engine/engine.cc | Treat return_ppl as incompatible with prefix caching in validation. |
| lmdeploy/turbomind/turbomind.py | Plumb ce_loss output into EngineOutput on FINISH. |
| lmdeploy/serve/openai/protocol.py | Add GetPPLRequest/GetPPLResponse protocol models. |
| lmdeploy/serve/openai/api_server.py | Add /get_ppl route that tokenizes (when needed) and calls async_get_ppl. |
| lmdeploy/serve/core/async_engine.py | Add async_get_ppl API and store speculative_config for validation. |
| lmdeploy/pytorch/messages.py | Add out_ce_loss flag and per-sequence CE-loss accumulator. |
| lmdeploy/pytorch/engine/model_agent/scoring.py | New helper to compute prompt CE-loss (supports chunked-prefill boundary token). |
| lmdeploy/pytorch/engine/model_agent/agent.py | Compute/propagate CE-loss from logits; request logits when CE-loss is needed. |
| lmdeploy/pytorch/engine/inputs_maker.py | Add return_ce_loss flag propagation for prefill input creation. |
| lmdeploy/pytorch/engine/engine.py | Extend inference output container with ce_loss. |
| lmdeploy/pytorch/engine/engine_loop.py | Accumulate and emit CE-loss per session on FINISH. |
| lmdeploy/pytorch/engine/engine_instance.py | Forward ce_loss from response payload into InferOutput. |
| lmdeploy/pipeline.py | Re-implement get_ppl using async_get_ppl per sequence. |
| lmdeploy/messages.py | Add GenerationConfig.return_ppl and EngineOutput.ce_loss fields/documentation. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| gen_config = GenerationConfig(max_new_tokens=max_new_tokens, return_ppl=True, top_k=1) | ||
| async with self.safe_run(handle, | ||
| session=session, | ||
| input_ids=input_ids, | ||
| gen_config=gen_config, | ||
| stream_output=False, | ||
| sequence_start=True, | ||
| sequence_end=True, | ||
| step=session.step) as gen: | ||
| async for outputs in gen: | ||
| pass | ||
| ce_loss = outputs.ce_loss |
|
|
||
| # compute summed, unnormalized prompt cross-entropy | ||
| ce_loss = None | ||
| if return_ce_loss and logits is not None: |
There was a problem hiding this comment.
last_logits is the inputs of async_sampling_logits, and it is also a slice of logits. In sampling last_logits might be updated inplace.
| The input may be raw text or token ids. Text is tokenized with | ||
| ``tokenizer.encode`` (no chat template applied). | ||
| """ | ||
| model: str | None = None |
| usage: UsageInfo | ||
|
|
||
|
|
||
| class GetPPLRequest(BaseModel): |
There was a problem hiding this comment.
Let's rename it "PPLRequest"
| input: str | list[int] | ||
|
|
||
|
|
||
| class GetPPLResponse(BaseModel): |
| created: int = Field(default_factory=lambda: int(time.time())) | ||
| model: str = None | ||
| ppl: float | ||
| usage: UsageInfo |
There was a problem hiding this comment.
Since usage play an important role in 'ppl' endpint, may remove it.
| request_input = request.input | ||
| model_name = request.model or async_engine.model_name |
There was a problem hiding this comment.
Let's validate request.model
| # TurboMind needs one decode token to drive the request to | ||
| # FINISH. The in-engine CE reduction still scores prompt tokens | ||
| # only. | ||
| max_new_tokens = 1 if self.backend == 'turbomind' else 0 |
There was a problem hiding this comment.
Do we have to differentiate the engines now? Can't we use the same max_new_tokens(1 is preferred) for both engines?
There was a problem hiding this comment.
Also, in async_get_logits there is also such case. May take it into consideration
| return output | ||
| engine = self.async_engine | ||
| async def _gather(): | ||
| return await asyncio.gather(*[engine.async_get_ppl(ids) for ids in input_ids]) |
There was a problem hiding this comment.
Batch get_ppl concurrency is unbounded.
Consider reusing the limiter as Pipeline.infer does.
async def _gather():
sem = self._get_limiter()
async def _one(ids):
await sem.acquire()
try:
return await engine.async_get_ppl(ids)
finally:
sem.release()
return await asyncio.gather(*[_one(ids) for ids in input_ids])
Usage
Accuracy