Add /get_ppl endpoint by irexyc · Pull Request #4679 · InternLM/lmdeploy

irexyc · 2026-06-15T12:42:11Z

Usage

curl -X POST 'http://127.0.0.1:23333/get_ppl' \
    -H 'Content-Type: application/json' \
    -d '{
      "input": [9707, 11, 1246, 525, 498, 3351, 30]
    }'


curl -X POST 'http://127.0.0.1:23333/get_ppl' \
    -H 'Content-Type: application/json' \
    -d '{"input": "Hello, how are you today? Hello"}'

Accuracy

dataset	version	metric	mode	old pytorch	new pytorch	Δ pytorch	old turbomind	new turbomind	Δ turbomind
race-high	9387ad	accuracy	ppl	62.75	62.75	0.00	62.61	62.72	+0.11
GPQA_diamond	4b5a83	accuracy	ppl	31.31	30.81	-0.50	32.32	31.31	-1.01
winogrande	252f01	accuracy	ll	57.06	57.06	0.00	57.38	57.38	0.00
mmlu-other	-	accuracy	ppl	49.74	49.74	0.00	49.83	49.47	-0.36

Copilot

Pull request overview

This PR adds a dedicated perplexity scoring API (POST /get_ppl) and wires end-to-end support for returning prompt cross-entropy loss (ce_loss) from both TurboMind and PyTorch backends, enabling PPL computation without exporting full logits to CPU.

Changes:

Add OpenAI-style /get_ppl endpoint + request/response models, backed by AsyncEngine.async_get_ppl.
Implement backend-side prompt CE-loss accumulation (TurboMind: CUDA kernel + output processor; PyTorch: per-step CE reduction with chunked-prefill boundary handling).
Simplify Pipeline.get_ppl to call async_get_ppl per sequence (removing the previous batching/long-text implementation).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/csrc/unittests/test_logprob_kernels.cu	Removed logprob kernel unit tests (file deleted).
tests/csrc/unittests/CMakeLists.txt	Stop building/linking the removed logprob kernel unit test.
src/turbomind/python/bind.cpp	Expose `GenerationConfig.return_ppl` to Python via pybind.
src/turbomind/models/output_processor.cc	Add CE-loss target collection, CE-loss computation from chunked logits, and output emission.
src/turbomind/models/CMakeLists.txt	Link `cross_entropy_kernels` into the models target.
src/turbomind/kernels/logprob_kernels.cu	Remove old logprob kernel implementation (file deleted).
src/turbomind/kernels/cross_entropy_kernels.h	Replace previous declaration with `invokeCrossEntropyLoss` API.
src/turbomind/kernels/cross_entropy_kernels.cu	New CUDA kernel to compute and accumulate cross-entropy loss on device.
src/turbomind/kernels/CMakeLists.txt	Replace `logprob_kernels` library target with `cross_entropy_kernels`.
src/turbomind/engine/request.h	Add `return_ppl` to config and per-request CE-loss state (`input_ce_loss`, `ce_loss`).
src/turbomind/engine/request.cc	Include `return_ppl` in config logging output.
src/turbomind/engine/model_request.cc	Allocate `ce_loss` output tensor when `return_ppl` is requested.
src/turbomind/engine/engine.cc	Treat `return_ppl` as incompatible with prefix caching in validation.
lmdeploy/turbomind/turbomind.py	Plumb `ce_loss` output into `EngineOutput` on FINISH.
lmdeploy/serve/openai/protocol.py	Add `GetPPLRequest`/`GetPPLResponse` protocol models.
lmdeploy/serve/openai/api_server.py	Add `/get_ppl` route that tokenizes (when needed) and calls `async_get_ppl`.
lmdeploy/serve/core/async_engine.py	Add `async_get_ppl` API and store `speculative_config` for validation.
lmdeploy/pytorch/messages.py	Add `out_ce_loss` flag and per-sequence CE-loss accumulator.
lmdeploy/pytorch/engine/model_agent/scoring.py	New helper to compute prompt CE-loss (supports chunked-prefill boundary token).
lmdeploy/pytorch/engine/model_agent/agent.py	Compute/propagate CE-loss from logits; request logits when CE-loss is needed.
lmdeploy/pytorch/engine/inputs_maker.py	Add `return_ce_loss` flag propagation for prefill input creation.
lmdeploy/pytorch/engine/engine.py	Extend inference output container with `ce_loss`.
lmdeploy/pytorch/engine/engine_loop.py	Accumulate and emit CE-loss per session on FINISH.
lmdeploy/pytorch/engine/engine_instance.py	Forward `ce_loss` from response payload into `InferOutput`.
lmdeploy/pipeline.py	Re-implement `get_ppl` using `async_get_ppl` per sequence.
lmdeploy/messages.py	Add `GenerationConfig.return_ppl` and `EngineOutput.ce_loss` fields/documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                gen_config = GenerationConfig(max_new_tokens=max_new_tokens, return_ppl=True, top_k=1)
+                async with self.safe_run(handle,
+                                         session=session,
+                                         input_ids=input_ids,
+                                         gen_config=gen_config,
+                                         stream_output=False,
+                                         sequence_start=True,
+                                         sequence_end=True,
+                                         step=session.step) as gen:
+                    async for outputs in gen:
+                        pass
+                    ce_loss = outputs.ce_loss


grimoire · 2026-06-16T07:23:57Z


+        # compute summed, unnormalized prompt cross-entropy
+        ce_loss = None
+        if return_ce_loss and logits is not None:


last_logits is the inputs of async_sampling_logits, and it is also a slice of logits. In sampling last_logits might be updated inplace.

lvhan028 · 2026-06-16T07:50:56Z

+    The input may be raw text or token ids. Text is tokenized with
+    ``tokenizer.encode`` (no chat template applied).
+    """
+    model: str | None = None


Can model be None?

lvhan028 · 2026-06-16T07:52:08Z

    usage: UsageInfo


+class GetPPLRequest(BaseModel):


Let's rename it "PPLRequest"

lvhan028 · 2026-06-16T07:52:18Z

+    input: str | list[int]
+
+
+class GetPPLResponse(BaseModel):


PPLResponse

lvhan028 · 2026-06-16T07:53:46Z

+    created: int = Field(default_factory=lambda: int(time.time()))
+    model: str = None
+    ppl: float
+    usage: UsageInfo


Since usage play an important role in 'ppl' endpint, may remove it.

lvhan028 · 2026-06-16T07:59:40Z

+    request_input = request.input
+    model_name = request.model or async_engine.model_name


Let's validate request.model

lvhan028 · 2026-06-16T08:27:57Z

+                # TurboMind needs one decode token to drive the request to
+                # FINISH. The in-engine CE reduction still scores prompt tokens
+                # only.
+                max_new_tokens = 1 if self.backend == 'turbomind' else 0


Do we have to differentiate the engines now? Can't we use the same max_new_tokens(1 is preferred) for both engines?

Also, in async_get_logits there is also such case. May take it into consideration

lvhan028 · 2026-06-16T08:47:55Z

-        return output
+        engine = self.async_engine
+        async def _gather():
+            return await asyncio.gather(*[engine.async_get_ppl(ids) for ids in input_ids])


Batch get_ppl concurrency is unbounded.
Consider reusing the limiter as Pipeline.infer does.

async def _gather(): sem = self._get_limiter() async def _one(ids): await sem.acquire() try: return await engine.async_get_ppl(ids) finally: sem.release() return await asyncio.gather(*[_one(ids) for ids in input_ids])

irexyc added 4 commits June 11, 2026 07:53

support compute ce_loss in pytorch backend

12a6e1b

support compute ce_loss in turbomind backend

2cedd4e

remove old get_ppl function

41a5e97

remove logprob_kernels

8b6e861

Copilot AI review requested due to automatic review settings June 15, 2026 12:42

Copilot started reviewing on behalf of irexyc June 15, 2026 12:42 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

fix comments

92edf17

lvhan028 added the enhancement New feature or request label Jun 15, 2026

lvhan028 requested review from grimoire and lvhan028 June 15, 2026 13:33

lvhan028 reviewed Jun 16, 2026

View reviewed changes

Comment thread lmdeploy/pipeline.py Outdated

remove unused

dd75ade

grimoire reviewed Jun 16, 2026

View reviewed changes

lvhan028 reviewed Jun 16, 2026

View reviewed changes

irexyc added 5 commits June 16, 2026 09:32

compute ce loss before sampling

9bf6378

simplify request/response structure

ff3ff14

concurrency for get_ppl

3485627

add testcase

c168fba

Merge remote-tracking branch 'lmdeploy/main' into get_ppl

29edfa0

grimoire approved these changes Jun 16, 2026

View reviewed changes

Copilot AI mentioned this pull request Jun 16, 2026

fix: add return_ce_loss attribute to _DummySeq test stub #4683

Closed

fix ci

5a0546c

waynehacking8 mentioned this pull request Jun 16, 2026

[Bugfix] Fix double-counted max_q_seqlen in decode delta kv_seqlens #4685

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add /get_ppl endpoint#4679

Add /get_ppl endpoint#4679
irexyc wants to merge 12 commits into
InternLM:mainfrom
irexyc:get_ppl

irexyc commented Jun 15, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

grimoire Jun 16, 2026

Uh oh!

lvhan028 Jun 16, 2026

Uh oh!

lvhan028 Jun 16, 2026

Uh oh!

lvhan028 Jun 16, 2026

Uh oh!

lvhan028 Jun 16, 2026

Uh oh!

lvhan028 Jun 16, 2026

Uh oh!

lvhan028 Jun 16, 2026 •

edited

Loading

Uh oh!

lvhan028 Jun 16, 2026

Uh oh!

lvhan028 Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		request_input = request.input
		model_name = request.model or async_engine.model_name

Conversation

irexyc commented Jun 15, 2026

Usage

Accuracy

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

grimoire Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lvhan028 Jun 16, 2026 •

edited

Loading

lvhan028 Jun 16, 2026 •

edited

Loading