Skip to content

Add /get_ppl endpoint#4679

Open
irexyc wants to merge 12 commits into
InternLM:mainfrom
irexyc:get_ppl
Open

Add /get_ppl endpoint#4679
irexyc wants to merge 12 commits into
InternLM:mainfrom
irexyc:get_ppl

Conversation

@irexyc

@irexyc irexyc commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Usage

curl -X POST 'http://127.0.0.1:23333/get_ppl' \
    -H 'Content-Type: application/json' \
    -d '{
      "input": [9707, 11, 1246, 525, 498, 3351, 30]
    }'


curl -X POST 'http://127.0.0.1:23333/get_ppl' \
    -H 'Content-Type: application/json' \
    -d '{"input": "Hello, how are you today? Hello"}'

Accuracy

dataset version metric mode old pytorch new pytorch Δ pytorch old turbomind new turbomind Δ turbomind
race-high 9387ad accuracy ppl 62.75 62.75 0.00 62.61 62.72 +0.11
GPQA_diamond 4b5a83 accuracy ppl 31.31 30.81 -0.50 32.32 31.31 -1.01
winogrande 252f01 accuracy ll 57.06 57.06 0.00 57.38 57.38 0.00
mmlu-other - accuracy ppl 49.74 49.74 0.00 49.83 49.47 -0.36

Copilot AI review requested due to automatic review settings June 15, 2026 12:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a dedicated perplexity scoring API (POST /get_ppl) and wires end-to-end support for returning prompt cross-entropy loss (ce_loss) from both TurboMind and PyTorch backends, enabling PPL computation without exporting full logits to CPU.

Changes:

  • Add OpenAI-style /get_ppl endpoint + request/response models, backed by AsyncEngine.async_get_ppl.
  • Implement backend-side prompt CE-loss accumulation (TurboMind: CUDA kernel + output processor; PyTorch: per-step CE reduction with chunked-prefill boundary handling).
  • Simplify Pipeline.get_ppl to call async_get_ppl per sequence (removing the previous batching/long-text implementation).

Reviewed changes

Copilot reviewed 26 out of 26 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
tests/csrc/unittests/test_logprob_kernels.cu Removed logprob kernel unit tests (file deleted).
tests/csrc/unittests/CMakeLists.txt Stop building/linking the removed logprob kernel unit test.
src/turbomind/python/bind.cpp Expose GenerationConfig.return_ppl to Python via pybind.
src/turbomind/models/output_processor.cc Add CE-loss target collection, CE-loss computation from chunked logits, and output emission.
src/turbomind/models/CMakeLists.txt Link cross_entropy_kernels into the models target.
src/turbomind/kernels/logprob_kernels.cu Remove old logprob kernel implementation (file deleted).
src/turbomind/kernels/cross_entropy_kernels.h Replace previous declaration with invokeCrossEntropyLoss API.
src/turbomind/kernels/cross_entropy_kernels.cu New CUDA kernel to compute and accumulate cross-entropy loss on device.
src/turbomind/kernels/CMakeLists.txt Replace logprob_kernels library target with cross_entropy_kernels.
src/turbomind/engine/request.h Add return_ppl to config and per-request CE-loss state (input_ce_loss, ce_loss).
src/turbomind/engine/request.cc Include return_ppl in config logging output.
src/turbomind/engine/model_request.cc Allocate ce_loss output tensor when return_ppl is requested.
src/turbomind/engine/engine.cc Treat return_ppl as incompatible with prefix caching in validation.
lmdeploy/turbomind/turbomind.py Plumb ce_loss output into EngineOutput on FINISH.
lmdeploy/serve/openai/protocol.py Add GetPPLRequest/GetPPLResponse protocol models.
lmdeploy/serve/openai/api_server.py Add /get_ppl route that tokenizes (when needed) and calls async_get_ppl.
lmdeploy/serve/core/async_engine.py Add async_get_ppl API and store speculative_config for validation.
lmdeploy/pytorch/messages.py Add out_ce_loss flag and per-sequence CE-loss accumulator.
lmdeploy/pytorch/engine/model_agent/scoring.py New helper to compute prompt CE-loss (supports chunked-prefill boundary token).
lmdeploy/pytorch/engine/model_agent/agent.py Compute/propagate CE-loss from logits; request logits when CE-loss is needed.
lmdeploy/pytorch/engine/inputs_maker.py Add return_ce_loss flag propagation for prefill input creation.
lmdeploy/pytorch/engine/engine.py Extend inference output container with ce_loss.
lmdeploy/pytorch/engine/engine_loop.py Accumulate and emit CE-loss per session on FINISH.
lmdeploy/pytorch/engine/engine_instance.py Forward ce_loss from response payload into InferOutput.
lmdeploy/pipeline.py Re-implement get_ppl using async_get_ppl per sequence.
lmdeploy/messages.py Add GenerationConfig.return_ppl and EngineOutput.ce_loss fields/documentation.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread lmdeploy/serve/core/async_engine.py Outdated
Comment on lines +887 to +898
gen_config = GenerationConfig(max_new_tokens=max_new_tokens, return_ppl=True, top_k=1)
async with self.safe_run(handle,
session=session,
input_ids=input_ids,
gen_config=gen_config,
stream_output=False,
sequence_start=True,
sequence_end=True,
step=session.step) as gen:
async for outputs in gen:
pass
ce_loss = outputs.ce_loss
Comment thread lmdeploy/pipeline.py
@lvhan028 lvhan028 added the enhancement New feature or request label Jun 15, 2026
@lvhan028 lvhan028 requested review from grimoire and lvhan028 June 15, 2026 13:33
Comment thread lmdeploy/pipeline.py Outdated

# compute summed, unnormalized prompt cross-entropy
ce_loss = None
if return_ce_loss and logits is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_logits is the inputs of async_sampling_logits, and it is also a slice of logits. In sampling last_logits might be updated inplace.

Comment thread lmdeploy/serve/openai/protocol.py Outdated
The input may be raw text or token ids. Text is tokenized with
``tokenizer.encode`` (no chat template applied).
"""
model: str | None = None

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can model be None?

Comment thread lmdeploy/serve/openai/protocol.py Outdated
usage: UsageInfo


class GetPPLRequest(BaseModel):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's rename it "PPLRequest"

Comment thread lmdeploy/serve/openai/protocol.py Outdated
input: str | list[int]


class GetPPLResponse(BaseModel):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PPLResponse

Comment thread lmdeploy/serve/openai/protocol.py Outdated
created: int = Field(default_factory=lambda: int(time.time()))
model: str = None
ppl: float
usage: UsageInfo

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since usage play an important role in 'ppl' endpint, may remove it.

Comment thread lmdeploy/serve/openai/api_server.py Outdated
Comment on lines +1198 to +1199
request_input = request.input
model_name = request.model or async_engine.model_name

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's validate request.model

Comment thread lmdeploy/serve/core/async_engine.py Outdated
Comment on lines +881 to +884
# TurboMind needs one decode token to drive the request to
# FINISH. The in-engine CE reduction still scores prompt tokens
# only.
max_new_tokens = 1 if self.backend == 'turbomind' else 0

@lvhan028 lvhan028 Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have to differentiate the engines now? Can't we use the same max_new_tokens(1 is preferred) for both engines?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, in async_get_logits there is also such case. May take it into consideration

Comment thread lmdeploy/pipeline.py Outdated
return output
engine = self.async_engine
async def _gather():
return await asyncio.gather(*[engine.async_get_ppl(ids) for ids in input_ids])

@lvhan028 lvhan028 Jun 16, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Batch get_ppl concurrency is unbounded.
Consider reusing the limiter as Pipeline.infer does.

async def _gather():
    sem = self._get_limiter()

    async def _one(ids):
        await sem.acquire()
        try:
            return await engine.async_get_ppl(ids)
        finally:
            sem.release()

    return await asyncio.gather(*[_one(ids) for ids in input_ids])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants