Skip to content

Fix off-by-one in completion_tokens count in generate_stream#3843

Open
Chessing234 wants to merge 1 commit intolm-sys:mainfrom
Chessing234:fix/completion-tokens-off-by-one
Open

Fix off-by-one in completion_tokens count in generate_stream#3843
Chessing234 wants to merge 1 commit intolm-sys:mainfrom
Chessing234:fix/completion-tokens-off-by-one

Conversation

@Chessing234
Copy link
Copy Markdown

Summary

In generate_stream(), the loop for i in range(max_new_tokens) generates a token and appends it to output_ids before yielding. However, the usage dict reports completion_tokens: i instead of i + 1. Since i is 0-indexed and the token is already appended by the time the yield executes, this undercounts by exactly 1 on every streaming chunk and on the final response.

Example: After the first token is generated (i=0), the response reports completion_tokens: 0 instead of 1. After N tokens, it reports N-1 instead of N.

Fix: Use i + 1 for completion_tokens and total_tokens in both the streaming yield (line 282-283) and the final yield (line 303-304).

Reference: The vLLM worker (vllm_worker.py) correctly uses len(output.token_ids) for the same field, confirming the intended semantics.

Test plan

  • Verify with a simple generation request that completion_tokens matches the actual number of tokens in the response
  • Confirm total_tokens = prompt_tokens + completion_tokens holds

🤖 Generated with Claude Code

The loop `for i in range(max_new_tokens)` appends a token before
yielding, but reports `completion_tokens: i` instead of `i + 1`.
Since i is 0-indexed and the token is already appended, this
undercounts by 1 on every streaming chunk and on the final response.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant