Skip to content

add enable_prefill_decode_mixed start args#1315

Merged
hiworldwzj merged 2 commits into
mainfrom
wzj_dev
May 22, 2026
Merged

add enable_prefill_decode_mixed start args#1315
hiworldwzj merged 2 commits into
mainfrom
wzj_dev

Conversation

@hiworldwzj
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces the --enable_prefill_decode_mixed flag, enabling the simultaneous execution of prefill and decode requests in a single scheduling step when run_mode is normal. This feature aims to enhance throughput by allowing decode progress even when new prefill requests are present. The implementation includes documentation updates, CLI argument additions, and logic in the base backend to merge request types when batch limits allow. Review feedback highlights the need for more precise token counting for decode requests to accommodate speculative decoding and suggests enforcing the documented incompatibility with microbatch overlap features.

Comment on lines +685 to +688
if self.args.enable_prefill_decode_mixed and len(prefill_reqs) > 0 and len(decode_reqs) > 0:
if prefill_tokens + len(decode_reqs) <= self.batch_max_tokens:
prefill_reqs.extend(decode_reqs)
decode_reqs = []
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using len(decode_reqs) as the token count for decode requests is only accurate if each request requires exactly one token. However, when features like Multi-Token Prediction (MTP) or speculative decoding are enabled, a decode request may require multiple tokens per step. It is safer to calculate the total required tokens using decode_need_token_num() to ensure the batch_max_tokens limit is not exceeded.

Suggested change
if self.args.enable_prefill_decode_mixed and len(prefill_reqs) > 0 and len(decode_reqs) > 0:
if prefill_tokens + len(decode_reqs) <= self.batch_max_tokens:
prefill_reqs.extend(decode_reqs)
decode_reqs = []
if self.args.enable_prefill_decode_mixed and len(prefill_reqs) > 0 and len(decode_reqs) > 0:
decode_tokens = sum(req.decode_need_token_num() for req in decode_reqs)
if prefill_tokens + decode_tokens <= self.batch_max_tokens:
prefill_reqs.extend(decode_reqs)
decode_reqs = []

Comment on lines +191 to +192
if args.enable_prefill_decode_mixed:
assert args.run_mode == "normal", "--enable_prefill_decode_mixed only supports run_mode normal"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The documentation for --enable_prefill_decode_mixed explicitly states that it cannot be used together with --enable_prefill_microbatch_overlap or --enable_decode_microbatch_overlap. This constraint should be enforced here to prevent invalid configurations.

Suggested change
if args.enable_prefill_decode_mixed:
assert args.run_mode == "normal", "--enable_prefill_decode_mixed only supports run_mode normal"
if args.enable_prefill_decode_mixed:
assert args.run_mode == "normal", "--enable_prefill_decode_mixed only supports run_mode normal"
assert not args.enable_prefill_microbatch_overlap and not args.enable_decode_microbatch_overlap, \
"--enable_prefill_decode_mixed cannot be used with microbatch overlap"

@hiworldwzj hiworldwzj merged commit eaa3b28 into main May 22, 2026
1 check passed
@hiworldwzj hiworldwzj deleted the wzj_dev branch May 22, 2026 07:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant