add enable_prefill_decode_mixed start args#1315
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces the --enable_prefill_decode_mixed flag, enabling the simultaneous execution of prefill and decode requests in a single scheduling step when run_mode is normal. This feature aims to enhance throughput by allowing decode progress even when new prefill requests are present. The implementation includes documentation updates, CLI argument additions, and logic in the base backend to merge request types when batch limits allow. Review feedback highlights the need for more precise token counting for decode requests to accommodate speculative decoding and suggests enforcing the documented incompatibility with microbatch overlap features.
| if self.args.enable_prefill_decode_mixed and len(prefill_reqs) > 0 and len(decode_reqs) > 0: | ||
| if prefill_tokens + len(decode_reqs) <= self.batch_max_tokens: | ||
| prefill_reqs.extend(decode_reqs) | ||
| decode_reqs = [] |
There was a problem hiding this comment.
Using len(decode_reqs) as the token count for decode requests is only accurate if each request requires exactly one token. However, when features like Multi-Token Prediction (MTP) or speculative decoding are enabled, a decode request may require multiple tokens per step. It is safer to calculate the total required tokens using decode_need_token_num() to ensure the batch_max_tokens limit is not exceeded.
| if self.args.enable_prefill_decode_mixed and len(prefill_reqs) > 0 and len(decode_reqs) > 0: | |
| if prefill_tokens + len(decode_reqs) <= self.batch_max_tokens: | |
| prefill_reqs.extend(decode_reqs) | |
| decode_reqs = [] | |
| if self.args.enable_prefill_decode_mixed and len(prefill_reqs) > 0 and len(decode_reqs) > 0: | |
| decode_tokens = sum(req.decode_need_token_num() for req in decode_reqs) | |
| if prefill_tokens + decode_tokens <= self.batch_max_tokens: | |
| prefill_reqs.extend(decode_reqs) | |
| decode_reqs = [] |
| if args.enable_prefill_decode_mixed: | ||
| assert args.run_mode == "normal", "--enable_prefill_decode_mixed only supports run_mode normal" |
There was a problem hiding this comment.
The documentation for --enable_prefill_decode_mixed explicitly states that it cannot be used together with --enable_prefill_microbatch_overlap or --enable_decode_microbatch_overlap. This constraint should be enforced here to prevent invalid configurations.
| if args.enable_prefill_decode_mixed: | |
| assert args.run_mode == "normal", "--enable_prefill_decode_mixed only supports run_mode normal" | |
| if args.enable_prefill_decode_mixed: | |
| assert args.run_mode == "normal", "--enable_prefill_decode_mixed only supports run_mode normal" | |
| assert not args.enable_prefill_microbatch_overlap and not args.enable_decode_microbatch_overlap, \ | |
| "--enable_prefill_decode_mixed cannot be used with microbatch overlap" |
No description provided.