Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion examples/specdec_bench/run.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ def run_simple(args):
type=str,
required=False,
default="EAGLE3",
choices=["EAGLE3", "EAGLE", "DRAFT_TARGET", "NGRAM", "MTP", "NONE"],
choices=["EAGLE3", "EAGLE", "DRAFT_TARGET", "NGRAM", "MTP", "DFLASH", "NONE"],
help="Speculative algorithm to use",
)
parser.add_argument("--model_dir", type=str, required=True, help="Path to the model directory")
Expand Down
76 changes: 40 additions & 36 deletions examples/specdec_bench/specdec_bench/models/sglang.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,44 +43,48 @@ def __init__(
speculative_algorithm = "LOOKAHEAD"
elif speculative_algorithm == "NONE":
speculative_algorithm = None

engine_kwargs = dict(
model_path=model_dir,
skip_tokenizer_init=True,
trust_remote_code=kwargs.get("trust_remote_code", False),
mem_fraction_static=kwargs.get("mem_fraction_static", 0.8),
disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
tp_size=kwargs.get("tensor_parallel_size", 1),
ep_size=kwargs.get("moe_expert_parallel_size", 1),
torch_compile_max_bs=max_concurrent_requests,
max_running_requests=max_concurrent_requests,
attention_backend=kwargs.get("attention_backend"),
enable_torch_compile=kwargs.get("enable_torch_compile", False),
cuda_graph_max_bs=max_concurrent_requests,
disable_cuda_graph=False,
disable_cuda_graph_padding=True,
Comment on lines +59 to +61
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[IMPORTANT Compatibility] disable_cuda_graph_padding=True is now applied to all SGLang paths — non-speculative, EAGLE3, EAGLE, MTP, DRAFT_TARGET, NGRAM — not just DFLASH.

Before this PR, this kwarg wasn't passed, so SGLang's default (False, padding enabled) applied. Disabling padding can force more CUDA-graph recompilations / runs at non-bucketed batch sizes, which may shift latency/throughput numbers for the other algorithms that this benchmark is meant to compare.

Since the PR description states this flag is needed specifically to avoid bucket-padding mismatches during DFLASH replay, suggest gating it on DFLASH (or making it kwargs-overridable) so existing EAGLE/MTP/etc. benchmark results remain comparable to runs from before this change. For example, drop disable_cuda_graph_padding from the shared engine_kwargs dict and only set it inside the if speculative_algorithm == "DFLASH": branch.

)
Comment on lines +47 to +62
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix CI blocker: replace dict(...) with a literal.

Line 47 triggers Ruff C408 in the pipeline, so this currently fails code-quality checks.

Suggested patch
-        engine_kwargs = dict(
-            model_path=model_dir,
-            skip_tokenizer_init=True,
-            trust_remote_code=kwargs.get("trust_remote_code", False),
-            mem_fraction_static=kwargs.get("mem_fraction_static", 0.8),
-            disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
-            tp_size=kwargs.get("tensor_parallel_size", 1),
-            ep_size=kwargs.get("moe_expert_parallel_size", 1),
-            torch_compile_max_bs=max_concurrent_requests,
-            max_running_requests=max_concurrent_requests,
-            attention_backend=kwargs.get("attention_backend"),
-            enable_torch_compile=kwargs.get("enable_torch_compile", False),
-            cuda_graph_max_bs=max_concurrent_requests,
-            disable_cuda_graph=False,
-            disable_cuda_graph_padding=True,
-        )
+        engine_kwargs = {
+            "model_path": model_dir,
+            "skip_tokenizer_init": True,
+            "trust_remote_code": kwargs.get("trust_remote_code", False),
+            "mem_fraction_static": kwargs.get("mem_fraction_static", 0.8),
+            "disable_overlap_schedule": kwargs.get("disable_overlap_schedule", False),
+            "tp_size": kwargs.get("tensor_parallel_size", 1),
+            "ep_size": kwargs.get("moe_expert_parallel_size", 1),
+            "torch_compile_max_bs": max_concurrent_requests,
+            "max_running_requests": max_concurrent_requests,
+            "attention_backend": kwargs.get("attention_backend"),
+            "enable_torch_compile": kwargs.get("enable_torch_compile", False),
+            "cuda_graph_max_bs": max_concurrent_requests,
+            "disable_cuda_graph": False,
+            "disable_cuda_graph_padding": True,
+        }
🧰 Tools
🪛 GitHub Actions: Code Quality / code-quality

[error] 47-62: ruff-check failed (hook id: ruff-check). C408 Unnecessary dict() call (rewrite as a literal) at examples/specdec_bench/specdec_bench/models/sglang.py:47:25.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/specdec_bench/specdec_bench/models/sglang.py` around lines 47 - 62,
The engine_kwargs construction uses dict(...) which triggers Ruff C408; replace
the dict(...) call with a dict literal for the variable engine_kwargs in the
module (the block that sets model_path, skip_tokenizer_init, trust_remote_code,
mem_fraction_static, disable_overlap_schedule, tp_size, ep_size,
torch_compile_max_bs, max_running_requests, attention_backend,
enable_torch_compile, cuda_graph_max_bs, disable_cuda_graph,
disable_cuda_graph_padding) so the same key/value pairs are expressed using a {
... } literal instead of dict(...).

if speculative_algorithm is not None:
# https://github.com/sgl-project/sglang/pull/3582
self.model = sgl.Engine(
model_path=model_dir,
skip_tokenizer_init=True,
trust_remote_code=kwargs.get("trust_remote_code", False),
mem_fraction_static=0.8,
disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
tp_size=kwargs.get("tensor_parallel_size", 1),
ep_size=kwargs.get("moe_expert_parallel_size", 1),
speculative_algorithm=speculative_algorithm,
speculative_num_steps=kwargs.get("speculative_num_steps", 3),
speculative_eagle_topk=kwargs.get("speculative_eagle_topk", 1),
speculative_num_draft_tokens=kwargs.get("speculative_num_draft_tokens", 4),
speculative_draft_model_path=kwargs.get("draft_model_dir"),
torch_compile_max_bs=max_concurrent_requests,
max_running_requests=max_concurrent_requests,
attention_backend=kwargs.get("attention_backend"),
enable_torch_compile=kwargs.get("enable_torch_compile", False),
cuda_graph_max_bs=max_concurrent_requests,
disable_cuda_graph=False,
)
else:
self.model = sgl.Engine(
model_path=model_dir,
skip_tokenizer_init=True,
trust_remote_code=kwargs.get("trust_remote_code", False),
mem_fraction_static=0.8,
disable_overlap_schedule=kwargs.get("disable_overlap_schedule", False),
tp_size=kwargs.get("tensor_parallel_size", 1),
ep_size=kwargs.get("moe_expert_parallel_size", 1),
torch_compile_max_bs=max_concurrent_requests,
max_running_requests=max_concurrent_requests,
attention_backend=kwargs.get("attention_backend"),
enable_torch_compile=kwargs.get("enable_torch_compile", False),
cuda_graph_max_bs=max_concurrent_requests,
disable_cuda_graph=False,
)
engine_kwargs["speculative_algorithm"] = speculative_algorithm
engine_kwargs["speculative_draft_model_path"] = kwargs.get("draft_model_dir")
if speculative_algorithm == "DFLASH":
engine_kwargs["speculative_num_draft_tokens"] = kwargs.get("speculative_num_draft_tokens", 8)
if "speculative_dflash_draft_window_size" in kwargs:
engine_kwargs["speculative_dflash_draft_window_size"] = kwargs[
"speculative_dflash_draft_window_size"
]
print(
f"[specdec_bench] DFLASH ignores --draft_length / speculative_num_steps / "
f"speculative_eagle_topk; effective draft block = "
f"speculative_num_draft_tokens={engine_kwargs['speculative_num_draft_tokens']}"
)
Comment on lines +73 to +77
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] Minor UX nit: this print message tells users that DFLASH ignores --draft_length and instructs them to set speculative_num_draft_tokens, but speculative_num_draft_tokens is not a CLI flag in run.py — it's only reachable via engine_args in the --runtime_params YAML. Consider mentioning that explicitly in the print, or adding a CLI flag, otherwise users will be left guessing how to override the default of 8.

else:
engine_kwargs["speculative_num_draft_tokens"] = kwargs.get("speculative_num_draft_tokens", 4)
engine_kwargs["speculative_num_steps"] = kwargs.get("speculative_num_steps", 3)
engine_kwargs["speculative_eagle_topk"] = kwargs.get("speculative_eagle_topk", 1)

# extra engine arg needed for qwen3.5
if "mamba_scheduler_strategy" in kwargs:
engine_kwargs["mamba_scheduler_strategy"] = kwargs["mamba_scheduler_strategy"]

self.model = sgl.Engine(**engine_kwargs)

self.sampling_config = sampling_kwargs

Expand Down
6 changes: 6 additions & 0 deletions examples/specdec_bench/specdec_bench/models/vllm.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,12 @@ def __init__(self, model_dir, max_concurrent_requests, sampling_kwargs, **kwargs
"method": "mtp",
"num_speculative_tokens": kwargs.get("speculative_num_steps", 3),
}
elif kwargs.get("speculative_algorithm") == "DFLASH":
specdec = {
"method": "dflash",
"model": kwargs.get("draft_model_dir"),
"num_speculative_tokens": kwargs.get("speculative_num_draft_tokens", 8),
}
elif kwargs.get("speculative_algorithm") == "NONE":
specdec = None

Expand Down
Loading