feat(launcher): add DFlash support for DeepSeek-V4-Flash target model by ChenhanYu · Pull Request #1379 · NVIDIA/Model-Optimizer

ChenhanYu · 2026-04-30T18:00:54Z

Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130).

New files:

examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py Runtime patch script (P0–P11) applied to the vLLM container at job startup. Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM, fixes KV cache group allocation for heterogeneous DFlash draft layers, handles the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that overflows H100 shared memory with block_size=256 × MLA 576 bytes/token. Includes a PR contribution strategy note in the module docstring.
examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256, gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15 speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash.
examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests) MTP smoke test YAMLs for B200, B300, and CW-DFW H100.
common/specdec/sglang_smoke_test.sh SGLang speculative decoding smoke test.
common/specdec/read_vllm_files.sh Diagnostic: dump relevant vLLM source lines.

common/specdec/vllm_smoke_test.sh:
Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before starting vLLM. Allows per-model patches without modifying the common script.

common/vllm/query.sh:
Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect, COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for torch inductor compatibility with DeepSeek V4 fp8 ops.

core.py / slurm_config.py:
Expose retries, requeue, and additional_parameters in SlurmConfig and slurm_factory so long-running jobs can use Slurm's native requeue mechanism.

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

Is this change backward compatible?: ✅ / ❌ / N/A
If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
Did you write any new necessary tests?: ✅ / ❌ / N/A
Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

New Features
- DeepSeek‑V4 Flash speculative decoding support and a smoke-test pipeline for end-to-end validation
- Runtime options for vLLM servers: model placement to tmpfs, GPU memory tuning, server timeouts, tokenizer/mode and KV-cache dtype controls
- Job retry and requeue support for more reliable Slurm launches
Chores
- Installer-time model/launcher patches and helper scripts to improve compatibility and startup stability

Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130). New files: - examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py Runtime patch script (P0–P11) applied to the vLLM container at job startup. Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM, fixes KV cache group allocation for heterogeneous DFlash draft layers, handles the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that overflows H100 shared memory with block_size=256 × MLA 576 bytes/token. Includes a PR contribution strategy note in the module docstring. - examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256, gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15 speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash. - examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests) MTP smoke test YAMLs for B200, B300, and CW-DFW H100. - common/specdec/sglang_smoke_test.sh SGLang speculative decoding smoke test. - common/specdec/read_vllm_files.sh Diagnostic: dump relevant vLLM source lines. common/specdec/vllm_smoke_test.sh: Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before starting vLLM. Allows per-model patches without modifying the common script. common/vllm/query.sh: Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect, COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for torch inductor compatibility with DeepSeek V4 fp8 ops. core.py / slurm_config.py: Expose retries, requeue, and additional_parameters in SlurmConfig and slurm_factory so long-running jobs can use Slurm's native requeue mechanism. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>

copy-pr-bot · 2026-04-30T18:00:58Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-04-30T18:01:08Z

📝 Walkthrough

Walkthrough

Adds vLLM runtime and smoke-test enhancements plus DeepSeek‑V4‑Flash DFlash support: new shell utilities and runtime flags, an idempotent Python patcher that edits vLLM and related modules to enable DFlash, new example DFlash model and draft creation scripts/config, and SLURM config fields with executor wiring updates.

Changes

vLLM runtime & smoke-test

Layer / File(s)	Summary
Runtime flags / env `tools/launcher/common/specdec/vllm_smoke_test.sh`, `tools/launcher/common/vllm/query.sh`	Adds many new env vars (KV cache dtype, block size, TP size, tokenizer mode, GPU mem util, COPY_MODEL_TO_TMPFS, SERVER_TIMEOUT, DATA_PARALLEL_SIZE, etc.) and ulimit/TMPDIR/REDIRECT behavior.
Runtime patching & patch hooks `tools/launcher/common/specdec/vllm_smoke_test.sh`, `tools/launcher/common/vllm/query.sh`	Introduces FORCE_AF_V2 flow (auto-load .pth + in-place source/regex edits via Python), optional transformers upgrade and model-type registration, vLLM custom-op monkey patching, and VLLM_PATCH_SCRIPT hook invocation.
Utility / extraction tool `tools/launcher/common/specdec/read_vllm_files.sh`	New script to extract labeled line ranges from specific Torch files with strict error handling and NOT FOUND fallbacks.
Operational wiring `tools/launcher/common/specdec/vllm_smoke_test.sh`, `tools/launcher/common/vllm/query.sh`	Integrates the above flags/hooks into server startup sequence, extends readiness timeout, and hardens token parsing using `python3 -S`.

DeepSeek‑V4‑Flash DFlash example & patching

Layer / File(s)	Summary
Model draft config / data shape `tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json`	Adds DFlashDraftModel config and dflash_config (target layers, block size, dtype, auto_map).
Draft model implementation `tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py`	New DFlash draft model: rotary embedding helper, Qwen3DFlashAttention, decoder layer, and DFlashDraftModel with caching/forward behavior.
Draft creation script `tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh`	New script that produces a randomly initialized HF draft checkpoint (copies config/dflash.py, instantiates AutoModel with trust_remote_code, casts to bfloat16, saves).
Idempotent vLLM patcher `tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`	Adds installer-time patcher applying ~12 targeted edits across vLLM modules: registers `deepseek_v4` as DFlash target, injects aux hidden‑state collection and outer setters/getters, relaxes kv-cache grouping/validation and page-size assumptions, fixes FP8 dtype mapping, adds sparse-attention fallback, guards topk prefill None cases, deletes .pyc, and reports per-patch status.
Smoke-test pipeline `tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml`	Adds 2-task Slurm pipeline: `task_0` creates draft checkpoint, `task_1` runs vLLM smoke test with SPEC_METHOD=dflash and points to the patch script; configures GPU count, TP, KV cache dtype, and resource settings.

SLURM configuration & executor wiring

Layer / File(s)	Summary
Config data shape `tools/launcher/slurm_config.py`	Adds `SlurmConfig` fields: `retries: int = 0`, `requeue: bool = False`, `additional_parameters: dict = None`; updates `slurm_factory` signature to accept `retries` and `requeue`.
Executor wiring `tools/launcher/core.py`	`build_slurm_executor` now uses `slurm_config.retries` for retries and merges `{"requeue": True}` into additional Slurm parameters when `slurm_config.requeue` is truthy.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name	Status	Explanation	Resolution
Security Anti-Patterns	❌ Error	Hardcoded `trust_remote_code=True` in create_dflash_draft.sh without inline justification. Violates SECURITY.md requirement for safe transformers loading.	Add inline comment confirming locally-generated config/model are trusted, OR expose as caller-configurable parameter defaulting to False.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title accurately summarizes the main objective: adding DFlash support for DeepSeek-V4-Flash as a target model in the launcher infrastructure.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenhany/dflash-deepseek-v4-flash

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (6)

tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py (2)

82-93: 💤 Low value

Consider adding error handling for write failures.

The _patch_file function doesn't handle potential exceptions from path.write_text() (e.g., permission denied, disk full). While the container environment is likely controlled, unhandled write failures could silently corrupt state.

💡 Proposed defensive error handling

 def _patch_file(path: pathlib.Path, old: str, new: str, sentinel: str, label: str) -> bool:
     src = path.read_text()
     if sentinel in src:
         print(f"{label}: already patched")
         return True
     if old not in src:
         print(f"WARNING: {label}: pattern not found — skipping")
         return False
-    path.write_text(src.replace(old, new, 1))
-    _delete_pyc(path.stem)
-    print(f"{label}: OK")
-    return True
+    try:
+        path.write_text(src.replace(old, new, 1))
+        _delete_pyc(path.stem)
+        print(f"{label}: OK")
+        return True
+    except OSError as e:
+        print(f"ERROR: {label}: write failed — {e}")
+        return False

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 82 - 93, The _patch_file function can raise exceptions from
path.write_text (e.g., permission errors or disk full) which are not currently
handled; update _patch_file to catch exceptions around the write and _delete_pyc
call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log
or print a clear error message including the exception and label, and return
False on failure so callers know the patch failed; ensure sentinel/old checks
remain unchanged and only the write/delete operations are guarded.

408-414: 💤 Low value

Consider using existing torch import instead of inline import.

The injected code uses import torch as _t inline. Since flash_attn.py already imports torch, this redundant import could be avoided by using the existing module-level torch.

💡 Simplified patch without redundant import

     _fa_new = re.sub(
         r"([ \t]*)raise ValueError\(f\"Unrecognized FP8 dtype: \{kv_cache_dtype\}\"\)",
         lambda m: (
             m.group(1) + "# _dflash_fp8_ds_mla_patch: fp8_ds_mla is e4m3fn stored by the compressor\n"
             + m.group(1) + "if kv_cache_dtype == \"fp8_ds_mla\":\n"
-            + m.group(1) + "    import torch as _t; return _t.float8_e4m3fn\n"
+            + m.group(1) + "    return torch.float8_e4m3fn\n"
             + m.group(1) + "raise ValueError(f\"Unrecognized FP8 dtype: {kv_cache_dtype}\")"
         ),

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 408 - 414, The injected replacement in the re.sub lambda currently
does an inline "import torch as _t" when handling kv_cache_dtype ==
"fp8_ds_mla"; change it to use the existing module-level torch symbol instead
(e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct
return of torch.float8_e4m3fn) so the patch uses the already-imported torch in
flash_attn.py and avoids a redundant inline import; update the lambda string in
the _fa_new assignment that matches the raise ValueError to reference torch
directly.

tools/launcher/core.py (1)

275-276: 💤 Low value

Long line reduces readability; consider splitting.

Line 276 combines dict unpacking, conditional expression, and getattr in a single expression. While functional, this is hard to parse at a glance.

💡 Proposed refactor for readability

         retries=slurm_config.retries,
-        additional_parameters={**(slurm_config.additional_parameters or {}), **({"requeue": True} if getattr(slurm_config, "requeue", False) else {})},
+        additional_parameters={
+            **(slurm_config.additional_parameters or {}),
+            **({"requeue": True} if getattr(slurm_config, "requeue", False) else {}),
+        },
         packager=packager,

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/core.py` around lines 275 - 276, The one-line dict merge for
additional_parameters is hard to read; extract the merge into a small, clear
pre-computation: create a local variable (e.g., merged_additional_parameters) by
first copying slurm_config.additional_parameters or {} and then, if
getattr(slurm_config, "requeue", False) is True, set
merged_additional_parameters["requeue"] = True; finally pass
merged_additional_parameters to the function call instead of the long one-liner.
This keeps the logic around slurm_config, requeue, and additional_parameters
explicit and easy to read.

tools/launcher/common/specdec/vllm_smoke_test.sh (1)

549-560: 💤 Low value

Consider using arrays for optional arguments to avoid word-splitting fragility.

The unquoted ${PARALLELISM_ARGS} and ${OPTIONAL_ARGS} rely on word splitting, which works for simple flags but can break if values contain spaces or special characters. While currently functional, using arrays throughout would be more robust.

♻️ Example: array-based approach

-# Build optional args
-OPTIONAL_ARGS=""
+# Build optional args as array
+OPTIONAL_ARGS=()
 if [ -n "${REASONING_PARSER:-}" ]; then
-    OPTIONAL_ARGS="${OPTIONAL_ARGS} --reasoning-parser ${REASONING_PARSER}"
+    OPTIONAL_ARGS+=(--reasoning-parser "${REASONING_PARSER}")
 fi
 # ... similar for other flags ...

-VLLM_CMD=(vllm serve "${MODEL}"
-    --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}"
-    ${PARALLELISM_ARGS}
-    --port "${PORT}"
-    ${OPTIONAL_ARGS})
+VLLM_CMD=(vllm serve "${MODEL}"
+    --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}"
+    "${PARALLELISM_ARGS[@]}"
+    --port "${PORT}"
+    "${OPTIONAL_ARGS[@]}")

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/vllm_smoke_test.sh` around lines 549 - 560, The
VLLM command currently injects ${PARALLELISM_ARGS} and ${OPTIONAL_ARGS} via
unquoted string expansion which is fragile; change those to arrays and append
them into VLLM_CMD as array elements (e.g., ensure PARALLELISM_ARGS and
OPTIONAL_ARGS are declared as arrays and replace unquoted expansions with
VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=( "${OPTIONAL_ARGS[@]}" )),
keep the conditional appends for --speculative-config and --compilation-config
as array additions (VLLM_CMD+=(--speculative-config "$SPEC_CONFIG") etc.), and
finally invoke "${VLLM_CMD[@]}" to preserve proper tokenization.

tools/launcher/common/vllm/query.sh (1)

103-316: 💤 Low value

Consider extracting shared FORCE_AF_V2 patching logic.

This FORCE_AF_V2 patching block (lines 139-316) is nearly identical to the implementation in vllm_smoke_test.sh (lines 85-316). While duplicating the logic in each script ensures independence and avoids cross-script dependencies, consider extracting to a shared Python script if maintenance burden increases.

The exec(PATCH_CODE) on line 235 is safe since PATCH_CODE is a hardcoded string constant, not external input.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/vllm/query.sh` around lines 103 - 316, The FORCE_AF_V2
patching block is duplicated; extract the shared logic into a single reusable
script/module (e.g., a Python file that exposes and installs the patch) and have
both query.sh and vllm_smoke_test.sh load and run it instead of embedding
PATCH_CODE inline. Move the large PATCH_CODE string and the install/exec logic
(symbols: PATCH_CODE, vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and
the FORCE_AF_V2 conditional) into the shared script, update each shell script to
check FORCE_AF_V2 and call that shared script (or import/execute it) so the
behavior remains identical while centralizing maintenance. Ensure the new shared
file is installed into a location accessible to the runtime and keep the
existing runtime prints and exception handling unchanged.

tools/launcher/common/specdec/sglang_smoke_test.sh (1)

65-67: ⚡ Quick win

Avoid unconditional pre-release transformers upgrades in smoke runs.

Line [65]-Line [67] mutates runtime dependencies on every execution (--pre), which makes results non-reproducible and network-sensitive. Gate this behind an opt-in env var and require a tested spec/range.

Suggested fix

-echo "Upgrading transformers (--pre for deepseek_v4 support)..."
-pip install --upgrade --pre transformers -q || echo "WARNING: transformers upgrade failed, continuing"
+if [ "${UPGRADE_TRANSFORMERS:-0}" = "1" ]; then
+    : "${TRANSFORMERS_SPEC:?Set TRANSFORMERS_SPEC to a tested version/range}"
+    echo "Upgrading transformers (${TRANSFORMERS_SPEC})..."
+    pip install --upgrade "${TRANSFORMERS_SPEC}" -q || {
+        echo "ERROR: transformers upgrade failed"
+        exit 1
+    }
+fi

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 65 - 67, The
smoke test unconditionally runs pip install --upgrade --pre transformers which
mutates runtime deps each run; change the logic around the echo/"pip install
--upgrade --pre transformers -q" invocation (the upgrade step labeled "Upgrading
transformers (--pre for deepseek_v4 support)...") to only run when an opt-in
environment variable (e.g., UPGRADE_TRANSFORMERS or USE_PRETRANSFORMERS) is set,
and require a constrained, tested spec (e.g., a VERSION or SPEC env var or
documented version range) instead of always using --pre; update the message to
reflect the gated behavior and skip the network upgrade by default to keep smoke
runs reproducible.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 35-42: The cleanup() function should guard use of SERVER_PID and
only remove tmpfs copies this script created: check SERVER_PID exists/ non-empty
using parameter expansion (e.g., test -n "${SERVER_PID-}" or similar) before
running kill/kil -9 to avoid failures under set -u, and introduce a
TMPFS_MODEL_OWNED (or TMPFS_MODEL_CREATED) flag when you create or copy the
tmpfs model in the code that discovers TMPFS_MODEL (the logic around the TMPFS
discovery at the block that sets TMPFS_MODEL near where you check Lines 49-51);
then in cleanup() only rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so
shared tmpfs paths are not deleted by concurrent jobs.
- Around line 90-104: The loop that tries to write _deepseek_v4_patch.py and
deepseek_v4.pth may fail silently and only exec(STUB) patches the current
interpreter, so the spawned "python -m sglang.launch_server" process won't get
the patch; add a boolean flag (e.g., wrote_pth = False) before iterating
site.getsitepackages() + [site.getusersitepackages()], set it to True when
writing deepseek_v4.pth succeeds, and after the loop check the flag and call
sys.exit(1) (and print an error) if no writable site-packages were found so the
script fails fast instead of continuing to exec(STUB) and spawning the server
without the .pth patch.

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yaml`:
- Line 24: Remove the unrecognized environment variable
SGLANG_APPLY_CONFIG_BACKUP from the YAML or, if it is intended to be supported,
add handling and documentation: update sglang_smoke_test.sh to read and act on
SGLANG_APPLY_CONFIG_BACKUP (e.g., apply a backup config path or "none"
semantics) and add a comment and README entry describing the variable and valid
values; ensure the YAML entry "- SGLANG_APPLY_CONFIG_BACKUP: \"none\"" is either
deleted or retained only after the script and docs are updated to reference
SGLANG_APPLY_CONFIG_BACKUP.

---

Nitpick comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 65-67: The smoke test unconditionally runs pip install --upgrade
--pre transformers which mutates runtime deps each run; change the logic around
the echo/"pip install --upgrade --pre transformers -q" invocation (the upgrade
step labeled "Upgrading transformers (--pre for deepseek_v4 support)...") to
only run when an opt-in environment variable (e.g., UPGRADE_TRANSFORMERS or
USE_PRETRANSFORMERS) is set, and require a constrained, tested spec (e.g., a
VERSION or SPEC env var or documented version range) instead of always using
--pre; update the message to reflect the gated behavior and skip the network
upgrade by default to keep smoke runs reproducible.

In `@tools/launcher/common/specdec/vllm_smoke_test.sh`:
- Around line 549-560: The VLLM command currently injects ${PARALLELISM_ARGS}
and ${OPTIONAL_ARGS} via unquoted string expansion which is fragile; change
those to arrays and append them into VLLM_CMD as array elements (e.g., ensure
PARALLELISM_ARGS and OPTIONAL_ARGS are declared as arrays and replace unquoted
expansions with VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=(
"${OPTIONAL_ARGS[@]}" )), keep the conditional appends for --speculative-config
and --compilation-config as array additions (VLLM_CMD+=(--speculative-config
"$SPEC_CONFIG") etc.), and finally invoke "${VLLM_CMD[@]}" to preserve proper
tokenization.

In `@tools/launcher/common/vllm/query.sh`:
- Around line 103-316: The FORCE_AF_V2 patching block is duplicated; extract the
shared logic into a single reusable script/module (e.g., a Python file that
exposes and installs the patch) and have both query.sh and vllm_smoke_test.sh
load and run it instead of embedding PATCH_CODE inline. Move the large
PATCH_CODE string and the install/exec logic (symbols: PATCH_CODE,
vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and the FORCE_AF_V2
conditional) into the shared script, update each shell script to check
FORCE_AF_V2 and call that shared script (or import/execute it) so the behavior
remains identical while centralizing maintenance. Ensure the new shared file is
installed into a location accessible to the runtime and keep the existing
runtime prints and exception handling unchanged.

In `@tools/launcher/core.py`:
- Around line 275-276: The one-line dict merge for additional_parameters is hard
to read; extract the merge into a small, clear pre-computation: create a local
variable (e.g., merged_additional_parameters) by first copying
slurm_config.additional_parameters or {} and then, if getattr(slurm_config,
"requeue", False) is True, set merged_additional_parameters["requeue"] = True;
finally pass merged_additional_parameters to the function call instead of the
long one-liner. This keeps the logic around slurm_config, requeue, and
additional_parameters explicit and easy to read.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`:
- Around line 82-93: The _patch_file function can raise exceptions from
path.write_text (e.g., permission errors or disk full) which are not currently
handled; update _patch_file to catch exceptions around the write and _delete_pyc
call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log
or print a clear error message including the exception and label, and return
False on failure so callers know the patch failed; ensure sentinel/old checks
remain unchanged and only the write/delete operations are guarded.
- Around line 408-414: The injected replacement in the re.sub lambda currently
does an inline "import torch as _t" when handling kv_cache_dtype ==
"fp8_ds_mla"; change it to use the existing module-level torch symbol instead
(e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct
return of torch.float8_e4m3fn) so the patch uses the already-imported torch in
flash_attn.py and avoids a redundant inline import; update the lambda string in
the _fa_new assignment that matches the raise ValueError to reference torch
directly.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e768f4bf-2470-478f-a408-7f7a00c11644

📥 Commits

Reviewing files that changed from the base of the PR and between bb08094 and 0b2f2b4.

📒 Files selected for processing (17)

tools/launcher/common/specdec/read_vllm_files.sh
tools/launcher/common/specdec/sglang_smoke_test.sh
tools/launcher/common/specdec/vllm_smoke_test.sh
tools/launcher/common/vllm/query.sh
tools/launcher/core.py
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3_data_cw_dfw.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b300.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_cw_dfw.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b200.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b300.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_cw_dfw.yaml
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
tools/launcher/slurm_config.py

coderabbitai · 2026-04-30T18:08:16Z

+cleanup() {
+    kill "$SERVER_PID" 2>/dev/null || true
+    sleep 2
+    kill -9 "$SERVER_PID" 2>/dev/null || true
+    if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then
+        echo "Removing tmpfs model copy: $TMPFS_MODEL"
+        rm -rf "$TMPFS_MODEL"
+    fi


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard cleanup for unset PID and avoid deleting shared tmpfs copies.

With set -u, Line [36] can fail when the script exits before Line [141] assigns SERVER_PID. Also, tmpfs copies discovered at Line [49]-Line [51] are reused but still removed at exit, which can break concurrent jobs using the same path.

Suggested fix

TMPFS_MODEL="" +SERVER_PID="" +TMPFS_MODEL_OWNED=0 cleanup() { - kill "$SERVER_PID" 2>/dev/null || true - sleep 2 - kill -9 "$SERVER_PID" 2>/dev/null || true - if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then + if [ -n "${SERVER_PID:-}" ] && kill -0 "${SERVER_PID}" 2>/dev/null; then + kill "${SERVER_PID}" 2>/dev/null || true + sleep 2 + kill -9 "${SERVER_PID}" 2>/dev/null || true + fi + if [ "${TMPFS_MODEL_OWNED:-0}" = "1" ] && [ -n "${TMPFS_MODEL:-}" ] && [ -d "${TMPFS_MODEL}" ]; then echo "Removing tmpfs model copy: $TMPFS_MODEL" rm -rf "$TMPFS_MODEL" fi } @@ if [ -d "$TMPFS_MODEL" ] && [ -f "$TMPFS_MODEL/config.json" ]; then echo "Using existing tmpfs model copy: $TMPFS_MODEL" + TMPFS_MODEL_OWNED=0 else @@ cp -r "$MODEL" "$TMPFS_MODEL" echo "Model copy done: $TMPFS_MODEL" + TMPFS_MODEL_OWNED=1 fi

Also applies to: 46-57, 141-141

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 35 - 42, The cleanup() function should guard use of SERVER_PID and only remove tmpfs copies this script created: check SERVER_PID exists/ non-empty using parameter expansion (e.g., test -n "${SERVER_PID-}" or similar) before running kill/kil -9 to avoid failures under set -u, and introduce a TMPFS_MODEL_OWNED (or TMPFS_MODEL_CREATED) flag when you create or copy the tmpfs model in the code that discovers TMPFS_MODEL (the logic around the TMPFS discovery at the block that sets TMPFS_MODEL near where you check Lines 49-51); then in cleanup() only rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so shared tmpfs paths are not deleted by concurrent jobs.

coderabbitai · 2026-04-30T18:08:16Z

+for sp in site.getsitepackages() + [site.getusersitepackages()]:
+    if not os.path.isdir(sp):
+        continue
+    try:
+        with open(os.path.join(sp, '_deepseek_v4_patch.py'), 'w') as f:
+            f.write(STUB)
+        with open(os.path.join(sp, 'deepseek_v4.pth'), 'w') as f:
+            f.write('import _deepseek_v4_patch\n')
+        print(f"[patch] Wrote deepseek_v4.pth to {sp}")
+        break
+    except Exception as e:
+        print(f"[patch] Could not write to {sp}: {e}")
+
+exec(STUB)
+PYEOF


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '85,110p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1169

🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '1,95p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4915

🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '105,150p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2286

Fail fast when the .pth patch cannot be installed for spawned processes.

If all writable site-packages attempts fail (lines 90–102), the script continues silently. exec(STUB) at line 103 only patches the current interpreter; the spawned python -m sglang.launch_server process (line 131) depends on the .pth file for automatic patch application at startup. Without the file, the server starts without deepseek_v4 registered, causing opaque failures downstream.

Track whether .pth was successfully written and exit with an error if all site-packages locations are non-writable.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 90 - 104, The loop that tries to write _deepseek_v4_patch.py and deepseek_v4.pth may fail silently and only exec(STUB) patches the current interpreter, so the spawned "python -m sglang.launch_server" process won't get the patch; add a boolean flag (e.g., wrote_pth = False) before iterating site.getsitepackages() + [site.getusersitepackages()], set it to True when writing deepseek_v4.pth succeeds, and after the loop check the flag and call sys.exit(1) (and print an error) if no writable site-packages were found so the script fails fast instead of continuing to exec(STUB) and spawning the server without the .pth patch.

codecov · 2026-04-30T18:13:56Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.13%. Comparing base (0678136) to head (1b123da).
⚠️ Report is 64 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1379      +/-   ##
==========================================
- Coverage   74.46%   69.13%   -5.34%     
==========================================
  Files         464      491      +27     
  Lines       50089    58234    +8145     
==========================================
+ Hits        37300    40261    +2961     
- Misses      12789    17973    +5184

Flag	Coverage Δ
unit	`52.55% <ø> (+0.11%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

SGLang MTP smoke tests have not been run and verified. Remove them until they are tested end-to-end on a cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>

These files belong to the draft model folder and were not created or verified in this PR. Remove until they have dedicated test coverage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>

The z-lab/DeepSeek-V4-Flash-DFlash checkpoint is a 4-layer Qwen3-based scaffold created from z-lab's DFlash architecture reference with random weights. The smoke test verifies the inference pipeline starts and generates tokens, not acceptance rate or generation quality. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>

Adds a 2-task pipeline variant of vllm_dflash_smoke_test_cw_dfw.yaml: - task_0 creates a randomly-initialized DFlash draft checkpoint from the bundled config.json + dflash.py (4-layer Qwen3 scaffold based on z-lab's DFlash architecture), saving it to /scratchspace/dflash_draft - task_1 runs the existing vllm_smoke_test.sh using that checkpoint Bundles dflash_draft/config.json and dflash_draft/dflash.py so the test is fully self-contained — no pre-trained draft checkpoint required. The random-weight draft validates the DFlash inference stack (patching, KV cache allocation, speculative decoding loop) without needing a trained checkpoint. Signed-off-by: chenhany <chenhany@nvidia.com>

DeepSeek-V4-Flash has compress_ratios={0:3, 4:21, 128:20} so vLLM's is_deepseek_v4 flag fires, but the DFlash speculative decoding path does not populate c128a_prefill_topk_indices during prefill metadata build. This caused an AttributeError on the first inference request. P12 in patch_vllm_dflash.py patches _forward_prefill to detect the None case and fall back to SWA-only mode (topk_indices_buffer, top_k=0, N=0) rather than crashing. This is sufficient for the smoke test which validates the inference stack, not acceptance rate or quality. Also: add draft_model to GlobalVariables in core.py, and tune smoke test env vars (GPU_MEM_UTIL=0.80, MAX_BATCHED_TOKENS=16384) to match values verified on CW-DFW H100 8-GPU node. Smoke test result: 8 passed, 0 failed (cicd_1777582482 on CW-DFW). Signed-off-by: Han-Yun Chen <chenhany@nvidia.com> Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>

coderabbitai

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh`:
- Around line 25-33: The script currently calls AutoConfig.from_pretrained and
AutoModel.from_config with trust_remote_code=True which triggers remote-code
execution via the auto_map in the config; instead, import and instantiate the
local DFlash model class directly (the class defined in dflash.py that was
copied earlier) rather than using AutoConfig/AutoModel, or add a
caller-controlled flag to disable trust_remote_code and default it to false;
specifically replace the AutoConfig/AutoModel usage with a direct import of the
local DFlash class (referencing dflash.py and the class name used there) and
call its constructor with the config loaded without invoking trust_remote_code,
or expose a CLI/ENV variable to gate the trust_remote_code boolean when using
AutoModel.

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py`:
- Around line 4-15: The file-level header lacks explicit third-party provenance
for the z-lab-derived scaffold; add a clear provenance block at the top of
dflash.py that states the upstream repository URL and exact commit hash used,
the original project's copyright and license (including SPDX identifier), and a
short excerpt or link to the original license text, while keeping the existing
NVIDIA Apache-2.0 header; ensure the block references "z-lab DFlash reference
implementation" and the original source commit so the file is license-compliant
and traceable.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`:
- Around line 564-576: When topk_indices is None (the c128a_prefill_topk_indices
fallback) don't silently switch to SWA-only; instead require an explicit opt-in
via an environment flag (e.g., something like DEEPSEEK_ALLOW_SWA_FALLBACK)
before setting swa_only=True and using self.topk_indices_buffer; if the env flag
is absent raise or log a one-time error/warning and abort the run. Update the
branch that currently handles topk_indices is None (references: topk_indices,
topk_indices_buffer, swa_only, num_decode_tokens) to check the env flag first,
emit the one-time message (or raise), and only then assign topk_indices =
self.topk_indices_buffer[num_decode_tokens:], set swa_only and top_k/N
accordingly. Ensure the message clearly names c128a_prefill_topk_indices and the
env flag so users can opt into smoke-test behavior.
- Around line 71-75: The hardcoded VLLM path should be derived from the
installed vllm package instead of a fixed string; replace the literal VLLM =
pathlib.Path("/usr/local/.../vllm") with logic that imports the vllm module and
sets VLLM = pathlib.Path(vllm.__file__).parent (or the appropriate package
parent) and handle ImportError to surface a clear message if vllm is not
installed; update any references to VLLM accordingly (look for the VLLM symbol
in this file).

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml`:
- Around line 18-29: The tuning notes for gpu_memory_utilization and
max_num_batched_tokens are out of sync with task_1 settings: update either the
explanatory comments (gpu_memory_utilization=0.85, max_num_batched_tokens=32768)
to reflect the actual task variables GPU_MEM_UTIL=0.80 and
MAX_BATCHED_TOKENS=16384, or change task_1's GPU_MEM_UTIL and MAX_BATCHED_TOKENS
to match the documented values; ensure consistency for the same settings
referenced elsewhere (also update the repeated notes around the block mentioned
at lines 69-71) so gpu_memory_utilization / GPU_MEM_UTIL and
max_num_batched_tokens / MAX_BATCHED_TOKENS are identical across comments and
task_1.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f3104469-0f5f-4db2-81fa-d14bec5bb486

📥 Commits

Reviewing files that changed from the base of the PR and between ffe00b6 and 1b123da.

📒 Files selected for processing (6)

tools/launcher/core.py
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml

✅ Files skipped from review due to trivial changes (1)

tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json

coderabbitai · 2026-05-08T00:38:07Z

+python3 - <<'EOF'
+import os, sys, torch
+sys.path.insert(0, os.environ.get("OUT", "/scratchspace/dflash_draft"))
+from transformers import AutoConfig, AutoModel
+
+out = os.environ.get("OUT", "/scratchspace/dflash_draft")
+print(f"Initializing random DFlash draft model from config: {out}/config.json")
+config = AutoConfig.from_pretrained(out, trust_remote_code=True)
+model = AutoModel.from_config(config, trust_remote_code=True).to(torch.bfloat16)


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "create_dflash_draft.sh" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 145

🏁 Script executed:

wc -l tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 146

🏁 Script executed:

cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 1878

🏁 Script executed:

cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py

Repository: NVIDIA/Model-Optimizer

Length of output: 11077

🏁 Script executed:

cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json

Repository: NVIDIA/Model-Optimizer

Length of output: 1470

Avoid executing checkpoint code just to initialize random weights.

This always enables remote-code execution via the auto_map field in the config. Since dflash.py is already copied locally in the same script, instantiate the class directly from that file instead of going through AutoConfig/AutoModel(..., trust_remote_code=True), or at minimum make the flag caller-controlled and default it off.

Possible fix

import os, sys, torch +import importlib.util +from pathlib import Path sys.path.insert(0, os.environ.get("OUT", "/scratchspace/dflash_draft")) -from transformers import AutoConfig, AutoModel +from transformers import Qwen3Config out = os.environ.get("OUT", "/scratchspace/dflash_draft") print(f"Initializing random DFlash draft model from config: {out}/config.json") -config = AutoConfig.from_pretrained(out, trust_remote_code=True) -model = AutoModel.from_config(config, trust_remote_code=True).to(torch.bfloat16) +spec = importlib.util.spec_from_file_location("dflash", Path(out) / "dflash.py") +module = importlib.util.module_from_spec(spec) +assert spec.loader is not None +spec.loader.exec_module(module) +config = Qwen3Config.from_pretrained(out) +model = module.DFlashDraftModel(config).to(torch.bfloat16)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh` around lines 25 - 33, The script currently calls AutoConfig.from_pretrained and AutoModel.from_config with trust_remote_code=True which triggers remote-code execution via the auto_map in the config; instead, import and instantiate the local DFlash model class directly (the class defined in dflash.py that was copied earlier) rather than using AutoConfig/AutoModel, or add a caller-controlled flag to disable trust_remote_code and default it to false; specifically replace the AutoConfig/AutoModel usage with a direct import of the local DFlash class (referencing dflash.py and the class name used there) and call its constructor with the config loaded without invoking trust_remote_code, or expose a CLI/ENV variable to gate the trust_remote_code boolean when using AutoModel.

coderabbitai · 2026-05-08T00:38:07Z

+# DFlash draft model architecture for DeepSeek-V4-Flash.
+#
+# This is a 4-layer Qwen3-based scaffold used as the DFlash speculative decoding
+# draft model. The architecture is adapted from z-lab's DFlash reference implementation.
+#
+# Each decoder layer performs cross-attention over the target model's hidden states
+# (aux outputs from selected target layers) alongside the draft model's own hidden states.
+# The `fc` + `hidden_norm` project the concatenated target hidden states down to hidden_size
+# before cross-attention.
+#
+# This file is bundled with the smoke test config so that a randomly-initialized draft
+# checkpoint can be created at test time without requiring a pre-trained checkpoint.


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit third-party provenance for the z-lab-derived scaffold.

The header says this file is adapted from z-lab's DFlash reference implementation, but the file only carries the NVIDIA Apache header. If any implementation here was copied or closely adapted, please add the upstream repo + commit and the original license/SPDX details in this file so the example stays license-compliant.

As per coding guidelines, **/*.{py,cpp,hpp,h,cc,cxx,c,ts,tsx,js,jsx}: When copying code from third-party sources, include a reference link with commit hash to the source, the original repository's Copyright/License, and the NVIDIA Apache 2.0 Copyright/License header in the file.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py` around lines 4 - 15, The file-level header lacks explicit third-party provenance for the z-lab-derived scaffold; add a clear provenance block at the top of dflash.py that states the upstream repository URL and exact commit hash used, the original project's copyright and license (including SPDX identifier), and a short excerpt or link to the original license text, while keeping the existing NVIDIA Apache-2.0 header; ensure the block references "z-lab DFlash reference implementation" and the original source commit so the file is license-compliant and traceable.

coderabbitai · 2026-05-08T00:38:07Z

+import pathlib
+import re
+import sys
+
+VLLM = pathlib.Path("/usr/local/lib/python3.12/dist-packages/vllm")


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "patch_vllm_dflash.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 143

🏁 Script executed:

# Once we find the file, let's read it cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py

Repository: NVIDIA/Model-Optimizer

Length of output: 33038

🏁 Script executed:

# Also check file size first to determine how to read it wc -l tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py

Repository: NVIDIA/Model-Optimizer

Length of output: 145

Resolve the vLLM package root dynamically.

Hardcoding /usr/local/lib/python3.12/dist-packages/vllm ties the script to a single Python version and installation layout. On the next base-image update or non-standard vLLM installation, this path will fail silently or not be found. Derive the path from the imported module instead.

Suggested fix

import pathlib import re import sys +import vllm -VLLM = pathlib.Path("/usr/local/lib/python3.12/dist-packages/vllm") +VLLM = pathlib.Path(vllm.__file__).resolve().parent

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py` around lines 71 - 75, The hardcoded VLLM path should be derived from the installed vllm package instead of a fixed string; replace the literal VLLM = pathlib.Path("/usr/local/.../vllm") with logic that imports the vllm module and sets VLLM = pathlib.Path(vllm.__file__).parent (or the appropriate package parent) and handle ImportError to surface a clear message if vllm is not installed; update any references to VLLM accordingly (look for the VLLM symbol in this file).

coderabbitai · 2026-05-08T00:38:07Z

+        "            if topk_indices is None:  # _p12_c128a_prefill_fallback_patch\n"
+        "                # c128a_prefill_topk_indices unavailable; degrade to SWA-only.\n"
+        "                assert self.topk_indices_buffer is not None\n"
+        "                topk_indices = self.topk_indices_buffer[num_decode_tokens:]\n"
+        "                swa_only = True\n"
+        "                top_k = 0\n"
+        "                N = 0\n"
+        "            else:\n"
+        "                top_k = topk_indices.shape[-1]\n"
+        "                # Compressed region must fit the full compressed pool (seq_len //\n"
+        "                # compress_ratio), not just top_k. top_k bounds how many indices\n"
+        "                # the indexer selects, not the pool size it indexes into.\n"
+        "                N = (self.max_model_len + self.compress_ratio - 1) // self.compress_ratio"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the SWA-only fallback behind an explicit smoke-test opt-in.

When c128a_prefill_topk_indices is missing, this silently changes attention semantics and still lets the run continue. That hides real metadata bugs in any non-smoke-test usage. Please require an explicit env flag and emit a one-time warning or error otherwise.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py` around lines 564 - 576, When topk_indices is None (the c128a_prefill_topk_indices fallback) don't silently switch to SWA-only; instead require an explicit opt-in via an environment flag (e.g., something like DEEPSEEK_ALLOW_SWA_FALLBACK) before setting swa_only=True and using self.topk_indices_buffer; if the env flag is absent raise or log a one-time error/warning and abort the run. Update the branch that currently handles topk_indices is None (references: topk_indices, topk_indices_buffer, swa_only, num_decode_tokens) to check the env flag first, emit the one-time message (or raise), and only then assign topk_indices = self.topk_indices_buffer[num_decode_tokens:], set swa_only and top_k/N accordingly. Ensure the message clearly names c128a_prefill_topk_indices and the env flag so users can opt into smoke-test behavior.

coderabbitai · 2026-05-08T00:38:07Z

+# Key config choices:
+#   block_size=256   — required: SWA layers use window_size=256 so the page constraint
+#                      max(sm_page_sizes) ≤ max(all_page_sizes) forces block_size ≥ 256
+#   gpu_memory_utilization=0.85 — leaves ~4 GB headroom per GPU for Triton JIT compilation
+#                                  of the DeepSeek compressor kernel on first inference
+#   enforce_eager              — disables CUDA graph capture; avoids NoneType tensor errors
+#                                in deepseek_v4_attention CUDA graph execution with DFlash patches
+#   max_num_batched_tokens=32768 — vLLM's default; required so spec-decoding token budget
+#                                   (max_num_batched_tokens minus per-request draft overhead)
+#                                   stays positive. Setting this too low (e.g. 4096) causes
+#                                   "max_num_scheduled_tokens is set to negative value" error.
+#


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the tuning notes in sync with the actual task settings.

The comments describe gpu_memory_utilization=0.85 and max_num_batched_tokens=32768, but task_1 sets GPU_MEM_UTIL=0.80 and MAX_BATCHED_TOKENS=16384. Please update either the notes or the values so operators debug the right configuration.

Also applies to: 69-71

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml` around lines 18 - 29, The tuning notes for gpu_memory_utilization and max_num_batched_tokens are out of sync with task_1 settings: update either the explanatory comments (gpu_memory_utilization=0.85, max_num_batched_tokens=32768) to reflect the actual task variables GPU_MEM_UTIL=0.80 and MAX_BATCHED_TOKENS=16384, or change task_1's GPU_MEM_UTIL and MAX_BATCHED_TOKENS to match the documented values; ensure consistency for the same settings referenced elsewhere (also update the repeated notes around the block mentioned at lines 69-71) so gpu_memory_utilization / GPU_MEM_UTIL and max_num_batched_tokens / MAX_BATCHED_TOKENS are identical across comments and task_1.

coderabbitai Bot reviewed Apr 30, 2026

View reviewed changes

ChenhanYu and others added 5 commits April 30, 2026 11:19

chore: drop unverified SGLang files from PR

3ff58a9

SGLang MTP smoke tests have not been run and verified. Remove them until they are tested end-to-end on a cluster. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: chenhany <chenhany@nvidia.com>

coderabbitai Bot reviewed May 8, 2026

View reviewed changes

Conversation

ChenhanYu commented Apr 30, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot Bot commented Apr 30, 2026

Uh oh!

coderabbitai Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks failed

❌ Failed checks (1 error, 1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov Bot commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ChenhanYu commented Apr 30, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 30, 2026 •

edited

Loading

codecov Bot commented Apr 30, 2026 •

edited

Loading