Skip to content

feat(launcher): add DFlash support for DeepSeek-V4-Flash target model#1379

Open
ChenhanYu wants to merge 6 commits intomainfrom
chenhany/dflash-deepseek-v4-flash
Open

feat(launcher): add DFlash support for DeepSeek-V4-Flash target model#1379
ChenhanYu wants to merge 6 commits intomainfrom
chenhany/dflash-deepseek-v4-flash

Conversation

@ChenhanYu
Copy link
Copy Markdown
Collaborator

@ChenhanYu ChenhanYu commented Apr 30, 2026

Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130).

New files:

  • examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py Runtime patch script (P0–P11) applied to the vLLM container at job startup. Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM, fixes KV cache group allocation for heterogeneous DFlash draft layers, handles the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that overflows H100 shared memory with block_size=256 × MLA 576 bytes/token. Includes a PR contribution strategy note in the module docstring.

  • examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256, gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15 speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash.

  • examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests) MTP smoke test YAMLs for B200, B300, and CW-DFW H100.

  • common/specdec/sglang_smoke_test.sh SGLang speculative decoding smoke test.

  • common/specdec/read_vllm_files.sh Diagnostic: dump relevant vLLM source lines.

common/specdec/vllm_smoke_test.sh:
Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before starting vLLM. Allows per-model patches without modifying the common script.

common/vllm/query.sh:
Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect, COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for torch inductor compatibility with DeepSeek V4 fp8 ops.

core.py / slurm_config.py:
Expose retries, requeue, and additional_parameters in SlurmConfig and slurm_factory so long-running jobs can use Slurm's native requeue mechanism.

What does this PR do?

Type of change: ?

Usage

# Add a code snippet demonstrating how to use this

Testing

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ / ❌ / N/A
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ / ❌ / N/A
  • Did you write any new necessary tests?: ✅ / ❌ / N/A
  • Did you update Changelog?: ✅ / ❌ / N/A

Additional Information

Summary by CodeRabbit

  • New Features

    • DeepSeek‑V4 Flash speculative decoding support and a smoke-test pipeline for end-to-end validation
    • Runtime options for vLLM servers: model placement to tmpfs, GPU memory tuning, server timeouts, tokenizer/mode and KV-cache dtype controls
    • Job retry and requeue support for more reliable Slurm launches
  • Chores

    • Installer-time model/launcher patches and helper scripts to improve compatibility and startup stability

Add launcher examples and infrastructure to run DeepSeek-V4-Flash as the
DFlash speculative decoding target model in vLLM (container: deepseekv4-cu130).

New files:
- examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
  Runtime patch script (P0–P11) applied to the vLLM container at job startup.
  Adds the EAGLE3/DFlash aux-hidden-state interface to DeepseekV4ForCausalLM,
  fixes KV cache group allocation for heterogeneous DFlash draft layers, handles
  the fp8_ds_mla dtype, and bypasses the fp8_fp4_paged_mqa_logits kernel that
  overflows H100 shared memory with block_size=256 × MLA 576 bytes/token.
  Includes a PR contribution strategy note in the module docstring.

- examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
  CW-DFW H100 smoke test: 8xH100, TP=8, fp8 KV cache, block_size=256,
  gpu_memory_utilization=0.85 (leaves ~4 GB headroom for Triton JIT), 15
  speculative tokens, draft model z-lab/DeepSeek-V4-Flash-DFlash.

- examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/ (vllm + sglang MTP smoke tests)
  MTP smoke test YAMLs for B200, B300, and CW-DFW H100.

- common/specdec/sglang_smoke_test.sh  SGLang speculative decoding smoke test.
- common/specdec/read_vllm_files.sh    Diagnostic: dump relevant vLLM source lines.

common/specdec/vllm_smoke_test.sh:
  Add VLLM_PATCH_SCRIPT hook — if set, run the specified Python script before
  starting vLLM. Allows per-model patches without modifying the common script.

common/vllm/query.sh:
  Port B200/data-synthesis fixes: ulimit -u unlimited, DEEPGEMM_TMPDIR redirect,
  COPY_MODEL_TO_TMPFS for NFS stale-handle avoidance, and FORCE_AF_V2 patch for
  torch inductor compatibility with DeepSeek V4 fp8 ops.

core.py / slurm_config.py:
  Expose retries, requeue, and additional_parameters in SlurmConfig and
  slurm_factory so long-running jobs can use Slurm's native requeue mechanism.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 30, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 30, 2026

📝 Walkthrough

Walkthrough

Adds vLLM runtime and smoke-test enhancements plus DeepSeek‑V4‑Flash DFlash support: new shell utilities and runtime flags, an idempotent Python patcher that edits vLLM and related modules to enable DFlash, new example DFlash model and draft creation scripts/config, and SLURM config fields with executor wiring updates.

Changes

vLLM runtime & smoke-test

Layer / File(s) Summary
Runtime flags / env
tools/launcher/common/specdec/vllm_smoke_test.sh, tools/launcher/common/vllm/query.sh
Adds many new env vars (KV cache dtype, block size, TP size, tokenizer mode, GPU mem util, COPY_MODEL_TO_TMPFS, SERVER_TIMEOUT, DATA_PARALLEL_SIZE, etc.) and ulimit/TMPDIR/REDIRECT behavior.
Runtime patching & patch hooks
tools/launcher/common/specdec/vllm_smoke_test.sh, tools/launcher/common/vllm/query.sh
Introduces FORCE_AF_V2 flow (auto-load .pth + in-place source/regex edits via Python), optional transformers upgrade and model-type registration, vLLM custom-op monkey patching, and VLLM_PATCH_SCRIPT hook invocation.
Utility / extraction tool
tools/launcher/common/specdec/read_vllm_files.sh
New script to extract labeled line ranges from specific Torch files with strict error handling and NOT FOUND fallbacks.
Operational wiring
tools/launcher/common/specdec/vllm_smoke_test.sh, tools/launcher/common/vllm/query.sh
Integrates the above flags/hooks into server startup sequence, extends readiness timeout, and hardens token parsing using python3 -S.

DeepSeek‑V4‑Flash DFlash example & patching

Layer / File(s) Summary
Model draft config / data shape
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json
Adds DFlashDraftModel config and dflash_config (target layers, block size, dtype, auto_map).
Draft model implementation
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py
New DFlash draft model: rotary embedding helper, Qwen3DFlashAttention, decoder layer, and DFlashDraftModel with caching/forward behavior.
Draft creation script
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh
New script that produces a randomly initialized HF draft checkpoint (copies config/dflash.py, instantiates AutoModel with trust_remote_code, casts to bfloat16, saves).
Idempotent vLLM patcher
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
Adds installer-time patcher applying ~12 targeted edits across vLLM modules: registers deepseek_v4 as DFlash target, injects aux hidden‑state collection and outer setters/getters, relaxes kv-cache grouping/validation and page-size assumptions, fixes FP8 dtype mapping, adds sparse-attention fallback, guards topk prefill None cases, deletes .pyc, and reports per-patch status.
Smoke-test pipeline
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
Adds 2-task Slurm pipeline: task_0 creates draft checkpoint, task_1 runs vLLM smoke test with SPEC_METHOD=dflash and points to the patch script; configures GPU count, TP, KV cache dtype, and resource settings.

SLURM configuration & executor wiring

Layer / File(s) Summary
Config data shape
tools/launcher/slurm_config.py
Adds SlurmConfig fields: retries: int = 0, requeue: bool = False, additional_parameters: dict = None; updates slurm_factory signature to accept retries and requeue.
Executor wiring
tools/launcher/core.py
build_slurm_executor now uses slurm_config.retries for retries and merges {"requeue": True} into additional Slurm parameters when slurm_config.requeue is truthy.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes


Important

Pre-merge checks failed

Please resolve all errors before merging. Addressing warnings is optional.

❌ Failed checks (1 error, 1 warning)

Check name Status Explanation Resolution
Security Anti-Patterns ❌ Error Hardcoded trust_remote_code=True in create_dflash_draft.sh without inline justification. Violates SECURITY.md requirement for safe transformers loading. Add inline comment confirming locally-generated config/model are trusted, OR expose as caller-configurable parameter defaulting to False.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the main objective: adding DFlash support for DeepSeek-V4-Flash as a target model in the launcher infrastructure.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenhany/dflash-deepseek-v4-flash

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (6)
tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py (2)

82-93: 💤 Low value

Consider adding error handling for write failures.

The _patch_file function doesn't handle potential exceptions from path.write_text() (e.g., permission denied, disk full). While the container environment is likely controlled, unhandled write failures could silently corrupt state.

💡 Proposed defensive error handling
 def _patch_file(path: pathlib.Path, old: str, new: str, sentinel: str, label: str) -> bool:
     src = path.read_text()
     if sentinel in src:
         print(f"{label}: already patched")
         return True
     if old not in src:
         print(f"WARNING: {label}: pattern not found — skipping")
         return False
-    path.write_text(src.replace(old, new, 1))
-    _delete_pyc(path.stem)
-    print(f"{label}: OK")
-    return True
+    try:
+        path.write_text(src.replace(old, new, 1))
+        _delete_pyc(path.stem)
+        print(f"{label}: OK")
+        return True
+    except OSError as e:
+        print(f"ERROR: {label}: write failed — {e}")
+        return False
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 82 - 93, The _patch_file function can raise exceptions from
path.write_text (e.g., permission errors or disk full) which are not currently
handled; update _patch_file to catch exceptions around the write and _delete_pyc
call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log
or print a clear error message including the exception and label, and return
False on failure so callers know the patch failed; ensure sentinel/old checks
remain unchanged and only the write/delete operations are guarded.

408-414: 💤 Low value

Consider using existing torch import instead of inline import.

The injected code uses import torch as _t inline. Since flash_attn.py already imports torch, this redundant import could be avoided by using the existing module-level torch.

💡 Simplified patch without redundant import
     _fa_new = re.sub(
         r"([ \t]*)raise ValueError\(f\"Unrecognized FP8 dtype: \{kv_cache_dtype\}\"\)",
         lambda m: (
             m.group(1) + "# _dflash_fp8_ds_mla_patch: fp8_ds_mla is e4m3fn stored by the compressor\n"
             + m.group(1) + "if kv_cache_dtype == \"fp8_ds_mla\":\n"
-            + m.group(1) + "    import torch as _t; return _t.float8_e4m3fn\n"
+            + m.group(1) + "    return torch.float8_e4m3fn\n"
             + m.group(1) + "raise ValueError(f\"Unrecognized FP8 dtype: {kv_cache_dtype}\")"
         ),
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 408 - 414, The injected replacement in the re.sub lambda currently
does an inline "import torch as _t" when handling kv_cache_dtype ==
"fp8_ds_mla"; change it to use the existing module-level torch symbol instead
(e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct
return of torch.float8_e4m3fn) so the patch uses the already-imported torch in
flash_attn.py and avoids a redundant inline import; update the lambda string in
the _fa_new assignment that matches the raise ValueError to reference torch
directly.
tools/launcher/core.py (1)

275-276: 💤 Low value

Long line reduces readability; consider splitting.

Line 276 combines dict unpacking, conditional expression, and getattr in a single expression. While functional, this is hard to parse at a glance.

💡 Proposed refactor for readability
         retries=slurm_config.retries,
-        additional_parameters={**(slurm_config.additional_parameters or {}), **({"requeue": True} if getattr(slurm_config, "requeue", False) else {})},
+        additional_parameters={
+            **(slurm_config.additional_parameters or {}),
+            **({"requeue": True} if getattr(slurm_config, "requeue", False) else {}),
+        },
         packager=packager,
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/core.py` around lines 275 - 276, The one-line dict merge for
additional_parameters is hard to read; extract the merge into a small, clear
pre-computation: create a local variable (e.g., merged_additional_parameters) by
first copying slurm_config.additional_parameters or {} and then, if
getattr(slurm_config, "requeue", False) is True, set
merged_additional_parameters["requeue"] = True; finally pass
merged_additional_parameters to the function call instead of the long one-liner.
This keeps the logic around slurm_config, requeue, and additional_parameters
explicit and easy to read.
tools/launcher/common/specdec/vllm_smoke_test.sh (1)

549-560: 💤 Low value

Consider using arrays for optional arguments to avoid word-splitting fragility.

The unquoted ${PARALLELISM_ARGS} and ${OPTIONAL_ARGS} rely on word splitting, which works for simple flags but can break if values contain spaces or special characters. While currently functional, using arrays throughout would be more robust.

♻️ Example: array-based approach
-# Build optional args
-OPTIONAL_ARGS=""
+# Build optional args as array
+OPTIONAL_ARGS=()
 if [ -n "${REASONING_PARSER:-}" ]; then
-    OPTIONAL_ARGS="${OPTIONAL_ARGS} --reasoning-parser ${REASONING_PARSER}"
+    OPTIONAL_ARGS+=(--reasoning-parser "${REASONING_PARSER}")
 fi
 # ... similar for other flags ...

-VLLM_CMD=(vllm serve "${MODEL}"
-    --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}"
-    ${PARALLELISM_ARGS}
-    --port "${PORT}"
-    ${OPTIONAL_ARGS})
+VLLM_CMD=(vllm serve "${MODEL}"
+    --max-num-batched-tokens "${MAX_BATCHED_TOKENS:-32768}"
+    "${PARALLELISM_ARGS[@]}"
+    --port "${PORT}"
+    "${OPTIONAL_ARGS[@]}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/vllm_smoke_test.sh` around lines 549 - 560, The
VLLM command currently injects ${PARALLELISM_ARGS} and ${OPTIONAL_ARGS} via
unquoted string expansion which is fragile; change those to arrays and append
them into VLLM_CMD as array elements (e.g., ensure PARALLELISM_ARGS and
OPTIONAL_ARGS are declared as arrays and replace unquoted expansions with
VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=( "${OPTIONAL_ARGS[@]}" )),
keep the conditional appends for --speculative-config and --compilation-config
as array additions (VLLM_CMD+=(--speculative-config "$SPEC_CONFIG") etc.), and
finally invoke "${VLLM_CMD[@]}" to preserve proper tokenization.
tools/launcher/common/vllm/query.sh (1)

103-316: 💤 Low value

Consider extracting shared FORCE_AF_V2 patching logic.

This FORCE_AF_V2 patching block (lines 139-316) is nearly identical to the implementation in vllm_smoke_test.sh (lines 85-316). While duplicating the logic in each script ensures independence and avoids cross-script dependencies, consider extracting to a shared Python script if maintenance burden increases.

The exec(PATCH_CODE) on line 235 is safe since PATCH_CODE is a hardcoded string constant, not external input.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/vllm/query.sh` around lines 103 - 316, The FORCE_AF_V2
patching block is duplicated; extract the shared logic into a single reusable
script/module (e.g., a Python file that exposes and installs the patch) and have
both query.sh and vllm_smoke_test.sh load and run it instead of embedding
PATCH_CODE inline. Move the large PATCH_CODE string and the install/exec logic
(symbols: PATCH_CODE, vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and
the FORCE_AF_V2 conditional) into the shared script, update each shell script to
check FORCE_AF_V2 and call that shared script (or import/execute it) so the
behavior remains identical while centralizing maintenance. Ensure the new shared
file is installed into a location accessible to the runtime and keep the
existing runtime prints and exception handling unchanged.
tools/launcher/common/specdec/sglang_smoke_test.sh (1)

65-67: ⚡ Quick win

Avoid unconditional pre-release transformers upgrades in smoke runs.

Line [65]-Line [67] mutates runtime dependencies on every execution (--pre), which makes results non-reproducible and network-sensitive. Gate this behind an opt-in env var and require a tested spec/range.

Suggested fix
-echo "Upgrading transformers (--pre for deepseek_v4 support)..."
-pip install --upgrade --pre transformers -q || echo "WARNING: transformers upgrade failed, continuing"
+if [ "${UPGRADE_TRANSFORMERS:-0}" = "1" ]; then
+    : "${TRANSFORMERS_SPEC:?Set TRANSFORMERS_SPEC to a tested version/range}"
+    echo "Upgrading transformers (${TRANSFORMERS_SPEC})..."
+    pip install --upgrade "${TRANSFORMERS_SPEC}" -q || {
+        echo "ERROR: transformers upgrade failed"
+        exit 1
+    }
+fi
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 65 - 67, The
smoke test unconditionally runs pip install --upgrade --pre transformers which
mutates runtime deps each run; change the logic around the echo/"pip install
--upgrade --pre transformers -q" invocation (the upgrade step labeled "Upgrading
transformers (--pre for deepseek_v4 support)...") to only run when an opt-in
environment variable (e.g., UPGRADE_TRANSFORMERS or USE_PRETRANSFORMERS) is set,
and require a constrained, tested spec (e.g., a VERSION or SPEC env var or
documented version range) instead of always using --pre; update the message to
reflect the gated behavior and skip the network upgrade by default to keep smoke
runs reproducible.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 35-42: The cleanup() function should guard use of SERVER_PID and
only remove tmpfs copies this script created: check SERVER_PID exists/ non-empty
using parameter expansion (e.g., test -n "${SERVER_PID-}" or similar) before
running kill/kil -9 to avoid failures under set -u, and introduce a
TMPFS_MODEL_OWNED (or TMPFS_MODEL_CREATED) flag when you create or copy the
tmpfs model in the code that discovers TMPFS_MODEL (the logic around the TMPFS
discovery at the block that sets TMPFS_MODEL near where you check Lines 49-51);
then in cleanup() only rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so
shared tmpfs paths are not deleted by concurrent jobs.
- Around line 90-104: The loop that tries to write _deepseek_v4_patch.py and
deepseek_v4.pth may fail silently and only exec(STUB) patches the current
interpreter, so the spawned "python -m sglang.launch_server" process won't get
the patch; add a boolean flag (e.g., wrote_pth = False) before iterating
site.getsitepackages() + [site.getusersitepackages()], set it to True when
writing deepseek_v4.pth succeeds, and after the loop check the flag and call
sys.exit(1) (and print an error) if no writable site-packages were found so the
script fails fast instead of continuing to exec(STUB) and spawning the server
without the .pth patch.

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yaml`:
- Line 24: Remove the unrecognized environment variable
SGLANG_APPLY_CONFIG_BACKUP from the YAML or, if it is intended to be supported,
add handling and documentation: update sglang_smoke_test.sh to read and act on
SGLANG_APPLY_CONFIG_BACKUP (e.g., apply a backup config path or "none"
semantics) and add a comment and README entry describing the variable and valid
values; ensure the YAML entry "- SGLANG_APPLY_CONFIG_BACKUP: \"none\"" is either
deleted or retained only after the script and docs are updated to reference
SGLANG_APPLY_CONFIG_BACKUP.

---

Nitpick comments:
In `@tools/launcher/common/specdec/sglang_smoke_test.sh`:
- Around line 65-67: The smoke test unconditionally runs pip install --upgrade
--pre transformers which mutates runtime deps each run; change the logic around
the echo/"pip install --upgrade --pre transformers -q" invocation (the upgrade
step labeled "Upgrading transformers (--pre for deepseek_v4 support)...") to
only run when an opt-in environment variable (e.g., UPGRADE_TRANSFORMERS or
USE_PRETRANSFORMERS) is set, and require a constrained, tested spec (e.g., a
VERSION or SPEC env var or documented version range) instead of always using
--pre; update the message to reflect the gated behavior and skip the network
upgrade by default to keep smoke runs reproducible.

In `@tools/launcher/common/specdec/vllm_smoke_test.sh`:
- Around line 549-560: The VLLM command currently injects ${PARALLELISM_ARGS}
and ${OPTIONAL_ARGS} via unquoted string expansion which is fragile; change
those to arrays and append them into VLLM_CMD as array elements (e.g., ensure
PARALLELISM_ARGS and OPTIONAL_ARGS are declared as arrays and replace unquoted
expansions with VLLM_CMD+=( "${PARALLELISM_ARGS[@]}" ) and VLLM_CMD+=(
"${OPTIONAL_ARGS[@]}" )), keep the conditional appends for --speculative-config
and --compilation-config as array additions (VLLM_CMD+=(--speculative-config
"$SPEC_CONFIG") etc.), and finally invoke "${VLLM_CMD[@]}" to preserve proper
tokenization.

In `@tools/launcher/common/vllm/query.sh`:
- Around line 103-316: The FORCE_AF_V2 patching block is duplicated; extract the
shared logic into a single reusable script/module (e.g., a Python file that
exposes and installs the patch) and have both query.sh and vllm_smoke_test.sh
load and run it instead of embedding PATCH_CODE inline. Move the large
PATCH_CODE string and the install/exec logic (symbols: PATCH_CODE,
vllm_force_af_v2_runtime, _install, exec(PATCH_CODE), and the FORCE_AF_V2
conditional) into the shared script, update each shell script to check
FORCE_AF_V2 and call that shared script (or import/execute it) so the behavior
remains identical while centralizing maintenance. Ensure the new shared file is
installed into a location accessible to the runtime and keep the existing
runtime prints and exception handling unchanged.

In `@tools/launcher/core.py`:
- Around line 275-276: The one-line dict merge for additional_parameters is hard
to read; extract the merge into a small, clear pre-computation: create a local
variable (e.g., merged_additional_parameters) by first copying
slurm_config.additional_parameters or {} and then, if getattr(slurm_config,
"requeue", False) is True, set merged_additional_parameters["requeue"] = True;
finally pass merged_additional_parameters to the function call instead of the
long one-liner. This keeps the logic around slurm_config, requeue, and
additional_parameters explicit and easy to read.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`:
- Around line 82-93: The _patch_file function can raise exceptions from
path.write_text (e.g., permission errors or disk full) which are not currently
handled; update _patch_file to catch exceptions around the write and _delete_pyc
call (wrap the path.write_text and _delete_pyc(path.stem) in a try/except), log
or print a clear error message including the exception and label, and return
False on failure so callers know the patch failed; ensure sentinel/old checks
remain unchanged and only the write/delete operations are guarded.
- Around line 408-414: The injected replacement in the re.sub lambda currently
does an inline "import torch as _t" when handling kv_cache_dtype ==
"fp8_ds_mla"; change it to use the existing module-level torch symbol instead
(e.g., replace the inline import/alias and return _t.float8_e4m3fn with a direct
return of torch.float8_e4m3fn) so the patch uses the already-imported torch in
flash_attn.py and avoids a redundant inline import; update the lambda string in
the _fa_new assignment that matches the raise ValueError to reference torch
directly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e768f4bf-2470-478f-a408-7f7a00c11644

📥 Commits

Reviewing files that changed from the base of the PR and between bb08094 and 0b2f2b4.

📒 Files selected for processing (17)
  • tools/launcher/common/specdec/read_vllm_files.sh
  • tools/launcher/common/specdec/sglang_smoke_test.sh
  • tools/launcher/common/specdec/vllm_smoke_test.sh
  • tools/launcher/common/vllm/query.sh
  • tools/launcher/core.py
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/hf_offline_eagle3_data_cw_dfw.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b200.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_b300.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/sglang_mtp_smoke_test_cw_dfw.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b200.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_b300.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash-DFlash/vllm_mtp_smoke_test_cw_dfw.yaml
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
  • tools/launcher/slurm_config.py

Comment on lines +35 to +42
cleanup() {
kill "$SERVER_PID" 2>/dev/null || true
sleep 2
kill -9 "$SERVER_PID" 2>/dev/null || true
if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then
echo "Removing tmpfs model copy: $TMPFS_MODEL"
rm -rf "$TMPFS_MODEL"
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Guard cleanup for unset PID and avoid deleting shared tmpfs copies.

With set -u, Line [36] can fail when the script exits before Line [141] assigns SERVER_PID. Also, tmpfs copies discovered at Line [49]-Line [51] are reused but still removed at exit, which can break concurrent jobs using the same path.

Suggested fix
 TMPFS_MODEL=""
+SERVER_PID=""
+TMPFS_MODEL_OWNED=0
 cleanup() {
-    kill "$SERVER_PID" 2>/dev/null || true
-    sleep 2
-    kill -9 "$SERVER_PID" 2>/dev/null || true
-    if [ -n "$TMPFS_MODEL" ] && [ -d "$TMPFS_MODEL" ]; then
+    if [ -n "${SERVER_PID:-}" ] && kill -0 "${SERVER_PID}" 2>/dev/null; then
+        kill "${SERVER_PID}" 2>/dev/null || true
+        sleep 2
+        kill -9 "${SERVER_PID}" 2>/dev/null || true
+    fi
+    if [ "${TMPFS_MODEL_OWNED:-0}" = "1" ] && [ -n "${TMPFS_MODEL:-}" ] && [ -d "${TMPFS_MODEL}" ]; then
         echo "Removing tmpfs model copy: $TMPFS_MODEL"
         rm -rf "$TMPFS_MODEL"
     fi
 }
@@
     if [ -d "$TMPFS_MODEL" ] && [ -f "$TMPFS_MODEL/config.json" ]; then
         echo "Using existing tmpfs model copy: $TMPFS_MODEL"
+        TMPFS_MODEL_OWNED=0
     else
@@
         cp -r "$MODEL" "$TMPFS_MODEL"
         echo "Model copy done: $TMPFS_MODEL"
+        TMPFS_MODEL_OWNED=1
     fi

Also applies to: 46-57, 141-141

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 35 - 42, The
cleanup() function should guard use of SERVER_PID and only remove tmpfs copies
this script created: check SERVER_PID exists/ non-empty using parameter
expansion (e.g., test -n "${SERVER_PID-}" or similar) before running kill/kil -9
to avoid failures under set -u, and introduce a TMPFS_MODEL_OWNED (or
TMPFS_MODEL_CREATED) flag when you create or copy the tmpfs model in the code
that discovers TMPFS_MODEL (the logic around the TMPFS discovery at the block
that sets TMPFS_MODEL near where you check Lines 49-51); then in cleanup() only
rm -rf "$TMPFS_MODEL" when TMPFS_MODEL_OWNED is true so shared tmpfs paths are
not deleted by concurrent jobs.

Comment on lines +90 to +104
for sp in site.getsitepackages() + [site.getusersitepackages()]:
if not os.path.isdir(sp):
continue
try:
with open(os.path.join(sp, '_deepseek_v4_patch.py'), 'w') as f:
f.write(STUB)
with open(os.path.join(sp, 'deepseek_v4.pth'), 'w') as f:
f.write('import _deepseek_v4_patch\n')
print(f"[patch] Wrote deepseek_v4.pth to {sp}")
break
except Exception as e:
print(f"[patch] Could not write to {sp}: {e}")

exec(STUB)
PYEOF
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '85,110p'

Repository: NVIDIA/Model-Optimizer

Length of output: 1169


🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '1,95p'

Repository: NVIDIA/Model-Optimizer

Length of output: 4915


🏁 Script executed:

cat -n tools/launcher/common/specdec/sglang_smoke_test.sh | sed -n '105,150p'

Repository: NVIDIA/Model-Optimizer

Length of output: 2286


Fail fast when the .pth patch cannot be installed for spawned processes.

If all writable site-packages attempts fail (lines 90–102), the script continues silently. exec(STUB) at line 103 only patches the current interpreter; the spawned python -m sglang.launch_server process (line 131) depends on the .pth file for automatic patch application at startup. Without the file, the server starts without deepseek_v4 registered, causing opaque failures downstream.

Track whether .pth was successfully written and exit with an error if all site-packages locations are non-writable.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tools/launcher/common/specdec/sglang_smoke_test.sh` around lines 90 - 104,
The loop that tries to write _deepseek_v4_patch.py and deepseek_v4.pth may fail
silently and only exec(STUB) patches the current interpreter, so the spawned
"python -m sglang.launch_server" process won't get the patch; add a boolean flag
(e.g., wrote_pth = False) before iterating site.getsitepackages() +
[site.getusersitepackages()], set it to True when writing deepseek_v4.pth
succeeds, and after the loop check the flag and call sys.exit(1) (and print an
error) if no writable site-packages were found so the script fails fast instead
of continuing to exec(STUB) and spawning the server without the .pth patch.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 30, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.13%. Comparing base (0678136) to head (1b123da).
⚠️ Report is 64 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1379      +/-   ##
==========================================
- Coverage   74.46%   69.13%   -5.34%     
==========================================
  Files         464      491      +27     
  Lines       50089    58234    +8145     
==========================================
+ Hits        37300    40261    +2961     
- Misses      12789    17973    +5184     
Flag Coverage Δ
unit 52.55% <ø> (+0.11%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ChenhanYu and others added 5 commits April 30, 2026 11:19
SGLang MTP smoke tests have not been run and verified. Remove them until
they are tested end-to-end on a cluster.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
These files belong to the draft model folder and were not created or
verified in this PR. Remove until they have dedicated test coverage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
The z-lab/DeepSeek-V4-Flash-DFlash checkpoint is a 4-layer Qwen3-based
scaffold created from z-lab's DFlash architecture reference with random
weights. The smoke test verifies the inference pipeline starts and
generates tokens, not acceptance rate or generation quality.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
Adds a 2-task pipeline variant of vllm_dflash_smoke_test_cw_dfw.yaml:
- task_0 creates a randomly-initialized DFlash draft checkpoint from
  the bundled config.json + dflash.py (4-layer Qwen3 scaffold based on
  z-lab's DFlash architecture), saving it to /scratchspace/dflash_draft
- task_1 runs the existing vllm_smoke_test.sh using that checkpoint

Bundles dflash_draft/config.json and dflash_draft/dflash.py so the test
is fully self-contained — no pre-trained draft checkpoint required.
The random-weight draft validates the DFlash inference stack (patching,
KV cache allocation, speculative decoding loop) without needing a
trained checkpoint.

Signed-off-by: chenhany <chenhany@nvidia.com>
DeepSeek-V4-Flash has compress_ratios={0:3, 4:21, 128:20} so vLLM's
is_deepseek_v4 flag fires, but the DFlash speculative decoding path
does not populate c128a_prefill_topk_indices during prefill metadata
build. This caused an AttributeError on the first inference request.

P12 in patch_vllm_dflash.py patches _forward_prefill to detect the
None case and fall back to SWA-only mode (topk_indices_buffer, top_k=0,
N=0) rather than crashing. This is sufficient for the smoke test which
validates the inference stack, not acceptance rate or quality.

Also: add draft_model to GlobalVariables in core.py, and tune smoke
test env vars (GPU_MEM_UTIL=0.80, MAX_BATCHED_TOKENS=16384) to match
values verified on CW-DFW H100 8-GPU node.

Smoke test result: 8 passed, 0 failed (cicd_1777582482 on CW-DFW).

Signed-off-by: Han-Yun Chen <chenhany@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: chenhany <chenhany@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh`:
- Around line 25-33: The script currently calls AutoConfig.from_pretrained and
AutoModel.from_config with trust_remote_code=True which triggers remote-code
execution via the auto_map in the config; instead, import and instantiate the
local DFlash model class directly (the class defined in dflash.py that was
copied earlier) rather than using AutoConfig/AutoModel, or add a
caller-controlled flag to disable trust_remote_code and default it to false;
specifically replace the AutoConfig/AutoModel usage with a direct import of the
local DFlash class (referencing dflash.py and the class name used there) and
call its constructor with the config loaded without invoking trust_remote_code,
or expose a CLI/ENV variable to gate the trust_remote_code boolean when using
AutoModel.

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py`:
- Around line 4-15: The file-level header lacks explicit third-party provenance
for the z-lab-derived scaffold; add a clear provenance block at the top of
dflash.py that states the upstream repository URL and exact commit hash used,
the original project's copyright and license (including SPDX identifier), and a
short excerpt or link to the original license text, while keeping the existing
NVIDIA Apache-2.0 header; ensure the block references "z-lab DFlash reference
implementation" and the original source commit so the file is license-compliant
and traceable.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`:
- Around line 564-576: When topk_indices is None (the c128a_prefill_topk_indices
fallback) don't silently switch to SWA-only; instead require an explicit opt-in
via an environment flag (e.g., something like DEEPSEEK_ALLOW_SWA_FALLBACK)
before setting swa_only=True and using self.topk_indices_buffer; if the env flag
is absent raise or log a one-time error/warning and abort the run. Update the
branch that currently handles topk_indices is None (references: topk_indices,
topk_indices_buffer, swa_only, num_decode_tokens) to check the env flag first,
emit the one-time message (or raise), and only then assign topk_indices =
self.topk_indices_buffer[num_decode_tokens:], set swa_only and top_k/N
accordingly. Ensure the message clearly names c128a_prefill_topk_indices and the
env flag so users can opt into smoke-test behavior.
- Around line 71-75: The hardcoded VLLM path should be derived from the
installed vllm package instead of a fixed string; replace the literal VLLM =
pathlib.Path("/usr/local/.../vllm") with logic that imports the vllm module and
sets VLLM = pathlib.Path(vllm.__file__).parent (or the appropriate package
parent) and handle ImportError to surface a clear message if vllm is not
installed; update any references to VLLM accordingly (look for the VLLM symbol
in this file).

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml`:
- Around line 18-29: The tuning notes for gpu_memory_utilization and
max_num_batched_tokens are out of sync with task_1 settings: update either the
explanatory comments (gpu_memory_utilization=0.85, max_num_batched_tokens=32768)
to reflect the actual task variables GPU_MEM_UTIL=0.80 and
MAX_BATCHED_TOKENS=16384, or change task_1's GPU_MEM_UTIL and MAX_BATCHED_TOKENS
to match the documented values; ensure consistency for the same settings
referenced elsewhere (also update the repeated notes around the block mentioned
at lines 69-71) so gpu_memory_utilization / GPU_MEM_UTIL and
max_num_batched_tokens / MAX_BATCHED_TOKENS are identical across comments and
task_1.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f3104469-0f5f-4db2-81fa-d14bec5bb486

📥 Commits

Reviewing files that changed from the base of the PR and between ffe00b6 and 1b123da.

📒 Files selected for processing (6)
  • tools/launcher/core.py
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml
✅ Files skipped from review due to trivial changes (1)
  • tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json

Comment on lines +25 to +33
python3 - <<'EOF'
import os, sys, torch
sys.path.insert(0, os.environ.get("OUT", "/scratchspace/dflash_draft"))
from transformers import AutoConfig, AutoModel

out = os.environ.get("OUT", "/scratchspace/dflash_draft")
print(f"Initializing random DFlash draft model from config: {out}/config.json")
config = AutoConfig.from_pretrained(out, trust_remote_code=True)
model = AutoModel.from_config(config, trust_remote_code=True).to(torch.bfloat16)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "create_dflash_draft.sh" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 145


🏁 Script executed:

wc -l tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 146


🏁 Script executed:

cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh

Repository: NVIDIA/Model-Optimizer

Length of output: 1878


🏁 Script executed:

cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py

Repository: NVIDIA/Model-Optimizer

Length of output: 11077


🏁 Script executed:

cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/config.json

Repository: NVIDIA/Model-Optimizer

Length of output: 1470


Avoid executing checkpoint code just to initialize random weights.

This always enables remote-code execution via the auto_map field in the config. Since dflash.py is already copied locally in the same script, instantiate the class directly from that file instead of going through AutoConfig/AutoModel(..., trust_remote_code=True), or at minimum make the flag caller-controlled and default it off.

Possible fix
 import os, sys, torch
+import importlib.util
+from pathlib import Path
 sys.path.insert(0, os.environ.get("OUT", "/scratchspace/dflash_draft"))
-from transformers import AutoConfig, AutoModel
+from transformers import Qwen3Config
 
 out = os.environ.get("OUT", "/scratchspace/dflash_draft")
 print(f"Initializing random DFlash draft model from config: {out}/config.json")
-config = AutoConfig.from_pretrained(out, trust_remote_code=True)
-model = AutoModel.from_config(config, trust_remote_code=True).to(torch.bfloat16)
+spec = importlib.util.spec_from_file_location("dflash", Path(out) / "dflash.py")
+module = importlib.util.module_from_spec(spec)
+assert spec.loader is not None
+spec.loader.exec_module(module)
+config = Qwen3Config.from_pretrained(out)
+model = module.DFlashDraftModel(config).to(torch.bfloat16)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/create_dflash_draft.sh`
around lines 25 - 33, The script currently calls AutoConfig.from_pretrained and
AutoModel.from_config with trust_remote_code=True which triggers remote-code
execution via the auto_map in the config; instead, import and instantiate the
local DFlash model class directly (the class defined in dflash.py that was
copied earlier) rather than using AutoConfig/AutoModel, or add a
caller-controlled flag to disable trust_remote_code and default it to false;
specifically replace the AutoConfig/AutoModel usage with a direct import of the
local DFlash class (referencing dflash.py and the class name used there) and
call its constructor with the config loaded without invoking trust_remote_code,
or expose a CLI/ENV variable to gate the trust_remote_code boolean when using
AutoModel.

Comment on lines +4 to +15
# DFlash draft model architecture for DeepSeek-V4-Flash.
#
# This is a 4-layer Qwen3-based scaffold used as the DFlash speculative decoding
# draft model. The architecture is adapted from z-lab's DFlash reference implementation.
#
# Each decoder layer performs cross-attention over the target model's hidden states
# (aux outputs from selected target layers) alongside the draft model's own hidden states.
# The `fc` + `hidden_norm` project the concatenated target hidden states down to hidden_size
# before cross-attention.
#
# This file is bundled with the smoke test config so that a randomly-initialized draft
# checkpoint can be created at test time without requiring a pre-trained checkpoint.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add explicit third-party provenance for the z-lab-derived scaffold.

The header says this file is adapted from z-lab's DFlash reference implementation, but the file only carries the NVIDIA Apache header. If any implementation here was copied or closely adapted, please add the upstream repo + commit and the original license/SPDX details in this file so the example stays license-compliant.

As per coding guidelines, **/*.{py,cpp,hpp,h,cc,cxx,c,ts,tsx,js,jsx}: When copying code from third-party sources, include a reference link with commit hash to the source, the original repository's Copyright/License, and the NVIDIA Apache 2.0 Copyright/License header in the file.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/dflash_draft/dflash.py`
around lines 4 - 15, The file-level header lacks explicit third-party provenance
for the z-lab-derived scaffold; add a clear provenance block at the top of
dflash.py that states the upstream repository URL and exact commit hash used,
the original project's copyright and license (including SPDX identifier), and a
short excerpt or link to the original license text, while keeping the existing
NVIDIA Apache-2.0 header; ensure the block references "z-lab DFlash reference
implementation" and the original source commit so the file is license-compliant
and traceable.

Comment on lines +71 to +75
import pathlib
import re
import sys

VLLM = pathlib.Path("/usr/local/lib/python3.12/dist-packages/vllm")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

find . -name "patch_vllm_dflash.py" -type f

Repository: NVIDIA/Model-Optimizer

Length of output: 143


🏁 Script executed:

# Once we find the file, let's read it
cat -n tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py

Repository: NVIDIA/Model-Optimizer

Length of output: 33038


🏁 Script executed:

# Also check file size first to determine how to read it
wc -l tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py

Repository: NVIDIA/Model-Optimizer

Length of output: 145


Resolve the vLLM package root dynamically.

Hardcoding /usr/local/lib/python3.12/dist-packages/vllm ties the script to a single Python version and installation layout. On the next base-image update or non-standard vLLM installation, this path will fail silently or not be found. Derive the path from the imported module instead.

Suggested fix
 import pathlib
 import re
 import sys
+import vllm
 
-VLLM = pathlib.Path("/usr/local/lib/python3.12/dist-packages/vllm")
+VLLM = pathlib.Path(vllm.__file__).resolve().parent
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 71 - 75, The hardcoded VLLM path should be derived from the
installed vllm package instead of a fixed string; replace the literal VLLM =
pathlib.Path("/usr/local/.../vllm") with logic that imports the vllm module and
sets VLLM = pathlib.Path(vllm.__file__).parent (or the appropriate package
parent) and handle ImportError to surface a clear message if vllm is not
installed; update any references to VLLM accordingly (look for the VLLM symbol
in this file).

Comment on lines +564 to +576
" if topk_indices is None: # _p12_c128a_prefill_fallback_patch\n"
" # c128a_prefill_topk_indices unavailable; degrade to SWA-only.\n"
" assert self.topk_indices_buffer is not None\n"
" topk_indices = self.topk_indices_buffer[num_decode_tokens:]\n"
" swa_only = True\n"
" top_k = 0\n"
" N = 0\n"
" else:\n"
" top_k = topk_indices.shape[-1]\n"
" # Compressed region must fit the full compressed pool (seq_len //\n"
" # compress_ratio), not just top_k. top_k bounds how many indices\n"
" # the indexer selects, not the pool size it indexes into.\n"
" N = (self.max_model_len + self.compress_ratio - 1) // self.compress_ratio"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the SWA-only fallback behind an explicit smoke-test opt-in.

When c128a_prefill_topk_indices is missing, this silently changes attention semantics and still lets the run continue. That hides real metadata bugs in any non-smoke-test usage. Please require an explicit env flag and emit a one-time warning or error otherwise.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/patch_vllm_dflash.py`
around lines 564 - 576, When topk_indices is None (the
c128a_prefill_topk_indices fallback) don't silently switch to SWA-only; instead
require an explicit opt-in via an environment flag (e.g., something like
DEEPSEEK_ALLOW_SWA_FALLBACK) before setting swa_only=True and using
self.topk_indices_buffer; if the env flag is absent raise or log a one-time
error/warning and abort the run. Update the branch that currently handles
topk_indices is None (references: topk_indices, topk_indices_buffer, swa_only,
num_decode_tokens) to check the env flag first, emit the one-time message (or
raise), and only then assign topk_indices =
self.topk_indices_buffer[num_decode_tokens:], set swa_only and top_k/N
accordingly. Ensure the message clearly names c128a_prefill_topk_indices and the
env flag so users can opt into smoke-test behavior.

Comment on lines +18 to +29
# Key config choices:
# block_size=256 — required: SWA layers use window_size=256 so the page constraint
# max(sm_page_sizes) ≤ max(all_page_sizes) forces block_size ≥ 256
# gpu_memory_utilization=0.85 — leaves ~4 GB headroom per GPU for Triton JIT compilation
# of the DeepSeek compressor kernel on first inference
# enforce_eager — disables CUDA graph capture; avoids NoneType tensor errors
# in deepseek_v4_attention CUDA graph execution with DFlash patches
# max_num_batched_tokens=32768 — vLLM's default; required so spec-decoding token budget
# (max_num_batched_tokens minus per-request draft overhead)
# stays positive. Setting this too low (e.g. 4096) causes
# "max_num_scheduled_tokens is set to negative value" error.
#
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Keep the tuning notes in sync with the actual task settings.

The comments describe gpu_memory_utilization=0.85 and max_num_batched_tokens=32768, but task_1 sets GPU_MEM_UTIL=0.80 and MAX_BATCHED_TOKENS=16384. Please update either the notes or the values so operators debug the right configuration.

Also applies to: 69-71

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@tools/launcher/examples/deepseek-ai/DeepSeek-V4-Flash/vllm_dflash_smoke_test_cw_dfw.yaml`
around lines 18 - 29, The tuning notes for gpu_memory_utilization and
max_num_batched_tokens are out of sync with task_1 settings: update either the
explanatory comments (gpu_memory_utilization=0.85, max_num_batched_tokens=32768)
to reflect the actual task variables GPU_MEM_UTIL=0.80 and
MAX_BATCHED_TOKENS=16384, or change task_1's GPU_MEM_UTIL and MAX_BATCHED_TOKENS
to match the documented values; ensure consistency for the same settings
referenced elsewhere (also update the repeated notes around the block mentioned
at lines 69-71) so gpu_memory_utilization / GPU_MEM_UTIL and
max_num_batched_tokens / MAX_BATCHED_TOKENS are identical across comments and
task_1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant