feat: data plane transfer queue integration#2439
Open
ZhiyuLi-Nvidia wants to merge 162 commits into
Open
Conversation
bff0471 to
d20a6ed
Compare
Contributor
Author
|
/ok to test d20a6ed |
d20a6ed to
e7f6a91
Compare
Contributor
Author
|
/ok to test e7f6a91 |
Contributor
Author
|
/ok to test f8add06 |
Contributor
Author
|
/ok to test c7cb642 |
c7cb642 to
fa121a5
Compare
Contributor
Author
|
/ok to test fa121a5 |
fa121a5 to
8de60a8
Compare
Contributor
Author
|
/ok to test 8de60a8 |
8de60a8 to
aeb273c
Compare
Contributor
Author
|
/ok to test aeb273c |
aeb273c to
1596562
Compare
Contributor
Author
|
/ok to test 1596562 |
1596562 to
abada7e
Compare
Contributor
Author
|
/ok to test abada7e |
When ``calibrate_qkv_fp8_scales`` is invoked after a training step (``grpo_sync.py``:863), it reads calibration data from the data-plane via ``policy.read_from_dataplane(meta, select_fields=_calib_fields, ...)``. ``_calib_fields`` was built as ``meta.fields - DP_CALIB_EXCLUDED_FIELDS``, but the train partition's ``meta.fields`` carries the ``decompose_message_log`` wire payload (``turn_lengths``, ``turn_roles``, ``turn_contents``) alongside the model-input columns. That bulk metadata then rides into the legacy ``get_microbatch_iterator`` → ``get_and_validate_seqlen`` path, which asserts every 2D tensor's dim 1 matches the model's seq_dim (8192). ``turn_lengths`` has shape ``(B, max_turns≈3)`` → AssertionError, recipe crashes. Wire still carries these fields (the driver-side reconstruct path needs them); we just narrow what FP8 calibration asks for. Add ``MESSAGE_LOG_BULK_FIELDS`` to ``DP_CALIB_EXCLUDED_FIELDS`` so the filter at the calibration request site automatically drops them. Also adds ``tilelang`` to base deps as the workaround mamba-ssm requires on Hopper with Triton >= 3.4.0 (per upstream state-spaces/mamba#640). qwen3.5-9b megatron, qwen3.5-35ba3b megatron-ep16, and any other gated-chunk mamba recipe crash with ``RuntimeError: ... Please install tilelang`` without it. uv.lock regenerated in-container (uv 0.11.6, 443 packages, +tilelang). Surfaced by extras sweep on 7ffb1c5db: - 11920261 grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron (FP8 calib) - 11920253 grpo-qwen3.5-9b-1n8g-megatron (tilelang) - 11920255 grpo-qwen3.5-35ba3b-2n8g-megatron-ep16 (tilelang) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…ELDS Architecture-invariant test that prevents the silent regression we just hit on grpo-qwen3-8b-base-1n8g-fp8-kvcache-megatron (job 11920261): a new wire field landed in ``meta.fields`` but the FP8 calibration's blacklist (``DP_CALIB_EXCLUDED_FIELDS``) wasn't updated, so calibration silently requested the bulk-shape field and crashed in ``get_and_validate_seqlen`` (which assumes all 2D tensors are ``(B, seq_len)``). Pinning this membership: anyone who later adds another bulk-metadata field to the wire (e.g., extra decompose payload) must either match the per-token shape contract or extend ``DP_CALIB_EXCLUDED_FIELDS``, or this test fails in CI before the recipe crashes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…xtures Slurm jobid 11922031: KeyError: 'claim_meta_poll_interval_s' tests/unit/data_plane/test_tq_lifecycle.py:65 in TQDataPlaneClient.__init__ (adapters/transfer_queue.py:413) Same shape as the chaos-smoke fix in f593316bc: two fixtures in ``test_tq_lifecycle.py`` built dicts missing 3 required DataPlaneConfig keys (``claim_meta_poll_interval_s``, ``global_segment_size``, ``local_buffer_size``). Add them with the same CI-sized values used elsewhere in the suite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
The legacy sync path returns FP8 scales from the calibration worker
as a Python dict via Ray, then re-broadcasts to vLLM workers — keeps
the driver on the critical path and doesn't compose with the async
proposal (multiple in-flight calibrations need a shared transport,
not driver-mediated dict passing).
Add a thin TQ-backed transport for scales:
- ``nemo_rl/data_plane/kv_scales.py``:
- ``pack_kv_scales`` / ``unpack_kv_scales`` — dict ↔ fixed-shape
tensors (``q_scales``, ``k_scales``, ``v_scales``).
- ``put_kv_scales`` / ``get_kv_scales`` — register a single-sample
``"kv_scales"`` partition, write/read via the existing
``DataPlaneClient`` put_samples/get_samples primitives.
- Adapter-agnostic (works on NoOp + TransferQueue alike).
- ``grpo_sync.py`` (sync path, calibration site):
After ``calibrate_qkv_fp8_scales`` returns the dict, immediately
round-trip it through TQ via put/get. Legacy ``refit_policy_generation``
still consumes the dict — the round-trip just validates the
transport works for scales. Next commit will move the read site
out of the driver into the vLLM refit worker so the dict path
drops out entirely.
- ``tests/unit/data_plane/test_kv_scales.py``: pack/unpack identity,
gap handling, empty case, full put→get round-trip via NoOp adapter
parametrized on n_layers ∈ {1, 8, 24}, partition-id passthrough.
Motivated by feedback on PR #2439 / data-plane async proposal: scales
belong on the wire, not in driver-side Python dicts.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…cut)" This reverts commit bee3a62f90bc15798b73578820da9f9c84715535. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
The existing ``DP_CALIB_EXCLUDED_FIELDS`` negative-list shape silently broke when ``MESSAGE_LOG_BULK_FIELDS`` (wire-only object arrays) were added to the train wire — ``calibrate_qkv_fp8_scales`` routes through get_microbatch_iterator which only handles seq-dim tensors, so any new non-tensor wire field crashes calib until someone augments the exclude list. Mirror the ``LP_SEED_FIELDS`` pattern instead: name the fields calibration actually needs. Changes: * schema.py: replace ``_DP_CALIB_INPUT_FIELDS`` (private) + ``DP_CALIB_EXCLUDED_FIELDS`` (derived negative) with ``DP_CALIB_INPUT_FIELDS = (INPUT_IDS, INPUT_LENGTHS)``. Same shape as ``LP_SEED_FIELDS`` — a positive tuple of what the consumer fetches. Drops the cross-layer ``llm_message_utils`` import. * grpo_sync.py: ``_calib_fields = [f for f in meta.fields if f in DP_CALIB_INPUT_FIELDS]``. Trade-off: drops the implicit multimodal-extras pass-through. Today's GRPO recipes are text-only; multimodal calibration can re-introduce extras via a meta-side marker (e.g. ``meta.extra_info["multimodal_calib_fields"]``) in a follow-up. Also remove ``tests/unit/data_plane/test_tq_chaos_smoke.py`` — was an untracked working-tree scratch that got pulled in during the test consolidation; not load-bearing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…edupe
Squashes 6 earlier commits into one logical change covering all
realistic-shape test infrastructure under ``tests/unit/data_plane/``.
* New helper module ``_rollout_shapes.py``:
- ``make_rollout_batch(n, max_seqlen, multimodal=False, *_dtype, seed)``
mints data with the shape + dtypes ``SyncRolloutActor.rollout_to_tq``
actually writes (int64 ids, int32 masks, bf16 logprobs, optional
multimodal extras as flat top-level fields).
- ``make_realistic_tags(n, zero_std_fraction, seed)`` mirrors GRPO
driver tag-stamping (std/total_reward/prompt_id/weight_version)
with a controllable zero-std fraction for dynamic-sampling tests.
- ``make_multi_turn_message_log(n, turns_per_sample, seed)`` builds
jagged-turn-count message logs for decompose/reconstruct round-trips.
- Shared cross-file helpers ``keys_from_uids``,
``register_train_partition``, ``mooncake_available`` (deduped from
three test files that each defined their own copy).
* Realistic-shape coverage added to 9 test files: codec_jagged
(dtype parametrize), codec_mooncake (bf16 per-token), codec_wire_stripped
(NonTensorStack of varied turn roles), correctness (kv_first_write
round-trip with mixed dtypes), kvbatchmeta (driver tags), message_log
decompose (jagged + multi-turn round-trip), observability (mixed-
dtype put_bytes), preshard_extras (VLM ``pixel_values`` round-trip),
sync_one_hop (full 7-stage TQ lifecycle).
* ``test_full_sync_step_lifecycle_on_realistic_batch`` walks the
production ``grpo_train_sync`` per-step flow end to end on a realistic
batch — register → kv_first_write → tag → worker delta-writes →
driver delta-write → full read → ``finish_step`` clear — asserting
every field's dtype survives the pipeline.
* Cross-file dedupe: ``_keys_from_uids`` / ``_setup`` /
``_setup_partition`` / ``_mooncake_available`` (defined identically
in 3+ test files) collapsed into single canonical implementations in
the helper module.
* Module-level imports for all helpers; module-level ``pytest``;
``_PARTITION = "train"`` const in the lifecycle test (was repeated 7×).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Regenerate via ``uv lock`` against the current ``pyproject.toml``. The diff is the resolver-marker expansion — each platform/extra marker now splits by the ``extra-7-nemo-rl-vllm`` extra dimension. Legitimate uv-driven refresh, not drift. (Note: the tilelang comment compression in the working tree was reverted by ``uv lock``'s pyproject.toml canonicalization; only the lockfile change landed here.) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
CI pre-commit ran ``ruff check --select I --fix`` (per ``.pre-commit-config.yaml``'s second ruff hook) and flagged four files — local ``._rollout_shapes`` imports ordered before third-party imports, and a stray blank line after ``from __future__ import annotations``. ``ruff check`` (the first hook) doesn't enable isort rules by default, which is why the earlier local lint pass missed these. Reproduced + fixed locally via the exact command CI uses. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
V1 inherited TQWorkerMixin._is_writeback_leader which returns True for every rank when CP=1, letting all TP ranks race on Mooncake upserts and crashing mooncake_cpu with -601 ILLEGAL_CLIENT. V2 already gates on (cp_local_rank, tp_local_rank) == (0, 0); V1 now mirrors the same override so TP>1 DTensor recipes (deepscaler-1.5b-16K/24K, dapo-qwen2.5-7b, gemma3-27b-actckpt-long) stop multi-writing the same prev_logprobs keys. Verified against the original failing recipe (deepscaler-1.5b-16K, TP=2): Step 1/20 + Step 2/20 completed cleanly with no -601 errors after the override was added. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
CI snapshot test test_reference_configs_up_to_date flagged two stale keys in tests/unit/reference_configs/grpo_math_1B.yaml: global_segment_size: real=549755813888 (512 GiB), reference=8589934592 (8 GiB) local_buffer_size: real=68719476736 (64 GiB), reference=1073741824 (1 GiB) Bring the snapshot in line with examples/configs/grpo_math_1B.yaml; no behavior change. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…ests Drop 9 source-grep tests, 1 duplicate import-smoke, 1 xfail-strict TODO, and the FP8-calib regression test (tautological under the positive-list calib filter — ``DP_CALIB_INPUT_FIELDS ∩ MESSAGE_LOG_BULK_FIELDS = ∅`` by definition, so the leak the test guarded against is impossible by construction). Keep only: * ``test_run_grpo_dispatches_both_trainers`` — behavioral: imports and calls ``_select_trainer`` directly; verifies dispatch to grpo_train (data_plane absent) and grpo_train_sync (data_plane.enabled=True). * ``test_data_plane_client_abc_method_present`` — hasattr on the live class (not a source-grep); parametrized over the 8 DataPlaneClient ABC methods that every adapter must implement. 376 → 73 lines. 9 collected (was 18). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…itHub The cu13 variant is now published on PyPI as a separate distribution name (mooncake-transfer-engine-cuda13). Switch from the direct GitHub release URL to a plain PyPI version pin. The wheel is byte-identical (verified sha256: a96794f4d3c693e6e71ad85ef578a429ec69ab36e0c2f9b45b200d37e45d3cc0, 44,756,026 bytes), so this is a pure CDN switch — no behavioral change. Eliminates a recurring github.com fetch-timeout failure mode on compute nodes during NRL_FORCE_REBUILD_VENVS=true. PyPI (Fastly) is far more reliable than github releases under concurrent fetches from a Slurm batch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
In grpo_train_sync, the driver previously consumed get_logprobs_from_meta / get_reference_policy_logprobs_from_meta via their Ray-returned BatchedDataDict — getting the full (B, S) per-token tensor through Ray's plasma store. That same tensor was also written back to TQ by the worker leader (for train_from_meta to fetch later), so every step paid two transfers for the same (B, S) per-token data. Drop the Ray-side consumption: workers still write to TQ via _write_back_result_field, and the driver now reads prev_logprobs / reference_policy_logprobs from TQ alongside the existing batched read for generation_logprobs / token_mask. One round-trip, one materialization point. Expected effect: shorter Ray scheduler queue + earlier plasma cleanup right before training_prep, which previously inherited the back-pressure of large outstanding plasma references. Targets the +13.5% on policy_and_reference_logprobs and the +67% on training_prep observed in the 32n8g DSV3 perf comparison. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Following d1bfe86c3, the driver-side fix alone was insufficient: even though grpo_train_sync ignored the BatchedDataDict returned from get_logprobs_from_meta / get_reference_policy_logprobs_from_meta, the underlying _logprob_dispatch still ran ray.get() on the worker futures which materialized the full (B, S) per-token tensor through Ray's plasma store before the aggregate_fn ran. Empirically the per-step regression in 11973965 stayed at ~125-128 s, identical to the unpatched DP-warm baseline. This patch eliminates the Ray transfer at the source: workers return None from get_logprobs_presharded / get_reference_policy_logprobs_presharded once the per-token tensor has been committed to TQ via the existing _write_back_result_field leader path. Aggregators handle all-None results by returning None; _logprob_dispatch propagates None up to the caller. grpo_train_sync (already patched in d1bfe86c3) reads the tensor from TQ instead. Wire cost: ~6 MB per step (B=512 × S~1536 × fp32 × 2 fields) and matching plasma references freed sooner — targets the +13.5 % regression on policy_and_reference_logprobs and the +67 % on training_prep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…eturn None Following 43f55293f the worker entry points get_logprobs_presharded / get_reference_policy_logprobs_presharded always return None — the per-token tensor is committed to TQ via _write_back_result_field. The accompanying _aggregate_logprob_results / _aggregate_reference_logprob_results helpers always saw an all-None list and returned None, so the aggregate_fn dispatch was dead code paying a parameter-and-callback cost. Drop both helpers. Simplify _logprob_dispatch: * remove aggregate_fn parameter * drop the unused unsorted_indices result (there is no result to reorder) * call get_all_worker_results purely for synchronisation get_logprobs_from_meta / get_reference_policy_logprobs_from_meta now return None explicitly; their return type is honest at the type-checker level. Also worker_mixin: drop the explicit ``return None`` (implicit), add ``del result`` after _write_back_result_field so the BatchedDataDict holding the per-token tensor is released before the worker idles waiting for the next dispatch. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…th for writeback-leader Background — previously every TQ-aware policy worker class had to override _is_writeback_leader() and re-derive its (tp, cp, pp) coords from torch.distributed / device_mesh / parallel_state. The default in TQWorkerMixin was deliberately wrong (always-True at CP=1) so subclasses were forced to override; missing or mismatched overrides produced silent duplicate writes to Mooncake (the -601 ILLEGAL_CLIENT bug). That's distributed state duplicated across 4 implementations to mirror information Ray's dispatcher already has via ``sharding_annotations.get_worker_coords(worker_idx)``. This patch makes Ray's worker_coords the single source of truth: 1. TQWorkerMixin grows ``set_sharding_coords(coords: dict)`` — a setter the worker-group calls once per actor right after construction. Stored on ``self._sharding_coords``. 2. ``RayWorkerGroup._create_workers_from_bundle_indices`` pushes coords into every worker that exposes the setter, immediately after the workers list is populated. Workers without the method are skipped. 3. ``TQWorkerMixin._is_writeback_leader()`` is now a 5-line reader of ``self._sharding_coords`` and matches Ray's own ``output_is_replicated`` semantics: (tp, cp, pp) all coord-0. 4. Subclass overrides in DTensor V1 / DTensor V2 / Megatron are deleted — no more "subclass must override" footgun. The V1-only override that patched the -601 bug (commit ecd8492) is also gone; the base now handles every worker class correctly. Bug class extinct: it is structurally impossible for a new TQ-aware worker class to forget the leader-rank logic. If sharding_coords are present, the gating works; if not, the default (single-worker, all True) is safe. Net diff: +43 / -51. No public API change. No per-call kwarg injection. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…e of truth for writeback-leader" This reverts commit d7cde02e7fd222d1cb8ba9df035c9f1ba7a54704. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
… -601 duplicate-write Root cause of the -601 ILLEGAL_CLIENT crash: _get_replica_group() returned None for CP=1, so _is_replica_leader() was always True for every TP sibling, causing all siblings to write to Mooncake concurrently on the same key. Changes: - Add REPLICATED_AXES constant and NamedSharding.is_axis_zero(coords, axes) as the single shared predicate for leader-rank gating (driver-side and worker-side). - Replace _is_writeback_leader() with _local_coords() abstract method; workers feed their TP/CP/PP local ranks and _is_replica_leader() calls is_axis_zero. - Drop the CP=1 early-return-None guard in _get_replica_group() on all workers; replica_group.size() > 1 in _fetch() controls the broadcast-vs-independent path. - Thread is_leader through _broadcast_batched_data_dict() instead of re-deriving it from get_rank() == src inside the helper. - Add grpo_dp_simple.sh and grpo_dp_mooncake.sh functional tests; wire into L1. - Add test_writeback_pipeline_e2e.py unit test pinning the non-leader no-write contract. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Drop unused LogprobOutputSpec / ReferenceLogprobOutputSpec imports in tq_policy.py (F401) and collapse a ternary in grpo_sync.py to satisfy ruff format. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Regenerated via .claude/uv_regen.sh (pinned uv 0.11.6, fresh lock from current pyproject.toml). Resolved 443 packages; `uv sync --locked --no-install-project --dry-run` passes. Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Contributor
Author
|
/ok to test 92499b3 |
yuki-97
previously approved these changes
May 22, 2026
Contributor
yuki-97
left a comment
There was a problem hiding this comment.
thanks @ZhiyuLi-Nvidia for the great effort! LGTM from my side.
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
…romote `_promote_1d_leaves` and `_from_wire` iterated `td.keys(include_nested=True, leaves_only=True)`, which silently excludes non-tensor leaves (NonTensorData / NonTensorStack). Object fields like `content` and `MESSAGE_LOG_BULK_FIELDS` were dropped from the rebuilt TensorDict on the mooncake_cpu put / get path, surfacing later as a `KeyError`. Switch to top-level `td.keys()` so non-tensor leaves are preserved. Tighten the post-rebuild assertion to use the same enumeration so it actually detects the silent drop class it was meant to guard against. Update `test_object_and_tensor_mixed_round_trip_backends` to mirror the e2e GRPO `kv_first_write` flow (tensor-only `DP_TRAIN_FIELDS` registration, production-shape `bulk_batch` with `np.ndarray(dtype=object)` content, mixed read via `read_columns`). Add `test_promote_1d_leaves_object_array_roundtrip` to pin the helper invariant with the production TD shape (1D + 2D tensor + object array). Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Signed-off-by: Zhiyu Li <zhiyul@NVIDIA.com>
Contributor
Author
|
/ok to test 7dabfeb |
|
LGTM |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do ?
Summary
driver only handles per-sample slices and KVBatchMeta.
Details in https://github.com/NVIDIA-NeMo/RL/blob/zhiyul/data_plane_plan/nemo_rl/data_plane/README.md
Scope
(read_columns / write_columns).
rollout 1-hop put → meta-driven logprob/train → kv_clear.
Test
https://wandb.ai/nvidia/nemorl-dataplane-zhiyul?nw=nwuserzhiyul
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information