Use DataDesigner native resume for retrieval SDG by shan-nvidia · Pull Request #51 · NVIDIA-NeMo/DataDesignerPlugins

shan-nvidia · 2026-05-28T13:32:29Z

What

Updates data-designer-retrieval-sdg to use DataDesigner native resumable generation instead of the plugin's manual per-batch restart loop.

The generate command now:

Requires data-designer>=0.6.1.
Calls DataDesigner.create(...) once across the full seed range.
Adds --resume/-r {never,always,if_possible} and passes the selected ResumeMode.
Adds stable --dataset-name support.
Replaces the old generation batch controls with --buffer-size.
Exports one JSONL file named from the resolved DataDesigner dataset name.

The converter keeps backward compatibility for legacy generated_batch*.json outputs while also accepting .jsonl, .json, and .parquet files.

Why

DataDesigner 0.6.x owns interrupted-run checkpointing now. Keeping plugin-level --batch-size, --start-batch-index, and --end-batch-index would leave two competing restart systems and make resume behavior harder to reason about.

This PR makes DataDesigner artifacts the source of truth for resume, while --output-dir remains the downstream exported JSONL location.

Usage

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume if_possible \
  --num-pairs 7

Resume an interrupted run with the same artifact path, dataset name, config, and buffer size:

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume always

Convert the exported JSONL:

data-designer-retrieval-sdg convert ./generated_output/my_retrieval_run.jsonl \
  --corpus-id my_corpus

How

The CLI builds the QA pipeline once using the full discovered seed range and calls create(..., num_records=total_records, dataset_name=..., resume=...).

dd.RunConfig(disable_early_shutdown=True, buffer_size=...) controls DataDesigner checkpoint granularity. The old manual batch JSON writer was removed from generate, but the converter still recognizes legacy batch JSON for existing outputs.

The embedding-dedup scheduler test was updated for DataDesigner 0.6.1's model scheduling metadata API, replacing the old is_llm_bound assertion with get_scheduling_metadata() coverage.

The workspace now declares pytest explicitly in the root dev dependency group so make test does not depend on a transitive test dependency from another tool.

Validation

Local checks:

make sync
make lint
make test
make test-plugin PLUGIN=data-designer-retrieval-sdg (70 passed)
make validate
make check
git diff --check

Real resume smoke against DataDesigner 0.6.1:

Input: /Users/sthan/workspace/retriever-sdg-v3/examples/nv_pp_random
Command used --num-files 2 --buffer-size 1 --resume never
Interrupted after completed parquet-files/batch_00000.parquet and partial tmp-partial-parquet-files/batch_00001.parquet
Resumed with --resume always
DataDesigner logged: Resuming from batch 2 of 2 (1 records already generated).
Final artifacts contained batch_00000.parquet and batch_00001.parquet
Final JSONL contained 2 lines
Final seed order: build-nvidia/nvidia/corrdiff.txt, then dli/course-v1:DLI+C-FX-01+V3.txt

eric-tramel

Inline review comments for the issues from the review-nuke pass.

eric-tramel · 2026-06-02T21:49:27Z

+        start_index=0,
+        end_index=total_records - 1,
+        **pipeline_kwargs,
+    )


This needs one more layer of validation before handing the name to DataDesigner. The subtle case is not a normal separator like /, which DataDesigner already rejects; it is a path component such as ... In DataDesigner 0.6.1, artifact_path / ".." can resolve to the parent directory when --resume always or --resume if_possible sees that parent already exists, so a run name can escape the intended artifact root and write metadata/parquet artifacts outside it. A small helper that rejects ., .., separators/control chars, and verifies (artifact_path / name).resolve() stays under artifact_path.resolve() would make the CLI boundary explicit.

eric-tramel · 2026-06-02T21:49:27Z

+
+
+def _load_parquet_records(input_file: Path) -> list[dict]:
+    """Load records from a parquet file exported by DataDesigner."""


This loader introduces a pandas/pyarrow shape that JSON inputs never hit. When a row has multiple nested chunks, this to_dict(orient="records") path returns chunks as a NumPy array. NumPy arrays with more than one element intentionally do not have a truth value, so the later if not chunks check raises ValueError before normalization gets a chance to call tolist(). Either normalize array-like nested values here, or move that normalization before the downstream emptiness check and use an explicit length check such as len(chunks) == 0.

eric-tramel · 2026-06-02T21:49:27Z

-        input_basename = args.input_dir.name
-        dataset_name = f"{input_basename}_batch{batch_idx}_{start_idx}_{end_idx}"
-        result = data_designer.create(config_builder, num_records=num_in_batch, dataset_name=dataset_name)
-        generated_df = result.load_dataset()


This changes the memory shape of the command in a way --buffer-size does not fully control. DataDesigner 0.6.1 writes checkpoint batches during generation, but create() still loads the completed dataset back into a full DataFrame for profiling before it returns. That means the streaming result.export() helps export memory, but peak memory has already grown to the full generated dataset. The old loop bounded that peak by batch_size. For large retrieval SDG jobs, either keep a chunked path for large runs or wait for/use a DataDesigner create mode that can skip/profile-stream without materializing everything.

eric-tramel · 2026-06-02T21:49:27Z

 [dependency-groups]
 dev = [
    "dumb-pypi>=1.15.0",
+    "pytest>=8.0.0",


This PRs behavioral scope is retrieval SDG native resume plus the direct DataDesigner bump, but this block also removes two workspace-wide resolver constraints. Even if the current lock still resolves python-multipart==0.0.29 and urllib3==2.7.0, removing the constraints changes future resolver policy for every plugin in the workspace, which makes the PR harder to reason about and harder to roll back. Please restore these constraints here and in uv.lock, or split the constraint-policy cleanup into its own PR with that motivation called out.

eric-tramel · 2026-06-02T21:49:27Z

+        self.model_providers = model_providers
+        self.run_config = None
+        self.create_calls: list[dict[str, object]] = []
+        self.result = FakeCreateResult(FakeArtifactStorage(artifact_path / "my_run", "my_run"))


This fake currently uses the same requested dataset name and resolved dataset name (my_run). Because those are identical, the test would still pass if the production code accidentally exported args.dataset_name instead of result.artifact_storage.resolved_dataset_name. The important behavior here is exactly that DataDesigner may rename the dataset when avoiding collisions, so make the fake resolved name different, for example my_run_resolved, and assert the JSONL export uses that resolved value.

eric-tramel · 2026-06-02T21:49:27Z

+    monkeypatch.setattr(cli, "build_model_providers", fake_build_model_providers)
+    monkeypatch.setattr(cli, "build_qa_generation_pipeline", fake_build_qa_generation_pipeline)
+
+    cli._run_generate(_generate_args(tmp_path))


This test bypasses the public CLI parser by constructing a large private argparse.Namespace. That makes the test less useful for the new public flags, because a typo or default regression in --dataset-name, --buffer-size, or --resume would not be exercised here. Prefer driving this through cli.main() with sys.argv and only the flags relevant to the behavior under test; keep the DataDesigner/model fakes, but let argparse prove the public command line maps to the expected RunConfig, dataset name, and ResumeMode.

shan-nvidia added 3 commits May 19, 2026 16:48

Use DataDesigner native resume for retrieval SDG

211474c

Update retrieval SDG to DataDesigner 0.6.1

f52a1cb

Declare pytest workspace test dependency

007bc97

shan-nvidia marked this pull request as ready for review June 2, 2026 20:58

shan-nvidia requested review from a team and oliverholworthy as code owners June 2, 2026 20:58

eric-tramel reviewed Jun 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DataDesigner native resume for retrieval SDG#51

Use DataDesigner native resume for retrieval SDG#51
shan-nvidia wants to merge 3 commits into
mainfrom
codex/sthan/retrieval-sdg-native-resume

shan-nvidia commented May 28, 2026 •

edited

Loading

Uh oh!

eric-tramel left a comment

Uh oh!

eric-tramel Jun 2, 2026

Uh oh!

eric-tramel Jun 2, 2026

Uh oh!

eric-tramel Jun 2, 2026

Uh oh!

eric-tramel Jun 2, 2026

Uh oh!

eric-tramel Jun 2, 2026

Uh oh!

eric-tramel Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		def _load_parquet_records(input_file: Path) -> list[dict]:
		"""Load records from a parquet file exported by DataDesigner."""

Conversation

shan-nvidia commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Why

Usage

How

Validation

Uh oh!

eric-tramel left a comment

Choose a reason for hiding this comment

Uh oh!

eric-tramel Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

eric-tramel Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

eric-tramel Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

eric-tramel Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

eric-tramel Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

eric-tramel Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

shan-nvidia commented May 28, 2026 •

edited

Loading