Skip to content

Use DataDesigner native resume for retrieval SDG#51

Open
shan-nvidia wants to merge 3 commits into
mainfrom
codex/sthan/retrieval-sdg-native-resume
Open

Use DataDesigner native resume for retrieval SDG#51
shan-nvidia wants to merge 3 commits into
mainfrom
codex/sthan/retrieval-sdg-native-resume

Conversation

@shan-nvidia
Copy link
Copy Markdown
Collaborator

@shan-nvidia shan-nvidia commented May 28, 2026

What

Updates data-designer-retrieval-sdg to use DataDesigner native resumable generation instead of the plugin's manual per-batch restart loop.

The generate command now:

  • Requires data-designer>=0.6.1.
  • Calls DataDesigner.create(...) once across the full seed range.
  • Adds --resume/-r {never,always,if_possible} and passes the selected ResumeMode.
  • Adds stable --dataset-name support.
  • Replaces the old generation batch controls with --buffer-size.
  • Exports one JSONL file named from the resolved DataDesigner dataset name.

The converter keeps backward compatibility for legacy generated_batch*.json outputs while also accepting .jsonl, .json, and .parquet files.

Why

DataDesigner 0.6.x owns interrupted-run checkpointing now. Keeping plugin-level --batch-size, --start-batch-index, and --end-batch-index would leave two competing restart systems and make resume behavior harder to reason about.

This PR makes DataDesigner artifacts the source of truth for resume, while --output-dir remains the downstream exported JSONL location.

Usage

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume if_possible \
  --num-pairs 7

Resume an interrupted run with the same artifact path, dataset name, config, and buffer size:

data-designer-retrieval-sdg generate \
  --input-dir ./my_documents \
  --output-dir ./generated_output \
  --artifact-path ./artifacts \
  --dataset-name my_retrieval_run \
  --buffer-size 200 \
  --resume always

Convert the exported JSONL:

data-designer-retrieval-sdg convert ./generated_output/my_retrieval_run.jsonl \
  --corpus-id my_corpus

How

The CLI builds the QA pipeline once using the full discovered seed range and calls create(..., num_records=total_records, dataset_name=..., resume=...).

dd.RunConfig(disable_early_shutdown=True, buffer_size=...) controls DataDesigner checkpoint granularity. The old manual batch JSON writer was removed from generate, but the converter still recognizes legacy batch JSON for existing outputs.

The embedding-dedup scheduler test was updated for DataDesigner 0.6.1's model scheduling metadata API, replacing the old is_llm_bound assertion with get_scheduling_metadata() coverage.

The workspace now declares pytest explicitly in the root dev dependency group so make test does not depend on a transitive test dependency from another tool.

Validation

Local checks:

  • make sync
  • make lint
  • make test
  • make test-plugin PLUGIN=data-designer-retrieval-sdg (70 passed)
  • make validate
  • make check
  • git diff --check

Real resume smoke against DataDesigner 0.6.1:

  • Input: /Users/sthan/workspace/retriever-sdg-v3/examples/nv_pp_random
  • Command used --num-files 2 --buffer-size 1 --resume never
  • Interrupted after completed parquet-files/batch_00000.parquet and partial tmp-partial-parquet-files/batch_00001.parquet
  • Resumed with --resume always
  • DataDesigner logged: Resuming from batch 2 of 2 (1 records already generated).
  • Final artifacts contained batch_00000.parquet and batch_00001.parquet
  • Final JSONL contained 2 lines
  • Final seed order: build-nvidia/nvidia/corrdiff.txt, then dli/course-v1:DLI+C-FX-01+V3.txt

@shan-nvidia shan-nvidia marked this pull request as ready for review June 2, 2026 20:58
@shan-nvidia shan-nvidia requested review from a team and oliverholworthy as code owners June 2, 2026 20:58
Copy link
Copy Markdown
Contributor

@eric-tramel eric-tramel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inline review comments for the issues from the review-nuke pass.

start_index=0,
end_index=total_records - 1,
**pipeline_kwargs,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs one more layer of validation before handing the name to DataDesigner. The subtle case is not a normal separator like /, which DataDesigner already rejects; it is a path component such as ... In DataDesigner 0.6.1, artifact_path / ".." can resolve to the parent directory when --resume always or --resume if_possible sees that parent already exists, so a run name can escape the intended artifact root and write metadata/parquet artifacts outside it. A small helper that rejects ., .., separators/control chars, and verifies (artifact_path / name).resolve() stays under artifact_path.resolve() would make the CLI boundary explicit.



def _load_parquet_records(input_file: Path) -> list[dict]:
"""Load records from a parquet file exported by DataDesigner."""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loader introduces a pandas/pyarrow shape that JSON inputs never hit. When a row has multiple nested chunks, this to_dict(orient="records") path returns chunks as a NumPy array. NumPy arrays with more than one element intentionally do not have a truth value, so the later if not chunks check raises ValueError before normalization gets a chance to call tolist(). Either normalize array-like nested values here, or move that normalization before the downstream emptiness check and use an explicit length check such as len(chunks) == 0.

input_basename = args.input_dir.name
dataset_name = f"{input_basename}_batch{batch_idx}_{start_idx}_{end_idx}"
result = data_designer.create(config_builder, num_records=num_in_batch, dataset_name=dataset_name)
generated_df = result.load_dataset()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes the memory shape of the command in a way --buffer-size does not fully control. DataDesigner 0.6.1 writes checkpoint batches during generation, but create() still loads the completed dataset back into a full DataFrame for profiling before it returns. That means the streaming result.export() helps export memory, but peak memory has already grown to the full generated dataset. The old loop bounded that peak by batch_size. For large retrieval SDG jobs, either keep a chunked path for large runs or wait for/use a DataDesigner create mode that can skip/profile-stream without materializing everything.

Comment thread pyproject.toml
[dependency-groups]
dev = [
"dumb-pypi>=1.15.0",
"pytest>=8.0.0",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PRs behavioral scope is retrieval SDG native resume plus the direct DataDesigner bump, but this block also removes two workspace-wide resolver constraints. Even if the current lock still resolves python-multipart==0.0.29 and urllib3==2.7.0, removing the constraints changes future resolver policy for every plugin in the workspace, which makes the PR harder to reason about and harder to roll back. Please restore these constraints here and in uv.lock, or split the constraint-policy cleanup into its own PR with that motivation called out.

self.model_providers = model_providers
self.run_config = None
self.create_calls: list[dict[str, object]] = []
self.result = FakeCreateResult(FakeArtifactStorage(artifact_path / "my_run", "my_run"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fake currently uses the same requested dataset name and resolved dataset name (my_run). Because those are identical, the test would still pass if the production code accidentally exported args.dataset_name instead of result.artifact_storage.resolved_dataset_name. The important behavior here is exactly that DataDesigner may rename the dataset when avoiding collisions, so make the fake resolved name different, for example my_run_resolved, and assert the JSONL export uses that resolved value.

monkeypatch.setattr(cli, "build_model_providers", fake_build_model_providers)
monkeypatch.setattr(cli, "build_qa_generation_pipeline", fake_build_qa_generation_pipeline)

cli._run_generate(_generate_args(tmp_path))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test bypasses the public CLI parser by constructing a large private argparse.Namespace. That makes the test less useful for the new public flags, because a typo or default regression in --dataset-name, --buffer-size, or --resume would not be exercised here. Prefer driving this through cli.main() with sys.argv and only the flags relevant to the behavior under test; keep the DataDesigner/model fakes, but let argparse prove the public command line maps to the expected RunConfig, dataset name, and ResumeMode.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants