Use DataDesigner native resume for retrieval SDG#51
Conversation
eric-tramel
left a comment
There was a problem hiding this comment.
Inline review comments for the issues from the review-nuke pass.
| start_index=0, | ||
| end_index=total_records - 1, | ||
| **pipeline_kwargs, | ||
| ) |
There was a problem hiding this comment.
This needs one more layer of validation before handing the name to DataDesigner. The subtle case is not a normal separator like /, which DataDesigner already rejects; it is a path component such as ... In DataDesigner 0.6.1, artifact_path / ".." can resolve to the parent directory when --resume always or --resume if_possible sees that parent already exists, so a run name can escape the intended artifact root and write metadata/parquet artifacts outside it. A small helper that rejects ., .., separators/control chars, and verifies (artifact_path / name).resolve() stays under artifact_path.resolve() would make the CLI boundary explicit.
|
|
||
|
|
||
| def _load_parquet_records(input_file: Path) -> list[dict]: | ||
| """Load records from a parquet file exported by DataDesigner.""" |
There was a problem hiding this comment.
This loader introduces a pandas/pyarrow shape that JSON inputs never hit. When a row has multiple nested chunks, this to_dict(orient="records") path returns chunks as a NumPy array. NumPy arrays with more than one element intentionally do not have a truth value, so the later if not chunks check raises ValueError before normalization gets a chance to call tolist(). Either normalize array-like nested values here, or move that normalization before the downstream emptiness check and use an explicit length check such as len(chunks) == 0.
| input_basename = args.input_dir.name | ||
| dataset_name = f"{input_basename}_batch{batch_idx}_{start_idx}_{end_idx}" | ||
| result = data_designer.create(config_builder, num_records=num_in_batch, dataset_name=dataset_name) | ||
| generated_df = result.load_dataset() |
There was a problem hiding this comment.
This changes the memory shape of the command in a way --buffer-size does not fully control. DataDesigner 0.6.1 writes checkpoint batches during generation, but create() still loads the completed dataset back into a full DataFrame for profiling before it returns. That means the streaming result.export() helps export memory, but peak memory has already grown to the full generated dataset. The old loop bounded that peak by batch_size. For large retrieval SDG jobs, either keep a chunked path for large runs or wait for/use a DataDesigner create mode that can skip/profile-stream without materializing everything.
| [dependency-groups] | ||
| dev = [ | ||
| "dumb-pypi>=1.15.0", | ||
| "pytest>=8.0.0", |
There was a problem hiding this comment.
This PRs behavioral scope is retrieval SDG native resume plus the direct DataDesigner bump, but this block also removes two workspace-wide resolver constraints. Even if the current lock still resolves python-multipart==0.0.29 and urllib3==2.7.0, removing the constraints changes future resolver policy for every plugin in the workspace, which makes the PR harder to reason about and harder to roll back. Please restore these constraints here and in uv.lock, or split the constraint-policy cleanup into its own PR with that motivation called out.
| self.model_providers = model_providers | ||
| self.run_config = None | ||
| self.create_calls: list[dict[str, object]] = [] | ||
| self.result = FakeCreateResult(FakeArtifactStorage(artifact_path / "my_run", "my_run")) |
There was a problem hiding this comment.
This fake currently uses the same requested dataset name and resolved dataset name (my_run). Because those are identical, the test would still pass if the production code accidentally exported args.dataset_name instead of result.artifact_storage.resolved_dataset_name. The important behavior here is exactly that DataDesigner may rename the dataset when avoiding collisions, so make the fake resolved name different, for example my_run_resolved, and assert the JSONL export uses that resolved value.
| monkeypatch.setattr(cli, "build_model_providers", fake_build_model_providers) | ||
| monkeypatch.setattr(cli, "build_qa_generation_pipeline", fake_build_qa_generation_pipeline) | ||
|
|
||
| cli._run_generate(_generate_args(tmp_path)) |
There was a problem hiding this comment.
This test bypasses the public CLI parser by constructing a large private argparse.Namespace. That makes the test less useful for the new public flags, because a typo or default regression in --dataset-name, --buffer-size, or --resume would not be exercised here. Prefer driving this through cli.main() with sys.argv and only the flags relevant to the behavior under test; keep the DataDesigner/model fakes, but let argparse prove the public command line maps to the expected RunConfig, dataset name, and ResumeMode.
What
Updates
data-designer-retrieval-sdgto use DataDesigner native resumable generation instead of the plugin's manual per-batch restart loop.The
generatecommand now:data-designer>=0.6.1.DataDesigner.create(...)once across the full seed range.--resume/-r {never,always,if_possible}and passes the selectedResumeMode.--dataset-namesupport.--buffer-size.The converter keeps backward compatibility for legacy
generated_batch*.jsonoutputs while also accepting.jsonl,.json, and.parquetfiles.Why
DataDesigner 0.6.x owns interrupted-run checkpointing now. Keeping plugin-level
--batch-size,--start-batch-index, and--end-batch-indexwould leave two competing restart systems and make resume behavior harder to reason about.This PR makes DataDesigner artifacts the source of truth for resume, while
--output-dirremains the downstream exported JSONL location.Usage
Resume an interrupted run with the same artifact path, dataset name, config, and buffer size:
Convert the exported JSONL:
How
The CLI builds the QA pipeline once using the full discovered seed range and calls
create(..., num_records=total_records, dataset_name=..., resume=...).dd.RunConfig(disable_early_shutdown=True, buffer_size=...)controls DataDesigner checkpoint granularity. The old manual batch JSON writer was removed fromgenerate, but the converter still recognizes legacy batch JSON for existing outputs.The embedding-dedup scheduler test was updated for DataDesigner 0.6.1's model scheduling metadata API, replacing the old
is_llm_boundassertion withget_scheduling_metadata()coverage.The workspace now declares
pytestexplicitly in the root dev dependency group somake testdoes not depend on a transitive test dependency from another tool.Validation
Local checks:
make syncmake lintmake testmake test-plugin PLUGIN=data-designer-retrieval-sdg(70 passed)make validatemake checkgit diff --checkReal resume smoke against DataDesigner 0.6.1:
/Users/sthan/workspace/retriever-sdg-v3/examples/nv_pp_random--num-files 2 --buffer-size 1 --resume neverparquet-files/batch_00000.parquetand partialtmp-partial-parquet-files/batch_00001.parquet--resume alwaysResuming from batch 2 of 2 (1 records already generated).batch_00000.parquetandbatch_00001.parquetbuild-nvidia/nvidia/corrdiff.txt, thendli/course-v1:DLI+C-FX-01+V3.txt