Skip to content

fix: fix preserving dataset merge#2515

Merged
terrykong merged 2 commits into
mainfrom
yukih/pr-2218
May 28, 2026
Merged

fix: fix preserving dataset merge#2515
terrykong merged 2 commits into
mainfrom
yukih/pr-2218

Conversation

@yuki-97
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 commented May 17, 2026

Continues #2218, thanks for the contribution @Bungmint.

Issue

Closes #2116.

Summary

  • Replace concatenate_datasets with a new merge_datasets helper throughout the SFT data pipeline (run_sft.py, nemo_rl/data/utils.py).
  • merge_datasets falls back to torch.utils.data.ConcatDataset when any dataset is not a HF Dataset (e.g. PreservingDataset), avoiding the schema validation crash from concatenate_datasets.

Test plan

  • tests/unit/data/datasets/test_preserving_dataset.py passes (new tests cover HF-only, PreservingDataset-only, and mixed merge cases).

@yuki-97 yuki-97 requested review from a team as code owners May 17, 2026 13:24
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yuki-97
Copy link
Copy Markdown
Contributor Author

yuki-97 commented May 17, 2026

/ok to test 8ad3cc6

@yuki-97 yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 17, 2026
@yuki-97
Copy link
Copy Markdown
Contributor Author

yuki-97 commented May 17, 2026

/ok to test 8ad3cc6

Bungmint and others added 2 commits May 27, 2026 06:07
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97
Copy link
Copy Markdown
Contributor Author

yuki-97 commented May 27, 2026

/ok to test 4d376b7

@terrykong terrykong merged commit 5494d14 into main May 28, 2026
37 checks passed
@terrykong terrykong deleted the yukih/pr-2218 branch May 28, 2026 06:55
vigneshwaran pushed a commit to vigneshwaran/RL that referenced this pull request May 28, 2026
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Co-authored-by: Youngmin Park <57158717+Bungmint@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Training dataload fails when using preserving dataset

3 participants