Skip to content

support --num-workers for dataset parallel loading#2048

Open
demouo wants to merge 3 commits into
THUDM:mainfrom
demouo:support_num_workers
Open

support --num-workers for dataset parallel loading#2048
demouo wants to merge 3 commits into
THUDM:mainfrom
demouo:support_num_workers

Conversation

@demouo

@demouo demouo commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

  1. Motivation in [Question] How to load 30+ images faster for vlm? #2037
  2. Reuse --num-workers config defined in Megatron training.
  3. Add pretty logging for dataset loading, which is useful for dataset length checking, modal showing and time consumption.
  4. The logic for building Sample is not changed.

Modified

  • slime/rollout/data_source.py: pass num_workers to Dataset
  • slime/utils/data.py: Add parallel logic to load data.

Results

For big datasets:
Old method for-loop equals to num_workers=1, loading at about 1 row per second:

(RolloutManager pid=1111890) [2026-06-09 18:03:14] data.py:220 - Read 100 rows from xxxx/train.parquet
(RolloutManager pid=1111890) [2026-06-09 18:05:05] data.py:300 - Dataset loaded: 100 samples, multimodal=True, 110.2s total (0.91 rows/s)

The new method using ThreadPoolExecutor with num_workers=64, loading at about 28 rows per second:

(RolloutManager pid=1079380) [2026-06-09 18:00:19] data.py:220 - Read 100 rows from xxxx/train.parquet
(RolloutManager pid=1079380) [2026-06-09 18:00:19] data.py:270 - Loading dataset with 64 workers (100 rows) ...
(RolloutManager pid=1079380) [2026-06-09 18:00:22] data.py:285 -   loading: 100/100 rows done, 3.5s elapsed, 28.85 rows/s
(RolloutManager pid=1079380) [2026-06-09 18:00:22] data.py:300 - Dataset loaded: 100 samples, multimodal=True, 3.5s total (28.81 rows/s)

Other

Loading log interval defaults to 1000 rows generally:

 if done % 1000 == 0 or done == len(rows):
    elapsed = time.time() - t0
    logger.info(
        "  loading: %d/%d rows done, %.1fs elapsed, %.2f rows/s",
        done,
        len(rows),
        elapsed,
        done / elapsed,
    )

Usage

--num-workers 64

@demouo

demouo commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Add more necessary details with 128 num-workers for speeding up dataset loading:

(RolloutManager pid=2713487) [2026-06-15 15:17:01] data.py:223 - Read 3000 rows from xxx/train.parquet
(RolloutManager pid=2713487) [2026-06-15 15:17:01] data.py:273 - Loading dataset with 128 workers (3000 rows)...
(RolloutManager pid=2713487) [2026-06-15 15:17:37] data.py:288 -   loading: 1000/3000 rows done, 36.0s elapsed, 27.79 rows/s
(RolloutManager pid=2713487) [2026-06-15 15:18:22] data.py:288 -   loading: 2000/3000 rows done, 81.0s elapsed, 24.68 rows/s
(RolloutManager pid=2713487) [2026-06-15 15:19:13] data.py:288 -   loading: 3000/3000 rows done, 131.9s elapsed, 22.74 rows/s
(RolloutManager pid=2713487) [2026-06-15 15:19:13] data.py:303 - Dataset loaded: 3000 samples, multimodal=True, 131.9s total (22.74 rows/s)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant