fix: disable tokenizers parallelism to prevent dataset map deadlock by Bias92 · Pull Request #481 · sgl-project/SpecForge

Bias92 · 2026-03-02T16:11:42Z

Summary

Fixes [Bug] BUILD_DATASET_NUM_PROC>1 and the data length is greater than 32768 will block mapping. #422 — dataset.map() deadlocks when BUILD_DATASET_NUM_PROC > 1 with long sequences (max_length > 32768)
Also addresses [Bug] args.build-dataset-num-proc doesn't work #349 — same root cause with VLM dataset preprocessing

Root Cause

HuggingFace's fast tokenizers use internal Rust-based parallelism. When datasets.map(num_proc=N) forks the Python process, the Rust threads in each forked worker deadlock against each other. This is a well-known issue documented by HuggingFace.

The deadlock becomes more likely with longer sequences (max_length > 32768) because each tokenization call takes longer, increasing the window for the race condition. This explains why:

Small max_length values work fine (tokenization completes before the deadlock can occur)
The sglang backend is unaffected (different tokenization path)
Setting num_proc=0 or 1 works (no forking → no deadlock)

Fix

Set TOKENIZERS_PARALLELISM=false before calling dataset.map() when num_proc > 1. This disables the Rust-level parallelism inside each forked worker, which is the recommended fix from HuggingFace. The parallelism loss per-worker is compensated by the Python-level multiprocessing across workers.

Test plan

Run with --build-dataset-num-proc 8 --max-length 32769 --target-model-backend hf — should no longer deadlock
Run with --build-dataset-num-proc 1 — behavior unchanged (env var not set)
Verify dataset preprocessing output is identical with and without the fix

…gl-project#422) HuggingFace tokenizers use internal Rust parallelism that deadlocks when combined with Python multiprocessing (datasets.map with num_proc>1). This is especially likely with long sequences (max_length>32768) where tokenization takes longer. Setting TOKENIZERS_PARALLELISM=false before the map call prevents the deadlock. Also addresses sgl-project#349.

gemini-code-assist · 2026-03-02T16:11:46Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

jiapingW · 2026-03-05T07:18:06Z

LGTM. Can you update your results of your test plan?

Bias92 · 2026-03-05T08:00:53Z

Thank you for the review! Due to academic commitments, I ran the test plan on a MacBook M4 Pro instead of my RTX 4060 Ti desktop — this is fine since the deadlock is a CPU-side forking issue and doesn't require a GPU.

Test Plan Results

Test 1: num_proc=8, max_length=32769 — no deadlock

[Fix applied] TOKENIZERS_PARALLELISM=false
Running datasets.map() with num_proc=8, max_length=32769...
Map (num_proc=8): 100%|██████████████████| 50/50 [00:00<00:00, 82.81 examples/s]
SUCCESS: Processed 50 samples without deadlock

Test 2: num_proc=1 — behavior unchanged (env var not set)

TOKENIZERS_PARALLELISM=NOT SET (expected: NOT SET)
Running datasets.map() with num_proc=1...
Map (num_proc=1): 100%|██████████████████| 50/50 [00:02<00:00, 22.13 examples/s]
SUCCESS: Processed 50 samples

Test 3: Output identical with and without fix

Map (num_proc=1): 100%|██████████████████| 50/50 [00:02<00:00, 22.50 examples/s]
Map (num_proc=8): 100%|██████████████████| 50/50 [00:00<00:00, 88.66 examples/s]
SUCCESS: All 50 samples produce identical output with and without fix

Bias92 · 2026-03-05T08:25:09Z

The lint failure is in scripts/train_dflash.py (isort formatting), which is unrelated to my change in specforge/data/preprocessing.py. This appears to be a pre-existing issue on main.

jiapingW · 2026-03-05T08:48:22Z

Please fix lint.

Bias92 · 2026-03-05T09:29:22Z

Please fix lint.

Got it. I will do that in 5 hours.

Bias92 · 2026-03-05T13:40:26Z

Fixed the lint issue in 6d09589. Waiting for workflow approval to confirm CI passes.

Bias92 requested review from shuaills, sleepcoo and zyksir as code owners March 2, 2026 16:11

style: fix isort import ordering in train_dflash.py

6d09589

Bias92 requested a review from FlamingoPg as a code owner March 5, 2026 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: disable tokenizers parallelism to prevent dataset map deadlock#481

fix: disable tokenizers parallelism to prevent dataset map deadlock#481
Bias92 wants to merge 2 commits intosgl-project:mainfrom
Bias92:fix/422-tokenizers-parallelism-deadlock

Bias92 commented Mar 2, 2026

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

jiapingW commented Mar 5, 2026

Uh oh!

Bias92 commented Mar 5, 2026

Uh oh!

Bias92 commented Mar 5, 2026

Uh oh!

jiapingW commented Mar 5, 2026

Uh oh!

Bias92 commented Mar 5, 2026

Uh oh!

Bias92 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Bias92 commented Mar 2, 2026

Summary

Root Cause

Fix

Test plan

Uh oh!

gemini-code-assist bot commented Mar 2, 2026

Uh oh!

jiapingW commented Mar 5, 2026

Uh oh!

Bias92 commented Mar 5, 2026

Test Plan Results

Uh oh!

Bias92 commented Mar 5, 2026

Uh oh!

jiapingW commented Mar 5, 2026

Uh oh!

Bias92 commented Mar 5, 2026

Uh oh!

Bias92 commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants