Skip to content

fix: disable tokenizers parallelism to prevent dataset map deadlock#481

Open
Bias92 wants to merge 2 commits intosgl-project:mainfrom
Bias92:fix/422-tokenizers-parallelism-deadlock
Open

fix: disable tokenizers parallelism to prevent dataset map deadlock#481
Bias92 wants to merge 2 commits intosgl-project:mainfrom
Bias92:fix/422-tokenizers-parallelism-deadlock

Conversation

@Bias92
Copy link

@Bias92 Bias92 commented Mar 2, 2026

Summary

Root Cause

HuggingFace's fast tokenizers use internal Rust-based parallelism. When datasets.map(num_proc=N) forks the Python process, the Rust threads in each forked worker deadlock against each other. This is a well-known issue documented by HuggingFace.

The deadlock becomes more likely with longer sequences (max_length > 32768) because each tokenization call takes longer, increasing the window for the race condition. This explains why:

  • Small max_length values work fine (tokenization completes before the deadlock can occur)
  • The sglang backend is unaffected (different tokenization path)
  • Setting num_proc=0 or 1 works (no forking → no deadlock)

Fix

Set TOKENIZERS_PARALLELISM=false before calling dataset.map() when num_proc > 1. This disables the Rust-level parallelism inside each forked worker, which is the recommended fix from HuggingFace. The parallelism loss per-worker is compensated by the Python-level multiprocessing across workers.

Test plan

  • Run with --build-dataset-num-proc 8 --max-length 32769 --target-model-backend hf — should no longer deadlock
  • Run with --build-dataset-num-proc 1 — behavior unchanged (env var not set)
  • Verify dataset preprocessing output is identical with and without the fix

…gl-project#422)

HuggingFace tokenizers use internal Rust parallelism that deadlocks
when combined with Python multiprocessing (datasets.map with num_proc>1).
This is especially likely with long sequences (max_length>32768) where
tokenization takes longer. Setting TOKENIZERS_PARALLELISM=false before
the map call prevents the deadlock.

Also addresses sgl-project#349.
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@jiapingW
Copy link
Collaborator

jiapingW commented Mar 5, 2026

LGTM. Can you update your results of your test plan?

@Bias92
Copy link
Author

Bias92 commented Mar 5, 2026

Thank you for the review! Due to academic commitments, I ran the test plan on a MacBook M4 Pro instead of my RTX 4060 Ti desktop — this is fine since the deadlock is a CPU-side forking issue and doesn't require a GPU.

Test Plan Results

Test 1: num_proc=8, max_length=32769 — no deadlock

[Fix applied] TOKENIZERS_PARALLELISM=false
Running datasets.map() with num_proc=8, max_length=32769...
Map (num_proc=8): 100%|██████████████████| 50/50 [00:00<00:00, 82.81 examples/s]
SUCCESS: Processed 50 samples without deadlock

Test 2: num_proc=1 — behavior unchanged (env var not set)

TOKENIZERS_PARALLELISM=NOT SET (expected: NOT SET)
Running datasets.map() with num_proc=1...
Map (num_proc=1): 100%|██████████████████| 50/50 [00:02<00:00, 22.13 examples/s]
SUCCESS: Processed 50 samples

Test 3: Output identical with and without fix

Map (num_proc=1): 100%|██████████████████| 50/50 [00:02<00:00, 22.50 examples/s]
Map (num_proc=8): 100%|██████████████████| 50/50 [00:00<00:00, 88.66 examples/s]
SUCCESS: All 50 samples produce identical output with and without fix

@Bias92
Copy link
Author

Bias92 commented Mar 5, 2026

The lint failure is in scripts/train_dflash.py (isort formatting), which is unrelated to my change in specforge/data/preprocessing.py. This appears to be a pre-existing issue on main.

@jiapingW
Copy link
Collaborator

jiapingW commented Mar 5, 2026

Please fix lint.

@Bias92
Copy link
Author

Bias92 commented Mar 5, 2026

Please fix lint.

Got it. I will do that in 5 hours.

@Bias92 Bias92 requested a review from FlamingoPg as a code owner March 5, 2026 13:27
@Bias92
Copy link
Author

Bias92 commented Mar 5, 2026

Fixed the lint issue in 6d09589. Waiting for workflow approval to confirm CI passes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] BUILD_DATASET_NUM_PROC>1 and the data length is greater than 32768 will block mapping.

2 participants