fix: disable tokenizers parallelism to prevent dataset map deadlock#481
fix: disable tokenizers parallelism to prevent dataset map deadlock#481Bias92 wants to merge 2 commits intosgl-project:mainfrom
Conversation
…gl-project#422) HuggingFace tokenizers use internal Rust parallelism that deadlocks when combined with Python multiprocessing (datasets.map with num_proc>1). This is especially likely with long sequences (max_length>32768) where tokenization takes longer. Setting TOKENIZERS_PARALLELISM=false before the map call prevents the deadlock. Also addresses sgl-project#349.
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
|
LGTM. Can you update your results of your test plan? |
|
Thank you for the review! Due to academic commitments, I ran the test plan on a MacBook M4 Pro instead of my RTX 4060 Ti desktop — this is fine since the deadlock is a CPU-side forking issue and doesn't require a GPU. Test Plan ResultsTest 1: Test 2: Test 3: Output identical with and without fix |
|
The lint failure is in |
|
Please fix lint. |
Got it. I will do that in 5 hours. |
|
Fixed the lint issue in |
Summary
dataset.map()deadlocks whenBUILD_DATASET_NUM_PROC > 1with long sequences (max_length > 32768)Root Cause
HuggingFace's fast tokenizers use internal Rust-based parallelism. When
datasets.map(num_proc=N)forks the Python process, the Rust threads in each forked worker deadlock against each other. This is a well-known issue documented by HuggingFace.The deadlock becomes more likely with longer sequences (
max_length > 32768) because each tokenization call takes longer, increasing the window for the race condition. This explains why:max_lengthvalues work fine (tokenization completes before the deadlock can occur)num_proc=0or1works (no forking → no deadlock)Fix
Set
TOKENIZERS_PARALLELISM=falsebefore callingdataset.map()whennum_proc > 1. This disables the Rust-level parallelism inside each forked worker, which is the recommended fix from HuggingFace. The parallelism loss per-worker is compensated by the Python-level multiprocessing across workers.Test plan
--build-dataset-num-proc 8 --max-length 32769 --target-model-backend hf— should no longer deadlock--build-dataset-num-proc 1— behavior unchanged (env var not set)