Centralize multimodal token expansion via get_text_with_replacements#46404
Centralize multimodal token expansion via get_text_with_replacements#46404ArthurZucker wants to merge 8 commits into
Conversation
Replace the "<|placeholder|>"/"<placeholder>" sentinel dance with a single re.sub pass in the image/video/audio token expansion of several processors. The old idiom expanded each placeholder into N copies of itself via a temporary sentinel (to avoid the while-loop re-matching the freshly inserted tokens), then did a second full pass over the now-huge string to convert the sentinel back. re.sub does not rescan replacement text, so a single pass with a closure that consumes the per-image counts in order is equivalent and avoids both the sentinel and the second pass. Output is byte-for-byte identical; ~18x faster on the expansion step. Models: deepseek_ocr2, colqwen2, ernie4_5_vl_moe (image+video), glm_image, llava_next (+ granite4_vision via modular), llava_onevision, granite_speech.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
vasqu
left a comment
There was a problem hiding this comment.
Adds a regression test that puts several `<image>` placeholders in a single prompt and asserts each is expanded to the right number of tokens in encounter order (via contiguous run lengths). Guards the single-pass re.sub expansion.
| for i in range(len(texts_doc)): | ||
| while self.image_token in texts_doc[i]: | ||
| texts_doc[i] = texts_doc[i].replace( | ||
| self.image_token, "<|placeholder|>" * (image_grid_thw[index].prod() // merge_length), 1 | ||
| ) | ||
| index += 1 | ||
| texts_doc[i] = texts_doc[i].replace("<|placeholder|>", self.image_token) | ||
|
|
||
| def expand(_match): | ||
| nonlocal index | ||
| num_image_tokens = image_grid_thw[index].prod() // merge_length | ||
| index += 1 | ||
| return self.image_token * num_image_tokens | ||
|
|
||
| pattern = re.escape(self.image_token) |
There was a problem hiding this comment.
why not use the new API? 😄 It replaces once, directly by the correct expansion
There was a problem hiding this comment.
ah i see Anton also commented above
There was a problem hiding this comment.
ahh haha get_text_with_replacements
Migrate the image/video/audio token expansion of these processors to the shared ProcessorMixin helper introduced in #45493 instead of bespoke logic: each processor now defines `replace_image_token`/`replace_video_token`/ `replace_audio_token` and routes expansion through `get_text_with_replacements`. Models: deepseek_ocr2, colqwen2, ernie4_5_vl_moe (image+video), glm_image, llava_next (+ granite4_vision via modular), llava_onevision, granite_speech. Note: `get_text_with_replacements` mutates the text list in place, so callers that must not edit the caller's input pass a copy.
`get_text_with_replacements` built the `batch_replacement_offsets` metadata (span/new_span dicts) on every call, even though it is only returned when `return_text_replacement_offsets=True`. Gate that work behind a new `return_replacement_offsets` argument (threaded from the generic `__call__`), so the common path skips it. Expanded text is unchanged; offset lists are empty when not requested. ~40% faster on the text-replacement step.
`get_text_with_replacements` expands placeholder tokens in place, so processors must run it on a copy (the default `__call__` does via `prepare_inputs_layout`). Add a ProcessorTesterMixin test that snapshots the input `text` list and asserts it is unchanged after a call, guarding against custom `__call__`s that forward the user's list directly. Skipped for ColQwen2/ColModernVBert (text XOR images).
Replace the earlier text-mutation test with one that spies on `get_text_with_replacements` and asserts the processor routes multimodal token expansion through it (the #45493 API) rather than bespoke logic. Skip it for processors that legitimately don't expand tokens: ColQwen2/ColModernVBert (text XOR images) and Mllama (one image token per image + cross-attention).
Route image/video/audio placeholder-token expansion through the shared ProcessorMixin.get_text_with_replacements helper (#45493) for the processors that still used bespoke `.replace`/`re.sub`/sentinel logic, each now defining replace_image_token/replace_video_token/replace_audio_token (or building the replacement list inline where the expansion is context-dependent): pixtral, internvl, qianfan_ocr, lighton_ocr, qwen2_audio, csm, video_llava, smolvlm, higgs_audio_v2, emu3, cohere2_vision, kosmos2, florence2, deepseek_vl, deepseek_vl_hybrid, gemma3, minicpmv4_6, paddleocr_vl, phi4_multimodal, qwen2_5_omni, qwen3_omni_moe. Output is byte-for-byte identical (validated against each model's existing token-count tests, or byte-equivalence harnesses where tests are gated by torchaudio/unreleased checkpoints). Mllama is intentionally left as-is (one image token per image + cross-attention, nothing to expand). gemma3n is deferred: its Hub chat template emits soft tokens, so it needs a soft-token anchored migration rather than a boi/eoi one.
|
[For maintainers] Suggested jobs to run (before merge) run-slow: cohere2_vision, colmodernvbert, colqwen2, csm, deepseek_ocr2, deepseek_vl, deepseek_vl_hybrid, emu3, ernie4_5_vl_moe, florence2, gemma3, glm_image, granite4_vision, granite_speech, higgs_audio_v2, internvl |
|
CI Dashboard: View test results in Grafana |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46404&sha=c2c78f |
| num_images = len(image_inputs["num_patches"]) | ||
| images_replacements = [self.replace_image_token(image_inputs, i) for i in range(num_images)] | ||
| image_inputs.pop("num_patches") | ||
| text, _ = self.get_text_with_replacements(list(text), images_replacements=images_replacements) |
There was a problem hiding this comment.
imo since we already touched it, we need to update everything and call self._process_images. Maybe we won't need a custom __call__ method at all, the processing looks quite simple
If you're asking claude to do it, can use this docs as instruction to it
https://huggingface.co/docs/transformers/multimodal_processing
Routes image/video/audio placeholder expansion through the shared
ProcessorMixin.get_text_with_replacementshelper (#45493) for ~25 processors that still rolled their own.replace/re.sub/sentinel logic. Each now definesreplace_image_token/replace_video_token/replace_audio_token(or builds the replacement list inline where expansion is context-dependent). Output is byte-for-byte identical.Also
get_text_with_replacementsnow skips buildingbatch_replacement_offsetsunlessreturn_text_replacement_offsets=True(~40% faster on the text step; feature unchanged when requested).ProcessorTesterMixin.test_processor_expands_tokens_via_get_text_with_replacementsenforces the contract; skip-listed for ColQwen2/ColModernVBert (text XOR images) and Mllama (one token/image + cross-attention).Notes
gemma3ndeferred (its Hub template emits soft tokens → needs a soft-token-anchored migration, not boi/eoi).replace_*_tokenshould be reviewed. Suggest squash-on-merge (history evolved in place).