Centralize multimodal token expansion via get_text_with_replacements by ArthurZucker · Pull Request #46404 · huggingface/transformers

ArthurZucker · 2026-06-04T08:11:26Z

Routes image/video/audio placeholder expansion through the shared ProcessorMixin.get_text_with_replacements helper (#45493) for ~25 processors that still rolled their own .replace/re.sub/sentinel logic. Each now defines replace_image_token/replace_video_token/replace_audio_token (or builds the replacement list inline where expansion is context-dependent). Output is byte-for-byte identical.

Also

Perf: get_text_with_replacements now skips building batch_replacement_offsets unless return_text_replacement_offsets=True (~40% faster on the text step; feature unchanged when requested).
Test: new ProcessorTesterMixin.test_processor_expands_tokens_via_get_text_with_replacements enforces the contract; skip-listed for ColQwen2/ColModernVBert (text XOR images) and Mllama (one token/image + cross-attention).

Notes

Validated against each model's existing token-count tests; byte-equivalence harnesses where tests are gated (torchaudio / unreleased checkpoints). CI runs the rest.
gemma3n deferred (its Hub template emits soft tokens → needs a soft-token-anchored migration, not boi/eoi).
AI assistance was used; every migrated replace_*_token should be reviewed. Suggest squash-on-merge (history evolved in place).

Replace the "<|placeholder|>"/"<placeholder>" sentinel dance with a single re.sub pass in the image/video/audio token expansion of several processors. The old idiom expanded each placeholder into N copies of itself via a temporary sentinel (to avoid the while-loop re-matching the freshly inserted tokens), then did a second full pass over the now-huge string to convert the sentinel back. re.sub does not rescan replacement text, so a single pass with a closure that consumes the per-image counts in order is equivalent and avoids both the sentinel and the second pass. Output is byte-for-byte identical; ~18x faster on the expansion step. Models: deepseek_ocr2, colqwen2, ernie4_5_vl_moe (image+video), glm_image, llava_next (+ granite4_vision via modular), llava_onevision, granite_speech.

HuggingFaceDocBuilderDev · 2026-06-04T08:23:54Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

vasqu

#45493 refactored the logic on replacements in general, I think these models just weren't covered yet (in those refactors). If the same fix could happen based on #45493 then I'd rather do it there. The relevant function is

transformers/src/transformers/processing_utils.py

Line 802 in b07d99b

def get_text_with_replacements(

then

Adds a regression test that puts several `<image>` placeholders in a single prompt and asserts each is expanded to the right number of tokens in encounter order (via contiguous run lengths). Guards the single-pass re.sub expansion.

zucchini-nlp · 2026-06-04T10:19:13Z

-                for i in range(len(texts_doc)):
-                    while self.image_token in texts_doc[i]:
-                        texts_doc[i] = texts_doc[i].replace(
-                            self.image_token, "<|placeholder|>" * (image_grid_thw[index].prod() // merge_length), 1
-                        )
-                        index += 1
-                    texts_doc[i] = texts_doc[i].replace("<|placeholder|>", self.image_token)
+
+                def expand(_match):
+                    nonlocal index
+                    num_image_tokens = image_grid_thw[index].prod() // merge_length
+                    index += 1
+                    return self.image_token * num_image_tokens
+
+                pattern = re.escape(self.image_token)


why not use the new API? 😄 It replaces once, directly by the correct expansion

ah i see Anton also commented above

which one? 🤣

ahh haha get_text_with_replacements

…nested def)

Migrate the image/video/audio token expansion of these processors to the shared ProcessorMixin helper introduced in #45493 instead of bespoke logic: each processor now defines `replace_image_token`/`replace_video_token`/ `replace_audio_token` and routes expansion through `get_text_with_replacements`. Models: deepseek_ocr2, colqwen2, ernie4_5_vl_moe (image+video), glm_image, llava_next (+ granite4_vision via modular), llava_onevision, granite_speech. Note: `get_text_with_replacements` mutates the text list in place, so callers that must not edit the caller's input pass a copy.

`get_text_with_replacements` built the `batch_replacement_offsets` metadata (span/new_span dicts) on every call, even though it is only returned when `return_text_replacement_offsets=True`. Gate that work behind a new `return_replacement_offsets` argument (threaded from the generic `__call__`), so the common path skips it. Expanded text is unchanged; offset lists are empty when not requested. ~40% faster on the text-replacement step.

`get_text_with_replacements` expands placeholder tokens in place, so processors must run it on a copy (the default `__call__` does via `prepare_inputs_layout`). Add a ProcessorTesterMixin test that snapshots the input `text` list and asserts it is unchanged after a call, guarding against custom `__call__`s that forward the user's list directly. Skipped for ColQwen2/ColModernVBert (text XOR images).

Replace the earlier text-mutation test with one that spies on `get_text_with_replacements` and asserts the processor routes multimodal token expansion through it (the #45493 API) rather than bespoke logic. Skip it for processors that legitimately don't expand tokens: ColQwen2/ColModernVBert (text XOR images) and Mllama (one image token per image + cross-attention).

Route image/video/audio placeholder-token expansion through the shared ProcessorMixin.get_text_with_replacements helper (#45493) for the processors that still used bespoke `.replace`/`re.sub`/sentinel logic, each now defining replace_image_token/replace_video_token/replace_audio_token (or building the replacement list inline where the expansion is context-dependent): pixtral, internvl, qianfan_ocr, lighton_ocr, qwen2_audio, csm, video_llava, smolvlm, higgs_audio_v2, emu3, cohere2_vision, kosmos2, florence2, deepseek_vl, deepseek_vl_hybrid, gemma3, minicpmv4_6, paddleocr_vl, phi4_multimodal, qwen2_5_omni, qwen3_omni_moe. Output is byte-for-byte identical (validated against each model's existing token-count tests, or byte-equivalence harnesses where tests are gated by torchaudio/unreleased checkpoints). Mllama is intentionally left as-is (one image token per image + cross-attention, nothing to expand). gemma3n is deferred: its Hub chat template emits soft tokens, so it needs a soft-token anchored migration rather than a boi/eoi one.

github-actions · 2026-06-04T13:35:49Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_vision, colmodernvbert, colqwen2, csm, deepseek_ocr2, deepseek_vl, deepseek_vl_hybrid, emu3, ernie4_5_vl_moe, florence2, gemma3, glm_image, granite4_vision, granite_speech, higgs_audio_v2, internvl

github-actions · 2026-06-04T13:43:35Z

CI Dashboard: View test results in Grafana

github-actions · 2026-06-04T13:48:47Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46404&sha=c2c78f

zucchini-nlp · 2026-06-04T14:25:40Z

+            num_images = len(image_inputs["num_patches"])
+            images_replacements = [self.replace_image_token(image_inputs, i) for i in range(num_images)]
+            image_inputs.pop("num_patches")
+            text, _ = self.get_text_with_replacements(list(text), images_replacements=images_replacements)


imo since we already touched it, we need to update everything and call self._process_images. Maybe we won't need a custom __call__ method at all, the processing looks quite simple

If you're asking claude to do it, can use this docs as instruction to it

https://huggingface.co/docs/transformers/multimodal_processing

ArthurZucker requested a review from molbap June 4, 2026 08:13

vasqu reviewed Jun 4, 2026

View reviewed changes

zucchini-nlp reviewed Jun 4, 2026

View reviewed changes

ArthurZucker added 6 commits June 4, 2026 13:25

Regenerate granite4_vision processor from modular (blank line before …

ce41c18

…nested def)

ArthurZucker changed the title ~~Faster image/audio token expansion in processors (single-pass re.sub)~~ Centralize multimodal token expansion via get_text_with_replacements Jun 4, 2026

zucchini-nlp reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Centralize multimodal token expansion via get_text_with_replacements#46404

Centralize multimodal token expansion via get_text_with_replacements#46404
ArthurZucker wants to merge 8 commits into
mainfrom
faster-expand-image-tokens

ArthurZucker commented Jun 4, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jun 4, 2026

Uh oh!

vasqu left a comment

Uh oh!

zucchini-nlp Jun 4, 2026

Uh oh!

zucchini-nlp Jun 4, 2026

Uh oh!

ArthurZucker Jun 4, 2026

Uh oh!

ArthurZucker Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

zucchini-nlp Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ArthurZucker commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Also

Notes

Uh oh!

HuggingFaceDocBuilderDev commented Jun 4, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

github-actions Bot commented Jun 4, 2026

Uh oh!

zucchini-nlp Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ArthurZucker commented Jun 4, 2026 •

edited

Loading

zucchini-nlp Jun 4, 2026 •

edited

Loading