Skip to content

Centralize multimodal token expansion via get_text_with_replacements#46404

Open
ArthurZucker wants to merge 8 commits into
mainfrom
faster-expand-image-tokens
Open

Centralize multimodal token expansion via get_text_with_replacements#46404
ArthurZucker wants to merge 8 commits into
mainfrom
faster-expand-image-tokens

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Jun 4, 2026

Routes image/video/audio placeholder expansion through the shared ProcessorMixin.get_text_with_replacements helper (#45493) for ~25 processors that still rolled their own .replace/re.sub/sentinel logic. Each now defines replace_image_token/replace_video_token/replace_audio_token (or builds the replacement list inline where expansion is context-dependent). Output is byte-for-byte identical.

Also

  • Perf: get_text_with_replacements now skips building batch_replacement_offsets unless return_text_replacement_offsets=True (~40% faster on the text step; feature unchanged when requested).
  • Test: new ProcessorTesterMixin.test_processor_expands_tokens_via_get_text_with_replacements enforces the contract; skip-listed for ColQwen2/ColModernVBert (text XOR images) and Mllama (one token/image + cross-attention).

Notes

  • Validated against each model's existing token-count tests; byte-equivalence harnesses where tests are gated (torchaudio / unreleased checkpoints). CI runs the rest.
  • gemma3n deferred (its Hub template emits soft tokens → needs a soft-token-anchored migration, not boi/eoi).
  • AI assistance was used; every migrated replace_*_token should be reviewed. Suggest squash-on-merge (history evolved in place).

Replace the "<|placeholder|>"/"<placeholder>" sentinel dance with a single
re.sub pass in the image/video/audio token expansion of several processors.

The old idiom expanded each placeholder into N copies of itself via a temporary
sentinel (to avoid the while-loop re-matching the freshly inserted tokens), then
did a second full pass over the now-huge string to convert the sentinel back.
re.sub does not rescan replacement text, so a single pass with a closure that
consumes the per-image counts in order is equivalent and avoids both the sentinel
and the second pass.

Output is byte-for-byte identical; ~18x faster on the expansion step.

Models: deepseek_ocr2, colqwen2, ernie4_5_vl_moe (image+video), glm_image,
llava_next (+ granite4_vision via modular), llava_onevision, granite_speech.
@ArthurZucker ArthurZucker requested a review from molbap June 4, 2026 08:13
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#45493 refactored the logic on replacements in general, I think these models just weren't covered yet (in those refactors). If the same fix could happen based on #45493 then I'd rather do it there. The relevant function is

def get_text_with_replacements(
then

Adds a regression test that puts several `<image>` placeholders in a single
prompt and asserts each is expanded to the right number of tokens in encounter
order (via contiguous run lengths). Guards the single-pass re.sub expansion.
Comment on lines +118 to +126
for i in range(len(texts_doc)):
while self.image_token in texts_doc[i]:
texts_doc[i] = texts_doc[i].replace(
self.image_token, "<|placeholder|>" * (image_grid_thw[index].prod() // merge_length), 1
)
index += 1
texts_doc[i] = texts_doc[i].replace("<|placeholder|>", self.image_token)

def expand(_match):
nonlocal index
num_image_tokens = image_grid_thw[index].prod() // merge_length
index += 1
return self.image_token * num_image_tokens

pattern = re.escape(self.image_token)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not use the new API? 😄 It replaces once, directly by the correct expansion

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah i see Anton also commented above

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which one? 🤣

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh haha get_text_with_replacements

Migrate the image/video/audio token expansion of these processors to the
shared ProcessorMixin helper introduced in #45493 instead of bespoke logic:
each processor now defines `replace_image_token`/`replace_video_token`/
`replace_audio_token` and routes expansion through `get_text_with_replacements`.

Models: deepseek_ocr2, colqwen2, ernie4_5_vl_moe (image+video), glm_image,
llava_next (+ granite4_vision via modular), llava_onevision, granite_speech.

Note: `get_text_with_replacements` mutates the text list in place, so callers
that must not edit the caller's input pass a copy.
`get_text_with_replacements` built the `batch_replacement_offsets` metadata
(span/new_span dicts) on every call, even though it is only returned when
`return_text_replacement_offsets=True`. Gate that work behind a new
`return_replacement_offsets` argument (threaded from the generic `__call__`),
so the common path skips it. Expanded text is unchanged; offset lists are
empty when not requested. ~40% faster on the text-replacement step.
`get_text_with_replacements` expands placeholder tokens in place, so processors
must run it on a copy (the default `__call__` does via `prepare_inputs_layout`).
Add a ProcessorTesterMixin test that snapshots the input `text` list and asserts
it is unchanged after a call, guarding against custom `__call__`s that forward
the user's list directly. Skipped for ColQwen2/ColModernVBert (text XOR images).
Replace the earlier text-mutation test with one that spies on
`get_text_with_replacements` and asserts the processor routes multimodal token
expansion through it (the #45493 API) rather than bespoke logic. Skip it for
processors that legitimately don't expand tokens: ColQwen2/ColModernVBert
(text XOR images) and Mllama (one image token per image + cross-attention).
Route image/video/audio placeholder-token expansion through the shared
ProcessorMixin.get_text_with_replacements helper (#45493) for the processors
that still used bespoke `.replace`/`re.sub`/sentinel logic, each now defining
replace_image_token/replace_video_token/replace_audio_token (or building the
replacement list inline where the expansion is context-dependent):

pixtral, internvl, qianfan_ocr, lighton_ocr, qwen2_audio, csm, video_llava,
smolvlm, higgs_audio_v2, emu3, cohere2_vision, kosmos2, florence2, deepseek_vl,
deepseek_vl_hybrid, gemma3, minicpmv4_6, paddleocr_vl, phi4_multimodal,
qwen2_5_omni, qwen3_omni_moe.

Output is byte-for-byte identical (validated against each model's existing
token-count tests, or byte-equivalence harnesses where tests are gated by
torchaudio/unreleased checkpoints). Mllama is intentionally left as-is (one
image token per image + cross-attention, nothing to expand). gemma3n is
deferred: its Hub chat template emits soft tokens, so it needs a soft-token
anchored migration rather than a boi/eoi one.
@ArthurZucker ArthurZucker changed the title Faster image/audio token expansion in processors (single-pass re.sub) Centralize multimodal token expansion via get_text_with_replacements Jun 4, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: cohere2_vision, colmodernvbert, colqwen2, csm, deepseek_ocr2, deepseek_vl, deepseek_vl_hybrid, emu3, ernie4_5_vl_moe, florence2, gemma3, glm_image, granite4_vision, granite_speech, higgs_audio_v2, internvl

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

CI Dashboard: View test results in Grafana

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=46404&sha=c2c78f

Comment on lines +105 to +108
num_images = len(image_inputs["num_patches"])
images_replacements = [self.replace_image_token(image_inputs, i) for i in range(num_images)]
image_inputs.pop("num_patches")
text, _ = self.get_text_with_replacements(list(text), images_replacements=images_replacements)
Copy link
Copy Markdown
Member

@zucchini-nlp zucchini-nlp Jun 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

imo since we already touched it, we need to update everything and call self._process_images. Maybe we won't need a custom __call__ method at all, the processing looks quite simple

If you're asking claude to do it, can use this docs as instruction to it

https://huggingface.co/docs/transformers/multimodal_processing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants