Skip to content

fix: parse multimodal tool messages#4680

Open
CUHKSZzxy wants to merge 2 commits into
InternLM:mainfrom
CUHKSZzxy:fix/mm_parsing
Open

fix: parse multimodal tool messages#4680
CUHKSZzxy wants to merge 2 commits into
InternLM:mainfrom
CUHKSZzxy:fix/mm_parsing

Conversation

@CUHKSZzxy

Copy link
Copy Markdown
Collaborator

Summary

  • Parse multimodal content in tool messages instead of only user messages.
  • Preserve tool-message metadata such as tool_call_id while converting multimodal parts.
  • Add regression coverage for a tool-result image payload and multimodal input detection.

Validation

  • Diff whitespace check passed.
  • Syntax compilation passed for touched Python files.
  • Replayed the original multimodal tool-result chat-completions payload against a real Qwen3.5 LMDeploy server and confirmed it completes normally instead of returning a prompt-processing error.

Assistance

Assisted with Codex + GPT-5.5 xHigh Fast

Copilot AI review requested due to automatic review settings June 16, 2026 04:00

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates LMDeploy’s multimodal message preprocessing so that multimodal tool messages (e.g., tool results containing image parts) are parsed the same way as multimodal user messages, while preserving tool-message metadata needed for tool-call correlation.

Changes:

  • Extend MultimodalProcessor._parse_multimodal_item to parse multimodal tool messages (not just user) and preserve message metadata when rewriting content.
  • Refactor multimodal-input detection to use a shared MULTIMODAL_TYPES constant and broaden detection to include tool messages.
  • Add regression tests for tool-result image parsing and tool-role multimodal detection.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
lmdeploy/serve/processors/multimodal.py Parses multimodal parts in tool messages and centralizes supported multimodal type detection.
tests/test_lmdeploy/test_content_merge.py Adds regression coverage for tool-result image payload parsing and tool-role multimodal input detection.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/test_lmdeploy/test_content_merge.py
Comment on lines +395 to +397
assert parsed[2]['content'][1]['type'] == Modality.IMAGE
assert parsed[2]['content'][1]['data'] == f'loaded:{image_data_url}'
assert parsed[2]['content'][1]['detail'] == 'auto'

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part does not apply to LMDeploy: Modality overrides __eq__ to compare against strings, so both "image" == Modality.IMAGE and Modality.IMAGE == "image" evaluate true. The parser does store modality.value, but the assertion against Modality.IMAGE is valid in this repo. I only addressed the reasonable test-surface issue by removing the heavyweight VisionModel import/collector assertion.

Comment thread tests/test_lmdeploy/test_content_merge.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants