Skip to content

Preserve VLM multipart content order#4274

Open
junruizh2021 wants to merge 1 commit into
openvinotoolkit:mainfrom
junruizh2021:feature/vlm-preserve-multipart-content-order
Open

Preserve VLM multipart content order#4274
junruizh2021 wants to merge 1 commit into
openvinotoolkit:mainfrom
junruizh2021:feature/vlm-preserve-multipart-content-order

Conversation

@junruizh2021

Copy link
Copy Markdown

🛠 Summary

This change fixes multipart content ordering for OpenAI Chat Completions in VLM requests.

Previously, text parts were flattened together and images were stored separately, then the VLM servable prepended generated <ov_genai_image_N> tags to each chat turn. That could change the intended prompt order when text and images were interleaved.

Now:

  • OpenAIChatCompletionsHandler::parseMessages() preserves the original multipart order.
  • Text parts are appended as-is.
  • image_url parts are decoded into imageHistory.
  • Each image part is replaced in-place with a <ov_genai_image_N>\n placeholder.
  • User-provided reserved image tags are rejected.
  • VLM servables no longer prepend image tags for Chat Completions; they only forward image tensors.
  • Responses endpoint behavior is preserved.

Tests were updated to cover processed JSON placeholders, interleaved text/image ordering, multi-message image indexes, and rejection of user-supplied reserved tags.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

Copilot AI review requested due to automatic review settings June 5, 2026 06:17

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates OpenAI message parsing and VLM input preparation to preserve image/text multipart ordering via <ov_genai_image_N> placeholders, and adds validation to reject user-supplied reserved image tags.

Changes:

  • Update message parsing to inject <ov_genai_image_N>\n placeholders for image_url parts and validate user text does not contain reserved image tags.
  • Adjust VLM servable input preparation to treat CHAT_COMPLETIONS and RESPONSES endpoints differently when handling placeholders.
  • Update/extend unit tests to reflect placeholder insertion and reserved-tag validation failures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File Description
src/test/http_openai_handler_test.cpp Updates expectations for placeholder-injected content and adds tests for reserved-tag rejection.
src/llm/visual_language_model/legacy/servable.cpp Changes how image placeholders are checked/inserted into chat history depending on endpoint.
src/llm/visual_language_model/continuous_batching/servable.cpp Mirrors legacy servable changes for continuous batching endpoint handling.
src/llm/apis/openai_completions.cpp Injects image placeholders during parsing and rejects reserved-tag occurrences in user text/content.

Comment on lines +280 to 308
if (executionContext->endpoint == Endpoint::RESPONSES) {
for (size_t i = 0; i < chatHistory.size(); i++) {
const auto& message = chatHistory[i];
if (message["content"].as_string().value_or("").find("<ov_genai_image_") != std::string::npos) {
return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
}
}
}

const ImageHistory& imageHistory = vlmExecutionContext->apiHandler->getImageHistory();
size_t imageIndex = 0;
std::unordered_map<size_t, std::string> imageTags;
for (const auto& image : imageHistory) {
const auto& [chatTurnIndex, imageTensor] = image;
std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
vlmExecutionContext->inputImages.push_back(imageTensor);
}
for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
if (executionContext->endpoint == Endpoint::RESPONSES) {
size_t imageIndex = 0;
std::unordered_map<size_t, std::string> imageTags;
for (const auto& image : imageHistory) {
const auto& [chatTurnIndex, imageTensor] = image;
std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
vlmExecutionContext->inputImages.push_back(imageTensor);
}
for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
}
} else {
for (const auto& image : imageHistory) {
const auto& [chatTurnIndex, imageTensor] = image;
(void)chatTurnIndex;
vlmExecutionContext->inputImages.push_back(imageTensor);
}
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review. I checked the parser ownership here: Endpoint::RESPONSES does not go through OpenAIChatCompletionsHandler::parseMessages(). It is parsed by openai_responses.cpp, where image content is still stored only in imageHistory and placeholders are not injected into message content. So the Responses servable path should keep owning placeholder insertion. I added comments in both VLM servables to make this contract explicit.

Comment on lines +76 to 104
if (executionContext->endpoint == Endpoint::RESPONSES) {
for (size_t i = 0; i < chatHistory.size(); i++) {
const auto& message = chatHistory[i];
if (message["content"].as_string().value_or("").find("<ov_genai_image_") != std::string::npos) {
return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
}
}
}

const ImageHistory& imageHistory = vlmExecutionContext->apiHandler->getImageHistory();
size_t imageIndex = 0;
std::unordered_map<size_t, std::string> imageTags;
for (const auto& image : imageHistory) {
const auto& [chatTurnIndex, imageTensor] = image;
std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
vlmExecutionContext->inputImages.push_back(imageTensor);
}

for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
if (executionContext->endpoint == Endpoint::RESPONSES) {
size_t imageIndex = 0;
std::unordered_map<size_t, std::string> imageTags;
for (const auto& image : imageHistory) {
const auto& [chatTurnIndex, imageTensor] = image;
std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
vlmExecutionContext->inputImages.push_back(imageTensor);
}
for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
}
} else {
for (const auto& image : imageHistory) {
const auto& [chatTurnIndex, imageTensor] = image;
(void)chatTurnIndex;
vlmExecutionContext->inputImages.push_back(imageTensor);
}
}
Comment thread src/llm/apis/openai_completions.cpp Outdated
Comment on lines +49 to +51
static bool containsReservedImageTag(const std::string& text) {
return text.find("<ov_genai_image_") != std::string::npos;
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I updated containsReservedImageTag to take std::string_view and now pass RapidJSON string data with GetString() plus GetStringLength(), so the check no longer creates temporary std::string objects on the parsing path.

Comment thread src/llm/apis/openai_completions.cpp Outdated
Comment on lines +183 to +185
if (memberName == "content" && containsReservedImageTag(member->value.GetString())) {
return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
}
Comment thread src/llm/apis/openai_completions.cpp Outdated
Comment on lines 225 to 227
if (containsReservedImageTag(entry["text"].GetString())) {
return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
}
}

TEST_F(HttpOpenAIHandlerParsingTest, ParsingMessagesMultipleTextItemsConcatenatesWithNewline) {
TEST_F(HttpOpenAIHandlerParsingTest, ParsingMessagesMultipleTextItemsPreservesTextParts) {

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The previous update made the test name misleading and could merge adjacent text parts without a delimiter. I changed the implementation and test expectation so consecutive text parts are still separated with \n, while image parts are replaced in-place with <ov_genai_image_N>\n placeholders to preserve multipart ordering.

Comment thread src/test/http_openai_handler_test.cpp Outdated
Comment on lines +3215 to +3216
EXPECT_EQ(chatHistory[0]["content"], "First part.Second part.");
EXPECT_EQ(apiHandler->getProcessedJson(), R"({"model":"llama","messages":[{"role":"user","content":"First part.Second part."}]})");
@junruizh2021 junruizh2021 force-pushed the feature/vlm-preserve-multipart-content-order branch from 538ddcc to ba1ae1b Compare June 5, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants