Preserve VLM multipart content order by junruizh2021 · Pull Request #4274 · openvinotoolkit/model_server

junruizh2021 · 2026-06-05T06:17:30Z

🛠 Summary

This change fixes multipart content ordering for OpenAI Chat Completions in VLM requests.

Previously, text parts were flattened together and images were stored separately, then the VLM servable prepended generated <ov_genai_image_N> tags to each chat turn. That could change the intended prompt order when text and images were interleaved.

Now:

OpenAIChatCompletionsHandler::parseMessages() preserves the original multipart order.
Text parts are appended as-is.
image_url parts are decoded into imageHistory.
Each image part is replaced in-place with a <ov_genai_image_N>\n placeholder.
User-provided reserved image tags are rejected.
VLM servables no longer prepend image tags for Chat Completions; they only forward image tensors.
Responses endpoint behavior is preserved.

Tests were updated to cover processed JSON placeholders, interleaved text/image ordering, multi-message image indexes, and rejection of user-supplied reserved tags.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates OpenAI message parsing and VLM input preparation to preserve image/text multipart ordering via <ov_genai_image_N> placeholders, and adds validation to reject user-supplied reserved image tags.

Changes:

Update message parsing to inject <ov_genai_image_N>\n placeholders for image_url parts and validate user text does not contain reserved image tags.
Adjust VLM servable input preparation to treat CHAT_COMPLETIONS and RESPONSES endpoints differently when handling placeholders.
Update/extend unit tests to reflect placeholder insertion and reserved-tag validation failures.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.

File	Description
src/test/http_openai_handler_test.cpp	Updates expectations for placeholder-injected content and adds tests for reserved-tag rejection.
src/llm/visual_language_model/legacy/servable.cpp	Changes how image placeholders are checked/inserted into chat history depending on endpoint.
src/llm/visual_language_model/continuous_batching/servable.cpp	Mirrors legacy servable changes for continuous batching endpoint handling.
src/llm/apis/openai_completions.cpp	Injects image placeholders during parsing and rejects reserved-tag occurrences in user text/content.

junruizh2021 · 2026-06-05T06:36:30Z

+        if (executionContext->endpoint == Endpoint::RESPONSES) {
+            for (size_t i = 0; i < chatHistory.size(); i++) {
+                const auto& message = chatHistory[i];
+                if (message["content"].as_string().value_or("").find("<ov_genai_image_") != std::string::npos) {
+                    return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
+                }
            }
        }

-        const ImageHistory& imageHistory = vlmExecutionContext->apiHandler->getImageHistory();
-        size_t imageIndex = 0;
-        std::unordered_map<size_t, std::string> imageTags;
-        for (const auto& image : imageHistory) {
-            const auto& [chatTurnIndex, imageTensor] = image;
-            std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
-            imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
-            vlmExecutionContext->inputImages.push_back(imageTensor);
-        }
-        for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
-            std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
-            chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
+        if (executionContext->endpoint == Endpoint::RESPONSES) {
+            size_t imageIndex = 0;
+            std::unordered_map<size_t, std::string> imageTags;
+            for (const auto& image : imageHistory) {
+                const auto& [chatTurnIndex, imageTensor] = image;
+                std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
+                imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
+                vlmExecutionContext->inputImages.push_back(imageTensor);
+            }
+            for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
+                std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
+                chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
+            }
+        } else {
+            for (const auto& image : imageHistory) {
+                const auto& [chatTurnIndex, imageTensor] = image;
+                (void)chatTurnIndex;
+                vlmExecutionContext->inputImages.push_back(imageTensor);
+            }
        }


Thanks for the review. I checked the parser ownership here: Endpoint::RESPONSES does not go through OpenAIChatCompletionsHandler::parseMessages(). It is parsed by openai_responses.cpp, where image content is still stored only in imageHistory and placeholders are not injected into message content. So the Responses servable path should keep owning placeholder insertion. I added comments in both VLM servables to make this contract explicit.

+        if (executionContext->endpoint == Endpoint::RESPONSES) {
+            for (size_t i = 0; i < chatHistory.size(); i++) {
+                const auto& message = chatHistory[i];
+                if (message["content"].as_string().value_or("").find("<ov_genai_image_") != std::string::npos) {
+                    return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
+                }
            }
        }

-        const ImageHistory& imageHistory = vlmExecutionContext->apiHandler->getImageHistory();
-        size_t imageIndex = 0;
-        std::unordered_map<size_t, std::string> imageTags;
-        for (const auto& image : imageHistory) {
-            const auto& [chatTurnIndex, imageTensor] = image;
-            std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
-            imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
-            vlmExecutionContext->inputImages.push_back(imageTensor);
-        }
-
-        for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
-            std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
-            chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
+        if (executionContext->endpoint == Endpoint::RESPONSES) {
+            size_t imageIndex = 0;
+            std::unordered_map<size_t, std::string> imageTags;
+            for (const auto& image : imageHistory) {
+                const auto& [chatTurnIndex, imageTensor] = image;
+                std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n";
+                imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag;
+                vlmExecutionContext->inputImages.push_back(imageTensor);
+            }
+            for (const auto& [chatTurnIndex, imageTagString] : imageTags) {
+                std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or("");
+                chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent;
+            }
+        } else {
+            for (const auto& image : imageHistory) {
+                const auto& [chatTurnIndex, imageTensor] = image;
+                (void)chatTurnIndex;
+                vlmExecutionContext->inputImages.push_back(imageTensor);
+            }
        }


junruizh2021 · 2026-06-05T06:36:35Z

+static bool containsReservedImageTag(const std::string& text) {
+    return text.find("<ov_genai_image_") != std::string::npos;
+}


Good point. I updated containsReservedImageTag to take std::string_view and now pass RapidJSON string data with GetString() plus GetStringLength(), so the check no longer creates temporary std::string objects on the parsing path.

+                if (memberName == "content" && containsReservedImageTag(member->value.GetString())) {
+                    return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
+                }


+                        if (containsReservedImageTag(entry["text"].GetString())) {
+                            return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag");
                        }


junruizh2021 · 2026-06-05T06:36:41Z

 }

-TEST_F(HttpOpenAIHandlerParsingTest, ParsingMessagesMultipleTextItemsConcatenatesWithNewline) {
+TEST_F(HttpOpenAIHandlerParsingTest, ParsingMessagesMultipleTextItemsPreservesTextParts) {


Agreed. The previous update made the test name misleading and could merge adjacent text parts without a delimiter. I changed the implementation and test expectation so consecutive text parts are still separated with \n, while image parts are replaced in-place with <ov_genai_image_N>\n placeholders to preserve multipart ordering.

+    EXPECT_EQ(chatHistory[0]["content"], "First part.Second part.");
+    EXPECT_EQ(apiHandler->getProcessedJson(), R"({"model":"llama","messages":[{"role":"user","content":"First part.Second part."}]})");


Copilot AI review requested due to automatic review settings June 5, 2026 06:17

Copilot AI reviewed Jun 5, 2026

View reviewed changes

Preserve VLM multipart content order

ba1ae1b

junruizh2021 force-pushed the feature/vlm-preserve-multipart-content-order branch from 538ddcc to ba1ae1b Compare June 5, 2026 06:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve VLM multipart content order#4274

Preserve VLM multipart content order#4274
junruizh2021 wants to merge 1 commit into
openvinotoolkit:mainfrom
junruizh2021:feature/vlm-preserve-multipart-content-order

junruizh2021 commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

junruizh2021 Jun 5, 2026

Uh oh!

junruizh2021 Jun 5, 2026

Uh oh!

junruizh2021 Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		EXPECT_EQ(chatHistory[0]["content"], "First part.Second part.");
		EXPECT_EQ(apiHandler->getProcessedJson(), R"({"model":"llama","messages":[{"role":"user","content":"First part.Second part."}]})");

Conversation

junruizh2021 commented Jun 5, 2026

🛠 Summary

🧪 Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

junruizh2021 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

junruizh2021 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

junruizh2021 Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants