Preserve VLM multipart content order#4274
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Updates OpenAI message parsing and VLM input preparation to preserve image/text multipart ordering via <ov_genai_image_N> placeholders, and adds validation to reject user-supplied reserved image tags.
Changes:
- Update message parsing to inject
<ov_genai_image_N>\nplaceholders forimage_urlparts and validate user text does not contain reserved image tags. - Adjust VLM servable input preparation to treat
CHAT_COMPLETIONSandRESPONSESendpoints differently when handling placeholders. - Update/extend unit tests to reflect placeholder insertion and reserved-tag validation failures.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| src/test/http_openai_handler_test.cpp | Updates expectations for placeholder-injected content and adds tests for reserved-tag rejection. |
| src/llm/visual_language_model/legacy/servable.cpp | Changes how image placeholders are checked/inserted into chat history depending on endpoint. |
| src/llm/visual_language_model/continuous_batching/servable.cpp | Mirrors legacy servable changes for continuous batching endpoint handling. |
| src/llm/apis/openai_completions.cpp | Injects image placeholders during parsing and rejects reserved-tag occurrences in user text/content. |
| if (executionContext->endpoint == Endpoint::RESPONSES) { | ||
| for (size_t i = 0; i < chatHistory.size(); i++) { | ||
| const auto& message = chatHistory[i]; | ||
| if (message["content"].as_string().value_or("").find("<ov_genai_image_") != std::string::npos) { | ||
| return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag"); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| const ImageHistory& imageHistory = vlmExecutionContext->apiHandler->getImageHistory(); | ||
| size_t imageIndex = 0; | ||
| std::unordered_map<size_t, std::string> imageTags; | ||
| for (const auto& image : imageHistory) { | ||
| const auto& [chatTurnIndex, imageTensor] = image; | ||
| std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n"; | ||
| imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag; | ||
| vlmExecutionContext->inputImages.push_back(imageTensor); | ||
| } | ||
| for (const auto& [chatTurnIndex, imageTagString] : imageTags) { | ||
| std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or(""); | ||
| chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent; | ||
| if (executionContext->endpoint == Endpoint::RESPONSES) { | ||
| size_t imageIndex = 0; | ||
| std::unordered_map<size_t, std::string> imageTags; | ||
| for (const auto& image : imageHistory) { | ||
| const auto& [chatTurnIndex, imageTensor] = image; | ||
| std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n"; | ||
| imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag; | ||
| vlmExecutionContext->inputImages.push_back(imageTensor); | ||
| } | ||
| for (const auto& [chatTurnIndex, imageTagString] : imageTags) { | ||
| std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or(""); | ||
| chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent; | ||
| } | ||
| } else { | ||
| for (const auto& image : imageHistory) { | ||
| const auto& [chatTurnIndex, imageTensor] = image; | ||
| (void)chatTurnIndex; | ||
| vlmExecutionContext->inputImages.push_back(imageTensor); | ||
| } | ||
| } |
There was a problem hiding this comment.
Thanks for the review. I checked the parser ownership here: Endpoint::RESPONSES does not go through OpenAIChatCompletionsHandler::parseMessages(). It is parsed by openai_responses.cpp, where image content is still stored only in imageHistory and placeholders are not injected into message content. So the Responses servable path should keep owning placeholder insertion. I added comments in both VLM servables to make this contract explicit.
| if (executionContext->endpoint == Endpoint::RESPONSES) { | ||
| for (size_t i = 0; i < chatHistory.size(); i++) { | ||
| const auto& message = chatHistory[i]; | ||
| if (message["content"].as_string().value_or("").find("<ov_genai_image_") != std::string::npos) { | ||
| return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag"); | ||
| } | ||
| } | ||
| } | ||
|
|
||
| const ImageHistory& imageHistory = vlmExecutionContext->apiHandler->getImageHistory(); | ||
| size_t imageIndex = 0; | ||
| std::unordered_map<size_t, std::string> imageTags; | ||
| for (const auto& image : imageHistory) { | ||
| const auto& [chatTurnIndex, imageTensor] = image; | ||
| std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n"; | ||
| imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag; | ||
| vlmExecutionContext->inputImages.push_back(imageTensor); | ||
| } | ||
|
|
||
| for (const auto& [chatTurnIndex, imageTagString] : imageTags) { | ||
| std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or(""); | ||
| chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent; | ||
| if (executionContext->endpoint == Endpoint::RESPONSES) { | ||
| size_t imageIndex = 0; | ||
| std::unordered_map<size_t, std::string> imageTags; | ||
| for (const auto& image : imageHistory) { | ||
| const auto& [chatTurnIndex, imageTensor] = image; | ||
| std::string imageTag = "<ov_genai_image_" + std::to_string(imageIndex++) + ">\n"; | ||
| imageTags[chatTurnIndex] = imageTags[chatTurnIndex] + imageTag; | ||
| vlmExecutionContext->inputImages.push_back(imageTensor); | ||
| } | ||
| for (const auto& [chatTurnIndex, imageTagString] : imageTags) { | ||
| std::string messageContent = chatHistory[chatTurnIndex]["content"].as_string().value_or(""); | ||
| chatHistory[chatTurnIndex]["content"] = imageTagString + messageContent; | ||
| } | ||
| } else { | ||
| for (const auto& image : imageHistory) { | ||
| const auto& [chatTurnIndex, imageTensor] = image; | ||
| (void)chatTurnIndex; | ||
| vlmExecutionContext->inputImages.push_back(imageTensor); | ||
| } | ||
| } |
| static bool containsReservedImageTag(const std::string& text) { | ||
| return text.find("<ov_genai_image_") != std::string::npos; | ||
| } |
There was a problem hiding this comment.
Good point. I updated containsReservedImageTag to take std::string_view and now pass RapidJSON string data with GetString() plus GetStringLength(), so the check no longer creates temporary std::string objects on the parsing path.
| if (memberName == "content" && containsReservedImageTag(member->value.GetString())) { | ||
| return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag"); | ||
| } |
| if (containsReservedImageTag(entry["text"].GetString())) { | ||
| return absl::InvalidArgumentError("Message contains restricted <ov_genai_image> tag"); | ||
| } |
| } | ||
|
|
||
| TEST_F(HttpOpenAIHandlerParsingTest, ParsingMessagesMultipleTextItemsConcatenatesWithNewline) { | ||
| TEST_F(HttpOpenAIHandlerParsingTest, ParsingMessagesMultipleTextItemsPreservesTextParts) { |
There was a problem hiding this comment.
Agreed. The previous update made the test name misleading and could merge adjacent text parts without a delimiter. I changed the implementation and test expectation so consecutive text parts are still separated with \n, while image parts are replaced in-place with <ov_genai_image_N>\n placeholders to preserve multipart ordering.
| EXPECT_EQ(chatHistory[0]["content"], "First part.Second part."); | ||
| EXPECT_EQ(apiHandler->getProcessedJson(), R"({"model":"llama","messages":[{"role":"user","content":"First part.Second part."}]})"); |
538ddcc to
ba1ae1b
Compare
🛠 Summary
This change fixes multipart content ordering for OpenAI Chat Completions in VLM requests.
Previously, text parts were flattened together and images were stored separately, then the VLM servable prepended generated
<ov_genai_image_N>tags to each chat turn. That could change the intended prompt order when text and images were interleaved.Now:
OpenAIChatCompletionsHandler::parseMessages()preserves the original multipart order.image_urlparts are decoded intoimageHistory.<ov_genai_image_N>\nplaceholder.Tests were updated to cover processed JSON placeholders, interleaved text/image ordering, multi-message image indexes, and rejection of user-supplied reserved tags.
🧪 Checklist
``