Mkulakow/responses api by michalkulakowski · Pull Request #4039 · openvinotoolkit/model_server

michalkulakowski · 2026-03-04T14:49:59Z

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

Copilot

Pull request overview

This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.

Changes:

Introduces a new Endpoint::RESPONSES and routes /v3/responses + /v3/v1/responses to it.
Adds Responses-specific request parsing (input, max_output_tokens, conflict check with max_completion_tokens) and unary response JSON serialization (object: "response", output: [...]).
Extends unit tests to cover Responses parsing and unary response structure.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`src/test/http_openai_handler_test.cpp`	Adds unit tests validating Responses parsing and unary response output structure.
`src/llm/servable.cpp`	Routes `/v3/responses` to the new endpoint and prepares inputs for it.
`src/llm/apis/openai_completions.hpp`	Adds `Endpoint::RESPONSES` and declares `parseResponsesPart()`.
`src/llm/apis/openai_completions.cpp`	Implements Responses parsing and a new unary serialization path for Responses output format.

Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:826

For RESPONSES requests, the validation error message says "max_tokens value should be greater than 0", but the endpoint-specific fields are max_output_tokens / max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.

    // specific part of max_tokens validation
    if (request.maxTokens == 0) {
        return absl::InvalidArgumentError("max_tokens value should be greater than 0");
    }

src/llm/servable.cpp:76

The invalid-endpoint error message lists /v3/tokenize, but TokenizeParser::isTokenizeEndpoint() accepts any URI that ends with tokenize (including /v3/v1/tokenize). Consider listing both /v3/tokenize and /v3/v1/tokenize (consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.

        return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");

Copilot · 2026-03-04T14:54:45Z

src/llm/apis/openai_completions.cpp

+    // logprobs: bool; optional - defaults to false
+    it = doc.FindMember("logprobs");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsBool())
+            return absl::InvalidArgumentError("logprobs accepts values true or false");
+        request.logprobschat = it->value.GetBool();
+    }
+    if (request.logprobschat && request.stream) {
+        return absl::InvalidArgumentError("logprobs are not supported in streaming mode.");
+    }


For the RESPONSES endpoint, stream can currently be enabled via parseCommonPart(), but serializeStreamingChunk() / serializeStreamingUsageChunk() do not handle Endpoint::RESPONSES (they only emit chat/completions-specific JSON). This will produce malformed streaming output for /v3/responses requests. Either explicitly reject streaming for RESPONSES in parseResponsesPart() (return InvalidArgument when request.stream == true) or implement proper Responses streaming serialization before enabling stream for this endpoint.

its probably old comment which already been fixed, right? @michalkulakowski

Copilot · 2026-03-04T14:54:45Z

src/llm/apis/openai_completions.cpp

+    std::optional<uint32_t> maxCompletionTokens;
+    std::optional<uint32_t> maxOutputTokens;
+
+    // max_completion_tokens: uint; optional
+    it = doc.FindMember("max_completion_tokens");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsUint()) {
+            if (it->value.IsUint64())
+                return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295");
+            return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer");
+        }
+        if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
+            return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
+        maxCompletionTokens = it->value.GetUint();
+    }
+
+    // max_output_tokens: uint; optional
+    // OpenAI Responses API uses this field for output token limit.
+    it = doc.FindMember("max_output_tokens");
+    if (it != doc.MemberEnd() && !it->value.IsNull()) {
+        if (!it->value.IsUint()) {
+            if (it->value.IsUint64())
+                return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295");
+            return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer");
+        }
+        if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
+            return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
+        maxOutputTokens = it->value.GetUint();
+    }
+
+    if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) {
+        return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided");
+    }
+    if (maxOutputTokens.has_value()) {
+        request.maxTokens = maxOutputTokens.value();
+    } else if (maxCompletionTokens.has_value()) {
+        request.maxTokens = maxCompletionTokens.value();
+    }


OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.

Copilot · 2026-03-04T14:54:46Z

src/llm/apis/openai_completions.cpp

+    if (request.maxTokens.has_value()) {
+        writer.String("max_output_tokens");
+        writer.Int(request.maxTokens.value());
+    }
+
+    writer.String("output");
+    writer.StartArray();
+    int outputIndex = 0;
+    for (const auto& parsedOutput : parsedOutputs) {
+        const std::string outputId = "msg-" + std::to_string(outputIndex++);
+
+        writer.StartObject();
+        writer.String("id");
+        writer.String(outputId.c_str());
+        writer.String("type");
+        writer.String("message");
+        writer.String("role");
+        writer.String("assistant");
+        writer.String("status");
+        writer.String("completed");
+        writer.String("content");
+        writer.StartArray();
+        writer.StartObject();
+        writer.String("type");
+        writer.String("output_text");
+        writer.String("text");
+        writer.String(parsedOutput.content.c_str());
+        writer.String("annotations");
+        writer.StartArray();
+        writer.EndArray();
+        writer.EndObject();
+        writer.EndArray();
+        writer.EndObject();
+    }
+    writer.EndArray();
+
+    writer.String("usage");
+    writer.StartObject();
+    writer.String("input_tokens");
+    writer.Int(usage.promptTokens);
+    writer.String("input_tokens_details");
+    writer.StartObject();
+    writer.String("cached_tokens");
+    writer.Int(0);
+    writer.EndObject();
+    writer.String("output_tokens");
+    writer.Int(usage.completionTokens);
+    writer.String("output_tokens_details");
+    writer.StartObject();
+    writer.String("reasoning_tokens");
+    writer.Int(0);
+    writer.EndObject();
+    writer.String("total_tokens");
+    writer.Int(usage.calculateTotalTokens());
+    writer.EndObject();


serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.

@michalkulakowski can you protect against such corner cases?

Copilot · 2026-03-04T14:54:46Z

src/llm/apis/openai_completions.cpp

+    const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
+    const std::string responseId = "resp-" + std::to_string(createdAt);
+


responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.

dkalinowski · 2026-03-18T14:35:17Z

demos/continuous_batching/vlm/README.md

+:::{dropdown} **Unary call with curl using responses endpoint**
+**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)


Suggested change

:::{dropdown} **Unary call with curl using responses endpoint**

**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)

:::{dropdown} **Unary call with cURL using Responses API**

**Note**: Using cURL in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md).

dkalinowski · 2026-03-18T14:36:31Z

demos/continuous_batching/vlm/README.md

+**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)
+
+```bash
+curl http://localhost:8000/v3/responses  -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"input\":[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": \"Describe what is on the picture.\"},{\"type\": \"input_image\", \"image_url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}]}], \"max_output_tokens\": 100}"


Can you break down the command into multiple lines to showcase what is actually being sent?

dkalinowski · 2026-03-18T14:39:38Z

demos/continuous_batching/README.md

 :::


+### Unary calls to responses endpoint using cURL


Suggested change

### Unary calls to responses endpoint using cURL

### Unary calls via Responses API using cURL

dkalinowski · 2026-03-18T14:40:50Z

demos/continuous_batching/README.md

+    "input_tokens": 27,
+    "input_tokens_details": { "cached_tokens": 0 },
+    "output_tokens": 30,
+    "output_tokens_details": { "reasoning_tokens": 0 },


Is reasoning_tokens supported when the model actually reason?
I also cannot see any example showcasing how message looks whenever there is reasoning

dkalinowski · 2026-03-18T14:41:34Z

demos/continuous_batching/vlm/README.md


 :::

+:::{dropdown} **Streaming request with OpenAI client using responses endpoint**


Suggested change

:::{dropdown} **Streaming request with OpenAI client using responses endpoint**

:::{dropdown} **Streaming request with OpenAI client via Responses API**

dkalinowski · 2026-03-18T14:43:53Z

demos/continuous_batching/vlm/README.md

+
+## Benchmarking text generation with high concurrency
+
+OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.


Why is it being added in responses api PR? Is it related?

dkalinowski · 2026-03-18T14:49:03Z

src/llm/apis/openai_completions.cpp

+        writer.String("input_tokens_details");
+        writer.StartObject();
+        writer.String("cached_tokens");
+        writer.Uint64(0);


please add note here that it is not supported

dkalinowski · 2026-03-18T14:49:26Z

src/llm/apis/openai_completions.cpp

+        writer.String("output_tokens_details");
+        writer.StartObject();
+        writer.String("reasoning_tokens");
+        writer.Uint64(0);


can we support it in this release?

dkalinowski · 2026-03-18T14:53:28Z

src/llm/apis/openai_completions.cpp

+
+    if (inputIt->value.IsString()) {
+        request.prompt = inputIt->value.GetString();
+        if (!request.prompt.has_value() || request.prompt.value().empty()) {


request.prompt must have value if you set it one line earlier

dkalinowski · 2026-03-18T14:59:20Z

src/llm/apis/openai_completions.cpp

+                    }
+                    request.imageHistory.push_back({i, tensor});
+                } else {
+                    return absl::InvalidArgumentError("Unsupported content type");


can you include whats the content type? this way we can easily see in logs what was expected supported content type by user

dkalinowski · 2026-03-18T15:05:46Z

src/llm/apis/openai_completions.cpp

+    if (existingMessages != doc.MemberEnd()) {
+        existingMessages->value = messages;
+    } else {
+        doc.AddMember("messages", messages, allocator);


Why is the document being edited? If I understand correctly, we pare doc to our request format. Why do we need to edit the doc? Shouldnt it be left untouched and just work on request in next steps? Why do we parse if we later not use it?

dkalinowski · 2026-03-18T15:09:57Z

src/llm/apis/openai_completions.cpp

            auto& obj = it->value.GetArray()[i];
            if (!obj.IsObject())
                return absl::InvalidArgumentError("Tool is not a JSON object");
+            rapidjson::Value* functionObj = nullptr;


functionObj is never used outside of scope, you dont need to work on pointers, contrary to parametersValue where it is required

dkalinowski · 2026-03-18T15:12:16Z

src/llm/apis/openai_completions.cpp

+                parametersValue->Accept(writer);
+                std::string parametersStr = buffer.GetString();
+                ToolSchemaWrapper schemaReprs{parametersValue, std::move(parametersStr)};
+                request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);


Suggested change

request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);

request.toolNameSchemaMap[functionName] = std::move(schemaReprs);

dkalinowski · 2026-03-18T15:42:07Z

src/llm/apis/openai_completions.cpp

        if (it->value.GetUint() == 0)
            return absl::InvalidArgumentError("n value should be greater than 0");
+        if (endpoint == Endpoint::RESPONSES && request.stream && it->value.GetUint() > 1)
+            return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");


Suggested change

return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");

return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming");

dkalinowski · 2026-03-18T15:46:04Z

src/llm/apis/openai_completions.cpp

+    if (endpoint == Endpoint::RESPONSES) {
+        std::vector<ParsedOutput> parsedOutputs;
+        for (const auto& tokens : results.tokens) {
+            updateUsage(usage, tokens, request.echo);


We just recently switched to using PerfMetrics @mzegla
Why cant we use it here?

dkalinowski · 2026-03-18T15:47:55Z

src/llm/apis/openai_completions.cpp

+            events.emplace_back(serializeResponsesEvent([this, &responseId, createdAt](Writer<StringBuffer>& writer) {
+                writer.StartObject();
+                writer.String("type");
+                writer.String("response.created");


it might not be good place for this event to be streamed, please verify vs other servings maybe if OpenAI is not specific enough

and all other events

Copilot AI review requested due to automatic review settings March 4, 2026 14:50

Copilot started reviewing on behalf of michalkulakowski March 4, 2026 14:50 View session

Copilot AI reviewed Mar 4, 2026

View reviewed changes

michalkulakowski force-pushed the mkulakow/responses_api branch 2 times, most recently from ea72634 to 299e2be Compare March 5, 2026 17:42

michalkulakowski and others added 11 commits March 18, 2026 13:50

Responses api init

33a7be4

fix

21c6629

style

a7295b7

fix

f4c6a1e

style

a0a3293

remove redundant tests

ac1420f

update tools parsing in responses

75875af

fix

4f64842

fix

99d08cd

fix

cdd7408

fix

852da04

michalkulakowski force-pushed the mkulakow/responses_api branch from e598abe to 852da04 Compare March 18, 2026 12:57

dkalinowski reviewed Mar 18, 2026

View reviewed changes

		const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
		const std::string responseId = "resp-" + std::to_string(createdAt);

		:::{dropdown} Unary call with curl using responses endpoint
		Note: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)

	### Unary calls to responses endpoint using cURL
	### Unary calls via Responses API using cURL


		:::

		:::{dropdown} Streaming request with OpenAI client using responses endpoint


		## Benchmarking text generation with high concurrency

		OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.

	request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);
	request.toolNameSchemaMap[functionName] = std::move(schemaReprs);

	return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");
	return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming");

Conversation

michalkulakowski commented Mar 4, 2026

🛠 Summary

🧪 Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants