Skip to content

Mkulakow/responses api#4039

Open
michalkulakowski wants to merge 11 commits intomainfrom
mkulakow/responses_api
Open

Mkulakow/responses api#4039
michalkulakowski wants to merge 11 commits intomainfrom
mkulakow/responses_api

Conversation

@michalkulakowski
Copy link
Collaborator

🛠 Summary

JIRA/Issue if applicable.
Describe the changes.

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

Copilot AI review requested due to automatic review settings March 4, 2026 14:50
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.

Changes:

  • Introduces a new Endpoint::RESPONSES and routes /v3/responses + /v3/v1/responses to it.
  • Adds Responses-specific request parsing (input, max_output_tokens, conflict check with max_completion_tokens) and unary response JSON serialization (object: "response", output: [...]).
  • Extends unit tests to cover Responses parsing and unary response structure.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/test/http_openai_handler_test.cpp Adds unit tests validating Responses parsing and unary response output structure.
src/llm/servable.cpp Routes /v3/responses to the new endpoint and prepares inputs for it.
src/llm/apis/openai_completions.hpp Adds Endpoint::RESPONSES and declares parseResponsesPart().
src/llm/apis/openai_completions.cpp Implements Responses parsing and a new unary serialization path for Responses output format.
Comments suppressed due to low confidence (2)

src/llm/apis/openai_completions.cpp:826

  • For RESPONSES requests, the validation error message says "max_tokens value should be greater than 0", but the endpoint-specific fields are max_output_tokens / max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.
    // specific part of max_tokens validation
    if (request.maxTokens == 0) {
        return absl::InvalidArgumentError("max_tokens value should be greater than 0");
    }

src/llm/servable.cpp:76

  • The invalid-endpoint error message lists /v3/tokenize, but TokenizeParser::isTokenizeEndpoint() accepts any URI that ends with tokenize (including /v3/v1/tokenize). Consider listing both /v3/tokenize and /v3/v1/tokenize (consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.
        return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");

Comment on lines +773 to +782
// logprobs: bool; optional - defaults to false
it = doc.FindMember("logprobs");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsBool())
return absl::InvalidArgumentError("logprobs accepts values true or false");
request.logprobschat = it->value.GetBool();
}
if (request.logprobschat && request.stream) {
return absl::InvalidArgumentError("logprobs are not supported in streaming mode.");
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the RESPONSES endpoint, stream can currently be enabled via parseCommonPart(), but serializeStreamingChunk() / serializeStreamingUsageChunk() do not handle Endpoint::RESPONSES (they only emit chat/completions-specific JSON). This will produce malformed streaming output for /v3/responses requests. Either explicitly reject streaming for RESPONSES in parseResponsesPart() (return InvalidArgument when request.stream == true) or implement proper Responses streaming serialization before enabling stream for this endpoint.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably old comment which already been fixed, right? @michalkulakowski

Comment on lines +784 to +821
std::optional<uint32_t> maxCompletionTokens;
std::optional<uint32_t> maxOutputTokens;

// max_completion_tokens: uint; optional
it = doc.FindMember("max_completion_tokens");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint()) {
if (it->value.IsUint64())
return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295");
return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer");
}
if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
maxCompletionTokens = it->value.GetUint();
}

// max_output_tokens: uint; optional
// OpenAI Responses API uses this field for output token limit.
it = doc.FindMember("max_output_tokens");
if (it != doc.MemberEnd() && !it->value.IsNull()) {
if (!it->value.IsUint()) {
if (it->value.IsUint64())
return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295");
return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer");
}
if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value())
return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value()));
maxOutputTokens = it->value.GetUint();
}

if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) {
return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided");
}
if (maxOutputTokens.has_value()) {
request.maxTokens = maxOutputTokens.value();
} else if (maxCompletionTokens.has_value()) {
request.maxTokens = maxCompletionTokens.value();
}
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.

Copilot uses AI. Check for mistakes.
Comment on lines +137 to +191
if (request.maxTokens.has_value()) {
writer.String("max_output_tokens");
writer.Int(request.maxTokens.value());
}

writer.String("output");
writer.StartArray();
int outputIndex = 0;
for (const auto& parsedOutput : parsedOutputs) {
const std::string outputId = "msg-" + std::to_string(outputIndex++);

writer.StartObject();
writer.String("id");
writer.String(outputId.c_str());
writer.String("type");
writer.String("message");
writer.String("role");
writer.String("assistant");
writer.String("status");
writer.String("completed");
writer.String("content");
writer.StartArray();
writer.StartObject();
writer.String("type");
writer.String("output_text");
writer.String("text");
writer.String(parsedOutput.content.c_str());
writer.String("annotations");
writer.StartArray();
writer.EndArray();
writer.EndObject();
writer.EndArray();
writer.EndObject();
}
writer.EndArray();

writer.String("usage");
writer.StartObject();
writer.String("input_tokens");
writer.Int(usage.promptTokens);
writer.String("input_tokens_details");
writer.StartObject();
writer.String("cached_tokens");
writer.Int(0);
writer.EndObject();
writer.String("output_tokens");
writer.Int(usage.completionTokens);
writer.String("output_tokens_details");
writer.StartObject();
writer.String("reasoning_tokens");
writer.Int(0);
writer.EndObject();
writer.String("total_tokens");
writer.Int(usage.calculateTotalTokens());
writer.EndObject();
Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.

Copilot uses AI. Check for mistakes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michalkulakowski can you protect against such corner cases?

Comment on lines +106 to +108
const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count();
const std::string responseId = "resp-" + std::to_string(createdAt);

Copy link

Copilot AI Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.

Copilot uses AI. Check for mistakes.
@michalkulakowski michalkulakowski force-pushed the mkulakow/responses_api branch 2 times, most recently from ea72634 to 299e2be Compare March 5, 2026 17:42
Comment on lines +122 to +123
:::{dropdown} **Unary call with curl using responses endpoint**
**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{dropdown} **Unary call with curl using responses endpoint**
**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)
:::{dropdown} **Unary call with cURL using Responses API**
**Note**: Using cURL in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md).

**Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md)

```bash
curl http://localhost:8000/v3/responses -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"input\":[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": \"Describe what is on the picture.\"},{\"type\": \"input_image\", \"image_url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}]}], \"max_output_tokens\": 100}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you break down the command into multiple lines to showcase what is actually being sent?

:::


### Unary calls to responses endpoint using cURL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Unary calls to responses endpoint using cURL
### Unary calls via Responses API using cURL

"input_tokens": 27,
"input_tokens_details": { "cached_tokens": 0 },
"output_tokens": 30,
"output_tokens_details": { "reasoning_tokens": 0 },
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is reasoning_tokens supported when the model actually reason?
I also cannot see any example showcasing how message looks whenever there is reasoning


:::

:::{dropdown} **Streaming request with OpenAI client using responses endpoint**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
:::{dropdown} **Streaming request with OpenAI client using responses endpoint**
:::{dropdown} **Streaming request with OpenAI client via Responses API**


## Benchmarking text generation with high concurrency

OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it being added in responses api PR? Is it related?

writer.String("input_tokens_details");
writer.StartObject();
writer.String("cached_tokens");
writer.Uint64(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add note here that it is not supported

writer.String("output_tokens_details");
writer.StartObject();
writer.String("reasoning_tokens");
writer.Uint64(0);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we support it in this release?


if (inputIt->value.IsString()) {
request.prompt = inputIt->value.GetString();
if (!request.prompt.has_value() || request.prompt.value().empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request.prompt must have value if you set it one line earlier

}
request.imageHistory.push_back({i, tensor});
} else {
return absl::InvalidArgumentError("Unsupported content type");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you include whats the content type? this way we can easily see in logs what was expected supported content type by user

if (existingMessages != doc.MemberEnd()) {
existingMessages->value = messages;
} else {
doc.AddMember("messages", messages, allocator);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the document being edited? If I understand correctly, we pare doc to our request format. Why do we need to edit the doc? Shouldnt it be left untouched and just work on request in next steps? Why do we parse if we later not use it?

auto& obj = it->value.GetArray()[i];
if (!obj.IsObject())
return absl::InvalidArgumentError("Tool is not a JSON object");
rapidjson::Value* functionObj = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

functionObj is never used outside of scope, you dont need to work on pointers, contrary to parametersValue where it is required

parametersValue->Accept(writer);
std::string parametersStr = buffer.GetString();
ToolSchemaWrapper schemaReprs{parametersValue, std::move(parametersStr)};
request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs);
request.toolNameSchemaMap[functionName] = std::move(schemaReprs);

if (it->value.GetUint() == 0)
return absl::InvalidArgumentError("n value should be greater than 0");
if (endpoint == Endpoint::RESPONSES && request.stream && it->value.GetUint() > 1)
return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming");
return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming");

if (endpoint == Endpoint::RESPONSES) {
std::vector<ParsedOutput> parsedOutputs;
for (const auto& tokens : results.tokens) {
updateUsage(usage, tokens, request.echo);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just recently switched to using PerfMetrics @mzegla
Why cant we use it here?

events.emplace_back(serializeResponsesEvent([this, &responseId, createdAt](Writer<StringBuffer>& writer) {
writer.StartObject();
writer.String("type");
writer.String("response.created");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might not be good place for this event to be streamed, please verify vs other servings maybe if OpenAI is not specific enough

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and all other events

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants