Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds initial support for the OpenAI Responses API to OVMS LLM serving, wiring a new /v3/responses endpoint through request parsing and unary response serialization, with accompanying unit tests.
Changes:
- Introduces a new
Endpoint::RESPONSESand routes/v3/responses+/v3/v1/responsesto it. - Adds Responses-specific request parsing (
input,max_output_tokens, conflict check withmax_completion_tokens) and unary response JSON serialization (object: "response",output: [...]). - Extends unit tests to cover Responses parsing and unary response structure.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/test/http_openai_handler_test.cpp |
Adds unit tests validating Responses parsing and unary response output structure. |
src/llm/servable.cpp |
Routes /v3/responses to the new endpoint and prepares inputs for it. |
src/llm/apis/openai_completions.hpp |
Adds Endpoint::RESPONSES and declares parseResponsesPart(). |
src/llm/apis/openai_completions.cpp |
Implements Responses parsing and a new unary serialization path for Responses output format. |
Comments suppressed due to low confidence (2)
src/llm/apis/openai_completions.cpp:826
- For RESPONSES requests, the validation error message says
"max_tokens value should be greater than 0", but the endpoint-specific fields aremax_output_tokens/max_completion_tokens. Updating the message to reference the actual accepted field(s) will make it clearer for API consumers.
// specific part of max_tokens validation
if (request.maxTokens == 0) {
return absl::InvalidArgumentError("max_tokens value should be greater than 0");
}
src/llm/servable.cpp:76
- The invalid-endpoint error message lists
/v3/tokenize, butTokenizeParser::isTokenizeEndpoint()accepts any URI that ends withtokenize(including/v3/v1/tokenize). Consider listing both/v3/tokenizeand/v3/v1/tokenize(consistent with the other endpoints) or wording it as a suffix-based rule to avoid misleading users.
return absl::InvalidArgumentError("Wrong endpoint. Allowed endpoints: /v3/chat/completions, /v3/completions, /v3/responses, /v3/tokenize");
| // logprobs: bool; optional - defaults to false | ||
| it = doc.FindMember("logprobs"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsBool()) | ||
| return absl::InvalidArgumentError("logprobs accepts values true or false"); | ||
| request.logprobschat = it->value.GetBool(); | ||
| } | ||
| if (request.logprobschat && request.stream) { | ||
| return absl::InvalidArgumentError("logprobs are not supported in streaming mode."); | ||
| } |
There was a problem hiding this comment.
For the RESPONSES endpoint, stream can currently be enabled via parseCommonPart(), but serializeStreamingChunk() / serializeStreamingUsageChunk() do not handle Endpoint::RESPONSES (they only emit chat/completions-specific JSON). This will produce malformed streaming output for /v3/responses requests. Either explicitly reject streaming for RESPONSES in parseResponsesPart() (return InvalidArgument when request.stream == true) or implement proper Responses streaming serialization before enabling stream for this endpoint.
There was a problem hiding this comment.
its probably old comment which already been fixed, right? @michalkulakowski
src/llm/apis/openai_completions.cpp
Outdated
| std::optional<uint32_t> maxCompletionTokens; | ||
| std::optional<uint32_t> maxOutputTokens; | ||
|
|
||
| // max_completion_tokens: uint; optional | ||
| it = doc.FindMember("max_completion_tokens"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsUint()) { | ||
| if (it->value.IsUint64()) | ||
| return absl::InvalidArgumentError("max_completion_tokens value can't be greater than 4294967295"); | ||
| return absl::InvalidArgumentError("max_completion_tokens is not an unsigned integer"); | ||
| } | ||
| if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value()) | ||
| return absl::InvalidArgumentError(absl::StrCat("max_completion_tokens exceeds limit provided in graph config: ", maxTokensLimit.value())); | ||
| maxCompletionTokens = it->value.GetUint(); | ||
| } | ||
|
|
||
| // max_output_tokens: uint; optional | ||
| // OpenAI Responses API uses this field for output token limit. | ||
| it = doc.FindMember("max_output_tokens"); | ||
| if (it != doc.MemberEnd() && !it->value.IsNull()) { | ||
| if (!it->value.IsUint()) { | ||
| if (it->value.IsUint64()) | ||
| return absl::InvalidArgumentError("max_output_tokens value can't be greater than 4294967295"); | ||
| return absl::InvalidArgumentError("max_output_tokens is not an unsigned integer"); | ||
| } | ||
| if (maxTokensLimit.has_value() && it->value.GetUint() > maxTokensLimit.value()) | ||
| return absl::InvalidArgumentError(absl::StrCat("max_output_tokens exceeds limit provided in graph config: ", maxTokensLimit.value())); | ||
| maxOutputTokens = it->value.GetUint(); | ||
| } | ||
|
|
||
| if (maxCompletionTokens.has_value() && maxOutputTokens.has_value() && maxCompletionTokens.value() != maxOutputTokens.value()) { | ||
| return absl::InvalidArgumentError("max_output_tokens and max_completion_tokens must match when both are provided"); | ||
| } | ||
| if (maxOutputTokens.has_value()) { | ||
| request.maxTokens = maxOutputTokens.value(); | ||
| } else if (maxCompletionTokens.has_value()) { | ||
| request.maxTokens = maxCompletionTokens.value(); | ||
| } |
There was a problem hiding this comment.
OpenAIChatCompletionsRequest::maxTokens is std::optional<int>, but in RESPONSES parsing you accept max_output_tokens/max_completion_tokens as uint32_t and then assign into request.maxTokens. Values > INT_MAX will overflow / become implementation-defined. Consider changing maxTokens to an unsigned/wider type (e.g., uint32_t/uint64_t) or clamp + validate against std::numeric_limits<int>::max() before assigning.
| if (request.maxTokens.has_value()) { | ||
| writer.String("max_output_tokens"); | ||
| writer.Int(request.maxTokens.value()); | ||
| } | ||
|
|
||
| writer.String("output"); | ||
| writer.StartArray(); | ||
| int outputIndex = 0; | ||
| for (const auto& parsedOutput : parsedOutputs) { | ||
| const std::string outputId = "msg-" + std::to_string(outputIndex++); | ||
|
|
||
| writer.StartObject(); | ||
| writer.String("id"); | ||
| writer.String(outputId.c_str()); | ||
| writer.String("type"); | ||
| writer.String("message"); | ||
| writer.String("role"); | ||
| writer.String("assistant"); | ||
| writer.String("status"); | ||
| writer.String("completed"); | ||
| writer.String("content"); | ||
| writer.StartArray(); | ||
| writer.StartObject(); | ||
| writer.String("type"); | ||
| writer.String("output_text"); | ||
| writer.String("text"); | ||
| writer.String(parsedOutput.content.c_str()); | ||
| writer.String("annotations"); | ||
| writer.StartArray(); | ||
| writer.EndArray(); | ||
| writer.EndObject(); | ||
| writer.EndArray(); | ||
| writer.EndObject(); | ||
| } | ||
| writer.EndArray(); | ||
|
|
||
| writer.String("usage"); | ||
| writer.StartObject(); | ||
| writer.String("input_tokens"); | ||
| writer.Int(usage.promptTokens); | ||
| writer.String("input_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("cached_tokens"); | ||
| writer.Int(0); | ||
| writer.EndObject(); | ||
| writer.String("output_tokens"); | ||
| writer.Int(usage.completionTokens); | ||
| writer.String("output_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("reasoning_tokens"); | ||
| writer.Int(0); | ||
| writer.EndObject(); | ||
| writer.String("total_tokens"); | ||
| writer.Int(usage.calculateTotalTokens()); | ||
| writer.EndObject(); |
There was a problem hiding this comment.
serializeResponsesUnaryResponse() writes numeric fields with writer.Int(...) (e.g., max_output_tokens, input_tokens, output_tokens, total_tokens). Several of these values come from size_t or may exceed signed 32-bit range, which can overflow and produce incorrect JSON. Prefer Int64/Uint64 (or Uint) for token counters/limits to match the validated range and the underlying types.
There was a problem hiding this comment.
@michalkulakowski can you protect against such corner cases?
src/llm/apis/openai_completions.cpp
Outdated
| const auto createdAt = std::chrono::duration_cast<std::chrono::seconds>(created.time_since_epoch()).count(); | ||
| const std::string responseId = "resp-" + std::to_string(createdAt); | ||
|
|
There was a problem hiding this comment.
responseId is derived only from createdAt in seconds ("resp-" + std::to_string(createdAt)), which can easily collide under concurrent load (multiple requests within the same second). Use a higher-resolution timestamp and/or add a per-process counter/UUID component to keep response IDs unique.
ea72634 to
299e2be
Compare
e598abe to
852da04
Compare
| :::{dropdown} **Unary call with curl using responses endpoint** | ||
| **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) |
There was a problem hiding this comment.
| :::{dropdown} **Unary call with curl using responses endpoint** | |
| **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) | |
| :::{dropdown} **Unary call with cURL using Responses API** | |
| **Note**: Using cURL in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md). |
| **Note**: using urls in request requires `--allowed_media_domains` parameter described [here](../../../docs/parameters.md) | ||
|
|
||
| ```bash | ||
| curl http://localhost:8000/v3/responses -H "Content-Type: application/json" -d "{ \"model\": \"OpenGVLab/InternVL2-2B\", \"input\":[{\"role\": \"user\", \"content\": [{\"type\": \"input_text\", \"text\": \"Describe what is on the picture.\"},{\"type\": \"input_image\", \"image_url\": \"http://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/static/images/zebra.jpeg\"}]}], \"max_output_tokens\": 100}" |
There was a problem hiding this comment.
Can you break down the command into multiple lines to showcase what is actually being sent?
| ::: | ||
|
|
||
|
|
||
| ### Unary calls to responses endpoint using cURL |
There was a problem hiding this comment.
| ### Unary calls to responses endpoint using cURL | |
| ### Unary calls via Responses API using cURL |
| "input_tokens": 27, | ||
| "input_tokens_details": { "cached_tokens": 0 }, | ||
| "output_tokens": 30, | ||
| "output_tokens_details": { "reasoning_tokens": 0 }, |
There was a problem hiding this comment.
Is reasoning_tokens supported when the model actually reason?
I also cannot see any example showcasing how message looks whenever there is reasoning
|
|
||
| ::: | ||
|
|
||
| :::{dropdown} **Streaming request with OpenAI client using responses endpoint** |
There was a problem hiding this comment.
| :::{dropdown} **Streaming request with OpenAI client using responses endpoint** | |
| :::{dropdown} **Streaming request with OpenAI client via Responses API** |
|
|
||
| ## Benchmarking text generation with high concurrency | ||
|
|
||
| OpenVINO Model Server employs efficient parallelization for text generation. It can be used to generate text also in high concurrency in the environment shared by multiple clients. |
There was a problem hiding this comment.
Why is it being added in responses api PR? Is it related?
| writer.String("input_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("cached_tokens"); | ||
| writer.Uint64(0); |
There was a problem hiding this comment.
please add note here that it is not supported
| writer.String("output_tokens_details"); | ||
| writer.StartObject(); | ||
| writer.String("reasoning_tokens"); | ||
| writer.Uint64(0); |
There was a problem hiding this comment.
can we support it in this release?
|
|
||
| if (inputIt->value.IsString()) { | ||
| request.prompt = inputIt->value.GetString(); | ||
| if (!request.prompt.has_value() || request.prompt.value().empty()) { |
There was a problem hiding this comment.
request.prompt must have value if you set it one line earlier
| } | ||
| request.imageHistory.push_back({i, tensor}); | ||
| } else { | ||
| return absl::InvalidArgumentError("Unsupported content type"); |
There was a problem hiding this comment.
can you include whats the content type? this way we can easily see in logs what was expected supported content type by user
| if (existingMessages != doc.MemberEnd()) { | ||
| existingMessages->value = messages; | ||
| } else { | ||
| doc.AddMember("messages", messages, allocator); |
There was a problem hiding this comment.
Why is the document being edited? If I understand correctly, we pare doc to our request format. Why do we need to edit the doc? Shouldnt it be left untouched and just work on request in next steps? Why do we parse if we later not use it?
| auto& obj = it->value.GetArray()[i]; | ||
| if (!obj.IsObject()) | ||
| return absl::InvalidArgumentError("Tool is not a JSON object"); | ||
| rapidjson::Value* functionObj = nullptr; |
There was a problem hiding this comment.
functionObj is never used outside of scope, you dont need to work on pointers, contrary to parametersValue where it is required
| parametersValue->Accept(writer); | ||
| std::string parametersStr = buffer.GetString(); | ||
| ToolSchemaWrapper schemaReprs{parametersValue, std::move(parametersStr)}; | ||
| request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs); |
There was a problem hiding this comment.
| request.toolNameSchemaMap[functionNameCStr] = std::move(schemaReprs); | |
| request.toolNameSchemaMap[functionName] = std::move(schemaReprs); |
| if (it->value.GetUint() == 0) | ||
| return absl::InvalidArgumentError("n value should be greater than 0"); | ||
| if (endpoint == Endpoint::RESPONSES && request.stream && it->value.GetUint() > 1) | ||
| return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming"); |
There was a problem hiding this comment.
| return absl::InvalidArgumentError("n greater than 1 is not supported for responses streaming"); | |
| return absl::InvalidArgumentError("n greater than 1 is not supported for Responses API streaming"); |
| if (endpoint == Endpoint::RESPONSES) { | ||
| std::vector<ParsedOutput> parsedOutputs; | ||
| for (const auto& tokens : results.tokens) { | ||
| updateUsage(usage, tokens, request.echo); |
There was a problem hiding this comment.
We just recently switched to using PerfMetrics @mzegla
Why cant we use it here?
| events.emplace_back(serializeResponsesEvent([this, &responseId, createdAt](Writer<StringBuffer>& writer) { | ||
| writer.StartObject(); | ||
| writer.String("type"); | ||
| writer.String("response.created"); |
There was a problem hiding this comment.
it might not be good place for this event to be streamed, please verify vs other servings maybe if OpenAI is not specific enough
🛠 Summary
JIRA/Issue if applicable.
Describe the changes.
🧪 Checklist
``