Conversation
| /// <summary>Represents a real-time client.</summary> | ||
| /// <remarks>This interface provides methods to create and manage real-time sessions.</remarks> | ||
| [Experimental("MEAI001")] | ||
| public interface IRealtimeClient : IDisposable |
There was a problem hiding this comment.
It seems very intentional that Chat is omitted from the name. Is this consistent with the way the models/providers are exposing real-time support?
There was a problem hiding this comment.
Right, I don't think the Chat term is used much with the realtime models. Here is snippet example of one of the python's implemenatin:
from openai_realtime import RealtimeClient
client = RealtimeClient(api_key="...", model="gpt-realtime")
client.send_text("Hello")
for event in client:
print(event)| /// <summary> | ||
| /// Gets or sets the type of audio. For example, "audio/pcm". | ||
| /// </summary> | ||
| public string Type { get; set; } |
There was a problem hiding this comment.
Just wondering if the MediaType is broader and not scoped to audio formats?
|
This all looks very different from the existing chat client APIs, is it necessarily so? I was imagining there would be an onramp from existing APIs to real-time. As it is, it seems a complete rewrite? |
Yes, it is different because the chat client is designed around a traditional request–response model. As mentioned in the description, realtime models operate over bidirectional streaming, allowing clients to send input and receive output at the same time. This enables the server to start generating results while it is still processing incoming data. For example, Realtime models support Voice Activity Detection (VAD), which can determine when to begin responding even before the entire audio stream is received. Think of it as a natural voice conversation: you can interrupt the model while it is speaking, and the server can start responding as soon as it detects a brief pause in your speech. If you have a better approach for supporting Realtime models, I’d be happy to explore it. |
|
I'm just trying to think about how this fits into existing samples / templates and how we teach people about it. It sounds like it's not a "grow into" type of technology but instead an "early fork". If an early fork, then I'd still expect most of the different tasks we already have to be possible in real-time. I wonder if it would help to look at a table of the tasks, which types are involved for Chat and which types are involved for Realtime. In some cases maybe we can use similar or same types (Options, Content?), but in others we might need new types. I would expect Realtime to be a superset of functionality in most cases. Is it ever possible to derive from those existing types? Additionally - we have infrastructure around IChatClient - like function calling - how does that work in RealTime and can we reuse the infrastructure we already have? |
That’s exactly the approach I followed in the proposal. I reused existing types wherever they naturally fit, such as The proposal also defines realtime-specific types, such as
I believe the use of |
Can you have a look at how function calling works today in https://github.com/dotnet/extensions/blob/e7fac9d9885b12ea2aacf75875802cc4571ee2ca/src/Libraries/Microsoft.Extensions.AI/ChatCompletion/FunctionInvokingChatClient.cs. We have these and other infrastructure built up around |
I’ll look into that. My initial thought is that we shouldn’t need a parallel infrastructure; function calls could be handled like any other conversation item (similar to text, audio, etc.). The client would initiate a function call through a client message using FunctionCallContent, and the model would return the results in a server message. That said, I’m not an expert yet, so I may be overlooking something. I’ll spend more time reviewing this area. |
|
Here are the open issues currently being tracked in the proposal:
I’ll continue adding items to the list as we explore the proposal in more depth. |
stephentoub
left a comment
There was a problem hiding this comment.
Have you tried implementing this on multiple providers?
|
|
||
| /// <summary>Gets a value indicating whether the session is currently connected.</summary> | ||
| /// <returns><see langword="true"/> if the session is connected; otherwise, <see langword="false"/>.</returns> | ||
| bool IsConnected { get; } |
There was a problem hiding this comment.
Why is this needed? How does it get used?
There was a problem hiding this comment.
I've removed it. Initially, I thought it might be useful if having a session object to check the connection status, but I've removed it for now until we find a need for it.
| /// <param name="updates">The sequence of real-time messages to send.</param> | ||
| /// <param name="cancellationToken">A token to cancel the operation.</param> | ||
| /// <returns>The response messages generated by the session.</returns> | ||
| IAsyncEnumerable<RealtimeServerMessage> GetStreamingResponseAsync( |
There was a problem hiding this comment.
Can I call this multiple times on the same IRealtimeSession instance?
There was a problem hiding this comment.
No, this method cannot be called concurrently on the same session instance. The provider's implementation should throw an exception if multiple calls are attempted. However, calling it sequentially should be fine, though I don't anticipate this being a common use case. I have added a remark to the docs for that.
| /// The log of the model’s confidence in generating a token. Higher values mean the token was more likely according to the model. | ||
| /// </summary> | ||
| [Experimental("MEAI001")] | ||
| public class LogProbability |
There was a problem hiding this comment.
Is this something that developers need in the 90% case? Is this generic across all providers?
There was a problem hiding this comment.
This is typically used by AI engineers for guardrails and testing. It helps with confidence scoring or probability distribution between tokens. I’m going to remove it from the proposal for now until we identify a need for it, at which point we can reintroduce it.
By the way, Log Probability is supported by other providers, such as Gemini Flash models. However, there are differences in the supported fields between providers. For example, OpenAI uses bytes to generate the result, whereas Gemini uses token IDs instead.
| /// Represents a reusable prompts that you can use in requests, rather than specifying the content of prompts in code. | ||
| /// </summary> | ||
| [Experimental("MEAI001")] | ||
| public class PromptTemplate |
There was a problem hiding this comment.
This doesn't seem specific to real-time. We should think through whether we need a representation for this that applies to IChatClient as well, for example. Or if it's actually needed at all... with IChatClient, devs that need this from the underlying provider can break glass, using RawRepresentationFactory or similar.
There was a problem hiding this comment.
I removed it for now. We can bring it back later if we need to.
| /// This is used to identify the purpose of the message being sent to the model. | ||
| /// </summary> | ||
| [Experimental("MEAI001")] | ||
| public enum RealtimeClientMessageType |
There was a problem hiding this comment.
How does this relate to the concrete subtypes like RealtimeClientInputAudioBufferAppendMessage ?
There was a problem hiding this comment.
I had initially to be able of using the same subtype for multiple events, but this is not the case any more so, I have removed RealtimeClientMessageType.
| /// <summary> | ||
| /// Gets or sets the tool choice mode for the response. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// If FunctionToolName or McpToolName is specified, this value will be ignored. | ||
| /// </remarks> | ||
| public ToolChoiceMode? ToolChoiceMode { get; set; } | ||
|
|
||
| /// <summary> | ||
| /// Gets or sets the name of the function tool to use for the response. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// If specified, the ToolChoiceMode, McpToolName, and McpToolServerLabel values will be ignored. | ||
| /// </remarks> | ||
| public string? FunctionToolName { get; set; } | ||
|
|
||
| /// <summary> | ||
| /// Gets or sets the name of the MCP tool to use for the response. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// If specified, the MCP tool server label will also be required. | ||
| /// </remarks> | ||
| public string? McpToolName { get; set; } | ||
|
|
||
| /// <summary> | ||
| /// Gets or sets the label of the MCP tool server to use for the response. | ||
| /// </summary> | ||
| /// <remarks> | ||
| /// If specified, the MCP tool name will also be required. | ||
| /// </remarks> | ||
| public string? McpToolServerLabel { get; set; } |
There was a problem hiding this comment.
Why are these needed? Can't these be modeled using the same AITool-derived types we already have?
| /// <summary> | ||
| /// Gets or sets the content of the conversation item. | ||
| /// </summary> | ||
| public AIContent Content { get; set; } |
There was a problem hiding this comment.
You are correct, this should be an array of contents. I'll fix that.
| /// Used with the <see cref="RealtimeServerMessageType.Error"/>. | ||
| /// </remarks> | ||
| [Experimental("MEAI001")] | ||
| public class RealtimeServerErrorMessage : RealtimeServerMessage |
There was a problem hiding this comment.
Could this just be ErrorContent as part of another message?
There was a problem hiding this comment.
I prefer to keep this as a separate error message because it includes additional properties, such as ErrorEventId and Parameter, which cannot necessarily be added to other messages or to the ErrorContent itself. Please let me know if you still feel differently.
| /// This property is used only when having audio and text tokens. Otherwise InputTokenCount is sufficient. | ||
| /// </remarks> | ||
| [Experimental("MEAI001")] | ||
| public long? InputTextTokenCount { get; set; } |
There was a problem hiding this comment.
These are important enough to model like this rather than as part of AdditionalCounts?
There was a problem hiding this comment.
@stephentoub I don't know where we draw the line for adding such new property or just using the AdditionalCounts? Recently, I am seeing we have added CachedInputTokenCount, why decided as a new property and not a part of AdditionalCounts? I am fine either way but will be good if we have a clear guidance when using AdditionalCounts and when add a new property.
There was a problem hiding this comment.
Commonality across providers. Individual providers have been putting cached token information into AdditonalCounts for a while, but now that most providers have this notion and expose it, we elevated it to public property.
There was a problem hiding this comment.
The new properties look supported in multiple providers. The difference is in Gemini Flash models it is derived from modality. Would that be enough to introduce the properties or just add them to the AdditonalCounts for now and we can consider exposing them later if we want to?
OpenAI:
"usage": {
"input_token_details": {
"text_tokens": 100,
"audio_tokens": 400
}
}Gemini:
"usage_metadata": {
"prompt_tokens_details": [
{ "modality": "TEXT", "token_count": 100 },
{ "modality": "AUDIO", "token_count": 400 }
]
}| <ForceLatestDotnetVersions>true</ForceLatestDotnetVersions> | ||
| <MinCodeCoverage>n/a</MinCodeCoverage> | ||
| <MinMutationScore>n/a</MinMutationScore> | ||
| <NoWarn>$(NoWarn);MEAI001</NoWarn> <!-- Added to suppress MEAI001 warnings thrown because of experimental usage added to UsageDetails type --> |
There was a problem hiding this comment.
I assume this is because of the source generator? Everyone else would get those same warnings as well. The experimental properties on serializable types should be marked as [JsonIgnore] as long as they're experimental.
There was a problem hiding this comment.
Thanks. Using [JsonIgnore] fixed the issue.
79edc89 to
6f78967
Compare
My plan is to implement this for the Gemini Flash 2.5 model. I'll update the proposal according to the finding from this implementation. |
ecd8e3f to
01b7a96
Compare
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> Co-authored-by: jozkee <16040868+jozkee@users.noreply.github.com> Co-authored-by: David Cantú <dacantu@microsoft.com> Co-authored-by: Stephen Toub <stoub@microsoft.com>
…tnet#7236) * Initial plan * Update ModelContextProtocol packages to 0.7.0-preview.1 Co-authored-by: jeffhandley <1031940+jeffhandley@users.noreply.github.com> * Update template package version to align with ModelContextProtocol 0.7.0-preview.1 Co-authored-by: jeffhandley <1031940+jeffhandley@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jeffhandley <1031940+jeffhandley@users.noreply.github.com>
…7.1 (dotnet#7237) * Initial plan * Update Agent Framework package versions to 1.0.0-preview.260127.1 Co-authored-by: jeffhandley <1031940+jeffhandley@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jeffhandley <1031940+jeffhandley@users.noreply.github.com>
…conventions (dotnet#7240) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
4982428 to
32b1f43
Compare
…et#7250) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
* Bring new cpu.requests formula from Kubernetes * Fix tests * Fix tests * Fix tests * Fix tests * Fix test * Fix assert
…397 (dotnet#7247) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
…gistration (dotnet#7255) Move _changeTokenRegistration.Dispose() outside the lock to avoid deadlock. CancellationTokenRegistration.Dispose() blocks waiting for any in-flight callback to complete, but the callback (RefreshAsync) tries to acquire the same lock, causing a deadlock. The fix captures the registration reference while holding the lock, then disposes it after releasing the lock. Applied to both RefreshAsyncInternal and DisposeAsync methods.
…th inverted polarity (dotnet#7262) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> Co-authored-by: Stephen Toub <stoub@microsoft.com>
…nt (dotnet#7261) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: halter73 <54385+halter73@users.noreply.github.com> Co-authored-by: Stephen Toub <stoub@microsoft.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
* Initial plan * Add ReasoningOptions to ChatOptions with OpenAI implementation Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> * Address code review feedback: add Clone method and document ExtraHigh limitation Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> * Address PR feedback: remove experimental attributes, make Clone internal, simplify ToOpenAIChatReasoningEffortLevel Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> * Address PR feedback: update docs, remove enum values, add tests Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> * Rename ReasoningOutput.Detailed to ReasoningOutput.Full Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
dotnet#7266) Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
… content (dotnet#7267) * Update OpenAIResponsesChatClient to handle streaming code interpreter content Right now it outputs it but in a bulk fashion only at the end of the response item. This makes it yield the deltas instead. * Dedup code block
… to 1.39 (dotnet#7274) * Initial plan * Update OpenTelemetry semantic convention version comments from 1.38 to 1.39 Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
…206.3 (dotnet#7277) On relative base path root Microsoft.DotNet.Arcade.Sdk , Microsoft.DotNet.Build.Tasks.Templating , Microsoft.DotNet.Helix.Sdk From Version 9.0.0-beta.26070.1 -> To Version 9.0.0-beta.26106.3 Co-authored-by: dotnet-maestro[bot] <dotnet-maestro[bot]@users.noreply.github.com>
* Remove [Experimental] attribute from IChatReducer * Annotate APIs that use experimental OpenAI APIs. Remove prerelease label. * Fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove project-wide OpenAI experimental suppressions. Finish annotating. * Use granular constants for openai experimental diagnostics * Update API baselines * Remove unused const * Remove redundant [Experimental] attributes for OpenAI Responses members * Update ApiChief baselines for MEAI --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…nt to JSON serialization infrastructure (dotnet#7275) Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: stephentoub <2642209+stephentoub@users.noreply.github.com>
… API (dotnet#7231) * new api surface, first test iteration * fix tests * add experimental attributes * Update test/Libraries/Microsoft.Extensions.Http.Diagnostics.Tests/Latency/HttpClientLatencyTelemetryExtensionsTest.cs Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * move and remove tests * fix warnings * update xml documentation * additional tests coverage * update tests and xml documentation --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
dbfa327 to
4c04354
Compare
|
Closing this as we opened the official PR at dotnet#7285 |
Realtime Client Abstraction Proposal
This proposal outlines a unified abstraction for Realtime model clients within the
Microsoft.Extensions.AI.Abstractionslibrary. The goal is to provide a consistent and provider-agnostic interface for interacting with Realtime AI systems, making it easier for developers to integrate, use, and switch between different model implementations.Realtime models typically operate over bidirectional streaming, enabling clients to send input and receive output concurrently. This allows the model server to begin generating results while still processing incoming data.
Interacting with realtime models usually involves creating a session or connection through which input and output are exchanged as streams. Sessions may maintain state across multiple interactions, enabling richer, more context-aware responses.
Realtime models can accept various types of input, such as text, audio, or images and typically produce output in the form of text or audio.
Proposed Interface
IRealtimeClient
Defined in
IRealtimeClient.csThis is the primary interface for interacting with realtime models. Applications use it to create sessions and manage realtime connections. Below is an example of how application code might look:
IRealtimeSession
Defined in
IRealtimeSession.csAfter creating an
IRealtimeClient, you can use it to create an instance ofIRealtimeSession, which represents an individual session with the realtime model. A session enables sending input and receiving output as streams. Here’s an example of how the application code might look:The application can then send input and receive output through the session using a mechanism similar to the example shown below:
The application sends messages to the model by creating instances of
RealtimeClientMessageand writing them to theclientMessageChannel.RealtimeClientMessageis a base type with specialized derived types representing the different message categories supported by realtime models.Similarly, the application receives
RealtimeServerMessageinstances from the model. This is also a base type with multiple derived message types representing messages the server may emit.The abstraction defines the following common client and server message types, with the expectation that more can be added over time.
Client Message Types
Server Message Types
Here are examples of how to send messages to the model:
Here are examples of receiving messages from the model:
Important Notes
UpdateAsyncmethod on theIRealtimeSessioninterface. TheRealtimeSessionOptionstype represents the updatable settings, and providers may extend this type to include provider-specific options.RealtimeClientMessageinstances that contain raw data for scenarios not covered by the predefined message types. TheRealtimeClientMessage.RawRepresentationproperty supports this behavior.RealtimeServerMessageinstances with raw data for scenarios outside the predefined message types. Applications can access this through theRealtimeServerMessage.RawRepresentationproperty.LogProbability,NoiseReductionOptions,PromptTemplate,RealtimeAudioFormat,SemanticVoiceActivityDetection,ServerVoiceActivityDetection,ToolChoiceMode, andTranscriptionOptions. If we choose to keep some or all of these, they may need to be moved into more appropriate folders. For now, they are grouped together to simplify review.RealtimeClientMessageTypeandRealtimeServerMessageTypeenumerations to categorize the different message types. I am on the fence about including these; they may not be necessary.OpenAIRealtimeClientfile. This is not part of the proposal but serves to validate the design. The implementation is not fully complete but supports the major scenarios. A demo application is also provided to illustrate practical usage.