Skip to content

OpenAI chat completions streaming does not aggregate audio delta chunks (GPT-4o audio modality) #1701

@braintrust-bot

Description

@braintrust-bot

Summary

OpenAI's GPT-4o models support audio input/output in chat completions. When streaming with audio output (modalities: ["text", "audio"]), the API sends delta.audio chunks containing audio.id, audio.transcript, audio.data (base64), and audio.expires_at. The current aggregateChatCompletionChunks function only handles delta.role, delta.content, delta.tool_calls, and delta.finish_reason — it does not aggregate delta.audio at all.

This means the audio transcript (the most useful field for observability) is lost in streaming responses. Non-streaming responses capture the full response object, so audio data is preserved there. Audio token metrics (prompt_audio_tokens, completion_audio_tokens) are already correctly extracted from usage data.

What is missing

  • Streaming aggregation (js/src/instrumentation/plugins/openai-plugin.ts, aggregateChatCompletionChunks ~lines 389-466): Handles delta.role, delta.content, delta.tool_calls, and delta.finish_reason. No branch for delta.audio. Audio transcript chunks are silently dropped.
  • Vendor SDK types (js/src/vendor-sdk-types/openai-common.ts): OpenAIChatDelta interface defines role, content, tool_calls, finish_reason and a catch-all [key: string]: unknown. No explicit audio field.
  • E2E tests: No scenario tests audio modality in chat completions.

Expected behavior

The streaming aggregation should at minimum concatenate delta.audio.transcript chunks so the final span output includes the model's audio transcript. Storing the full base64 audio.data in spans would be impractical (very large), but the transcript text is compact and essential for observability.

Upstream reference

  • OpenAI audio output in chat completions: https://platform.openai.com/docs/guides/audio
  • Streaming audio delta fields: audio.id, audio.data (base64 chunks), audio.transcript (text chunks), audio.expires_at
  • Supported on gpt-4o-audio-preview and gpt-4o-mini-audio-preview models.
  • Audio token metrics are documented in the usage object as prompt_tokens_details.audio_tokens and completion_tokens_details.audio_tokens.

Braintrust docs status

The Braintrust OpenAI integration page documents chat completions but does not mention audio modality (not_found).

What already works

Audio token metrics are properly extracted. In js/src/openai-utils.ts, the extractOpenAIMetrics function maps input_tokens_details.audio_tokensprompt_audio_tokens and output_tokens_details.audio_tokenscompletion_audio_tokens. This is confirmed by unit tests in openai-plugin.test.ts.

Local files inspected

  • js/src/instrumentation/plugins/openai-plugin.tsaggregateChatCompletionChunks function
  • js/src/vendor-sdk-types/openai-common.tsOpenAIChatDelta interface
  • js/src/openai-utils.tsextractOpenAIMetrics (audio token metrics)
  • js/src/wrappers/oai.ts — wrapper proxy
  • e2e/scenarios/openai-instrumentation/scenario.impl.mjs — e2e test scenarios

Metadata

Metadata

Assignees

No one assigned

    Labels

    bot-automationIssues generated by an agent automation

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions