TanStack · 8times4 · May 27, 2026 · May 28, 2026
diff --git a/.changeset/openai-transcription-diarization.md b/.changeset/openai-transcription-diarization.md
@@ -0,0 +1,7 @@
+---
+'@tanstack/ai': minor
+'@tanstack/ai-client': minor
+'@tanstack/ai-openai': minor
+---
+
+Add OpenAI transcription diarization support with `diarized_json` output, speaker-labeled segments, diarization model validation, chunking strategy options, and docs.
diff --git a/docs/adapters/openai.md b/docs/adapters/openai.md
@@ -294,17 +294,43 @@ console.log(result.text); // Transcribed text
 const result = await generateTranscription({
   adapter: openaiTranscription("whisper-1"),
   audio: audioFile,
+  responseFormat: "verbose_json",
+  prompt: "Technical terms: API, SDK",
   modelOptions: {
-    response_format: "verbose_json", // Get timestamps
     temperature: 0,
-    prompt: "Technical terms: API, SDK",
+    timestamp_granularities: ["word", "segment"],
   },
 });
 
 // Access segments with timestamps
 console.log(result.segments);
 ```
 
+### Speaker Diarization
+
+Use `gpt-4o-transcribe-diarize` for speaker-labeled transcripts:
+
+```typescript
+const result = await generateTranscription({
+  adapter: openaiTranscription("gpt-4o-transcribe-diarize"),
+  audio: meetingAudioFile,
+  modelOptions: {
+    chunking_strategy: "auto",
+    known_speaker_names: ["agent", "customer"],
+    known_speaker_references: [
+      "data:audio/wav;base64,...",
+      "data:audio/wav;base64,...",
+    ],
+  },
+});
+
+for (const segment of result.segments ?? []) {
+  console.log(segment.speaker, segment.start, segment.end, segment.text);
+}
+```
+
+`gpt-4o-transcribe-diarize` defaults to `responseFormat: "diarized_json"` and `chunking_strategy: "auto"`. OpenAI does not support `prompt`, `include`, or `timestamp_granularities` with diarized transcription.
+
 ## Environment Variables
 
 Set your API key in environment variables:
@@ -353,7 +379,7 @@ Creates an OpenAI text-to-speech adapter.
 
 ### `openaiTranscription(model, config?)` / `createOpenaiTranscription(model, apiKey, config?)`
 
-Creates an OpenAI transcription adapter (Whisper).
+Creates an OpenAI transcription adapter for Whisper, GPT-4o transcription, and GPT-4o diarized transcription models.
 
 ### `openaiVideo(model, config?)` / `createOpenaiVideo(model, apiKey, config?)`
 

diff --git a/docs/comparison/vercel-ai-sdk.md b/docs/comparison/vercel-ai-sdk.md
@@ -389,7 +389,7 @@ const result = await generateSpeech({
 })
 ```
 
-**Transcription** - `generateTranscription()` supports 5 output formats (json, text, srt, verbose_json, vtt), word-level timestamps with confidence scores, and speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.
+**Transcription** - `generateTranscription()` supports 6 output formats (json, text, srt, verbose_json, vtt, diarized_json), word-level timestamps with confidence scores, and speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.
 
 ```ts
 import { generateTranscription } from '@tanstack/ai'

diff --git a/docs/media/generation-hooks.md b/docs/media/generation-hooks.md
@@ -214,7 +214,7 @@ The `generate` function accepts a `TranscriptionGenerateInput`:
 | `audio` | `string \| File \| Blob` | Audio data -- base64 string, File, or Blob (required) |
 | `language` | `string` | Language in ISO-639-1 format (e.g., `"en"`) |
 | `prompt` | `string` | Optional prompt to guide the transcription |
-| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt'` | Output format |
+| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt' \| 'diarized_json'` | Output format |
 | `modelOptions` | `Record<string, any>` | Model-specific options |
 
 ## useSummarize

diff --git a/docs/media/transcription.md b/docs/media/transcription.md
@@ -2,7 +2,7 @@
 title: Transcription
 id: transcription
 order: 4
-description: "Transcribe audio to text with OpenAI Whisper and GPT-4o-transcribe via TanStack AI's generateTranscription() API."
+description: "Transcribe audio to text with OpenAI Whisper and GPT-4o transcription models, including speaker diarization, via TanStack AI's generateTranscription() API."
 keywords:
   - tanstack ai
   - transcription
@@ -22,7 +22,7 @@ TanStack AI provides support for audio transcription (speech-to-text) through de
 Audio transcription is handled by transcription adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI.
 
 Currently supported:
-- **OpenAI**: Whisper-1, GPT-4o-transcribe, GPT-4o-mini-transcribe
+- **OpenAI**: Whisper-1, GPT-4o-transcribe, GPT-4o-mini-transcribe, GPT-4o-transcribe-diarize
 - **fal.ai**: Whisper, Wizper, speech-to-text turbo, ElevenLabs speech-to-text
 
 ## Basic Usage
@@ -107,6 +107,8 @@ for (const segment of result.segments ?? []) {
 |--------|------|-------------|
 | `audio` | `File \| string` | Audio data (File object or base64 string) - required |
 | `language` | `string` | Language code (e.g., "en", "es", "fr") |
+| `prompt` | `string` | Optional prompt to guide transcription style or terms. Not supported with `gpt-4o-transcribe-diarize`. |
+| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt' \| 'diarized_json'` | Output format |
 
 ### Supported Languages
 
@@ -135,20 +137,23 @@ Whisper supports many languages. Common codes include:
 const result = await generateTranscription({
   adapter: openaiTranscription('whisper-1'),
   audio: audioFile,
+  responseFormat: 'verbose_json',
+  prompt: 'Technical terms: API, SDK, CLI',
   modelOptions: {
-    response_format: 'verbose_json', // Get detailed output with timestamps
     temperature: 0, // Lower = more deterministic
-    prompt: 'Technical terms: API, SDK, CLI', // Guide transcription
+    timestamp_granularities: ['word', 'segment'],
   },
 })
 ```
 
 | Option | Type | Description |
 |--------|------|-------------|
-| `response_format` | `string` | Output format: "json", "text", "srt", "verbose_json", "vtt" |
 | `temperature` | `number` | Sampling temperature (0 to 1) |
-| `prompt` | `string` | Optional text to guide transcription style |
-| `include` | `string[]` | Timestamp granularity: ["word"], ["segment"], or both |
+| `include` | `string[]` | Additional response data such as `logprobs`; only available with `json` responses on supported GPT-4o transcription models |
+| `timestamp_granularities` | `('word' \| 'segment')[]` | Timestamp detail for `whisper-1` with `responseFormat: 'verbose_json'` |
+| `chunking_strategy` | `'auto' \| { type: 'server_vad', ... } \| null` | Audio chunking strategy for `gpt-4o-transcribe-diarize`; required by OpenAI for diarization inputs longer than 30 seconds |
+| `known_speaker_names` | `string[]` | Up to four speaker labels for diarization |
+| `known_speaker_references` | `string[]` | 2-10 second data URL audio samples matching `known_speaker_names` |
 
 ### Response Formats
 
@@ -159,6 +164,32 @@ const result = await generateTranscription({
 | `srt` | SubRip subtitle format |
 | `verbose_json` | Detailed JSON with timestamps and segments |
 | `vtt` | WebVTT subtitle format |
+| `diarized_json` | JSON with speaker-labeled segments. Only supported by `gpt-4o-transcribe-diarize`. |
+
+### Speaker Diarization
+
+Use `gpt-4o-transcribe-diarize` when you need speaker labels. TanStack AI defaults this model to `responseFormat: 'diarized_json'` and sends `chunking_strategy: 'auto'` unless you provide a chunking strategy yourself.
+
+```typescript
+const result = await generateTranscription({
+  adapter: openaiTranscription('gpt-4o-transcribe-diarize'),
+  audio: meetingAudioFile,
+  modelOptions: {
+    chunking_strategy: 'auto',
+    known_speaker_names: ['agent', 'customer'],
+    known_speaker_references: [
+      'data:audio/wav;base64,...',
+      'data:audio/wav;base64,...',
+    ],
+  },
+})
+
+for (const segment of result.segments ?? []) {
+  console.log(segment.speaker, segment.start, segment.end, segment.text)
+}
+```
+
+OpenAI accepts up to four known speaker references. The diarization model does not support `prompt`, `include`, or `timestamp_granularities`; the adapter rejects those combinations before making the API request.
 
 ## Response Format
 
@@ -172,15 +203,17 @@ interface TranscriptionResult {
   language?: string    // Detected/specified language
   duration?: number    // Audio duration in seconds
   segments?: Array<{   // Timestamped segments
+    id: number
     start: number      // Start time in seconds
     end: number        // End time in seconds
     text: string       // Segment text
-    words?: Array<{    // Word-level timestamps
-      word: string
-      start: number
-      end: number
-      confidence?: number
-    }>
+    confidence?: number
+    speaker?: string   // Present for diarized output
+  }>
+  words?: Array<{      // Word-level timestamps
+    word: string
+    start: number
+    end: number
   }>
 }
 ```
@@ -208,9 +241,9 @@ async function transcribeAudio(filepath: string) {
     adapter: openaiTranscription('whisper-1'),
     audio: audioFile,
     language: 'en',
+    responseFormat: 'verbose_json',
     modelOptions: {
-      response_format: 'verbose_json',
-      include: ['segment', 'word'],
+      timestamp_granularities: ['word', 'segment'],
     },
   })
 
@@ -540,5 +573,6 @@ const adapter = createOpenaiTranscription('your-openai-api-key')
 
 5. **Prompting**: Use the `prompt` option to provide context or expected vocabulary (e.g., technical terms, names).
 
-6. **Timestamps**: Request `verbose_json` format and enable `include: ['word', 'segment']` when you need timing information for captions or synchronization.
+6. **Timestamps**: Request `responseFormat: 'verbose_json'` and set `modelOptions.timestamp_granularities` when you need timing information for captions or synchronization.
 
+7. **Diarization**: Use `gpt-4o-transcribe-diarize` with `diarized_json` output for multi-speaker audio. Keep `chunking_strategy: 'auto'` unless you need custom VAD tuning.
diff --git a/docs/reference/interfaces/TranscriptionOptions.md b/docs/reference/interfaces/TranscriptionOptions.md
@@ -95,7 +95,7 @@ An optional prompt to guide the transcription
 ### responseFormat?
 
 ```ts
-optional responseFormat: "text" | "json" | "srt" | "verbose_json" | "vtt";
+optional responseFormat: "text" | "json" | "srt" | "verbose_json" | "vtt" | "diarized_json";
 ```
 
 Defined in: [packages/ai/src/types.ts:1693](https://github.com/TanStack/ai/blob/main/packages/ai/src/types.ts#L1693)

diff --git a/packages/ai-client/src/generation-types.ts b/packages/ai-client/src/generation-types.ts
@@ -265,7 +265,13 @@ export interface TranscriptionGenerateInput {
   /** An optional prompt to guide the transcription */
   prompt?: string
   /** The format of the transcription output */
-  responseFormat?: 'json' | 'text' | 'srt' | 'verbose_json' | 'vtt'
+  responseFormat?:
+    | 'json'
+    | 'text'
+    | 'srt'
+    | 'verbose_json'
+    | 'vtt'
+    | 'diarized_json'
   /** Model-specific options */
   modelOptions?: Record<string, any>
 }
-Original file line number
+Diff line change
@@ Expand Up / @@ -389,7 +389,7 @@ const result = await generateSpeech({ @@
     })
     ```
-    **Transcription** - `generateTranscription()` supports 5 output formats (json, text, srt, verbose_json, vtt), word-level timestamps with confidence scores, and speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.
+    **Transcription** - `generateTranscription()` supports 6 output formats (json, text, srt, verbose_json, vtt, diarized_json), word-level timestamps with confidence scores, and speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.
     ```ts
     import { generateTranscription } from '@tanstack/ai'
@@ Expand Down @@