Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions .changeset/openai-transcription-diarization.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
---
'@tanstack/ai': minor
'@tanstack/ai-client': minor
'@tanstack/ai-openai': minor
---

Add OpenAI transcription diarization support with `diarized_json` output, speaker-labeled segments, diarization model validation, chunking strategy options, and docs.
32 changes: 29 additions & 3 deletions docs/adapters/openai.md
Original file line number Diff line number Diff line change
Expand Up @@ -294,17 +294,43 @@ console.log(result.text); // Transcribed text
const result = await generateTranscription({
adapter: openaiTranscription("whisper-1"),
audio: audioFile,
responseFormat: "verbose_json",
prompt: "Technical terms: API, SDK",
modelOptions: {
response_format: "verbose_json", // Get timestamps
temperature: 0,
prompt: "Technical terms: API, SDK",
timestamp_granularities: ["word", "segment"],
},
});

// Access segments with timestamps
console.log(result.segments);
```

### Speaker Diarization

Use `gpt-4o-transcribe-diarize` for speaker-labeled transcripts:

```typescript
const result = await generateTranscription({
adapter: openaiTranscription("gpt-4o-transcribe-diarize"),
audio: meetingAudioFile,
modelOptions: {
chunking_strategy: "auto",
known_speaker_names: ["agent", "customer"],
known_speaker_references: [
"data:audio/wav;base64,...",
"data:audio/wav;base64,...",
],
},
});

for (const segment of result.segments ?? []) {
console.log(segment.speaker, segment.start, segment.end, segment.text);
}
```

`gpt-4o-transcribe-diarize` defaults to `responseFormat: "diarized_json"` and `chunking_strategy: "auto"`. OpenAI does not support `prompt`, `include`, or `timestamp_granularities` with diarized transcription.

## Environment Variables

Set your API key in environment variables:
Expand Down Expand Up @@ -353,7 +379,7 @@ Creates an OpenAI text-to-speech adapter.

### `openaiTranscription(model, config?)` / `createOpenaiTranscription(model, apiKey, config?)`

Creates an OpenAI transcription adapter (Whisper).
Creates an OpenAI transcription adapter for Whisper, GPT-4o transcription, and GPT-4o diarized transcription models.

### `openaiVideo(model, config?)` / `createOpenaiVideo(model, apiKey, config?)`

Expand Down
2 changes: 1 addition & 1 deletion docs/comparison/vercel-ai-sdk.md
Original file line number Diff line number Diff line change
Expand Up @@ -389,7 +389,7 @@ const result = await generateSpeech({
})
```

**Transcription** - `generateTranscription()` supports 5 output formats (json, text, srt, verbose_json, vtt), word-level timestamps with confidence scores, and speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.
**Transcription** - `generateTranscription()` supports 6 output formats (json, text, srt, verbose_json, vtt, diarized_json), word-level timestamps with confidence scores, and speaker diarization via OpenAI's `gpt-4o-transcribe-diarize` model.

```ts
import { generateTranscription } from '@tanstack/ai'
Expand Down
2 changes: 1 addition & 1 deletion docs/media/generation-hooks.md
Original file line number Diff line number Diff line change
Expand Up @@ -214,7 +214,7 @@ The `generate` function accepts a `TranscriptionGenerateInput`:
| `audio` | `string \| File \| Blob` | Audio data -- base64 string, File, or Blob (required) |
| `language` | `string` | Language in ISO-639-1 format (e.g., `"en"`) |
| `prompt` | `string` | Optional prompt to guide the transcription |
| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt'` | Output format |
| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt' \| 'diarized_json'` | Output format |
| `modelOptions` | `Record<string, any>` | Model-specific options |

## useSummarize
Expand Down
66 changes: 50 additions & 16 deletions docs/media/transcription.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@
title: Transcription
id: transcription
order: 4
description: "Transcribe audio to text with OpenAI Whisper and GPT-4o-transcribe via TanStack AI's generateTranscription() API."
description: "Transcribe audio to text with OpenAI Whisper and GPT-4o transcription models, including speaker diarization, via TanStack AI's generateTranscription() API."
keywords:
- tanstack ai
- transcription
Expand All @@ -22,7 +22,7 @@ TanStack AI provides support for audio transcription (speech-to-text) through de
Audio transcription is handled by transcription adapters that follow the same tree-shakeable architecture as other adapters in TanStack AI.

Currently supported:
- **OpenAI**: Whisper-1, GPT-4o-transcribe, GPT-4o-mini-transcribe
- **OpenAI**: Whisper-1, GPT-4o-transcribe, GPT-4o-mini-transcribe, GPT-4o-transcribe-diarize
- **fal.ai**: Whisper, Wizper, speech-to-text turbo, ElevenLabs speech-to-text

## Basic Usage
Expand Down Expand Up @@ -107,6 +107,8 @@ for (const segment of result.segments ?? []) {
|--------|------|-------------|
| `audio` | `File \| string` | Audio data (File object or base64 string) - required |
| `language` | `string` | Language code (e.g., "en", "es", "fr") |
| `prompt` | `string` | Optional prompt to guide transcription style or terms. Not supported with `gpt-4o-transcribe-diarize`. |
| `responseFormat` | `'json' \| 'text' \| 'srt' \| 'verbose_json' \| 'vtt' \| 'diarized_json'` | Output format |

### Supported Languages

Expand Down Expand Up @@ -135,20 +137,23 @@ Whisper supports many languages. Common codes include:
const result = await generateTranscription({
adapter: openaiTranscription('whisper-1'),
audio: audioFile,
responseFormat: 'verbose_json',
prompt: 'Technical terms: API, SDK, CLI',
modelOptions: {
response_format: 'verbose_json', // Get detailed output with timestamps
temperature: 0, // Lower = more deterministic
prompt: 'Technical terms: API, SDK, CLI', // Guide transcription
timestamp_granularities: ['word', 'segment'],
},
})
```

| Option | Type | Description |
|--------|------|-------------|
| `response_format` | `string` | Output format: "json", "text", "srt", "verbose_json", "vtt" |
| `temperature` | `number` | Sampling temperature (0 to 1) |
| `prompt` | `string` | Optional text to guide transcription style |
| `include` | `string[]` | Timestamp granularity: ["word"], ["segment"], or both |
| `include` | `string[]` | Additional response data such as `logprobs`; only available with `json` responses on supported GPT-4o transcription models |
| `timestamp_granularities` | `('word' \| 'segment')[]` | Timestamp detail for `whisper-1` with `responseFormat: 'verbose_json'` |
| `chunking_strategy` | `'auto' \| { type: 'server_vad', ... } \| null` | Audio chunking strategy for `gpt-4o-transcribe-diarize`; required by OpenAI for diarization inputs longer than 30 seconds |
| `known_speaker_names` | `string[]` | Up to four speaker labels for diarization |
| `known_speaker_references` | `string[]` | 2-10 second data URL audio samples matching `known_speaker_names` |

### Response Formats

Expand All @@ -159,6 +164,32 @@ const result = await generateTranscription({
| `srt` | SubRip subtitle format |
| `verbose_json` | Detailed JSON with timestamps and segments |
| `vtt` | WebVTT subtitle format |
| `diarized_json` | JSON with speaker-labeled segments. Only supported by `gpt-4o-transcribe-diarize`. |

### Speaker Diarization

Use `gpt-4o-transcribe-diarize` when you need speaker labels. TanStack AI defaults this model to `responseFormat: 'diarized_json'` and sends `chunking_strategy: 'auto'` unless you provide a chunking strategy yourself.

```typescript
const result = await generateTranscription({
adapter: openaiTranscription('gpt-4o-transcribe-diarize'),
audio: meetingAudioFile,
modelOptions: {
chunking_strategy: 'auto',
known_speaker_names: ['agent', 'customer'],
known_speaker_references: [
'data:audio/wav;base64,...',
'data:audio/wav;base64,...',
],
},
})

for (const segment of result.segments ?? []) {
console.log(segment.speaker, segment.start, segment.end, segment.text)
}
```

OpenAI accepts up to four known speaker references. The diarization model does not support `prompt`, `include`, or `timestamp_granularities`; the adapter rejects those combinations before making the API request.

## Response Format

Expand All @@ -172,15 +203,17 @@ interface TranscriptionResult {
language?: string // Detected/specified language
duration?: number // Audio duration in seconds
segments?: Array<{ // Timestamped segments
id: number
start: number // Start time in seconds
end: number // End time in seconds
text: string // Segment text
words?: Array<{ // Word-level timestamps
word: string
start: number
end: number
confidence?: number
}>
confidence?: number
speaker?: string // Present for diarized output
}>
words?: Array<{ // Word-level timestamps
word: string
start: number
end: number
}>
}
```
Expand Down Expand Up @@ -208,9 +241,9 @@ async function transcribeAudio(filepath: string) {
adapter: openaiTranscription('whisper-1'),
audio: audioFile,
language: 'en',
responseFormat: 'verbose_json',
modelOptions: {
response_format: 'verbose_json',
include: ['segment', 'word'],
timestamp_granularities: ['word', 'segment'],
},
})

Expand Down Expand Up @@ -540,5 +573,6 @@ const adapter = createOpenaiTranscription('your-openai-api-key')

5. **Prompting**: Use the `prompt` option to provide context or expected vocabulary (e.g., technical terms, names).

6. **Timestamps**: Request `verbose_json` format and enable `include: ['word', 'segment']` when you need timing information for captions or synchronization.
6. **Timestamps**: Request `responseFormat: 'verbose_json'` and set `modelOptions.timestamp_granularities` when you need timing information for captions or synchronization.

7. **Diarization**: Use `gpt-4o-transcribe-diarize` with `diarized_json` output for multi-speaker audio. Keep `chunking_strategy: 'auto'` unless you need custom VAD tuning.
2 changes: 1 addition & 1 deletion docs/reference/interfaces/TranscriptionOptions.md
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ An optional prompt to guide the transcription
### responseFormat?

```ts
optional responseFormat: "text" | "json" | "srt" | "verbose_json" | "vtt";
optional responseFormat: "text" | "json" | "srt" | "verbose_json" | "vtt" | "diarized_json";
```

Defined in: [packages/ai/src/types.ts:1693](https://github.com/TanStack/ai/blob/main/packages/ai/src/types.ts#L1693)
Expand Down
8 changes: 7 additions & 1 deletion packages/ai-client/src/generation-types.ts
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,13 @@ export interface TranscriptionGenerateInput {
/** An optional prompt to guide the transcription */
prompt?: string
/** The format of the transcription output */
responseFormat?: 'json' | 'text' | 'srt' | 'verbose_json' | 'vtt'
responseFormat?:
| 'json'
| 'text'
| 'srt'
| 'verbose_json'
| 'vtt'
| 'diarized_json'
/** Model-specific options */
modelOptions?: Record<string, any>
}
Expand Down
Loading