Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions Documentation/Benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -745,6 +745,46 @@ AVERAGE 31.7 21.5 0.5 9.7 - 126.7
======================================================================
```

### Offline throughput (M5 Pro)

A single fused offline graph (`mel[1,128,3072] → speaker_preds`, 30.72 s window) exported via the
NeMo offline path — one CoreML call per window, no streaming state. ComputeUnit.ALL, median of 120
runs after 12 warmup:

| variant | model-exec (mel → preds) | RTFx |
|---|---:|---:|
| **fp16** | **10.65 ms** | **2884×** |
| 6-bit palettized | 10.93 ms | 2809× |

End-to-end incl. mel (fp16): 12.49 ms · 2459×. One fused GPU graph — no per-call dispatch or
ANE→GPU handoff. Numerical parity vs the PyTorch reference: 100% speaker-argmax agreement (fp16).

### Offline diarizer (whole-file, Swift)

`OfflineSortformerDiarizer` runs the fused offline model end-to-end: mel extraction → fused graph
per 30.72 s window (no streaming state) → timeline. Run with `fluidaudio sortformer <file> --offline`
(`--palettized` for the 6-bit set). Models: `FluidInference/diar-streaming-sortformer-coreml`
`v3/{fp16,palettized}/SortformerOffline_v2.1.mlmodelc`; NeMo-reference parity = 100% speaker-argmax
(fp16), 96.4% (6-bit palettized).

**Throughput** is excellent — ~1000–1400× RTFx (one fused call per window, no per-chunk state). Voice
detection matches streaming exactly (identical Miss/FA on AMI).

**Quality caveat — not for long multi-speaker audio.** Each 30.72 s window diarizes independently
with no speaker cache, so on long meetings with several speakers it produces large speaker confusion.
AMI-SDM (collar 0.25, full test set), same harness:

| | DER | Miss | FA | Speaker confusion | RTFx |
|---|---|---|---|---|---|
| Streaming (`highContextV2_1`) | **26.4%** | 23.4 | 0.6 | **2.4** | 835× |
| Offline (whole-file) | 56.7% | 23.4 | 0.6 | **32.7** | 1418× |

Detection is identical; the entire gap is speaker confusion the streaming `spkcache` avoids by
construction (accumulating speaker profiles across the whole history). Cross-window re-stitching does
not recover it — the confusion is generated *within* each window — so **use offline for short clips
(≤ ~30 s), few-speaker audio, or throughput-bound batch jobs, and use the streaming variants for
accurate long-form multi-speaker diarization.**

## LS-EEND Streaming Diarization
A research prototype from Westlake University for streaming speaker diarization.

Expand Down
89 changes: 83 additions & 6 deletions Documentation/Diarization/Sortformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -415,15 +415,92 @@ let config = DiarizerTimelineConfig(

## Model Variants

Three CoreML models are available on HuggingFace:
CoreML models live on HuggingFace under
[FluidInference/diar-streaming-sortformer-coreml](https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml),
in `v3/fp16/` (default) and `v3/palettized/` (see [Precision](#precision-fp16-vs-palettized)).
The `v3/` set is the BNNS-fixed rebuild — the older root-level models hit a
"tensor as both input and output" graph-compile crash on newer BNNS ([#726](https://github.com/FluidInference/FluidAudio/issues/726)).

| Variant | File | Config |
|---------|------|--------|
| Default | `Sortformer.mlmodelc` | `SortformerConfig.default` |
| Balanced | `SortformerNvidiaLow.mlmodelc` | `SortformerConfig.balancedV2_1` |
| High Context | `SortformerNvidiaHigh.mlmodelc` | `SortformerConfig.highContextV2_1` |
| Variant | Config | File (under `v3/<precision>/`) | Output latency |
|---------|--------|-------------------------------|----------------|
| Fast (v2.1) | `.fastV2_1` | `Sortformer_v2.1.mlmodelc` | ~1.04 s |
| Balanced (v2.1) | `.balancedV2_1` | `SortformerNvidiaLow_v2.1.mlmodelc` | ~1.5 s |
| High Context (v2.1) | `.highContextV2_1` | `SortformerNvidiaHigh_v2.1.mlmodelc` | ~3.5 s |
| Efficient (v2.1) | `.efficientV2_1` | `SortformerEfficient_v2.1.mlmodelc` | ~2.0 s (highest throughput) |

(The `v2` weight variants — `.fastV2`, `.balancedV2`, `.highContextV2` — ship alongside each `v2.1`.)

**Important:** Each model has baked-in static shapes. You must use the matching configuration.
The diarizer logs a loud config-mismatch error at `initialize()` if the `SortformerConfig` does
not match the streaming parameters embedded in the model (issue [#726](https://github.com/FluidInference/FluidAudio/issues/726)).

### Precision: fp16 vs palettized

Each variant is built at two weight precisions, selected via `SortformerConfig.precision`:

| Precision | Head weights | `highContextV2_1` RAM | DER impact | When |
|-----------|--------------|----------------------|------------|------|
| `.fp16` (default) | full | ~2.4 GB | baseline | Best accuracy; Apple Silicon Macs, recent iPhones/iPads |
| `.palettized` | 6-bit k-means LUT | ~330 MB | +0.9 pp avg (streaming); larger on high-context | RAM-constrained / older devices |

```swift
var config = SortformerConfig.highContextV2_1
config.precision = .palettized // ~2.4 GB -> ~330 MB
```

Palettization is **opt-in**, not the default, because 6-bit perturbs the embeddings and the
streaming speaker-cache cascades that drift over time (worse on the high-context variant). For
offline/batch or RAM-limited devices it's a good trade; for best streaming DER keep `.fp16`.

**Old-device compute units.** The ~2.4 GB fp16 high-context head triggers a multi-minute ANE
program-compile hang on RAM-constrained devices (A14, ~4 GB). `recommendedComputeUnits(for:)`
auto-falls-back those variants to `.cpuOnly` on <8 GB devices; everything else (including the
~330 MB palettized high-context head, which loads fine on ANE) keeps `.all`. Pass `computeUnits:`
explicitly to override. On A14 the recommended path is `precision = .palettized`.

### Benchmarks

Streaming DER/RTFx and offline-throughput numbers live in
[Documentation/Benchmarks.md](../Benchmarks.md#sortformer-streaming-diarization).

## Offline (whole-file) mode

When the entire audio is available up front, `OfflineSortformerDiarizer` runs the **fused offline
model** — a single graph `mel -> speaker_preds` over a fixed 30.72 s window (3072 mel → 384 output
frames) with **no streaming state** (no spkcache/FIFO threaded across calls). One CoreML call per
window makes it the fastest path for batch diarization (~2880× RTFx model-exec on M5 Pro; see
[Benchmarks](#benchmarks)). The model ships at both precisions:
`v3/fp16/SortformerOffline_v2.1.mlmodelc` and `v3/palettized/SortformerOffline_v2.1.mlmodelc`.

This differs from `SortformerDiarizer.processComplete(...)`, which runs the *streaming* model over
all chunks (threading speaker-cache state). Use the offline diarizer when you have the whole file and
want maximum throughput on short or few-speaker audio.

**Scope — short clips / few speakers / throughput.** Each 30.72 s window is diarized independently
with no speaker cache, so long multi-speaker audio accumulates large speaker confusion: on AMI-SDM
the offline path scores ~56% DER vs ~26% for the streaming `highContextV2_1` (voice detection is
identical — the gap is entirely speaker confusion the `spkcache` prevents). Cross-window re-stitching
can't recover it because the confusion is generated within each window. **For accurate long-form
multi-speaker diarization use the streaming variants; reach for offline for ≤ ~30 s clips,
few-speaker audio, or throughput-bound batch jobs.** Longer inputs are tiled into 30.72 s windows
(`overlapOutputFrames` controls the overlap) with activity-based stitching across boundaries.

```swift
let diarizer = OfflineSortformerDiarizer(config: .offlineV2_1)
try await diarizer.initializeFromHuggingFace() // or initialize(modelPath:)

let timeline = try diarizer.processComplete(audioSamples, sourceSampleRate: 16_000)
// Or load + resample a file directly:
let fileTimeline = try diarizer.processComplete(audioFileURL: audioURL)

for (index, speaker) in timeline.speakers {
for segment in speaker.finalizedSegments {
print("Speaker \(index): \(segment.startTime)s - \(segment.endTime)s")
}
}
```

CLI: `fluidaudio sortformer audio.wav --offline` (add `--palettized` for the 6-bit set).

## Usage Examples

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -123,9 +123,10 @@ struct KMeansClustering {
best = result
}
}
return best ?? clusterWithCentroids(
embeddings: embeddings, numClusters: numClusters,
maxIterations: maxIterations, seed: baseSeed)
return best
?? clusterWithCentroids(
embeddings: embeddings, numClusters: numClusters,
maxIterations: maxIterations, seed: baseSeed)
}

private static func normalizeEmbeddings(_ embeddings: [[Double]]) -> [[Double]] {
Expand Down
Loading
Loading