FluidInference · Alex-Wengg · Jun 21, 2026 · Jun 23, 2026 · Jun 24, 2026 · Jun 24, 2026
diff --git a/Documentation/Benchmarks.md b/Documentation/Benchmarks.md
@@ -745,6 +745,46 @@ AVERAGE          31.7     21.5      0.5      9.7         -     126.7
 ======================================================================
 ```
 
+### Offline throughput (M5 Pro)
+
+A single fused offline graph (`mel[1,128,3072] → speaker_preds`, 30.72 s window) exported via the
+NeMo offline path — one CoreML call per window, no streaming state. ComputeUnit.ALL, median of 120
+runs after 12 warmup:
+
+| variant | model-exec (mel → preds) | RTFx |
+|---|---:|---:|
+| **fp16** | **10.65 ms** | **2884×** |
+| 6-bit palettized | 10.93 ms | 2809× |
+
+End-to-end incl. mel (fp16): 12.49 ms · 2459×. One fused GPU graph — no per-call dispatch or
+ANE→GPU handoff. Numerical parity vs the PyTorch reference: 100% speaker-argmax agreement (fp16).
+
+### Offline diarizer (whole-file, Swift)
+
+`OfflineSortformerDiarizer` runs the fused offline model end-to-end: mel extraction → fused graph
+per 30.72 s window (no streaming state) → timeline. Run with `fluidaudio sortformer <file> --offline`
+(`--palettized` for the 6-bit set). Models: `FluidInference/diar-streaming-sortformer-coreml`
+`v3/{fp16,palettized}/SortformerOffline_v2.1.mlmodelc`; NeMo-reference parity = 100% speaker-argmax
+(fp16), 96.4% (6-bit palettized).
+
+**Throughput** is excellent — ~1000–1400× RTFx (one fused call per window, no per-chunk state). Voice
+detection matches streaming exactly (identical Miss/FA on AMI).
+
+**Quality caveat — not for long multi-speaker audio.** Each 30.72 s window diarizes independently
+with no speaker cache, so on long meetings with several speakers it produces large speaker confusion.
+AMI-SDM (collar 0.25, full test set), same harness:
+
+| | DER | Miss | FA | Speaker confusion | RTFx |
+|---|---|---|---|---|---|
+| Streaming (`highContextV2_1`) | **26.4%** | 23.4 | 0.6 | **2.4** | 835× |
+| Offline (whole-file) | 56.7% | 23.4 | 0.6 | **32.7** | 1418× |
+
+Detection is identical; the entire gap is speaker confusion the streaming `spkcache` avoids by
+construction (accumulating speaker profiles across the whole history). Cross-window re-stitching does
+not recover it — the confusion is generated *within* each window — so **use offline for short clips
+(≤ ~30 s), few-speaker audio, or throughput-bound batch jobs, and use the streaming variants for
+accurate long-form multi-speaker diarization.**
+
 ## LS-EEND Streaming Diarization
 A research prototype from Westlake University for streaming speaker diarization.
 

diff --git a/Documentation/Diarization/Sortformer.md b/Documentation/Diarization/Sortformer.md
@@ -415,15 +415,92 @@ let config = DiarizerTimelineConfig(
 
 ## Model Variants
 
-Three CoreML models are available on HuggingFace:
+CoreML models live on HuggingFace under
+[FluidInference/diar-streaming-sortformer-coreml](https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml),
+in `v3/fp16/` (default) and `v3/palettized/` (see [Precision](#precision-fp16-vs-palettized)).
+The `v3/` set is the BNNS-fixed rebuild — the older root-level models hit a
+"tensor as both input and output" graph-compile crash on newer BNNS ([#726](https://github.com/FluidInference/FluidAudio/issues/726)).
 
-| Variant | File | Config |
-|---------|------|--------|
-| Default | `Sortformer.mlmodelc` | `SortformerConfig.default` |
-| Balanced | `SortformerNvidiaLow.mlmodelc` | `SortformerConfig.balancedV2_1` |
-| High Context | `SortformerNvidiaHigh.mlmodelc` | `SortformerConfig.highContextV2_1` |
+| Variant | Config | File (under `v3/<precision>/`) | Output latency |
+|---------|--------|-------------------------------|----------------|
+| Fast (v2.1) | `.fastV2_1` | `Sortformer_v2.1.mlmodelc` | ~1.04 s |
+| Balanced (v2.1) | `.balancedV2_1` | `SortformerNvidiaLow_v2.1.mlmodelc` | ~1.5 s |
+| High Context (v2.1) | `.highContextV2_1` | `SortformerNvidiaHigh_v2.1.mlmodelc` | ~3.5 s |
+| Efficient (v2.1) | `.efficientV2_1` | `SortformerEfficient_v2.1.mlmodelc` | ~2.0 s (highest throughput) |
+
+(The `v2` weight variants — `.fastV2`, `.balancedV2`, `.highContextV2` — ship alongside each `v2.1`.)
 
 **Important:** Each model has baked-in static shapes. You must use the matching configuration.
+The diarizer logs a loud config-mismatch error at `initialize()` if the `SortformerConfig` does
+not match the streaming parameters embedded in the model (issue [#726](https://github.com/FluidInference/FluidAudio/issues/726)).
+
+### Precision: fp16 vs palettized
+
+Each variant is built at two weight precisions, selected via `SortformerConfig.precision`:
+
+| Precision | Head weights | `highContextV2_1` RAM | DER impact | When |
+|-----------|--------------|----------------------|------------|------|
+| `.fp16` (default) | full | ~2.4 GB | baseline | Best accuracy; Apple Silicon Macs, recent iPhones/iPads |
+| `.palettized` | 6-bit k-means LUT | ~330 MB | +0.9 pp avg (streaming); larger on high-context | RAM-constrained / older devices |
+
+```swift
+var config = SortformerConfig.highContextV2_1
+config.precision = .palettized   // ~2.4 GB -> ~330 MB
+```
+
+Palettization is **opt-in**, not the default, because 6-bit perturbs the embeddings and the
+streaming speaker-cache cascades that drift over time (worse on the high-context variant). For
+offline/batch or RAM-limited devices it's a good trade; for best streaming DER keep `.fp16`.
+
+**Old-device compute units.** The ~2.4 GB fp16 high-context head triggers a multi-minute ANE
+program-compile hang on RAM-constrained devices (A14, ~4 GB). `recommendedComputeUnits(for:)`
+auto-falls-back those variants to `.cpuOnly` on <8 GB devices; everything else (including the
+~330 MB palettized high-context head, which loads fine on ANE) keeps `.all`. Pass `computeUnits:`
+explicitly to override. On A14 the recommended path is `precision = .palettized`.
+
+### Benchmarks
+
+Streaming DER/RTFx and offline-throughput numbers live in
+[Documentation/Benchmarks.md](../Benchmarks.md#sortformer-streaming-diarization).
+
+## Offline (whole-file) mode
+
+When the entire audio is available up front, `OfflineSortformerDiarizer` runs the **fused offline
+model** — a single graph `mel -> speaker_preds` over a fixed 30.72 s window (3072 mel → 384 output
+frames) with **no streaming state** (no spkcache/FIFO threaded across calls). One CoreML call per
+window makes it the fastest path for batch diarization (~2880× RTFx model-exec on M5 Pro; see
+[Benchmarks](#benchmarks)). The model ships at both precisions:
+`v3/fp16/SortformerOffline_v2.1.mlmodelc` and `v3/palettized/SortformerOffline_v2.1.mlmodelc`.
+
+This differs from `SortformerDiarizer.processComplete(...)`, which runs the *streaming* model over
+all chunks (threading speaker-cache state). Use the offline diarizer when you have the whole file and
+want maximum throughput on short or few-speaker audio.
+
+**Scope — short clips / few speakers / throughput.** Each 30.72 s window is diarized independently
+with no speaker cache, so long multi-speaker audio accumulates large speaker confusion: on AMI-SDM
+the offline path scores ~56% DER vs ~26% for the streaming `highContextV2_1` (voice detection is
+identical — the gap is entirely speaker confusion the `spkcache` prevents). Cross-window re-stitching
+can't recover it because the confusion is generated within each window. **For accurate long-form
+multi-speaker diarization use the streaming variants; reach for offline for ≤ ~30 s clips,
+few-speaker audio, or throughput-bound batch jobs.** Longer inputs are tiled into 30.72 s windows
+(`overlapOutputFrames` controls the overlap) with activity-based stitching across boundaries.
+
+```swift
+let diarizer = OfflineSortformerDiarizer(config: .offlineV2_1)
+try await diarizer.initializeFromHuggingFace()              // or initialize(modelPath:)
+
+let timeline = try diarizer.processComplete(audioSamples, sourceSampleRate: 16_000)
+// Or load + resample a file directly:
+let fileTimeline = try diarizer.processComplete(audioFileURL: audioURL)
+
+for (index, speaker) in timeline.speakers {
+    for segment in speaker.finalizedSegments {
+        print("Speaker \(index): \(segment.startTime)s - \(segment.endTime)s")
+    }
+}
+```
+
+CLI: `fluidaudio sortformer audio.wav --offline` (add `--palettized` for the 6-bit set).
 
 ## Usage Examples
 

diff --git a/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift b/Sources/FluidAudio/Diarizer/Offline/Clustering/KMeansClustering.swift
@@ -123,9 +123,10 @@ struct KMeansClustering {
                 best = result
             }
         }
-        return best ?? clusterWithCentroids(
-            embeddings: embeddings, numClusters: numClusters,
-            maxIterations: maxIterations, seed: baseSeed)
+        return best
+            ?? clusterWithCentroids(
+                embeddings: embeddings, numClusters: numClusters,
+                maxIterations: maxIterations, seed: baseSeed)
     }
 
     private static func normalizeEmbeddings(_ embeddings: [[Double]]) -> [[Double]] {