fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728
fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728Alex-Wengg wants to merge 5 commits into
Conversation
…tch guard (#726) The root-level Sortformer CoreML models hit a BNNS graph-compile crash on newer BNNS ("tensor chunk_pre_encoder_embs_out as both an input and output"). The fixed rebuild lives at v3/fp16/ in the HF repo; point ModelNames there so downloads pick up the working models. - ModelNames.Sortformer.modelsSubdirectory = "v3/fp16" (BNNS-fixed set); v3/palettized is the 6-bit, ~2.5x-smaller set for RAM-constrained devices. - Add efficientV2_1 variant (chunk_len=25, ~2s latency, ~4x RTFx of fast) + config preset. - SortformerDiarizer now validates the diarizer config against the model's embedded metadata on init and logs a clear error on mismatch (a mismatch silently produced incorrect/slow diarization — #726). spkcacheUpdatePeriod excluded (host-clamped). - CLI: `sortformer --config fast|efficient|low|high`; `sortformer-benchmark --collar`, `--onset`, `--offset` (the hardcoded collar=0 / onset=0.5 skewed reported DER).
Supertonic3 Smoke Test ✅
Runtime: 0m21s Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf. |
Offline VBx Pipeline ResultsSpeaker Diarization Performance (VBx Batch Mode)Optimal clustering with Hungarian algorithm for maximum accuracy
Offline VBx Pipeline Timing BreakdownTime spent in each stage of batch diarization
Speaker Diarization Research ComparisonOffline VBx achieves competitive accuracy with batch processing
Pipeline Details:
🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 155.3s processing • Test runtime: 2m 43s • 06/24/2026, 01:41 AM EST |
Parakeet EOU Benchmark Results ✅Status: Benchmark passed Performance Metrics
Streaming Metrics
Test runtime: 0m47s • 06/24/2026, 01:37 AM EST RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O |
Speaker Diarization Benchmark ResultsSpeaker Diarization PerformanceEvaluating "who spoke when" detection accuracy
Diarization Pipeline Timing BreakdownTime spent in each stage of speaker diarization
Speaker Diarization Research ComparisonResearch baselines typically achieve 18-30% DER on standard datasets
Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:
🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 42.6s diarization time • Test runtime: 2m 49s • 06/24/2026, 01:33 AM EST |
Sortformer High-Latency Benchmark ResultsES2004a Performance (30.4s latency config)
Sortformer High-Latency • ES2004a • Runtime: 3m 26s • 2026-06-24T05:41:45.512Z |
PocketTTS Smoke Test ✅
Runtime: 0m5s Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon. |
VAD Benchmark ResultsPerformance Comparison
Dataset Details
✅: Average F1-Score above 70% |
ASR Benchmark Results ✅Status: All benchmarks passed Parakeet v3 (multilingual)
Parakeet v2 (English-optimized)
Streaming (v3)
Streaming (v2)
Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming 25 files per dataset • Test runtime: 7m33s • 06/24/2026, 01:35 AM EST RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time Expected RTFx Performance on Physical M1 Hardware:• M1 Mac: ~28x (clean), ~25x (other) Testing methodology follows HuggingFace Open ASR Leaderboard |
There was a problem hiding this comment.
🚩 requiredModels now downloads ALL 7 variants including efficientV2_1
Sortformer.requiredModels (Sources/FluidAudio/ModelNames.swift:725) returns Set(Variant.allCases.map(\.fileName)) which now includes all 7 variants. Any code path that downloads the full required set (i.e., when variant is nil) will now also attempt to download v3/fp16/SortformerEfficient_v2.1.mlmodelc. This is fine as long as that model file exists in the HuggingFace repo at https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml/tree/main/v3/fp16/. If it hasn't been uploaded yet, full-set downloads would fail.
(Refers to lines 724-727)
Was this helpful? React with 👍 or 👎 to provide feedback.
| private func validateConfigMatch(_ models: SortformerModels) { | ||
| guard let embedded = models.embeddedConfig else { return } | ||
| let current = SortformerModels.EmbeddedConfig( | ||
| chunkLen: config.chunkLen, | ||
| chunkLeftContext: config.chunkLeftContext, | ||
| chunkRightContext: config.chunkRightContext, | ||
| fifoLen: config.fifoLen, | ||
| spkcacheLen: config.spkcacheLen | ||
| ) | ||
| guard current != embedded else { return } | ||
| logger.error( | ||
| """ | ||
| Sortformer config mismatch — diarizer config does not match the loaded model. \ | ||
| This produces incorrect and much slower diarization (issue #726). \ | ||
| diarizer(chunkLen=\(current.chunkLen), leftCtx=\(current.chunkLeftContext), \ | ||
| rightCtx=\(current.chunkRightContext), fifoLen=\(current.fifoLen), \ | ||
| spkcacheLen=\(current.spkcacheLen)) \ | ||
| vs model(chunkLen=\(embedded.chunkLen), leftCtx=\(embedded.chunkLeftContext), \ | ||
| rightCtx=\(embedded.chunkRightContext), fifoLen=\(embedded.fifoLen), \ | ||
| spkcacheLen=\(embedded.spkcacheLen)). \ | ||
| Construct SortformerDiarizer with the SortformerConfig matching the model variant. | ||
| """ | ||
| ) | ||
| } |
There was a problem hiding this comment.
🚩 validateConfigMatch only warns, never fails — silent mismatch possible in production
The validateConfigMatch method at Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift:125-148 logs an error but does not throw or prevent initialization when a config mismatch is detected. This is likely intentional (backward compatibility, older models without metadata), but it means a misconfigured diarizer will silently produce incorrect results in production where log output may not be monitored. If the embedded metadata is present AND mismatched, this is almost certainly a programmer error. Consider whether this should throw in a future iteration.
Was this helpful? React with 👍 or 👎 to provide feedback.
…t fallback (#726) Addresses two follow-ups from the #726 reporter: - modelsSubdirectory was a let constant, so switching to the smaller palettized models (2.4GB -> 330MB RAM) required editing source. Add ModelNames.Sortformer.ModelPrecision and a mutable SortformerConfig.precision so callers select fp16 (default) vs palettized at runtime; bundle(for:) honors it. CLI: --palettized. - The ~2.4GB fp16 high-context head triggers a multi-minute ANE compile hang on RAM-constrained devices (A14). Add recommendedComputeUnits(for:): large fp16 high-context variants on <8GB devices load with .cpuOnly, everything else (incl. the ANE-friendly palettized head) keeps .all. Wired through load/loadFromHuggingFace/initialize; computeUnits remains overridable. Also fixes load() ignoring its compute-unit argument.
… benchmark (#726) Refresh the Model Variants table for the v3 model set (fp16/palettized paths, efficientV2_1, config-mismatch note). Document precision selection (RAM/DER trade), the A14 compute-unit auto-fallback, and the offline head-to-head vs Argmax: FluidAudio's fused offline graph is 1.3-1.4x faster (10.65ms/2884x vs 14.57ms/2108x encoder-only on M5 Pro), with the >10x Argmax claim explained as a streaming-vs-offline mismatch.
a6ff757 to
32a5759
Compare
…stitching) Add OfflineSortformerDiarizer backed by the fused offline Sortformer model (mel -> speaker_preds, 30.72s window, no streaming state) — one CoreML call per window, the fastest batch path (1.3-1.4x faster than Argmax offline, #726). - OfflineSortformerConfig / OfflineSortformerModels.runOffline (2-input graph, distinct from the streaming 6-input runMainModel) - Long audio tiled into overlapping windows; SortformerSpeakerStitcher recovers the cross-window speaker permutation (brute-force 4! over the overlap) so IDs stay globally consistent - ModelNames.Sortformer.offlineBundle(precision:) -> v3/{fp16,palettized}/SortformerOffline_v2.1.mlmodelc - CLI: sortformer --offline [--palettized] - Tests for the stitcher, config, and bundle paths Validated end-to-end: 288.6s audio -> 13 windows in 1.02s (281.9x RTFx), consistent speaker IDs across all window boundaries.
Document OfflineSortformerDiarizer whole-file throughput (fused model + speaker stitching) alongside the existing offline-vs-Argmax model-exec numbers.
Summary
Fixes the Sortformer BNNS graph-compile crash from #726 and hardens the config + device-compat path.
The root-level Sortformer CoreML models had
chunk_pre_encoder_embs_outas both a graph input and output in the head submodel, which the macOS 26 / newer BNNS compiler rejects:Root cause was a conversion-toolchain artifact (torch 2.9.x folded the identity that kept input/output distinct). The models were rebuilt clean (torch 2.7 + coremltools 9.0), verified no-alias + ANE-loadable + numerically matched to the PyTorch reference (100% speaker-argmax parity), and uploaded to
FluidInference/diar-streaming-sortformer-coreml/v3/.Changes
v3/fp16set.modelsSubdirectorywas alet):ModelNames.Sortformer.ModelPrecision { .fp16, .palettized }+ mutableSortformerConfig.precision(default.fp16);bundle(for:)honors it. Flip to the 6-bit, ~2.5× smaller set without editing source —var c = .highContextV2_1; c.precision = .palettized— orsortformer --palettized..all):SortformerModels.recommendedComputeUnits(for:)routes the large fp16 high-context variants to.cpuOnlyon <8 GB devices, while the ~330 MB palettized high-context head (loads fine on ANE) keeps.all. Threaded throughload/loadFromHuggingFace/initialize(all default to auto, still overridable). Also fixesload()ignoring its compute-unit argument.efficientV2_1variant: chunk_len=25, ~2 s output latency, ~4× the RTFx of the default streaming config at near-identical per-inference cost.SortformerDiarizervalidates the diarizerSortformerConfigagainst the model's embedded metadata (chunk_len/contexts/fifo_len/spkcache_len) on init and logs a clear error on mismatch.spkcacheUpdatePeriodis excluded since the host clamps it.sortformer --config fast|efficient|low|high [--palettized];sortformer-benchmark --collar/--onset/--offset.Benchmarks
RAM (
highContextV2_1): fp16 ~2.4 GB → palettized ~330 MB (reporter-confirmed).Offline throughput vs Argmax (M5 Pro, ComputeUnit.ALL, 30.72 s windows, median of 120). Argmax's Sortformer is an offline batch model (no streaming state); exported ours as a single fused offline graph and benchmarked head-to-head against their 3-model chain:
FluidAudio is 1.3–1.4× faster offline — one fused GPU graph vs their ANE→GPU split. The ">10× faster" sometimes cited for Argmax compares their offline model against our
highContextV2_1streaming config (slowest/largest variant) — apples-to-oranges. For low-latency streaming throughput use.efficientV2_1(~215× RTFx); Argmax ships no streaming Sortformer. Repro script: mobius#73offline_argmax_bench.py. Full writeup:Documentation/Diarization/Sortformer.md#benchmarks.Validation