Skip to content

fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728

Open
Alex-Wengg wants to merge 5 commits into
mainfrom
fix/sortformer-bnns-crash-726
Open

fix(sortformer): consume BNNS-fixed v3 models + config-mismatch guard (#726)#728
Alex-Wengg wants to merge 5 commits into
mainfrom
fix/sortformer-bnns-crash-726

Conversation

@Alex-Wengg

@Alex-Wengg Alex-Wengg commented Jun 21, 2026

Copy link
Copy Markdown
Member

Summary

Fixes the Sortformer BNNS graph-compile crash from #726 and hardens the config + device-compat path.

The root-level Sortformer CoreML models had chunk_pre_encoder_embs_out as both a graph input and output in the head submodel, which the macOS 26 / newer BNNS compiler rejects:

BNNS Graph Compile: Function main has tensor chunk_pre_encoder_embs_out as both an input and output.

Root cause was a conversion-toolchain artifact (torch 2.9.x folded the identity that kept input/output distinct). The models were rebuilt clean (torch 2.7 + coremltools 9.0), verified no-alias + ANE-loadable + numerically matched to the PyTorch reference (100% speaker-argmax parity), and uploaded to FluidInference/diar-streaming-sortformer-coreml/v3/.

Changes

  • Point downloads at the fixed models: default v3/fp16 set.
  • Runtime precision selection (addresses reporter feedback — modelsSubdirectory was a let): ModelNames.Sortformer.ModelPrecision { .fp16, .palettized } + mutable SortformerConfig.precision (default .fp16); bundle(for:) honors it. Flip to the 6-bit, ~2.5× smaller set without editing source — var c = .highContextV2_1; c.precision = .palettized — or sortformer --palettized.
  • A14 compute-unit auto-fallback (addresses reporter feedback — large fp16 high-context hangs for minutes on .all): SortformerModels.recommendedComputeUnits(for:) routes the large fp16 high-context variants to .cpuOnly on <8 GB devices, while the ~330 MB palettized high-context head (loads fine on ANE) keeps .all. Threaded through load/loadFromHuggingFace/initialize (all default to auto, still overridable). Also fixes load() ignoring its compute-unit argument.
  • efficientV2_1 variant: chunk_len=25, ~2 s output latency, ~4× the RTFx of the default streaming config at near-identical per-inference cost.
  • Config-mismatch guard: SortformerDiarizer validates the diarizer SortformerConfig against the model's embedded metadata (chunk_len/contexts/fifo_len/spkcache_len) on init and logs a clear error on mismatch. spkcacheUpdatePeriod is excluded since the host clamps it.
  • CLI ergonomics: sortformer --config fast|efficient|low|high [--palettized]; sortformer-benchmark --collar/--onset/--offset.

Benchmarks

RAM (highContextV2_1): fp16 ~2.4 GB → palettized ~330 MB (reporter-confirmed).

Offline throughput vs Argmax (M5 Pro, ComputeUnit.ALL, 30.72 s windows, median of 120). Argmax's Sortformer is an offline batch model (no streaming state); exported ours as a single fused offline graph and benchmarked head-to-head against their 3-model chain:

model-exec (mel → preds) end-to-end (incl. mel)
Argmax (3 calls) 14.57 ms · 2108× 16.41 ms · 1872×
FluidAudio (fused) 10.65 ms · 2884× 12.49 ms · 2459×

FluidAudio is 1.3–1.4× faster offline — one fused GPU graph vs their ANE→GPU split. The ">10× faster" sometimes cited for Argmax compares their offline model against our highContextV2_1 streaming config (slowest/largest variant) — apples-to-oranges. For low-latency streaming throughput use .efficientV2_1 (~215× RTFx); Argmax ships no streaming Sortformer. Repro script: mobius#73 offline_argmax_bench.py. Full writeup: Documentation/Diarization/Sortformer.md#benchmarks.

Validation

  • All CI green (build/test macOS + iOS, swift-format, Sortformer benchmark).
  • Numerical parity vs NeMo PyTorch reference = 100% speaker-argmax agreement across all variants.
  • Full AMI-SDM DER (forced-alignment GT, collar 0.25): highContext ~26.5%, default streaming ~29.0%. 6-bit palettization = +0.9 pp avg (streaming; larger on high-context — why fp16 stays default).

…tch guard (#726)

The root-level Sortformer CoreML models hit a BNNS graph-compile crash on newer
BNNS ("tensor chunk_pre_encoder_embs_out as both an input and output"). The fixed
rebuild lives at v3/fp16/ in the HF repo; point ModelNames there so downloads pick
up the working models.

- ModelNames.Sortformer.modelsSubdirectory = "v3/fp16" (BNNS-fixed set); v3/palettized
  is the 6-bit, ~2.5x-smaller set for RAM-constrained devices.
- Add efficientV2_1 variant (chunk_len=25, ~2s latency, ~4x RTFx of fast) + config preset.
- SortformerDiarizer now validates the diarizer config against the model's embedded
  metadata on init and logs a clear error on mismatch (a mismatch silently produced
  incorrect/slow diarization — #726). spkcacheUpdatePeriod excluded (host-clamped).
- CLI: `sortformer --config fast|efficient|low|high`; `sortformer-benchmark --collar`,
  `--onset`, `--offset` (the hardcoded collar=0 / onset=0.5 skewed reported DER).
@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Supertonic3 Smoke Test ✅

Check Result
Build
Model download (incl. VectorEstimatorVariants/ int4 buckets)
Model load
Synthesis pipeline (--ve-variant int4)
Output WAV ✅ (364.7 KB)

Runtime: 0m21s

Note: CI VMs lack a physical Neural Engine; the ANE-bucketed VectorEstimator falls back to CPU here. This validates download + variant resolution + synthesis, not ANE residency/perf.

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Offline VBx Pipeline Results

Speaker Diarization Performance (VBx Batch Mode)

Optimal clustering with Hungarian algorithm for maximum accuracy

Metric Value Target Status Description
DER 10.4% <20% Diarization Error Rate (lower is better)
RTFx 8.49x >1.0x Real-Time Factor (higher is faster)

Offline VBx Pipeline Timing Breakdown

Time spent in each stage of batch diarization

Stage Time (s) % Description
Model Download 19.179 15.5 Fetching diarization models
Model Compile 8.220 6.7 CoreML compilation
Audio Load 0.146 0.1 Loading audio file
Segmentation 31.921 25.8 VAD + speech detection
Embedding 123.295 99.8 Speaker embedding extraction
Clustering (VBx) 0.112 0.1 Hungarian algorithm + VBx clustering
Total 123.549 100 Full VBx pipeline

Speaker Diarization Research Comparison

Offline VBx achieves competitive accuracy with batch processing

Method DER Mode Description
FluidAudio (Offline) 10.4% VBx Batch On-device CoreML with optimal clustering
FluidAudio (Streaming) 17.7% Chunk-based First-occurrence speaker mapping
Research baseline 18-30% Various Standard dataset performance

Pipeline Details:

  • Mode: Offline VBx with Hungarian algorithm for optimal speaker-to-cluster assignment
  • Segmentation: VAD-based voice activity detection
  • Embeddings: WeSpeaker-compatible speaker embeddings
  • Clustering: PowerSet with VBx refinement
  • Accuracy: Higher than streaming due to optimal post-hoc mapping

🎯 Offline VBx Test • AMI Corpus ES2004a • 1049.0s meeting audio • 155.3s processing • Test runtime: 2m 43s • 06/24/2026, 01:41 AM EST

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Parakeet EOU Benchmark Results ✅

Status: Benchmark passed
Chunk Size: 320ms
Files Tested: 100/100

Performance Metrics

Metric Value Description
WER (Avg) 7.03% Average Word Error Rate
WER (Med) 4.17% Median Word Error Rate
RTFx 11.76x Real-time factor (higher = faster)
Total Audio 470.6s Total audio duration processed
Total Time 41.1s Total processing time

Streaming Metrics

Metric Value Description
Avg Chunk Time 0.041s Average chunk processing time
Max Chunk Time 0.082s Maximum chunk processing time
EOU Detections 0 Total End-of-Utterance detections

Test runtime: 0m47s • 06/24/2026, 01:37 AM EST

RTFx = Real-Time Factor (higher is better) • Processing includes: Model inference, audio preprocessing, state management, and file I/O

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric Value Target Status Description
DER 15.1% <30% Diarization Error Rate (lower is better)
JER 24.9% <25% Jaccard Error Rate
RTFx 24.63x >1.0x Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage Time (s) % Description
Model Download 10.380 24.4 Fetching diarization models
Model Compile 4.449 10.4 CoreML compilation
Audio Load 0.155 0.4 Loading audio file
Segmentation 12.768 30.0 Detecting speech regions
Embedding 21.279 50.0 Extracting speaker voices
Clustering 8.512 20.0 Grouping same speakers
Total 42.601 100 Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method DER Notes
FluidAudio 15.1% On-device CoreML
Research baseline 18-30% Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

  • M2 MacBook Air (2022): Runs at 150 RTFx real-time
  • Performance scales with Apple Neural Engine capabilities

🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 42.6s diarization time • Test runtime: 2m 49s • 06/24/2026, 01:33 AM EST

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

Sortformer High-Latency Benchmark Results

ES2004a Performance (30.4s latency config)

Metric Value Target Status
DER 30.3% <35%
Miss Rate 28.2% - -
False Alarm 0.9% - -
Speaker Error 1.2% - -
RTFx 17.7x >1.0x
Speakers 4/4 - -

Sortformer High-Latency • ES2004a • Runtime: 3m 26s • 2026-06-24T05:41:45.512Z

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

PocketTTS Smoke Test ✅

Check Result
Build
Model download
Model load
Synthesis pipeline
Output WAV ✅ (187.5 KB)

Runtime: 0m5s

Note: PocketTTS uses CoreML MLState (macOS 15) KV cache + Mimi streaming state. CI VM lacks physical GPU — audio quality and performance may differ from Apple Silicon.

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

VAD Benchmark Results

Performance Comparison

Dataset Accuracy Precision Recall F1-Score RTFx Files
MUSAN 92.0% 86.2% 100.0% 92.6% 585.6x faster 50
VOiCES 92.0% 86.2% 100.0% 92.6% 665.0x faster 50

Dataset Details

  • MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
  • VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

@github-actions

github-actions Bot commented Jun 21, 2026

Copy link
Copy Markdown

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset WER Avg WER Med RTFx Status
test-clean 0.57% 0.00% 5.69x
test-other 1.19% 0.00% 3.72x

Parakeet v2 (English-optimized)

Dataset WER Avg WER Med RTFx Status
test-clean 0.80% 0.00% 5.65x
test-other 1.00% 0.00% 3.27x

Streaming (v3)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.65x Streaming real-time factor
Avg Chunk Time 1.433s Average time to process each chunk
Max Chunk Time 1.989s Maximum chunk processing time
First Token 1.692s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming (v2)

Metric Value Description
WER 0.00% Word Error Rate in streaming mode
RTFx 0.59x Streaming real-time factor
Avg Chunk Time 1.558s Average time to process each chunk
Max Chunk Time 1.803s Maximum chunk processing time
First Token 1.549s Latency to first transcription token
Total Chunks 31 Number of chunks processed

Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming

25 files per dataset • Test runtime: 7m33s • 06/24/2026, 01:35 AM EST

RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

Testing methodology follows HuggingFace Open ASR Leaderboard

@beta-devin-ai-integration beta-devin-ai-integration Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

Open in Devin Review (Beta)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 requiredModels now downloads ALL 7 variants including efficientV2_1

Sortformer.requiredModels (Sources/FluidAudio/ModelNames.swift:725) returns Set(Variant.allCases.map(\.fileName)) which now includes all 7 variants. Any code path that downloads the full required set (i.e., when variant is nil) will now also attempt to download v3/fp16/SortformerEfficient_v2.1.mlmodelc. This is fine as long as that model file exists in the HuggingFace repo at https://huggingface.co/FluidInference/diar-streaming-sortformer-coreml/tree/main/v3/fp16/. If it hasn't been uploaded yet, full-set downloads would fail.

(Refers to lines 724-727)

Open in Devin Review (Beta)

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +125 to +148
private func validateConfigMatch(_ models: SortformerModels) {
guard let embedded = models.embeddedConfig else { return }
let current = SortformerModels.EmbeddedConfig(
chunkLen: config.chunkLen,
chunkLeftContext: config.chunkLeftContext,
chunkRightContext: config.chunkRightContext,
fifoLen: config.fifoLen,
spkcacheLen: config.spkcacheLen
)
guard current != embedded else { return }
logger.error(
"""
Sortformer config mismatch — diarizer config does not match the loaded model. \
This produces incorrect and much slower diarization (issue #726). \
diarizer(chunkLen=\(current.chunkLen), leftCtx=\(current.chunkLeftContext), \
rightCtx=\(current.chunkRightContext), fifoLen=\(current.fifoLen), \
spkcacheLen=\(current.spkcacheLen)) \
vs model(chunkLen=\(embedded.chunkLen), leftCtx=\(embedded.chunkLeftContext), \
rightCtx=\(embedded.chunkRightContext), fifoLen=\(embedded.fifoLen), \
spkcacheLen=\(embedded.spkcacheLen)). \
Construct SortformerDiarizer with the SortformerConfig matching the model variant.
"""
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 validateConfigMatch only warns, never fails — silent mismatch possible in production

The validateConfigMatch method at Sources/FluidAudio/Diarizer/Sortformer/SortformerDiarizer.swift:125-148 logs an error but does not throw or prevent initialization when a config mismatch is detected. This is likely intentional (backward compatibility, older models without metadata), but it means a misconfigured diarizer will silently produce incorrect results in production where log output may not be monitored. If the embedded metadata is present AND mismatched, this is almost certainly a programmer error. Consider whether this should throw in a future iteration.

Open in Devin Review (Beta)

Was this helpful? React with 👍 or 👎 to provide feedback.

…t fallback (#726)

Addresses two follow-ups from the #726 reporter:

- modelsSubdirectory was a let constant, so switching to the smaller
  palettized models (2.4GB -> 330MB RAM) required editing source. Add
  ModelNames.Sortformer.ModelPrecision and a mutable SortformerConfig.precision
  so callers select fp16 (default) vs palettized at runtime; bundle(for:)
  honors it. CLI: --palettized.

- The ~2.4GB fp16 high-context head triggers a multi-minute ANE compile
  hang on RAM-constrained devices (A14). Add recommendedComputeUnits(for:):
  large fp16 high-context variants on <8GB devices load with .cpuOnly,
  everything else (incl. the ANE-friendly palettized head) keeps .all.
  Wired through load/loadFromHuggingFace/initialize; computeUnits remains
  overridable. Also fixes load() ignoring its compute-unit argument.
… benchmark (#726)

Refresh the Model Variants table for the v3 model set (fp16/palettized paths,
efficientV2_1, config-mismatch note). Document precision selection (RAM/DER
trade), the A14 compute-unit auto-fallback, and the offline head-to-head vs
Argmax: FluidAudio's fused offline graph is 1.3-1.4x faster (10.65ms/2884x vs
14.57ms/2108x encoder-only on M5 Pro), with the >10x Argmax claim explained as
a streaming-vs-offline mismatch.
@Alex-Wengg Alex-Wengg force-pushed the fix/sortformer-bnns-crash-726 branch from a6ff757 to 32a5759 Compare June 24, 2026 01:12
…stitching)

Add OfflineSortformerDiarizer backed by the fused offline Sortformer model
(mel -> speaker_preds, 30.72s window, no streaming state) — one CoreML call per
window, the fastest batch path (1.3-1.4x faster than Argmax offline, #726).

- OfflineSortformerConfig / OfflineSortformerModels.runOffline (2-input graph,
  distinct from the streaming 6-input runMainModel)
- Long audio tiled into overlapping windows; SortformerSpeakerStitcher recovers
  the cross-window speaker permutation (brute-force 4! over the overlap) so IDs
  stay globally consistent
- ModelNames.Sortformer.offlineBundle(precision:) -> v3/{fp16,palettized}/SortformerOffline_v2.1.mlmodelc
- CLI: sortformer --offline [--palettized]
- Tests for the stitcher, config, and bundle paths

Validated end-to-end: 288.6s audio -> 13 windows in 1.02s (281.9x RTFx),
consistent speaker IDs across all window boundaries.
Document OfflineSortformerDiarizer whole-file throughput (fused model + speaker
stitching) alongside the existing offline-vs-Argmax model-exec numbers.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant