WAV2LEV is a model for reference-free ASR quality estimation that predicts the underlying sequence of Levenshtein edit operations (substitutions, deletions, insertions, matches) token-by-token, rather than estimating word error rate (WER) as a single scalar. WER is derived from the predicted edit sequence, yielding fine-grained, word-level error estimates that are more informative than those of direct WER estimators.
Paper: IEEE ICASSP 2026
Data: Mini-CNoiSY Corpus
WAV2LEV is a transformer decoder head built on top of Whisper large-v3. At each inference step it autoregressively predicts the next edit operation in the sequence [<start>, op₁, op₂, …, <end>], where each opᵢ ∈ {ins, sub, del, match}.
The decoder receives three concatenated input streams as cross-attention memory:
| Stream | Source | Description |
|---|---|---|
| Audio features | Whisper encoder | Frame-level hidden states (1280-dim) |
| Uncertainty features | Whisper decoder logits | 13 per-token statistics (entropy, top-2 probs, margin, ratio, Gini impurity, cumulative top-3/5/10 mass, normalised exponential entropy, hypothesis token probability, NLL) |
| Text embeddings | Whisper decoder embeddings | Token-level embeddings of the hypothesis transcript (1280-dim) |
Decoder configuration: 12 blocks · 16 attention heads · 1024-dim · GELU activations · 10% dropout
Training: token-level cross-entropy with label smoothing 0.05 · teacher forcing · AdamW (weight decay 1e-4) · linear warmup (1% of steps) · mixed precision · gradient accumulation (2 steps) · gradient clipping (max norm 1.0)
Mini-CNoiSY (Miniature Clean-Noisy Speech from YouTube) is a bespoke 354-hour noisy speech corpus designed to provide reliable ground-truth WER labels while covering a diverse range of noise conditions.
Construction:
- Clean speech segments sourced from YODAS2 and filtered for high-confidence transcription (agreement between manual and Whisper large-v3 labels, WER < 10%)
- Artificial noising pipeline applied to each segment:
- Room impulse response convolution (DNS5 RIRs)
- Additive background noise (SNR ~ Uniform(−5, 40) dB)
- Bandwidth limitation (16 kHz → 8 kHz → 16 kHz)
- Quantile clipping
- Codec compression (mp3 / ogg)
- Additive Gaussian white noise
- Simulated packet loss
- Background noise sourced from YouTube videos matched by camera device filename prefix; speech-containing segments removed using a VAD model
Corpus statistics:
| Split | Segments | Duration | Avg length | Avg WER |
|---|---|---|---|---|
| Train | 88,177 | 346.44 hr | 14.14 s | 27.88% |
| Validation | 1,003 | 3.95 hr | 14.17 s | 30.87% |
| Test | 1,003 | 3.95 hr | 14.16 s | 31.62% |
Dataset repository: https://github.com/HarveyRDonnelly/MiniCNoiSY
Evaluation on the Mini-CNoiSY test set. TER (Token Error Rate) measures sequence-level edit accuracy of the predicted Levenshtein operations.
| Model | RMSE ↓ | PCC ↑ | TER ↓ |
|---|---|---|---|
| WAV2LEV | 0.1488 | 0.8971 | 0.2972 |
| WHISP-MLP | 0.1376 | 0.9101 | — |
| Fe-WER* | 0.2333 | 0.8220 | — |
Mean predicted WER: 0.3178 ± 0.2704 · Mean true WER: 0.3162 ± 0.3313
* Fe-WER reimplemented: HuBERT large + XLM-RoBERTa large, mean pooling, 3-layer MLP.
git clone https://github.com/HarveyRDonnelly/WAV2LEV.git
cd WAV2LEV
pip install torch torchaudio transformers datasets dask pandas numpy scipy \
scikit-learn wandb tqdm openai-whisper librosa soundfile \
nemo_text_processingFollow the instructions at https://github.com/HarveyRDonnelly/MiniCNoiSY to download the corpus. Place the parquet splits at:
mini_cnoisy/data/cnoisy_final_v3/train.parquet
mini_cnoisy/data/cnoisy_final_v3/validation.parquet
mini_cnoisy/data/cnoisy_final_v3/test.parquet
Paths are configurable in utils/config.py (CNOISY_DATASET_PATH).
WAV2LEV and WHISP-MLP use precomputed Whisper encoder states and uncertainty features to avoid re-running the encoder at every training step:
python main.py --precomp_embeddingsEmbeddings are written to ./whisp_embeddings/ (configurable via EMBEDDINGS_PRECOMP_PATH in utils/config.py). The script is resumable — it skips already-processed segments.
WAV2LEV (experiment 10):
python main.py --exp 10 --model whisperWHISP-MLP (experiment 9):
python main.py --exp 9 --model whisperFe-WER (experiment 8, loads raw audio — does not require precomputed embeddings):
python main.py --exp 8 --model whisperAll experiments log metrics and plots to Weights & Biases. Model weights are saved to mini_cnoisy/weights/ after each epoch (_recent.pt) and whenever validation RMSE improves (_best.pt).
Hyperparameters are defined in utils/config.py under the corresponding *_MODEL_CONFIG dictionaries.
python main.py --exp 10 --model whisper --test \
--weights exp_10_whisper_<timestamp>_best.ptReplace <timestamp> with the run timestamp printed during training (or use _recent.pt for the final checkpoint). The evaluation script reports RMSE, PCC, Spearman correlation, and Token Error Rate, and logs detailed per-segment predictions to W&B.
If you use WAV2LEV or Mini-CNoiSY in your work, please cite:
@INPROCEEDINGS{11462338,
author={Donnelly, Harvey and Shi, Ken and Penn, Gerald},
booktitle={ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
title={WAV2LEV: Predicting Levenshtein Edit Operation Sequences For Fine-Grained Estimation of Automatic Speech Recognition Error},
year={2026},
pages={15022-15026},
doi={10.1109/ICASSP55912.2026.11462338}
}
