VoiceChat EA STT training reproducible features#15558

Draft

ankitapasad wants to merge 2 commits intoNVIDIA-NeMo:mainfrom

ankitapasad:stt_vc_ea_parity

Collaborator

ankitapasad commented Mar 27, 2026

What does this PR do ?

Adds following features to the dataset class to support VoiceChat EA STT training and fine-tuning

Correct agent EOS placement
Clean implementation of token IDs and update user BOS ID to match EA
MCQ system prompt
Filler responses for ASR training data
Number normalization
Corresponding tests

Collection: speechlm2

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

PR Type:

New Feature
Bugfix

If you haven't finished some of the above items you can still open "Draft" PR.

ankitapasad added 2 commits

March 27, 2026 12:39


          Separate train, val datasets

0a857c4

Co-authored-by: Claude <noreply@anthropic.com>

Signed-off-by: Ankita Pasad <apasad@nvidia.com>


          Correct EOS placement, MCQ prompt, ASR filler response, number normal…

a633150

…ization, clean-up token ID init, and corresponding tests

Co-authored-by: Claude <noreply@anthropic.com>

Signed-off-by: Ankita Pasad <apasad@nvidia.com>

ankitapasad requested review from kevinhu-nv and zhehuaichen

March 27, 2026 19:50

github-advanced-security bot found potential problems

View reviewed changes

tests/collections/speechlm2/test_duplex_eos_placement.py

+              import os
+              import pytest
+              import torch

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'torch' is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                  assert (target_tokens == eos).sum().item() == 0, "skip_eos=True should not place any EOS"
+                  # Now collate source tokens, passing in the target channel for EOS placement
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_tokens is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                  assert (target_tokens == eos).sum().item() == 0, "skip_eos=True should not place any EOS"
+                  # Now collate source tokens, passing in the target channel for EOS placement
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_token_lens is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                      skip_eos=True,
+                  )
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_tokens is not used.

tests/collections/speechlm2/test_duplex_eos_placement.py

+                      skip_eos=True,
+                  )
+                  source_tokens, source_token_lens = collate_token_channel(

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable source_token_lens is not used.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+              from nemo.collections.common.tokenizers import AutoTokenizer
+              from nemo.collections.speechlm2.data.duplex_stt_dataset import DuplexSTTDataset
+              from nemo.collections.speechlm2.data.utils import get_pad_id

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'get_pad_id' is not used.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+                  train_batch = train_ds[cuts]
+                  val_batch = val_ds[cuts]
+                  train_targets = train_batch["audio_data"]["target_tokens"]

Check notice

Code scanning / CodeQL

Unused local variable Note test

Variable train_targets is not used.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+                  # Force aligner should be created but never called during validation
+                  val_ds.force_aligner = MagicMock()
+                  val_ds[cuts]

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.

tests/collections/speechlm2/test_duplex_is_training_flag.py

+                  # Mock the force aligner to avoid loading wav2vec2
+                  train_ds.force_aligner = MagicMock()
+                  train_ds.force_aligner.batch_force_align_user_audio.side_effect = lambda cuts, **kwargs: cuts
+                  train_ds[cuts]

Check notice

Code scanning / CodeQL

Statement has no effect Note test

This statement has no effect.

tests/collections/speechlm2/test_duplex_utils.py

+                - is_mcq_cut_train / is_mcq_cut_val / is_asr_cut
+              """
+              import pytest

Check notice

Code scanning / CodeQL

Unused import Note test

Import of 'pytest' is not used.

kevinhu-nv reviewed

View reviewed changes

nemo/collections/speechlm2/data/duplex_stt_dataset.py

                       assert tokenizer.bos is not None, "BOS support in the tokenizer is required."
                       assert tokenizer.eos is not None, "EOS support in the tokenizer is required."
+                      user_bos_token = '^'

Collaborator

kevinhu-nv Mar 30, 2026 •

edited

Loading

I use the same bos and eos for user and agent channels. I feel that is cleaner and I verified that does not impact model performance. I see you want to make exactly match EA, let's make these as configurable and one can set ^ and $? When we release the EA ckpt, we will release a config anyway to make it use ^ and $.

kevinhu-nv reviewed

View reviewed changes

nemo/collections/speechlm2/data/duplex_stt_dataset.py

+                  Prompt selection priority:
+. Per-cut custom prompt (cut.custom['system_prompt'])
+. MCQ training cut -> THINK prompt for think-cuts, NOTHINK prompt for others
+. MCQ validation cut (when add_mcq_prompt=True) -> THINK prompt

Collaborator

kevinhu-nv Mar 30, 2026

Can you also add a support for custom prompt? We can then easily evaluate different demo setups we have used.

Collaborator

kevinhu-nv commented Mar 30, 2026

A high-level question: Can you also share a training script/wandb to make sure metrics look roughly good? I think additional efforts may be needed to catch the EA ckpt but it is better to check at intermediate steps as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet