diff --git a/docs/source/asr/asr_checkpoints.rst b/docs/source/asr/asr_checkpoints.rst new file mode 100644 index 000000000000..f7f7c72d6566 --- /dev/null +++ b/docs/source/asr/asr_checkpoints.rst @@ -0,0 +1,339 @@ +.. _asr-checkpoints-list: + +======================= +ASR Model Checkpoints +======================= + +This page lists all supported ASR model checkpoints released by NVIDIA NeMo. +Benchmark scores for each model can be found on its `HuggingFace model card `__. + +Glossary +-------- + +.. list-table:: + :header-rows: 1 + + * - Term + - Definition + * - **ASR** + - Automatic Speech Recognition — transcribing speech to text + * - **AST** + - Automatic Speech Translation — translating speech to text from one language to another + * - **AED** + - Attention Encoder-Decoder — autoregressive decoder using cross-attention (Canary family) + * - **CTC** + - Connectionist Temporal Classification — non-autoregressive decoder + * - **RNN-T** + - Recurrent Neural Network Transducer — autoregressive streaming-friendly decoder + * - **TDT** + - Token-and-Duration Transducer — extends RNN-T with duration prediction for faster inference + * - **Hybrid** + - Joint RNN-T + CTC model — both decoders trained together, either usable at inference + * - **PnC** + - Punctuation and Capitalization in the output + * - **Streaming** + - Real-time / cache-aware inference capability + * - **EU4** + - Multilingual: English, German, Spanish, French + * - **EU25** + - Multilingual: 25 European languages (de, en, es, fr, it, pl, pt, nl, ru, uk, be, hr, cs, bg, da, et, fi, el, hu, lv, lt, mt, ro, sk, sl, sv) + + +Canary Models (AED) +------------------- + +Multi-task encoder-decoder models supporting ASR, AST, PnC, and timestamps across multiple languages. + +.. list-table:: + :header-rows: 1 + + * - Model + - Decoder + - Capabilities + - Languages + * - `canary-1b-v2 `__ + - AED + - ASR, AST, PnC, timestamps + - EU25 + * - `canary-qwen-2.5b `__ + - AED + - ASR, AST, PnC, timestamps + - EU25 + * - `canary-1b-flash `__ + - AED + - ASR, AST, PnC, timestamps, fast + - EU4 + * - `canary-180m-flash `__ + - AED + - ASR, AST, PnC, timestamps, fast + - EU4 + * - `canary-1b `__ + - AED + - ASR, AST, PnC + - EU4 + + +Parakeet Models (English) +-------------------------- + +High-accuracy English ASR models with FastConformer encoder. + +.. list-table:: + :header-rows: 1 + + * - Model + - Decoder + - Capabilities + - Size + * - `parakeet-tdt-0.6b-v3 `__ + - TDT + - ASR, PnC, timestamps + - 0.6B + * - `parakeet-tdt-0.6b-v2 `__ + - TDT + - ASR, PnC, timestamps + - 0.6B + * - `parakeet-tdt-1.1b `__ + - TDT + - ASR, timestamps + - 1.1B + * - `parakeet-tdt_ctc-1.1b `__ + - Hybrid TDT+CTC + - ASR, timestamps + - 1.1B + * - `parakeet-tdt_ctc-0.6b-ja `__ + - Hybrid TDT+CTC + - ASR, timestamps, Japanese + - 0.6B + * - `parakeet-tdt_ctc-110m `__ + - Hybrid TDT+CTC + - ASR, timestamps + - 110M + * - `parakeet-rnnt-1.1b `__ + - RNN-T + - ASR, timestamps + - 1.1B + * - `parakeet-rnnt-0.6b `__ + - RNN-T + - ASR, timestamps + - 0.6B + * - `parakeet-ctc-1.1b `__ + - CTC + - ASR + - 1.1B + * - `parakeet-ctc-0.6b `__ + - CTC + - ASR + - 0.6B + * - `parakeet-rnnt-110m-da-dk `__ + - RNN-T + - ASR, Danish + - 110M + + +Streaming Models +----------------- + +Cache-aware models for real-time / low-latency inference. + +.. list-table:: + :header-rows: 1 + + * - Model + - Decoder + - Capabilities + - Languages + * - `nemotron-speech-streaming-en-0.6b `__ + - Hybrid + - ASR, streaming + - en + * - `multitalker-parakeet-streaming-0.6b-v1 `__ + - RNN-T + - ASR, multitalker, streaming + - en + * - `parakeet_realtime_eou_120m-v1 `__ + - RNN-T + - ASR, end-of-utterance, streaming + - en + * - `stt_en_fastconformer_hybrid_large_streaming_multi `__ + - Hybrid + - ASR, streaming, multiple look-aheads + - en + * - `stt_en_fastconformer_hybrid_medium_streaming_80ms_pc `__ + - Hybrid + - ASR, PnC, streaming + - en + * - `stt_en_fastconformer_hybrid_medium_streaming_80ms `__ + - Hybrid + - ASR, streaming + - en + * - `stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc `__ + - Hybrid + - ASR, PnC, streaming + - ka + * - `stt_en_fastconformer_hybrid_large_streaming_1040ms `__ + - Hybrid + - ASR, streaming + - en + + +FastConformer English Models (Non-Streaming) +---------------------------------------------- + +.. list-table:: + :header-rows: 1 + + * - Model + - Decoder + - Capabilities + - Size + * - `stt_en_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - Large + * - `stt_en_fastconformer_ctc_large `__ + - CTC + - ASR + - Large + * - `stt_en_fastconformer_ctc_xlarge `__ + - CTC + - ASR + - XLarge + * - `stt_en_fastconformer_ctc_xxlarge `__ + - CTC + - ASR + - XXLarge + * - `stt_en_fastconformer_transducer_large `__ + - RNN-T + - ASR + - Large + * - `stt_en_fastconformer_transducer_xlarge `__ + - RNN-T + - ASR + - XLarge + * - `stt_en_fastconformer_transducer_xxlarge `__ + - RNN-T + - ASR + - XXLarge + * - `stt_en_fastconformer_tdt_large `__ + - TDT + - ASR + - Large + + +FastConformer Multilingual Models +---------------------------------- + +.. list-table:: + :header-rows: 1 + + * - Model + - Decoder + - Capabilities + - Language + * - `stt_multilingual_fastconformer_hybrid_large_pc_blend_eu `__ + - Hybrid + - ASR, PnC + - Multilingual EU + * - `stt_de_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - de + * - `stt_es_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - es + * - `stt_es_fastconformer_hybrid_large_pc_nc `__ + - Hybrid + - ASR, PnC + - es + * - `stt_fr_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - fr + * - `stt_it_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - it + * - `stt_ru_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - ru + * - `stt_ua_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - uk + * - `stt_pl_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - pl + * - `stt_hr_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - hr + * - `stt_be_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - be + * - `stt_nl_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - nl + * - `stt_pt_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - pt + * - `stt_fa_fastconformer_hybrid_large `__ + - Hybrid + - ASR + - fa + * - `stt_ka_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - ka + * - `stt_hy_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - hy + * - `stt_ar_fastconformer_hybrid_large_pc_v1.0 `__ + - Hybrid + - ASR, PnC + - ar + * - `stt_ar_fastconformer_hybrid_large_pcd_v1.0 `__ + - Hybrid + - ASR, PnC, diacritization + - ar + * - `stt_uz_fastconformer_hybrid_large_pc `__ + - Hybrid + - ASR, PnC + - uz + * - `stt_kk_ru_fastconformer_hybrid_large `__ + - Hybrid + - ASR + - kk, ru + * - `parakeet-ctc-0.6b-Vietnamese `__ + - CTC + - ASR + - vi + + +Loading Models +-------------- + +All models can be loaded via the ``from_pretrained()`` API: + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + + # From HuggingFace (prefix with nvidia/) + model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2") + + # From NGC (no prefix) + model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") + +To list all available models programmatically: + +.. code-block:: python + + nemo_asr.models.ASRModel.list_available_models() diff --git a/docs/source/asr/asr_customization/word_boosting.rst b/docs/source/asr/asr_customization/word_boosting.rst index 58912539560f..62f3d2b45a47 100644 --- a/docs/source/asr/asr_customization/word_boosting.rst +++ b/docs/source/asr/asr_customization/word_boosting.rst @@ -146,6 +146,104 @@ You can compute the F-score for the list of context phrases directly from the de --key_words_file=${CONTEXT_PHRASES_LIST} +.. _word_boosting_per_stream: + +Per-Stream Phrase Boosting +========================== + +Per-stream (per-utterance) phrase boosting extends GPU-PB to allow specifying different key phrases for each audio stream or utterance in a batch. +This is useful when different utterances require different context biasing (e.g., different speaker names, product terms, or domain vocabulary per audio). + +Per-stream boosting is currently supported for **greedy label-looping decoding with Transducers (RNN-T, TDT)**, including cache-aware streaming models. + +Manifest-based Usage +-------------------- + +Specify per-utterance key phrases in your manifest using the ``biasing_request`` field: + +.. code-block:: json + + {"audio_filepath": "/data/file1.wav", "text": "ground truth", "biasing_request": {"boosting_model_cfg": {"key_phrases_list": ["one phrase"]}}} + {"audio_filepath": "/data/file2.wav", "text": "ground truth", "biasing_request": {"boosting_model_cfg": {"key_phrases_list": ["other phrases", "and this one"]}}} + +Use the streaming inference script with ``use_per_stream_biasing=true``: + +.. code-block:: bash + + python examples/asr/asr_streaming_inference/asr_streaming_infer.py \ + --config-path="../conf/asr_streaming_inference/" \ + --config-name=cache_aware_rnnt.yaml \ + audio_file="" \ + output_filename="result.jsonl" \ + asr.model_name="nvidia/parakeet-rnnt-1.1b" \ + asr.decoding.greedy.enable_per_stream_biasing=True + +Python API Usage +---------------- + +.. code-block:: python + + from omegaconf import open_dict + from nemo.collections.asr.models import EncDecRNNTBPEModel + from nemo.collections.asr.parts.context_biasing.biasing_multi_model import BiasingRequestItemConfig + from nemo.collections.asr.parts.context_biasing.boosting_graph_batched import BoostingTreeModelConfig + from nemo.collections.asr.parts.utils.rnnt_utils import Hypothesis + + asr_model = EncDecRNNTBPEModel.from_pretrained("nvidia/parakeet-rnnt-1.1b") + asr_model.to("cuda") + + with open_dict(asr_model.cfg.decoding): + asr_model.cfg.decoding.strategy = "greedy_batch" + asr_model.cfg.decoding.greedy.loop_labels = True + asr_model.cfg.decoding.greedy.enable_per_stream_biasing = True + asr_model.change_decoding_strategy(asr_model.cfg.decoding) + + biasing_requests = [ + BiasingRequestItemConfig( + boosting_model_cfg=BoostingTreeModelConfig(key_phrases_list=["one phrase"]), + boosting_model_alpha=2.0, + ), + None, # no biasing for this utterance + BiasingRequestItemConfig( + boosting_model_cfg=BoostingTreeModelConfig(key_phrases_list=["other phrases"]), + boosting_model_alpha=1.0, + ), + ] + + results = asr_model.transcribe( + audio=["file1.wav", "file2.wav", "file3.wav"], + partial_hypothesis=[ + Hypothesis.empty_with_biasing_cfg(biasing_cfg=req) if req else None + for req in biasing_requests + ], + return_hypotheses=True, + ) + +Caching +------- + +Building a boosting model from a phrase list has some overhead. NeMo provides caching mechanisms to speed up repeated use of the same phrases: + +.. list-table:: + :header-rows: 1 + :widths: 15 60 25 + + * - Strategy + - Description + - Recommended For + * - Memory + - Set ``cache_key`` on ``BiasingRequestItemConfig`` to cache compiled models in memory by a string key. + - Repeated phrase sets + * - Disk + - Set ``model_path`` on ``BoostingTreeModelConfig`` to save/load compiled models from disk. + - Persistent caching + * - Decoder + - Set ``auto_manage_multi_model=False`` and manually manage models in the decoder's multi-model. + - Advanced use cases + +With memory caching, per-stream boosting achieves near-zero overhead compared to global (shared) boosting. + + .. _word_boosting_flashlight: Flashlight-based Word Boosting diff --git a/docs/source/asr/asr_language_modeling_and_customization.rst b/docs/source/asr/asr_language_modeling_and_customization.rst index 125797a31a81..99f918006079 100644 --- a/docs/source/asr/asr_language_modeling_and_customization.rst +++ b/docs/source/asr/asr_language_modeling_and_customization.rst @@ -7,6 +7,41 @@ ASR Language Modeling and Customization NeMo supports decoding-time customization techniques such as *language modeling* and *word boosting*, which improve transcription accuracy by incorporating external knowledge or domain-specific vocabulary—without retraining the model. + +Decoder Types +------------- + +NeMo ASR models use different decoder architectures. The table below summarizes them: + +.. list-table:: + :header-rows: 1 + + * - Decoder + - Type + - Description + - Models + * - **CTC** + - Non-autoregressive + - Connectionist Temporal Classification. Fast inference, supports LM fusion and word boosting. + - Parakeet-CTC, FastConformer-CTC + * - **RNN-T** + - Autoregressive + - Recurrent Neural Network Transducer. Strong accuracy, streaming-friendly. + - Parakeet-RNNT, FastConformer-Transducer + * - **TDT** + - Autoregressive + - Token-and-Duration Transducer. Extends RNN-T with duration prediction for better timestamps. + - Parakeet-TDT + * - **AED** + - Autoregressive + - Attention Encoder-Decoder. Multi-task capable (ASR + AST), prompt-based language control. + - Canary-1B, Canary-1B-V2, Canary-1B-Flash + * - **Hybrid** + - Both + - Joint RNN-T + CTC training. Use either decoder at inference time. + - FastConformer Hybrid models + + Language Modeling ----------------- @@ -69,6 +104,80 @@ NeMo provides tools for training n-gram language models that can be used for lan For details, please refer to: :ref:`ngram-utils`. +CUDA Graphs +----------- + +CUDA graphs accelerate decoding by capturing and replaying GPU operations, eliminating kernel launch overhead. +Support varies by decoder strategy: + +.. list-table:: + :header-rows: 1 + + * - Strategy + - Config Parameter + - Default + - Notes + * - ``greedy_batch`` (RNN-T, TDT) + - ``use_cuda_graph_decoder`` + - ``true`` + - Requires ``loop_labels=True`` and ``blank_as_pad=True`` + * - ``maes_batch``, ``malsd_batch`` (beam) + - ``allow_cuda_graphs`` + - ``true`` + - Batched beam search strategies + * - Non-batched ``greedy`` / ``beam`` + - N/A + - N/A + - Not supported; standard decoding used + +To disable CUDA graphs (e.g. for debugging or when preserving alignments with frame-looping): + +.. code-block:: yaml + + decoding: + greedy: + use_cuda_graph_decoder: false + beam: + allow_cuda_graphs: false + +When unsupported, NeMo falls back to standard decoding automatically. + + +Confidence Estimation +--------------------- + +NeMo supports per-frame, per-token, and per-word confidence scores during decoding. +Confidence estimation helps applications decide when to trust ASR output and when to request human review. + +.. code-block:: yaml + + decoding: + confidence_cfg: + preserve_frame_confidence: false + preserve_token_confidence: false + preserve_word_confidence: false + exclude_blank: true + aggregation: "mean" # mean, min, max, prod + method_cfg: + name: "entropy" # max_prob or entropy + entropy_type: "tsallis" # gibbs, tsallis, renyi + alpha: 0.33 + entropy_norm: "exp" # lin or exp + +**Confidence methods:** + +* ``max_prob``: Maximum token probability as confidence. Simple and fast. +* ``entropy``: Normalized entropy of the log-likelihood vector (default). Entropy types: + + - ``gibbs``: Standard Gibbs entropy + - ``tsallis``: Tsallis entropy (default, recommended) + - ``renyi``: Renyi entropy + +**Aggregation** combines frame-level scores into token/word scores: ``mean``, ``min``, ``max``, or ``prod``. + +For TDT models, set ``tdt_include_duration_confidence: true`` to include duration prediction confidence. + + .. toctree:: :maxdepth: 1 :hidden: diff --git a/docs/source/asr/configs.rst b/docs/source/asr/configs.rst index 88ee9a8df081..9547f6e2b07f 100644 --- a/docs/source/asr/configs.rst +++ b/docs/source/asr/configs.rst @@ -1,929 +1,100 @@ -NeMo ASR Configuration Files -============================ - -This section describes the NeMo configuration file setup that is specific to models in the ASR collection. For general information -about how to set up and run experiments that is common to all NeMo models (e.g. Experiment Manager and PyTorch Lightning trainer -parameters), see the :doc:`../core/core` section. - -The model section of the NeMo ASR configuration files generally requires information about the dataset(s) being used, the preprocessor -for audio files, parameters for any augmentation being performed, as well as the model architecture specification. The sections on -this page cover each of these in more detail. - -Example configuration files for all of the NeMo ASR scripts can be found in the -`config directory of the examples `_. - .. _asr-configs-dataset-configuration: +.. _asr-configs-preprocessor-configuration: +.. _asr-configs-augmentation-configurations: -Dataset Configuration ---------------------- - -Training, validation, and test parameters are specified using the ``train_ds``, ``validation_ds``, and -``test_ds`` sections in the configuration file, respectively. Depending on the task, there may be arguments specifying the sample rate -of the audio files, the vocabulary of the dataset (for character prediction), whether or not to shuffle the dataset, and so on. You may -also decide to leave fields such as the ``manifest_filepath`` blank, to be specified via the command-line at runtime. - -Any initialization parameter that is accepted for the Dataset class used in the experiment can be set in the config file. -Refer to the :ref:`Datasets ` section of the API for a list of Datasets and their respective parameters. - -An example ASR train and validation configuration should look similar to the following: - -.. code-block:: yaml - - # Specified at the beginning of the config file - labels: &labels [" ", "a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", - "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"] - - model: - train_ds: - manifest_filepath: ??? - sample_rate: 16000 - labels: *labels # Uses the labels above - batch_size: 32 - trim_silence: True - max_duration: 16.7 - shuffle: True - num_workers: 8 - pin_memory: true - # tarred datasets - is_tarred: false # If set to true, uses the tarred version of the Dataset - tarred_audio_filepaths: null # Not used if is_tarred is false - shuffle_n: 2048 # Not used if is_tarred is false - # bucketing params - bucketing_strategy: "synced_randomized" - bucketing_batch_size: null - bucketing_weights: null - - validation_ds: - manifest_filepath: ??? - sample_rate: 16000 - labels: *labels # Uses the labels above - batch_size: 32 - shuffle: False # No need to shuffle the validation data - num_workers: 8 - pin_memory: true - -There are two ways to test/validate on more than one manifest: - -- Specify a list in the `manifest_filepath` field. Results will be reported for each, the first one being used for overall loss / WER (specify `val_dl_idx` if you wish to change that). In this case, all manifests will share configuration parameters. -- Use the ds_item key and pass a list of config objects to it. This allows you to use differently configured datasets for validation, e.g. - -.. code-block:: yaml - - model: - validation_ds: - ds_item: - - name: dataset1 - manifest_filepath: ??? - # Config parameters for dataset1 - ... - - name: dataset2 - manifest_filepath: ??? - # Config parameters for dataset2 - ... - -By default, dataloaders are set up when the model is instantiated. However, dataloader setup can be deferred to -model's `setup()` method by setting ``defer_setup`` in the configuration. - -For example, training data setup can be deferred as follows: - -.. code-block:: yaml - - model: - train_ds: - # Configure training data as usual - ... - # Defer train dataloader setup from `__init__` to `setup` - defer_setup: true +NeMo ASR Configuration Files +============================ +This page covers ASR-specific configuration. For general NeMo setup (Experiment Manager, trainer), see :doc:`../core/core`. +Example configs: `examples/asr/conf `_. -.. _asr-configs-metric-configuration: Metric Configurations --------------------- -NeMo ASR models supports WER and BLEU metric logging during training and validation. All metrics are based on the TorchMetrics backend, allowing for distributed training without additional code. - -Word Error Rate (WER) -~~~~~~~~~~~~~~~~~~~~~ - -WER is the default metric for all ASR models and measures transcription accuracy at the word or character level. +NeMo ASR supports WER and BLEU metrics via TorchMetrics. .. code-block:: yaml model: - use_cer: false # Set to true for Character Error Rate instead (default: false) - log_prediction: true # Whether to log a sample prediction during training (default: true) - batch_dim_index: 0 # Index of batch dimension in prediction tensors output. Set to 1 for RNNT models. + use_cer: false + log_prediction: true -BLEU Score -~~~~~~~~~~ +For BLEU (translation): set ``bleu_tokenizer`` (``13a``, ``none``, ``intl``, ``char``, ``zh``, ``ja-mecab``, ``ko-mecab``, ``flores101``, ``flores200``). -BLEU score can be used for ASR models to evaluate translation quality. NeMo's BLEU implementation is based on SacreBLEU for standardized, reproducible scoring: - -.. code-block:: yaml - - model: - bleu_tokenizer: "13a" # SacreBLEU tokenizer type (see below). (default: "13a") - n_gram: 4 # Maximum n-gram order for BLEU calculation. (default: 4) - lowercase: false # Whether to lowercase before computing BLEU. (default: False) - weights: null # Optional custom weights for n-gram orders. (default: null) - smooth: false # Whether to apply smoothing to BLEU calculation. (default: False) - check_cuts_for_bleu_tokenizers: false # Enable per-sample tokenizer selection. (See below for more details.) (default: False) - log_prediction: true # Whether to log sample predictions. (default: True) - batch_dim_index: 0 # Index of batch dimension in prediction tensors output. Set to 1 for RNNT models. (default: 0) - -BLEU score relies on TorchMetrics' SacreBLEU implementation and supports all SacreBLEU tokenization options. Valid strings may be passed to ``bleu_tokenizer`` parameter to configure base tokenizer behavior during BLEU calculation. Available options are: - -* ``"13a"`` - Default WMT tokenizer (mteval-v13a script compatible) -* ``"none"`` - No tokenization applied -* ``"intl"`` - International tokenization (mteval-v14 script compatible) -* ``"char"`` - Character-level tokenization (language-agnostic) -* ``"zh"`` - Chinese tokenization (separates Chinese characters, uses 13a for non-Chinese) -* ``"ja-mecab"`` - Japanese tokenization using MeCab morphological analyzer -* ``"ko-mecab"`` - Korean tokenization using MeCab-ko morphological analyzer -* ``"flores101"`` / ``"flores200"`` - SentencePiece models from Flores datasets - -**Note** Due to their unique orthographies, it is highly recommended to use ``zh``, ``ja-mecab``, or ``ko-mecab`` tokenizers for Chinese, Japanese, and Korean target evaluations, respectively. For more information on SacreBLEU tokenizers, please refer to the `SacreBLEU documentation `__. - -**Dynamic Tokenizer Selection** - -In multilingual training scenarios, it is somtimes desireable to configure the BLEU tokenizer per sample to avoid sub-optimal parsing (e.g. tokenizing Chinese characters as English words). This can be toggled with ``check_cuts_for_bleu_tokenizers: true``. When enabled with Lhotse dataloading, BLEU will check individual ``cuts`` in a batch's Lhotse ``CutSet`` for the ``bleu_tokenizer`` attribute. If found, the tokenizer will be used for that sample. If not, the default ``bleu_tokenizer`` from config will be used. - -MultiTask Metrics -~~~~~~~~~~~~~~~~~ - -Multiple metrics can be configured simultaneously using a ``MultiTaskMetric`` config. This is done by specifying in the config each desired metric as a DictConfig entry with a custom key name and ``_target_`` path, along with desired properties. All properties specified within a metric config will be passed only to the metric class. All properties specified at the top level of the config will be inherited by all submetrics. +For multitask models, use ``multitask_metrics_config`` with per-metric constraints: .. code-block:: yaml model: multitask_metrics_config: - log_prediction: true metrics: wer: _target_: nemo.collections.asr.metrics.wer.WER - use_cer: true - constraint: ".task==transcribe" # Only apply WER to transcription samples + constraint: ".task==transcribe" bleu: _target_: nemo.collections.asr.metrics.bleu.BLEU - bleu_tokenizer: flores101 - lowercase: true - check_cuts_for_bleu_tokenizers: true - constraint: ".task==translate" # Only apply BLEU to translation samples - -**Metric Constraints** - -Each metric within ``MultiTaskMetric`` can be configured with an optional boolean ``constraint`` pattern that filters batch samples before metric computation. This allows validation to be limited to only applicable samples in a batch (e.g. only apply WER to transcription samples, only apply BLEU to translation samples). Constraint patterns match against property keywords in the batch's Lhotse CutSet. - -.. code-block:: yaml - - model: - multitask_metrics_config: - metrics: - pnc_wer: - _target_: nemo.collections.asr.metrics.wer.WER - constraint: ".task==transcribe and .pnc==true" - - multilingual_bleu: - _target_: nemo.collections.asr.metrics.bleu.BLEU - constraint: "(.source_lang!=.target_lang) or .task==translate" - -**Note:** MultiTaskMetric is currently only supported for AED multitask models. - + constraint: ".task==translate" -.. _asr-configs-preprocessor-configuration: - -Preprocessor Configuration --------------------------- - -If you are loading audio files for your experiment, you will likely want to use a preprocessor to convert from the -raw audio signal to features (e.g. mel-spectrogram or MFCC). The ``preprocessor`` section of the config specifies the audio -preprocessor to be used via the ``_target_`` field, as well as any initialization parameters for that preprocessor. - -An example of specifying a preprocessor is as follows: - -.. code-block:: yaml - - model: - ... - preprocessor: - # _target_ is the audio preprocessor module you want to use - _target_: nemo.collections.asr.modules.AudioToMelSpectrogramPreprocessor - normalize: "per_feature" - window_size: 0.02 - ... - # Other parameters for the preprocessor - -Refer to the :ref:`Audio Preprocessors ` API section for the preprocessor options, expected arguments, -and defaults. - -.. _asr-configs-augmentation-configurations: - -Augmentation Configurations ---------------------------- - -There are a few on-the-fly spectrogram augmentation options for NeMo ASR, which can be specified by the -configuration file using a ``spec_augment`` section. - -For example, there are options for `Cutout `_ and -`SpecAugment `_ available via the ``SpectrogramAugmentation`` module. - -The following example sets up both ``Cutout`` (via the ``rect_*`` parameters) and ``SpecAugment`` (via the ``freq_*`` -and ``time_*`` parameters). - -.. code-block:: yaml - - model: - ... - spec_augment: - _target_: nemo.collections.asr.modules.SpectrogramAugmentation - # Cutout parameters - rect_masks: 5 # Number of rectangles to cut from any given spectrogram - rect_freq: 50 # Max cut of size 50 along the frequency dimension - rect_time: 120 # Max cut of size 120 along the time dimension - # SpecAugment parameters - freq_masks: 2 # Cut two frequency bands - freq_width: 15 # ... of width 15 at maximum - time_masks: 5 # Cut out 10 time bands - time_width: 25 # ... of width 25 at maximum - -You can use any combination of ``Cutout``, frequency/time ``SpecAugment``, or neither of them. - -With NeMo ASR, you can also add augmentation pipelines that can be used to simulate various kinds of noise -added to audio in the channel. Augmentors in a pipeline are applied on the audio data read in the data layer. Online -augmentors can be specified in the config file using an ``augmentor`` section in ``train_ds``. The following example -adds an augmentation pipeline that first adds white noise to an audio sample with a probability of 0.5 and at a level -randomly picked between -50 dB and -10 dB and then passes the resultant samples through a room impulse response randomly -picked from the manifest file provided for ``impulse`` augmentation in the config file. - -.. code-block:: yaml - - model: - ... - train_ds: - ... - augmentor: - white_noise: - prob: 0.5 - min_level: -50 - max_level: -10 - impulse: - prob: 0.3 - manifest_path: /path/to/impulse_manifest.json - -Refer to the :ref:`Audio Augmentors ` API section for more details. Tokenizer Configurations ------------------------ -Some models utilize sub-word encoding via an external tokenizer instead of explicitly defining their vocabulary. - -For such models, a ``tokenizer`` section is added to the model config. ASR models currently support two types of -custom tokenizers: - -- Google Sentencepiece tokenizers (tokenizer type of ``bpe`` in the config) -- HuggingFace WordPiece tokenizers (tokenizer type of ``wpe`` in the config) -- Aggregate tokenizers ((tokenizer type of ``agg`` in the config), see below) - -In order to build custom tokenizers, refer to the ``ASR_with_Subword_Tokenization`` notebook available in the -ASR tutorials directory. - -The following example sets up a ``SentencePiece Tokenizer`` at a path specified by the user: +Models using sub-word encoding require a ``tokenizer`` section: .. code-block:: yaml model: - ... tokenizer: - dir: "" - type: "bpe" # can be "bpe" or "wpe" - -The Aggregate (``agg``) tokenizer feature makes it possible to combine tokenizers in order to train multilingual -models. The config file would look like this: - -.. code-block:: yaml - - model: - ... - tokenizer: - type: "agg" # aggregate tokenizer - langs: - en: - dir: "" - type: "bpe" # can be "bpe" or "wpe" - es: - dir: "" - type: "bpe" # can be "bpe" or "wpe" - -In the above config file, each language is associated with its own pre-trained tokenizer, which gets assigned -a token id range in the order the tokenizers are listed. To train a multilingual model, one needs to populate the -``lang`` field in the manifest file, allowing the routing of each sample to the correct tokenizer. At inference time, -the routing is done based on the inferred token id range. - -For models which utilize sub-word tokenization, we share the decoder module (``ConvASRDecoder``) with character tokenization models. -All parameters are shared, but for models which utilize sub-word encoding, there are minor differences when setting up the config. For -such models, the tokenizer is utilized to fill in the missing information when the model is constructed automatically. - -For example, a decoder config corresponding to a sub-word tokenization model should look similar to the following: - -.. code-block:: yaml - - model: - ... - decoder: - _target_: nemo.collections.asr.modules.ConvASRDecoder - feat_in: *enc_final - num_classes: -1 # filled with vocabulary size from tokenizer at runtime - vocabulary: [] # filled with vocabulary from tokenizer at runtime - - -On-the-fly Code Switching -------------------------- - -Nemo supports creating code-switched synthetic utterances on-the-fly during training/validation/testing. This allows you to create ASR models which -support intra-utterance code switching. If you have Nemo formatted audio data on disk (either JSON manifests or tarred audio data), you -can easily mix as many of these audio sources together as desired by adding some extra parameters to your `train_ds`, `validation_ds`, and `test_ds`. - -Please note that this allows you to mix any kind of audio sources together to create synthetic utterances which sample from all sources. The most -common use case for this is blending different languages together to create a multilingual code-switched model, but you can also blend -together different audio sources from the same languages (or language families), to create noise robust data, or mix fast and slow speech from the -same language. - -For multilingual code-switched models, we recommend using AggTokenizer for your Tokenizer if mixing different languages. - -The following example shows how to mix 3 different languages: English (en), German (de), and Japanese (ja) added to the `train_ds` model block, however -you can add similar logic to your `validation_ds` and `test_ds` blocks for on-the-fly code-switched validation and test data too. This example mixes -together 3 languages, but you can use as many as you want. However, be advised that the more languages you add, the higher your `min_duration` and `max_duration` -need to be set to ensure all languages are sampled into each synthetic utterance, and setting these hyperparameters higher will use more VRAM per mini-batch during -training and evaluation. - -.. code-block:: yaml - - model: - train_ds: - manifest_filepath: [/path/to/EN/tarred_manifest.json, /path/to/DE/tarred_manifest.json, /path/to/JA/tarred_manifest.json] - tarred_audio_filepaths: ['/path/to/EN/tars/audio__OP_0..511_CL_.tar', '/path/to/DE/tars/audio__OP_0..1023_CL_.tar', '/path/to/JA/tars/audio__OP_0..2047_CL_.tar'] - is_code_switched: true - is_tarred: true - shuffle: true - code_switched: # add this block for code-switching - min_duration: 12 # the minimum number of seconds for each synthetic code-switched utterance - max_duration: 20 # the maximum number of seconds for each synthetic code-switched utterance - min_monolingual: 0.3 # the minimum percentage of utterances which will be pure monolingual (0.3 = 30%) - probs: [0.25, 0.5, 0.25] # the probability to sample each language (matches order of `language` above) if not provided, assumes uniform distribution - force_monochannel: true # if your source data is multi-channel, then setting this to True will force the synthetic utterances to be mono-channel - sampling_scales: 0.75 # allows you to down/up sample individual languages. Can set this as an array for individual languages, or a scalar for all languages - seed: 123 # add a seed for replicability in future runs (highly useful for `validation_ds` and `test_ds`) - - -Model Architecture Configurations ---------------------------------- - -Each configuration file should describe the model architecture being used for the experiment. Models in the NeMo ASR collection need -an ``encoder`` section and a ``decoder`` section, with the ``_target_`` field specifying the module to use for each. + dir: "" + type: "bpe" # or "wpe" -Here is the list of the parameters in the model section which are shared among most of the ASR models: - -+-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ -| **Parameter** | **Datatype** | **Description** | **Supported Values** | -+=========================+==================+===============================================================================================================+=================================+ -| :code:`log_prediction` | bool | Whether a random sample should be printed in the output at each step, along with its predicted transcript. | | -+-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ -| :code:`ctc_reduction` | string | Specifies the reduction type of CTC loss. Defaults to ``mean_batch`` which would take the average over the | :code:`none`, | -| | | batch after taking the average over the length of each sample. | :code:`mean_batch` | -| | | | :code:`mean`, :code:`sum` | -+-------------------------+------------------+---------------------------------------------------------------------------------------------------------------+---------------------------------+ - -The following sections go into more detail about the specific configurations of each model architecture. - -For more information about the ASR models, refer to the :doc:`Models <./models>` section. - - -.. _asr-configs-conformer-ctc: - -Conformer-CTC -~~~~~~~~~~~~~ - -The config files for Conformer-CTC model contain character-based encoding and sub-word encoding at -``/examples/asr/conf/conformer/conformer_ctc_char.yaml`` and ``/examples/asr/conf/conformer/conformer_ctc_bpe.yaml`` -respectively. Some components of the configs of :ref:`Conformer-CTC ` include the following datasets: - -* ``train_ds``, ``validation_ds``, and ``test_ds`` -* opimizer (``optim``) -* augmentation (``spec_augment``) -* ``decoder`` -* ``trainer`` -* ``exp_manager`` - -There should be a tokenizer section where you can -specify the tokenizer if you want to use sub-word encoding instead of character-based encoding. - - -The encoder section includes the details about the Conformer-CTC encoder architecture. You may find more information in the -config files and also :ref:`nemo.collections.asr.modules.ConformerEncoder `. - - -Conformer-Transducer -~~~~~~~~~~~~~~~~~~~~ - -Please refer to the model page of :ref:`Conformer-Transducer ` for more information on this model. - -.. _asr-configs-lstm-transducer-and-ctc: - -LSTM-Transducer and LSTM-CTC -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The config files for LSTM-Transducer and LSTM-CTC models can be found at ``/examples/asr/conf/lstm/lstm_transducer_bpe.yaml`` and ``/examples/asr/conf/lstm/lstm_ctc_bpe.yaml`` respectively. -Most of the of the configs of are similar to other ctc or transducer models. The main difference is the encoder part. -The encoder section includes the details about the RNN-based encoder architecture. You may find more information in the -config files and also :ref:`nemo.collections.asr.modules.RNNEncoder `. - - -InterCTC Config ---------------- - -All CTC-based models also support `InterCTC loss `_. To use it, you need to specify -2 parameters as in example below - -.. code-block:: yaml - - model: - # ... - interctc: - loss_weights: [0.3] - apply_at_layers: [8] - -which can be used to reproduce the default setup from the paper (assuming the total number of layers is 18). -You can also specify multiple CTC losses from different layers, e.g., to get 2 losses from layers 3 and 8 with -weights 0.1 and 0.3, specify: - -.. code-block:: yaml - - model: - # ... - interctc: - loss_weights: [0.1, 0.3] - apply_at_layers: [3, 8] - -Note that the final-layer CTC loss weight is automatically computed to normalize -all weight to 1 (0.6 in the example above). - - -Stochastic Depth Config ------------------------ - -`Stochastic Depth `_ is a useful technique for regularizing ASR model training. -Currently it's only supported for :ref:`nemo.collections.asr.modules.ConformerEncoder `. To -use it, specify the following parameters in the encoder config file to reproduce the default setup from the paper: - -.. code-block:: yaml - - model: - # ... - encoder: - # ... - stochastic_depth_drop_prob: 0.3 - stochastic_depth_mode: linear # linear or uniform - stochastic_depth_start_layer: 1 - -See :ref:`documentation of ConformerEncoder ` for more details. Note that stochastic depth -is supported for both CTC and Transducer model variations (or any other kind of model/loss that's using -conformer as encoder). +For multilingual models, use aggregate tokenizers (``type: "agg"``) with per-language sub-tokenizers. Transducer Configurations ------------------------- -All CTC-based ASR model configs can be modified to support Transducer loss training. Below, we discuss the modifications required in the config to enable Transducer training. All modifications are made to the ``model`` config. - -Model Defaults -~~~~~~~~~~~~~~ - -It is a subsection to the model config representing the default values shared across the entire model represented as ``model.model_defaults``. - -There are three values that are primary components of a transducer model. They are : - -* ``enc_hidden``: The hidden dimension of the final layer of the Encoder network. -* ``pred_hidden``: The hidden dimension of the final layer of the Prediction network. -* ``joint_hidden``: The hidden dimension of the intermediate layer of the Joint network. - -One can access these values inside the config by using OmegaConf interpolation as follows : +CTC configs can be extended for Transducer training by adding prediction network, joint network, decoding, and loss sections. .. code-block:: yaml model: - ... model_defaults: enc_hidden: 256 pred_hidden: 256 joint_hidden: 256 - ... + decoder: - ... + _target_: nemo.collections.asr.modules.RNNTDecoder + blank_as_pad: true prednet: pred_hidden: ${model.model_defaults.pred_hidden} - -Acoustic Encoder Model -~~~~~~~~~~~~~~~~~~~~~~ - -The transducer model is comprised of three models combined. One of these models is the Acoustic (encoder) model. We should be able to drop in any CTC Acoustic model config into this section of the transducer config. - -The only condition that needs to be met is that **the final layer of the acoustic model must have the hidden dimension defined in ``model_defaults.enc_hidden``**. - -Decoder / Prediction Model -~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The Prediction model is generally an autoregressive, causal model that consumes text tokens and returns embeddings that will be used by the Joint model. The base config for an LSTM based Prediction network can be found in the ``decoder`` section of Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. - -**This config can be copy-pasted into any custom transducer model with no modification.** - -Let us discuss some of the important arguments: - -* ``blank_as_pad``: In ordinary transducer models, the embedding matrix does not acknowledge the ``Transducer Blank`` token (similar to CTC Blank). However, this causes the autoregressive loop to be more complicated and less efficient. Instead, this flag which is set by default, will add the ``Transducer Blank`` token to the embedding matrix - and use it as a pad value (zeros tensor). This enables more efficient inference without harming training. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. - -* ``prednet.pred_hidden``: The hidden dimension of the LSTM and the output dimension of the Prediction network. - -.. code-block:: yaml - - decoder: - _target_: nemo.collections.asr.modules.RNNTDecoder - normalization_mode: null - random_state_sampling: false - blank_as_pad: true - - prednet: - pred_hidden: ${model.model_defaults.pred_hidden} - pred_rnn_layers: 1 - t_max: null - dropout: 0.0 - -Joint Model -~~~~~~~~~~~ - -The Joint model is a simple feed-forward Multi-Layer Perceptron network. This MLP accepts the output of the Acoustic and Prediction models and computes a joint probability distribution over the entire vocabulary space. The base config for the Joint network can be found in the ``joint`` section of Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. - -**This config can be copy-pasted into any custom transducer model with no modification.** - -The Joint model config has several essential components which we discuss below : - -* ``log_softmax``: Due to the cost of computing softmax on such large tensors, the Numba CUDA implementation of RNNT loss will implicitly compute the log softmax when called (so its inputs should be logits). The CPU version of the loss doesn't face such memory issues so it requires log-probabilities instead. Since the behaviour is different for CPU-GPU, the ``None`` value will automatically switch behaviour dependent on whether the input tensor is on a CPU or GPU device. - -* ``preserve_memory``: This flag will call ``torch.cuda.empty_cache()`` at certain critical sections when computing the Joint tensor. While this operation might allow us to preserve some memory, the empty_cache() operation is tremendously slow and will slow down training by an order of magnitude or more. It is available to use but not recommended. - -* ``fuse_loss_wer``: This flag performs "batch splitting" and then "fused loss + metric" calculation. It will be discussed in detail in the next tutorial that will train a Transducer model. - -* ``fused_batch_size``: When the above flag is set to True, the model will have two distinct "batch sizes". The batch size provided in the three data loader configs (``model.*_ds.batch_size``) will now be the ``Acoustic model`` batch size, whereas the ``fused_batch_size`` will be the batch size of the ``Prediction model``, the ``Joint model``, the ``transducer loss`` module and the ``decoding`` module. - -* ``jointnet.joint_hidden``: The hidden intermediate dimension of the joint network. - -.. code-block:: yaml - - joint: - _target_: nemo.collections.asr.modules.RNNTJoint - log_softmax: null # sets it according to cpu/gpu device - - # fused mode - fuse_loss_wer: false - fused_batch_size: 16 - - jointnet: - joint_hidden: ${model.model_defaults.joint_hidden} - activation: "relu" - dropout: 0.0 - -Sampled Softmax Joint Model -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -There are some situations where a large vocabulary with a Transducer model - such as for multilingual models with a large -number of languages. In this setting, we need to consider the cost of memory of training Transducer networks which does -not allow large vocabulary. - -For such cases, one can instead utilize the ``SampledRNNTJoint`` module instead of the usual ``RNNTJoint`` module, in order -to compute the loss using a sampled subset of the vocabulary rather than the full vocabulary file. - -It adds only one additional parameter : - -* ``n_samples``: Specifies the minimum number of tokens to sample from the vocabulary space, - excluding the RNNT blank token. If a given value is larger than the entire vocabulary size, - then the full vocabulary will be used. - -The only difference in config required is to replace ``nemo.collections.asr.modules.RNNTJoint`` with ``nemo.collections.asr.modules.SampledRNNTJoint`` - -.. code-block:: yaml - - joint: - _target_: nemo.collections.asr.modules.SampledRNNTJoint - n_samples: 500 - ... # All other arguments from RNNTJoint can be used after this. - - -Effect of Batch Splitting / Fused Batch step -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The following information below explain why memory is an issue when training Transducer models and how NeMo tackles the issue with its Fused Batch step. The material can be read for a thorough understanding, otherwise, it can be skipped. You can also follow these steps in the "ASR_with_Transducers" tutorial. - -**Diving deeper into the memory costs of Transducer Joint** - -One of the significant limitations of Transducers is the exorbitant memory cost of computing the Joint module. The Joint module is comprised of two steps. - -1) Projecting the Acoustic and Transcription feature dimensions to some standard hidden dimension (specified by model.model_defaults.joint_hidden) - -2) Projecting this intermediate hidden dimension to the final vocabulary space to obtain the transcription. - -Take the following example. - -BS=32 ; T (after 2x stride) = 800, U (with character encoding) = 400-450 tokens, Vocabulary size V = 28 (26 alphabet chars, space and apostrophe). Let the hidden dimension of the Joint model be 640 (Most Google Transducer papers use hidden dimension of 640). - -* :math:`Memory \, (Hidden, \, gb) = 32 \times 800 \times 450 \times 640 \times 4 = 29.49` gigabytes (4 bytes per float). - -* :math:`Memory \, (Joint, \, gb) = 32 \times 800 \times 450 \times 28 \times 4 = 1.290` gigabytes (4 bytes per float) - -**NOTE**: This is just for the forward pass! We need to double this memory to store gradients! This much memory is also just for the Joint model **alone**. Far more memory is required for the Prediction model as well as the large Acoustic model itself and its gradients! - -Even with mixed precision, that's $\sim 30$ GB of GPU RAM for just 1 part of the network + its gradients. - -Effect of Fused Batch Step -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The fundamental problem is that the joint tensor grows in size when ``[T x U]`` grows in size. This growth in memory cost is due to many reasons - either by model construction (downsampling) or the choice of dataset preprocessing (character tokenization vs. sub-word tokenization). - -Another dimension that NeMo can control is **batch**. Due to how we batch our samples, small and large samples all get clumped together into a single batch. So even though the individual samples are not all as long as the maximum length of T and U in that batch, when a batch of such samples is constructed, it will consume a significant amount of memory for the sake of compute efficiency. - -So as is always the case - **trade-off compute speed for memory savings**. - -The fused operation goes as follows : - -1) Forward the entire acoustic model in a single pass. (Use global batch size here for acoustic model - found in ``model.*_ds.batch_size``) - -2) Split the Acoustic Model's logits by ``fused_batch_size`` and loop over these sub-batches. - -3) Construct a sub-batch of same ``fused_batch_size`` for the Prediction model. Now the target sequence length is :math:`U_{sub-batch} < U`. - -4) Feed this :math:`U_{sub-batch}` into the Joint model, along with a sub-batch from the Acoustic model (with :math:`T_{sub-batch} < T)`. Remember, we only have to slice off a part of the acoustic model here since we have the full batch of samples :math:`(B, T, D)` from the acoustic model. - -5) Performing steps (3) and (4) yields :math:`T_{sub-batch}` and :math:`U_{sub-batch}`. Perform sub-batch joint step - costing an intermediate :math:`(B, T_{sub-batch}, U_{sub-batch}, V)` in memory. - -6) Compute loss on sub-batch and preserve in a list to be later concatenated. - -7) Compute sub-batch metrics (such as Character / Word Error Rate) using the above Joint tensor and sub-batch of ground truth labels. Preserve the scores to be averaged across the entire batch later. - -8) Delete the sub-batch joint matrix :math:`(B, T_{sub-batch}, U_{sub-batch}, V)`. Only gradients from .backward() are preserved now in the computation graph. - -9) Repeat steps (3) - (8) until all sub-batches are consumed. - -10) Cleanup step. Compute full batch WER and log. Concatenate loss list and pass to PTL to compute the equivalent of the original (full batch) Joint step. Delete ancillary objects necessary for sub-batching. - -Transducer Decoding -~~~~~~~~~~~~~~~~~~~ - -Models which have been trained with CTC can transcribe text simply by performing a regular argmax over the output of their decoder. For transducer-based models, the three networks must operate in a synchronized manner in order to transcribe the acoustic features. The base config for the Transducer decoding step can be found in the ``decoding`` section of Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. - -**This config can be copy-pasted into any custom transducer model with no modification.** - -The most important component at the top level is the ``strategy``. It can take one of many values: - -* ``greedy``: This is sample-level greedy decoding. It is generally exceptionally slow as each sample in the batch will be decoded independently. For publications, this should be used alongside batch size of 1 for exact results. - -* ``greedy_batch``: This is the general default and should nearly match the ``greedy`` decoding scores (if the acoustic features are not affected by feature mixing in batch mode). Even for small batch sizes, this strategy is significantly faster than ``greedy``. - -* ``beam``: Runs beam search with the implicit language model of the Prediction model. It will generally be quite slow, and might need some tuning of the beam size to get better transcriptions. - -* ``tsd``: Time synchronous decoding. Please refer to the paper: `Alignment-Length Synchronous Decoding for RNN Transducer `_ for details on the algorithm implemented. Time synchronous decoding (TSD) execution time grows by the factor T * max_symmetric_expansions. For longer sequences, T is greater and can therefore take a long time for beams to obtain good results. TSD also requires more memory to execute. - -* ``alsd``: Alignment-length synchronous decoding. Please refer to the paper: `Alignment-Length Synchronous Decoding for RNN Transducer `_ for details on the algorithm implemented. Alignment-length synchronous decoding (ALSD) execution time is faster than TSD, with a growth factor of T + U_max, where U_max is the maximum target length expected during execution. Generally, T + U_max < T * max_symmetric_expansions. However, ALSD beams are non-unique. Therefore it is required to use larger beam sizes to achieve the same (or close to the same) decoding accuracy as TSD. For a given decoding accuracy, it is possible to attain faster decoding via ALSD than TSD. - -* ``maes``: Modified Adaptive Expansion Search Decoding. Please refer to the paper `Accelerating RNN Transducer Inference via Adaptive Expansion Search `_. Modified Adaptive Synchronous Decoding (mAES) execution time is adaptive w.r.t the number of expansions (for tokens) required per timestep. The number of expansions can usually be constrained to 1 or 2, and in most cases 2 is sufficient. This beam search technique can possibly obtain superior WER while sacrificing some evaluation time. - -.. code-block:: yaml - - decoding: - strategy: "greedy_batch" - - # preserve decoding alignments - preserve_alignments: false - - # Overrides the fused batch size after training. - # Setting it to -1 will process whole batch at once when combined with `greedy_batch` decoding strategy - fused_batch_size: Optional[int] = -1 - - # greedy strategy config - greedy: - max_symbols: 10 - - # beam strategy config - beam: - beam_size: 2 - score_norm: true - softmax_temperature: 1.0 # scale the logits by some temperature prior to softmax - tsd_max_sym_exp: 10 # for Time Synchronous Decoding, int > 0 - alsd_max_target_len: 5.0 # for Alignment-Length Synchronous Decoding, float > 1.0 - maes_num_steps: 2 # for modified Adaptive Expansion Search, int > 0 - maes_prefix_alpha: 1 # for modified Adaptive Expansion Search, int > 0 - maes_expansion_beta: 2 # for modified Adaptive Expansion Search, int >= 0 - maes_expansion_gamma: 2.3 # for modified Adaptive Expansion Search, float >= 0 - -Transducer Loss -~~~~~~~~~~~~~~~ - -This section configures the type of Transducer loss itself, along with possible sub-sections. By default, an optimized implementation of Transducer loss will be used which depends on Numba for CUDA acceleration. The base config for the Transducer loss section can be found in the ``loss`` section of Transducer architectures. For further information refer to the ``Intro to Transducers`` tutorial in the ASR tutorial section. - -**This config can be copy-pasted into any custom transducer model with no modification.** - -The loss config is based on a resolver pattern and can be used as follows: - -1) ``loss_name``: ``default`` is generally a good option. Will select one of the available resolved losses and match the kwargs from a sub-configs passed via explicit ``{loss_name}_kwargs`` sub-config. - -2) ``{loss_name}_kwargs``: This sub-config is passed to the resolved loss above and can be used to configure the resolved loss. - - -.. code-block:: yaml - - loss: - loss_name: "default" - warprnnt_numba_kwargs: - fastemit_lambda: 0.0 - -FastEmit Regularization -^^^^^^^^^^^^^^^^^^^^^^^ - -FastEmit Regularization is supported for the default Numba based WarpRNNT loss. Recently proposed regularization approach - `FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization `_ allows us near-direct control over the latency of transducer models. - -Refer to the above paper for results and recommendations of ``fastemit_lambda``. - - -.. _Hybrid-Transducer-CTC-Prompt_model__Config: - -Hybrid-Transducer-CTC with Prompt Conditioning Configuration ------------------------------------------------------------- - -The :ref:`Hybrid-Transducer-CTC model with prompt conditioning ` -(``EncDecHybridRNNTCTCBPEModelWithPrompt``) extends the base hybrid model to support prompt-based multilingual ASR/AST. - -**Key Configuration Parameters:** - -The model introduces several prompt-specific configuration parameters in the ``model_defaults`` section: - -.. code-block:: yaml - - model: - model_defaults: - # Prompt Feature Configuration - initialize_prompt_feature: true # Enable prompt conditioning - num_prompts: 128 # Number of supported prompt categories - prompt_dictionary: { # Mapping from identifiers to prompt indices - # Language prompts (0-99) - 'en-US': 0, - 'de-DE': 1, - 'fr-FR': 2, - 'es-ES': 3, - # Task/domain prompts (100-127) - 'pnc': 100, # Punctuation mode - 'no_pnc': 101, # No punctuation mode - } - -**Dataset Configuration:** - -The model requires training data with prompt annotations when using Lhotse datasets: - -.. code-block:: yaml - - model: - train_ds: - use_lhotse: true - initialize_prompt_feature: true - prompt_field: "target_lang" # Field name for prompt extraction - prompt_dictionary: ${model.model_defaults.prompt_dictionary} - num_prompts: ${model.model_defaults.num_prompts} - - validation_ds: - use_lhotse: true - initialize_prompt_feature: true - prompt_field: "target_lang" - prompt_dictionary: ${model.model_defaults.prompt_dictionary} - num_prompts: ${model.model_defaults.num_prompts} - -**Manifest Format:** - -Training manifests should include prompt information: - -.. code-block:: json - - { - "audio_filepath": "/path/to/audio.wav", - "text": "transcription text", - "duration": 10.5, - "target_lang": "en-US" - } - -**Example Configuration:** - -A complete example configuration can be found at: -``/examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml`` - -**Training Command:** - -.. code-block:: bash - - python /examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe_prompt.py \ - --config-path=/examples/asr/conf/fastconformer/hybrid_transducer_ctc/ \ - --config-name=fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml \ - model.train_ds.manifest_filepath= \ - model.validation_ds.manifest_filepath= \ - model.tokenizer.dir= \ - model.test_ds.manifest_filepath= - -Fine-tuning Configurations --------------------------- - -All ASR scripts support easy fine-tuning by partially/fully loading the pretrained weights from a checkpoint into the **currently instantiated model**. Note that the currently instantiated model should have parameters that match the pre-trained checkpoint (such that weights may load properly). In order to directly fine-tune a pre-existing checkpoint, please follow the tutorial `ASR Language Fine-tuning. `_ - -Models can be fine-tuned in two ways: -* By updating or retaining current tokenizer alone -* By updating model architecture and tokenizer - -Fine-tuning by updating or retaining current tokenizer -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -In this case, the model architecture is not updated. The model is initialized with the pre-trained weights by -two ways: - -1) Providing a path to a NeMo model (via ``init_from_nemo_model``) -2) Providing a name of a pretrained NeMo model (which will be downloaded via the cloud) (via ``init_from_pretrained_model``) - -Then users can use existing tokenizer or update the tokenizer with new vocabulary. This is useful when users don't want to update the model architecture -but want to update the tokenizer with new vocabulary. - -The same script can be used to finetune CTC, RNNT or Hybrid models as well. - -/examples/asr/speech_to_text_finetune.py script supports this type of fine-tuning with the following arguments: - -.. code-block:: sh - - python examples/asr/speech_to_text_finetune.py \ - --config-path= \ - --config-name=) \ - model.train_ds.manifest_filepath="" \ - model.validation_ds.manifest_filepath="" \ - model.tokenizer.update_tokenizer= \ # True to update tokenizer, False to retain existing tokenizer - model.tokenizer.dir= \ # Path to tokenizer dir when update_tokenizer=True - model.tokenizer.type= \ # tokenizer type when update_tokenizer=True - trainer.devices=-1 \ - trainer.accelerator='gpu' \ - trainer.max_epochs=50 \ - +init_from_nemo_model="" (or +init_from_pretrained_model="") - - -Refer to /examples/asr/conf/asr_finetune/speech_to_text_finetune.yaml for more details. - -Finetune ASR Models using HuggingFace Datasets -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Users can utilize HuggingFace Datasets for finetuning NeMo ASR models. The following config file can be used for this purpose: -`/examples/asr/conf/asr_finetune/speech_to_text_hf_finetune.yaml` - -As mentioned earlier, users can update the tokenizer or use an existing one based on their requirements. If users want to create a new tokenizer -from HuggingFace Datasets, they can use the following script: -`/scripts/tokenizers/get_hf_text_data.py` - -Fine-tuning by changing model architecture and tokenizer -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -If users want to update the model architecture as well they can use the following script: - -For providing pretrained model, users can provide Pre-trained weights in multiple ways - - -1) Providing a path to a NeMo model (via ``init_from_nemo_model``) -2) Providing a name of a pretrained NeMo model (which will be downloaded via the cloud) (via ``init_from_pretrained_model``) -3) Providing a path to a Pytorch Lightning checkpoint file (via ``init_from_ptl_ckpt``) - -There are multiple ASR subtasks inside the ``examples/asr/`` directory, you can substitute the ```` tag below. - -.. code-block:: sh - - python examples/asr//script_to_.py \ - --config-path= \ - --config-name=) \ - model.train_ds.manifest_filepath="" \ - model.validation_ds.manifest_filepath="" \ - trainer.devices=-1 \ - trainer.accelerator='gpu' \ - trainer.max_epochs=50 \ - +init_from_nemo_model="" # (or +init_from_pretrained_model, +init_from_ptl_ckpt ) - -To reinitialize part of the model, to make it different from the pretrained model, users can mention them through config: - -.. code-block:: yaml - - init_from_nemo_model: "" - asr_model: - include: ["preprocessor","encoder"] - exclude: ["decoder"] - -Fine-tuning Execution Flow Diagram ----------------------------------- - -When preparing your own training or fine-tuning scripts, please follow the execution flow diagram order for correct inference. - -Depending on the type of model, there may be extra steps that must be performed - - -* CTC Models - `Examples directory for CTC Models `_ -* RNN Transducer Models - `Examples directory for Transducer Models `_ + pred_rnn_layers: 1 + + joint: + _target_: nemo.collections.asr.modules.RNNTJoint + log_softmax: null + fuse_loss_wer: false + fused_batch_size: 16 + jointnet: + joint_hidden: ${model.model_defaults.joint_hidden} + activation: "relu" + + decoding: + strategy: "greedy_batch" # greedy, greedy_batch, beam, tsd, alsd, maes + greedy: + max_symbols: 10 + beam: + beam_size: 2 + score_norm: true + + loss: + loss_name: "default" + warprnnt_numba_kwargs: + fastemit_lambda: 0.0 + +For large vocabularies, use ``SampledRNNTJoint`` with ``n_samples`` to reduce memory. +`FastEmit `_ regularization controls transducer latency via ``fastemit_lambda``. + +For decoding customization (confidence scores, CUDA graphs, language models, word boosting), see :doc:`ASR Language Modeling and Customization <./asr_language_modeling_and_customization>`. diff --git a/docs/source/asr/datasets.rst b/docs/source/asr/datasets.rst index edf4205fd78a..4d4783cd4af6 100644 --- a/docs/source/asr/datasets.rst +++ b/docs/source/asr/datasets.rst @@ -1,1222 +1,59 @@ Datasets ======== -NeMo has scripts to convert several common ASR datasets into the format expected by the ``nemo_asr`` collection. You can get started -with those datasets by following the instructions to run those scripts in the section appropriate to each dataset below. - -If the user has their own data and want to preprocess it to use with NeMo ASR models, refer to the `Preparing Custom ASR Data`_ section. - -If the user already has a dataset that you want to convert to a tarred format, refer to the :ref:`Tarred Datasets ` section. - -.. _LibriSpeech_dataset: - -LibriSpeech ------------ - -Run the following scripts to download the LibriSpeech data and convert it into the format expected by `nemo_asr`. At least 250GB free -space is required. - -.. code-block:: bash - - # install sox - sudo apt-get install sox - mkdir data - python get_librispeech_data.py --data_root=data --data_set=ALL - -After this, the ``data`` folder should contain wav files and ``.json`` manifests for NeMo ASR datalayer. - -Each line is a training example. ``audio_filepath`` contains the path to the wav file, ``duration`` is the duration in seconds, and ``text`` is the transcript: - -.. code-block:: json - - {"audio_filepath": "/1355-39947-0000.wav", "duration": 11.3, "text": "psychotherapy and the community both the physician and the patient find their place in the community the life interests of which are superior to the interests of the individual"} - {"audio_filepath": "/1355-39947-0001.wav", "duration": 15.905, "text": "it is an unavoidable question how far from the higher point of view of the social mind the psychotherapeutic efforts should be encouraged or suppressed are there any conditions which suggest suspicion of or direct opposition to such curative work"} - -Fisher English Training Speech ------------------------------- - -Run these scripts to convert the Fisher English Training Speech data into a format expected by the ``nemo_asr`` collection. - -In brief, the following scripts convert the ``.sph`` files to ``.wav``, slices those files into smaller audio samples, matches the -smaller slices with their corresponding transcripts, and splits the resulting audio segments into train, validation, and test sets -(with one manifest each). - -.. note:: - - 106 GB of space is required to run the ``.wav`` conversion - - additional 105 GB is required for the slicing and matching - - ``sph2pipe`` is required in order to run the ``.wav`` conversion - -**Instructions** - -The following scripts assume that you already have the Fisher dataset from the Linguistic Data Consortium, with a directory structure -that looks similar to the following: - -.. code-block:: bash - - FisherEnglishTrainingSpeech/ - ├── LDC2004S13-Part1 - │   ├── fe_03_p1_transcripts - │   ├── fisher_eng_tr_sp_d1 - │   ├── fisher_eng_tr_sp_d2 - │   ├── fisher_eng_tr_sp_d3 - │   └── ... - └── LDC2005S13-Part2 - ├── fe_03_p2_transcripts - ├── fe_03_p2_sph1 - ├── fe_03_p2_sph2 - ├── fe_03_p2_sph3 - └── ... - -The transcripts that will be used are located in the ``fe_03_p<1,2>_transcripts/data/trans`` directory. The audio files (``.sph``) -are located in the remaining directories in an ``audio`` subdirectory. - -#. Convert the audio files from ``.sph`` to ``.wav`` by running: - - .. code-block:: bash - - cd /scripts/dataset_processing - python fisher_audio_to_wav.py \ - --data_root= --dest_root= - - This will place the unsliced ``.wav`` files in ``/LDC200[4,5]S13-Part[1,2]/audio-wav/``. It will take several - minutes to run. - -#. Process the transcripts and slice the audio data. - - .. code-block:: bash - - python process_fisher_data.py \ - --audio_root= --transcript_root= \ - --dest_root= \ - --remove_noises - - This script splits the full dataset into train, validation, test sets, and places the audio slices in the corresponding folders - in the destination directory. One manifest is written out per set, which includes each slice's transcript, duration, and path. - - This will likely take around 20 minutes to run. Once finished, delete the 10 minute long ``.wav`` files. - -2000 HUB5 English Evaluation Speech ------------------------------------ - -Run the following script to convert the HUB5 data into a format expected by the ``nemo_asr`` collection. - -Similarly, to the Fisher dataset processing scripts, this script converts the ``.sph`` files to ``.wav``, slices the audio files and -transcripts into utterances, and combines them into segments of some minimum length (default is 10 seconds). The resulting segments -are all written out to an audio directory and the corresponding transcripts are written to a manifest JSON file. - -.. note:: - - 5 GB of free space is required to run this script - - ``sph2pipe`` is also required to be installed - -This script assumes you already have the 2000 HUB5 dataset from the Linguistic Data Consortium. - -Run the following command to process the 2000 HUB5 English Evaluation Speech samples: - -.. code-block:: bash - - python process_hub5_data.py \ - --data_root= \ - --dest_root= - -You can optionally include ``--min_slice_duration=`` if you would like to change the minimum audio segment duration. - -AN4 Dataset ------------ - -This is a small dataset recorded and distributed by Carnegie Mellon University. It consists of recordings of people spelling out -addresses, names, etc. - -#. `Download and extract the dataset `_ (which is labeled "NIST's Sphere audio (.sph) format (64M)". - -#. Convert the ``.sph`` files to ``.wav`` using sox, and build one training and one test manifest. - - .. code-block:: bash - - python process_an4_data.py --data_root= - -After the script finishes, the ``train_manifest.json`` and ``test_manifest.json`` can be found in the ``/an4/`` directory. - -Aishell-1 ---------- - -To download the Aishell-1 data and convert it into a format expected by ``nemo_asr``, run: - -.. code-block:: bash - - # install sox - sudo apt-get install sox - mkdir data - python get_aishell_data.py --data_root=data - -After the script finishes, the ``data`` folder should contain a ``data_aishell`` folder which contains a wav file, a transcript folder, and related ``.json`` and ``vocab.txt`` files. - -Aishell-2 ---------- - -To process the AIShell-2 dataset, in the command below, set the data folder of AIShell-2 using ``--audio_folder`` and where to push -these files using ``--dest_folder``. In order to generate files in the supported format of ``nemo_asr``, run: - -.. code-block:: bash - - python process_aishell2_data.py --audio_folder= --dest_folder= - -After the script finishes, the ``train.json``, ``dev.json``, ``test.json``, and ``vocab.txt`` files can be found in the ``dest_folder`` directory. +NeMo ASR models expect data as a set of audio files plus a manifest file describing each utterance. .. _section-with-manifest-format-explanation: -Preparing Custom ASR Data -------------------------- - -The ``nemo_asr`` collection expects each dataset to consist of a set of utterances in individual audio files plus -a manifest that describes the dataset, with information about one utterance per line (``.json``). -The audio files can be of any format supported by `Pydub `_, though we recommend -WAV files as they are the default and have been most thoroughly tested. - -There should be one manifest file per dataset that will be passed in, therefore, if the user wants separate training and validation -datasets, they should also have separate manifests. Otherwise, they will be loading validation data with their training data and vice -versa. +Manifest Format +--------------- -Each line of the manifest should be in the following format: +Each line of the manifest is a JSON object: .. code-block:: json {"audio_filepath": "/path/to/audio.wav", "text": "the transcription of the utterance", "duration": 23.147} -The :code:`audio_filepath` field should provide an absolute path to the ``.wav`` file corresponding to the utterance. -The :code:`text` field should contain the full transcript for the utterance, and the :code:`duration` field should -reflect the duration of the utterance in seconds. +* ``audio_filepath`` — absolute or relative path to the audio file (WAV recommended) +* ``text`` — the transcript +* ``duration`` — duration in seconds -Each entry in the manifest (describing one audio file) should be bordered by '{' and '}' and must -be contained on one line. The fields that describe the file should be separated by commas, and have the form :code:`"field_name": value`, -as shown above. There should be no extra lines in the manifest, i.e. there should be exactly as many lines in the manifest as -there are audio files in the dataset. - -Since the manifest specifies the path for each utterance, the audio files do not have to be located -in the same directory as the manifest, or even in any specific directory structure. - -Once there is a manifest that describes each audio file in the dataset, use the dataset by passing -in the manifest file path in the experiment config file, e.g. as ``training_ds.manifest_filepath=``. +There should be one manifest per dataset split (train, validation, test). Pass it via ``training_ds.manifest_filepath=``. .. _Tarred_Datasets: Tarred Datasets --------------- -If experiments are run on a cluster with datasets stored on a distributed file system, the user will likely -want to avoid constantly reading multiple small files and would prefer tarring their audio files. -There are tarred versions of some NeMo ASR dataset classes for this case, such as the ``TarredAudioToCharDataset`` -(corresponding to the ``AudioToCharDataset``) and the ``TarredAudioToBPEDataset`` (corresponding to the -``AudioToBPEDataset``). The tarred audio dataset classes in NeMo use `WebDataset `_. - -To use an existing tarred dataset instead of a non-tarred dataset, set ``is_tarred: true`` in -the experiment config file. Then, pass in the paths to all of the audio tarballs in ``tarred_audio_filepaths``, either as a list -of filepaths, e.g. ``['/data/shard1.tar', '/data/shard2.tar']``, or in a single brace-expandable string, e.g. -``'/data/shard_{1..64}.tar'`` or ``'/data/shard__OP_1..64_CL_'`` (recommended, see note below). - -.. note:: - For brace expansion, there may be cases where ``{x..y}`` syntax cannot be used due to shell interference. This occurs most commonly - inside SLURM scripts. Therefore, we provide a few equivalent replacements. Supported opening braces (equivalent to ``{``) are ``(``, - ``[``, ``<`` and the special tag ``_OP_``. Supported closing braces (equivalent to ``}``) are ``)``, ``]``, ``>`` and the special - tag ``_CL_``. For SLURM based tasks, we suggest the use of the special tags for ease of use. - -As with non-tarred datasets, the manifest file should be passed in ``manifest_filepath``. The dataloader assumes that the length -of the manifest after filtering is the correct size of the dataset for reporting training progress. - -The ``tarred_shard_strategy`` field of the config file can be set if you have multiple shards and are running an experiment with -multiple workers. It defaults to ``scatter``, which preallocates a set of shards per worker which do not change during runtime. -Note that this strategy, on specific occasions (when the number of shards is not divisible with ``world_size``), will not sample -the entire dataset. As an alternative the ``replicate`` strategy, will preallocate the entire set of shards to every worker and not -change it during runtime. The benefit of this strategy is that it allows each worker to sample data points from the entire dataset -independently of others. Note, though, that more than one worker may sample the same shard, and even sample the same data points! -As such, there is no assured guarantee that all samples in the dataset will be sampled at least once during 1 epoch. Note that -for these reasons it is not advisable to use tarred datasets as validation and test datasets. - -For more information about the individual tarred datasets and the parameters available, including shuffling options, -see the corresponding class APIs in the :ref:`Datasets ` section. - -.. warning:: - If using multiple workers, the number of shards should be divisible by the world size to ensure an even - split among workers. If it is not divisible, logging will give a warning but training will proceed, but likely hang at the last epoch. - In addition, if using distributed processing, each shard must have the same number of entries after filtering is - applied such that each worker ends up with the same number of files. We currently do not check for this in any dataloader, but the user's - program may hang if the shards are uneven. - -Sharded Manifests -~~~~~~~~~~~~~~~~~ -If your dataset / manifest is large, you may wish to use sharded manifest files instead of a single manifest file. The naming convention -is identical to the audio tarballs and there should be a 1:1 relationship between a sharded audio tarfile and its manifest shard; e.g. -``'/data/sharded_manifests/manifest__OP_1..64_CL_'`` in the above example. Using sharded manifests improves job startup times and -decreases memory usage, as each worker only loads manifest shards for the corresponding audio shards instead of the entire manifest. - -To enable sharded manifest filename expansion, set the ``shard_manifests`` field of the config file to true. In addition, the -``defer_setup`` flag needs to be true as well, so that the dataloader will be initialized after the DDP and its length can be collected from -the distributed workers. - -Batching strategies ---------------------- - -For training ASR models, audios with different lengths may be grouped into a batch. It would make it necessary to use paddings to make all the same length. -These extra paddings is a significant source of computation waste. - -Semi Sorted Batching ---------------------- - -Sorting samples by duration and spliting them into batches speeds up training, but can degrade the quality of the model. To avoid quality degradation and maintain some randomness in the partitioning process, we add pseudo noise to the sample length when sorting. - -It may result into training speedup of more than 40 percent with the same quality. To enable and use semi sorted batching add some lines in config. +For cluster training with distributed file systems, tar your audio files to avoid reading many small files. +Use ``is_tarred: true`` in the config and provide tarball paths via ``tarred_audio_filepaths``. - .. code:: +NeMo uses `WebDataset `_ for tarred data. - ++model.train_ds.use_semi_sorted_batching=true - ++model.train_ds.randomization_factor=0.1 +**Convert to tarred format:** -Semi sorted batching is supported by the following models: - - .. code:: - - nemo.collections.asr.models.EncDecCTCModel - nemo.collections.asr.models.EncDecCTCModelBPE - nemo.collections.asr.models.EncDecRNNTModel - nemo.collections.asr.models.EncDecRNNTBPEModel - nemo.collections.asr.models.EncDecHybridRNNTCTCModel - nemo.collections.asr.models.EncDecHybridRNNTCTCBPEModel - -For more details about this algorithm, see the `paper `_ . - -.. _Bucketing_Datasets: - -Bucketing Datasets ---------------------- - -Splitting the training samples into buckets with different lengths and sampling from the same bucket for each batch would increase the computation efficiency. -It may result into training speedup of more than 2X. To enable and use the bucketing feature, you need to create the bucketing version of the dataset by using `conversion script here `_. -You may use --buckets_num to specify the number of buckets (Recommend to use 4 to 8 buckets). It creates multiple tarred datasets, one per bucket, based on the audio durations. The range of [min_duration, max_duration) is split into equal sized buckets. - -To enable the bucketing feature in the dataset section of the config files, you need to pass the multiple tarred datasets as a list of lists. -If user passes just a list of strings, then the datasets would simply get concatenated which would be different from bucketing. -Here is an example for 4 buckets and 512 shards: - -.. code:: - - python speech_to_text_bpe.py - ... - model.train_ds.manifest_filepath=[[PATH_TO_TARS/bucket1/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket2/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket3/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket4/tarred_audio_manifest.json]] - model.train_ds.tarred_audio_filepaths=[[PATH_TO_TARS/bucket1/audio__OP_0..511_CL_.tar], - [PATH_TO_TARS/bucket2/audio__OP_0..511_CL_.tar], - [PATH_TO_TARS/bucket3/audio__OP_0..511_CL_.tar], - [PATH_TO_TARS/bucket4/audio__OP_0..511_CL_.tar]] - -When bucketing is enabled, in each epoch, first all GPUs would use the first bucket, then go to the second bucket, and so on. It guarantees that all GPUs are using the same bucket at the same time. It reduces the number of paddings in each batch and speedup the training significantly without hurting the accuracy significantly. - -There are two types of batching: - -* Fixed-size bucketing: all batches would have the same number of samples specified by train_ds.batch_size -* Adaptive-size bucketing: uses different batch sizes for each bucket. - -Adaptive-size bucketing helps to increase the GPU utilization and speedup the training. -Batches sampled from buckets with smaller audio lengths can be larger which would increase the GPU utilization and speedup the training. -You may use train_ds.bucketing_batch_size to enable the adaptive batching and specify the batch sizes for the buckets. -When bucketing_batch_size is not set, train_ds.batch_size is going to be used for all buckets (fixed-size bucketing). - -bucketing_batch_size can be set as an integer or a list of integers to explicitly specify the batch size for each bucket. -if bucketing_batch_size is set to be an integer, then linear scaling is being used to scale-up the batch sizes for batches with shorted audio size. For example, setting train_ds.bucketing_batch_size=8 for 4 buckets would use these sizes [32,24,16,8] for different buckets. -When bucketing_batch_size is set, train_ds.batch_size need to be set to 1. - -Training an ASR model on audios sorted based on length may affect the accuracy of the model. We introduced some strategies to mitigate it. -We support three types of bucketing strategies: - -* fixed_order: the same order of buckets are used for all epochs -* synced_randomized (default): each epoch would have a different order of buckets. Order of the buckets is shuffled every epoch. -* fully_randomized: similar to synced_randomized but each GPU has its own random order. So GPUs would not be synced. - -The parameter train_ds.bucketing_strategy can be set to specify one of these strategies. The recommended strategy is synced_randomized which gives the highest training speedup. -The fully_randomized strategy would have lower speedup than synced_randomized but may give better accuracy. - -Bucketing may improve the training speed more than 2x but may affect the final accuracy of the model slightly. Training for more epochs and using 'synced_randomized' strategy help to fill this gap. -Currently bucketing feature is just supported for tarred datasets. - - -Conversion to Tarred Datasets -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -You can easily convert your existing NeMo-compatible ASR datasets using the -`conversion script here `_. - -.. code:: bash +.. code-block:: bash - python convert_to_tarred_audio_dataset.py \ - --manifest_path= \ - --target_dir= \ - --num_shards= - --max_duration= \ - --min_duration= \ + python scripts/speech_recognition/convert_to_tarred_audio_dataset.py \ + --manifest_path= \ + --target_dir= \ + --num_shards=64 \ --shuffle --shuffle_seed=0 -This script shuffles the entries in the given manifest (if ``--shuffle`` is set, which we recommend), filter -audio files according to ``min_duration`` and ``max_duration``, and tar the remaining audio files to the directory -``--target_dir`` in ``n`` shards, along with separate manifest and metadata files. - -The files in the target directory should look similar to the following: - -.. code:: - - target_dir/ - ├── audio_1.tar - ├── audio_2.tar - ├── ... - ├── metadata.yaml - ├── tarred_audio_manifest.json - ├── sharded_manifests/ - ├── manifest_1.json - ├── ... - └── manifest_N.json - - -Note that file structures are flattened such that all audio files are at the top level in each tarball. This ensures that -filenames are unique in the tarred dataset and the filepaths do not contain "-sub" and forward slashes in each ``audio_filepath`` are -simply converted to underscores. For example, a manifest entry for ``/data/directory1/file.wav`` would be ``_data_directory1_file.wav`` -in the tarred dataset manifest, and ``/data/directory2/file.wav`` would be converted to ``_data_directory2_file.wav``. - -Sharded manifests are generated by default; this behavior can be toggled via the ``no_shard_manifests`` flag. - -Upsampling Datasets -------------------- - -Buckets may also be 'weighted' to allow multiple runs through a target dataset during each training epoch. This can be beneficial in cases when a dataset is composed of several component sets of unequal sizes and one desires to mitigate bias towards the larger sets through oversampling. - -Weighting is managed with the `bucketing_weights` parameter. After passing your composite tarred datasets in the format described above for bucketing, pass a list of integers (one per bucket) to indicate how many times a manifest should be read during training. - -For example, by passing `[2,1,1,3]` to the code below: - -.. code:: - - python speech_to_text_bpe.py - ... - model.train_ds.manifest_filepath=[[PATH_TO_TARS/bucket1/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket2/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket3/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket4/tarred_audio_manifest.json]] - model.train_ds.tarred_audio_filepaths=[[PATH_TO_TARS/bucket1/audio__OP_0..511_CL_.tar], - [PATH_TO_TARS/bucket2/audio__OP_0..511_CL_.tar], - [PATH_TO_TARS/bucket3/audio__OP_0..511_CL_.tar], - [PATH_TO_TARS/bucket4/audio__OP_0..511_CL_.tar]] - ... - model.train_ds.bucketing_weights=[2,1,1,3] - -NeMo will configure training so that all data in `bucket1` will be present twice in a training epoch, `bucket4` will be present three times, and that of `bucket2` and `bucket3` will occur only once each. Note that this will increase the effective amount of data present during training and thus affect training time per epoch. - -If using adaptive bucketing, note that the same batch size will be assigned to each instance of the upsampled data. That is, given the following: - -.. code:: - - python speech_to_text_bpe.py - ... - model.train_ds.manifest_filepath=[[PATH_TO_TARS/bucket1/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket2/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket3/tarred_audio_manifest.json], - [PATH_TO_TARS/bucket4/tarred_audio_manifest.json]] - ... - ... - model.train_ds.bucketing_weights=[2,1,1,3] - model.train_ds.bucketing_batch_size=[4,4,4,2] - -All instances of data from `bucket4` will still be trained with a batch size of 2 while all others would have a batch size of 4. As with standard bucketing, this requires `batch_size`` to be set to 1. -If `bucketing_batch_size` is not specified, all datasets will be passed with the same fixed batch size as specified by the `batch_size` parameter. - -It is recommended to set bucketing strategies to `fully_randomized` during multi-GPU training to prevent possible dataset bias during training. - - -Datasets on AIStore -------------------- - -`AIStore `_ is an open-source lightweight object storage system focused on large-scale deep learning. -AIStore is aimed to scale linearly with each added storage node, can be deployed on any Linux machine and can provide a unified namespace across multiple remote backends, such as Amazon S3, Google Cloud, and Microsoft Azure. -More details are provided in the `documentation `_ and the `repository `_ of the AIStore project. - -NeMo currently supports datasets from an AIStore bucket provider under ``ais://`` namespace. - -AIStore Setup -~~~~~~~~~~~~~ - -NeMo is currently relying on the AIStore (AIS) command-line interface (CLI) to handle the supported datasets. -The CLI is available in current NeMo Docker containers. -If necessary, the CLI can be configured using the instructions provided in `AIStore CLI `_ documentation. - -To start using the AIS CLI to access data on an AIS cluster, an endpoint needs to be configured. -The endpoint is configured by setting ``AIS_ENDPOINT`` environment variable before using the CLI - -.. code:: - - export AIS_ENDPOINT=http://hostname:port - ais --help - -In the above, ``hostname:port`` denotes the address of an AIS gateway. -For example, the address could be ``localhost:51080`` if testing using a local `minimal production-ready standalone Docker container `_. - -Dataset Setup -~~~~~~~~~~~~~ - -Currently, both tarred and non-tarred datasets are supported. -For any dataset, the corresponding manifest file is cached locally and processed as a regular manifest file. -For non-tarred datasets, the audio data is also cached locally. -For tarred datasets, shards from the AIS cluster are used by piping ``ais get`` to WebDataset. - -Tarred Dataset from AIS -^^^^^^^^^^^^^^^^^^^^^^^ - -A tarred dataset can be easily used as described in the :ref:`Tarred Datasets ` section by providing paths to manifests on an AIS cluster. -For example, a tarred dataset from an AIS cluster can be configured as - -.. code:: - - manifest_filepath='ais://bucket/tarred_audio_manifest.json' - tarred_audio_filepaths='ais://bucket/shard_{1..64}.tar' - -:ref:`Bucketing Datasets ` are configured in a similar way by providing paths on an AIS cluster. - -Non-tarred Dataset from AIS -^^^^^^^^^^^^^^^^^^^^^^^^^^^ - -A non-tarred dataset can be easly used by providing a manifest file path on an AIS cluster - -.. code:: - - manifest_filepath='ais://bucket/dataset_manifest.json' - -Note that it is assumed that the manifest file path contains audio file paths relative to the manifest locations. -For example the manifest file may have lines in the following format - -.. code-block:: json - - {"audio_filepath": "path/to/audio.wav", "text": "transcription of the uterance", "duration": 23.147} - -The corresponding audio file would be downloaded from ``ais://bucket/path/to/audio.wav``. - -Cache configuration -^^^^^^^^^^^^^^^^^^^ - -Manifests and audio files from non-tarred datasets will be cached locally. -Location of the cache can be configured by setting two environment variables - -- ``NEMO_DATA_STORE_CACHE_DIR``: path to a location which can be used to cache the data -- ``NEMO_DATA_STORE_CACHE_SHARED``: flag to denote whether the cache location is shared between the compute nodes - -In a multi-node environment, the cache location may or may be not shared between the nodes. -This can be configured by setting ``NEMO_DATA_STORE_CACHE_SHARED`` to ``1`` when the location is shared between the nodes or to ``0`` when each node has a separate cache. - -When a globally shared cache is available, the data should be cached only once from the global rank zero node. -When a node-specific cache is used, the data should be cached only once by each local rank zero node. -To control this behavior using `torch.distributed.barrier`, instantiation of the corresponding dataloader needs to be deferred ``ModelPT::setup``, to ensure a distributed environment has been initialized. -This can be achieved by setting ``defer_setup`` as - -.. code:: shell - - ++model.train_ds.defer_setup=true - ++model.validation_ds.defer_setup=true - ++model.test_ds.defer_setup=true - - -Complete Example -^^^^^^^^^^^^^^^^ +.. _Bucketing_Datasets: -An example using an AIS cluster at ``hostname:port`` with a tarred dataset for training, a non-tarred dataset for validation and node-specific caching is given below +Bucketing +--------- -.. code:: shell +Split training samples into duration-based buckets for up to 2x speedup by reducing padding. +Pass tarred datasets as a list of lists to enable bucketing. Use ``bucket_batch_size`` for adaptive batch sizes per bucket. - export AIS_ENDPOINT=http://hostname:port \ - && export NEMO_DATA_STORE_CACHE_DIR=/tmp \ - && export NEMO_DATA_STORE_CACHE_SHARED=0 \ - python speech_to_text_bpe.py \ - ... - model.train_ds.manifest_filepath=ais://train_bucket/tarred_audio_manifest.json \ - model.train_ds.tarred_audio_filepaths=ais://train_bucket/audio__OP_0..511_CL_.tar \ - ++model.train_ds.defer_setup=true \ - model.validation_ds.manifest_filepath=ais://validation_bucket/validation_manifest.json \ - ++model.validation_ds.defer_setup=true +For advanced dynamic bucketing with Lhotse, see :doc:`Lhotse Dataloading `. Lhotse Dataloading ------------------ -NeMo supports using `Lhotse`_, a speech data handling library, as a dataloading option. The key features of Lhotse used in NeMo are: - -* Dynamic batch sizes - Lhotse samples mini-batches to satisfy the constraint of total speech duration in a mini-batch (``batch_duration``), - rather than a specific number of examples (i.e., batch size). -* Dynamic bucketing - Instead of statically pre-bucketing the data, Lhotse allocates training examples to buckets dynamically. - This allows more rapid experimentation with bucketing settings (number of buckets, specific placement of bucket duration bins) - to minimize the amount of padding and accelerate training. -* Quadratic duration penalty - Adding a quadratic penalty to an utterance's duration allows to sample mini-batches so that the - GPU utilization is more consistent across big batches of short utterances and small batches of long utterances when using - models with quadratic time/memory complexity (such as transformer). -* Dynamic weighted data source multiplexing - An approach to combining diverse data sources (e.g. multiple domains, languages, tasks) - where each data source is treated as a separate stream with its own sampling probability. The resulting data stream is a - multiplexer that samples from each sub-stream. This approach ensures that the distribution of different sources is approximately - constant in time (i.e., stationary); in fact, each mini-batch will have roughly the same ratio of data coming from each source. - Since the multiplexing is done dynamically, it is very easy to tune the sampling weights. - -Lhotse dataloading supports the following types of inputs: - -* NeMo manifests - Regular NeMo JSON manifests. -* NeMo tarred data - Tarred NeMo JSON manifests + audio tar files; we also support combination of multiple NeMo - tarred data sources (e.g., multiple buckets of NeMo data or multiple datasets) via dynamic multiplexing. - - We support using a subset of Tarred NeMo JSON manifests along with audio tar files without disrupting the alignment between the tarred files and their corresponding manifests. - This feature is essential because large datasets often consist of numerous tar files and multiple versions of Tarred NeMo JSON manifest subsets, which may contain only a portion of the audio files due to filtering for various reasons. - To skip specific entries in the manifests without repeatedly copying and retarring audio files, the entries must include a ``_skipme`` key. This key should be set to ``True``, ``1``, or a reason for skipping (e.g., ``low character-rate``). - -* Lhotse CutSet manifests - Regular Lhotse CutSet manifests (typically gzipped JSONL). - See `Lhotse Cuts documentation`_ to learn more about Lhotse data formats. -* Lhotse Shar data - Lhotse Shar is a data format that also uses tar files for sequential data loading, - but is designed to be modular (i.e., easily extensible with new data sources and with new feature fields). - More details can be found here: |tutorial_shar| - -.. caution:: As of now, Lhotse is mainly supported in most ASR model configurations. We aim to gradually extend this support to other speech tasks. - -.. _Lhotse: https://github.com/lhotse-speech/lhotse -.. _Lhotse Cuts documentation: https://lhotse.readthedocs.io/en/latest/cuts.html -.. |tutorial_shar| image:: https://colab.research.google.com/assets/colab-badge.svg - :target: https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb - -Enabling Lhotse via configuration -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -.. note:: Using Lhotse with tarred datasets will make the dataloader infinite, ditching the notion of an "epoch". "Epoch" may still be logged in W&B/TensorBoard, but it will correspond to the number of executed training loops between validation loops. - -Start with an existing NeMo experiment YAML configuration. Typically, you'll only need to add a few options to enable Lhotse. -These options are:: - - # NeMo generic dataloading arguments - model.train_ds.manifest_filepath=... - model.train_ds.tarred_audio_filepaths=... # for tarred datasets only - model.train_ds.num_workers=4 - model.train_ds.min_duration=0.3 # optional - model.train_ds.max_duration=30.0 # optional - model.train_ds.shuffle=true # optional - - # Lhotse dataloading related arguments - ++model.train_ds.use_lhotse=True - ++model.train_ds.batch_duration=1100 - ++model.train_ds.quadratic_duration=30 - ++model.train_ds.num_buckets=30 - ++model.train_ds.num_cuts_for_bins_estimate=10000 - ++model.train_ds.bucket_buffer_size=10000 - ++model.train_ds.shuffle_buffer_size=10000 - - # PyTorch Lightning related arguments - ++trainer.use_distributed_sampler=false - ++trainer.limit_train_batches=1000 - trainer.val_check_interval=1000 - trainer.max_steps=300000 - -.. note:: The default values above are a reasonable starting point for a hybrid RNN-T + CTC ASR model on a 32GB GPU with a data distribution dominated by 15s long utterances. - -Let's briefly go over each of the Lhotse dataloading arguments: - -* ``use_lhotse`` enables Lhotse dataloading -* ``batch_duration`` is the total max duration of utterances in a mini-batch and controls the batch size; the more shorter utterances, the bigger the batch size, and vice versa. -* ``quadratic_duration`` adds a quadratically growing penalty for long utterances; useful in bucketing and transformer type of models. The value set here means utterances this long will count as if with a doubled duration. -* ``num_buckets`` is the number of buckets in the bucketing sampler. Bigger value means less padding but also less randomization. -* ``num_cuts_for_bins_estimate`` is the number of utterance we will sample before the start of the training to estimate the duration bins for buckets. Larger number results in a more accurate estimatation but also a bigger lag before starting the training. -* ``bucket_buffer_size`` is the number of utterances (data and metadata) we will hold in memory to be distributed between buckets. With bigger ``batch_duration``, this number may need to be increased for dynamic bucketing sampler to work properly (typically it will emit a warning if this is too low). -* ``shuffle_buffer_size`` is an extra number of utterances we will hold in memory to perform approximate shuffling (via reservoir-like sampling). Bigger number means more memory usage but also better randomness. - -The PyTorch Lightning ``trainer`` related arguments: - -* ``use_distributed_sampler=false`` is required because Lhotse has its own handling of distributed sampling. -* ``val_check_interval``/``limit_train_batches`` - These are required for dataloaders with tarred/Shar datasets - because Lhotse makes the dataloader infinite, so we'd never go past epoch 0. This approach guarantees - we will never hang the training because the dataloader in some node has less mini-batches than the others - in some epochs. The value provided here will be the effective length of each "pseudo-epoch" after which we'll - trigger the validation loop. -* ``max_steps`` is the total number of steps we expect to be training for. It is required for the same reason as ``limit_train_batches``; since we'd never go past epoch 0, the training would have never finished. - -Some other Lhotse related arguments we support: - -* ``cuts_path`` can be provided to read data from a Lhotse CutSet manifest instead of a NeMo manifest. - Specifying this option will result in ``manifest_filepaths`` and ``tarred_audio_filepaths`` being ignored. -* ``shar_path`` - Can be provided to read data from a Lhotse Shar manifest instead of a NeMo manifest. - Specifying this option will result in ``manifest_filepaths`` and ``tarred_audio_filepaths`` being ignored. - This argument can be a string (single Shar directory), a list of strings (Shar directories), - or a list of 2-item lists, where the first item is a Shar directory path, and the other is a sampling weight. - The user can also provide a dict mapping Lhotse Shar fields to a list of shard paths with data for that field. - For details about Lhotse Shar format, see: |tutorial_shar| -* ``bucket_duration_bins`` - Duration bins are a list of float values (seconds) that when provided, will skip the initial bucket bin estimation - and save some time. It has to have a length of ``num_buckets - 1``. An optimal value can be obtained by running CLI: - ``lhotse cut estimate-bucket-bins -b $num_buckets my-cuts.jsonl.gz`` -* ``use_bucketing`` is a boolean which indicates if we want to enable/disable dynamic bucketing. By defalt it's enabled. -* ``text_field`` is the name of the key in the JSON (NeMo) manifest from which we should be reading text (default="text"). -* ``lang_field`` is the name of the key in the JSON (NeMo) manifest from which we should be reading language tag (default="lang"). This is useful when working e.g. with ``AggregateTokenizer``. -* ``batch_size`` - Limits the number of examples in a mini-batch to this number, when combined with ``batch_duration``. - When ``batch_duration`` is not set, it acts as a static batch size. -* ``seed`` sets a random seed for the shuffle buffer. - -The full and always up-to-date list of supported options can be found in ``LhotseDataLoadingConfig`` class. - -.. _asr-dataset-config-format: - -Extended multi-dataset configuration format -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Combining a large number of datasets and defining weights for them can be tricky. -We offer an extended configuration format that allows you to explicitly define datasets, -dataset groups, and their weights either inline in the experiment configuration, -or as a path to a separate YAML file. - -In addition to the features above, this format introduces a special ``tags`` dict-like field. -The keys and values in ``tags`` are automatically attached to every sampled example, which -is very useful when combining multiple datasets with different properties. -The dataset class which converts these examples to tensors can partition the mini-batch and apply -different processing to each group. -For example, you may want to construct different prompts for the model using metadata in ``tags``. - -.. note:: When fine-tuning a model that was trained with ``input_cfg`` option, typically you'd only need - to override the following options: ``input_cfg=null`` and ``manifest_filepath=path/to/manifest.json``. - -Example 1. Combine two datasets with equal weights and attach custom metadata in ``tags`` to each cut: - -.. code-block:: yaml - - input_cfg: - - type: nemo_tarred - manifest_filepath: /path/to/manifest__OP_0..512_CL_.json - tarred_audio_filepath: /path/to/tarred_audio/audio__OP_0..512_CL_.tar - weight: 0.4 - tags: - lang: en - pnc: no - - type: nemo_tarred - manifest_filepath: /path/to/other/manifest__OP_0..512_CL_.json - tarred_audio_filepath: /path/to/other/tarred_audio/audio__OP_0..512_CL_.tar - weight: 0.6 - tags: - lang: pl - pnc: yes - -Example 2. Combine multiple (4) datasets, corresponding to different tasks (ASR, AST). -Each task gets its own group and its own weight. -Then within each task, each dataset get its own within-group weight as well. -The final weight is the product of outer and inner weight: - -.. code-block:: yaml - - input_cfg: - - type: group - weight: 0.7 - tags: - task: asr - input_cfg: - - type: nemo_tarred - manifest_filepath: /path/to/asr1/manifest__OP_0..512_CL_.json - tarred_audio_filepath: /path/to/tarred_audio/asr1/audio__OP_0..512_CL_.tar - weight: 0.6 - tags: - source_lang: en - target_lang: en - - type: nemo_tarred - manifest_filepath: /path/to/asr2/manifest__OP_0..512_CL_.json - tarred_audio_filepath: /path/to/asr2/tarred_audio/audio__OP_0..512_CL_.tar - weight: 0.4 - tags: - source_lang: pl - target_lang: pl - - type: group - weight: 0.3 - tags: - task: ast - input_cfg: - - type: nemo_tarred - manifest_filepath: /path/to/ast1/manifest__OP_0..512_CL_.json - tarred_audio_filepath: /path/to/ast1/tarred_audio/audio__OP_0..512_CL_.tar - weight: 0.2 - tags: - source_lang: en - target_lang: pl - - type: nemo_tarred - manifest_filepath: /path/to/ast2/manifest__OP_0..512_CL_.json - tarred_audio_filepath: /path/to/ast2/tarred_audio/audio__OP_0..512_CL_.tar - weight: 0.8 - tags: - source_lang: pl - target_lang: en - -Configuring multimodal dataloading -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -Our configuration format supports specifying data sources from other modalities than just audio. -At this time, this support is extended to audio and text modalities. We provide the following parser types: - -**Raw text files.** Simple text files where each line is an individual text example. This can represent standard language modeling data. -This parser is registered under ``type: txt``. - -Data format examples:: - - # file: document_0.txt - This is a language modeling example. - Wall Street is expecting major news tomorrow. - - # file: document_1.txt - Invisible bats have stormed the city. - What an incredible event! - -Dataloading configuration example:: - - input_cfg: - - type: txt - paths: /path/to/document_{0..1}.txt - language: en # optional - -Python object example:: - - from nemo.collections.common.data.lhotse.text_adapters import TextExample - - example = TextExample( - text="This is a language modeling example.", - language="en", # optional - ) - -Python dataloader instantiation example:: - - from nemo.collections.common.data.lhotse.dataloader import get_lhotse_dataloader_from_config - - dl = get_lhotse_dataloader_from_config({ - "input_cfg": [ - {"type": "txt", "paths": "/path/to/document_{0..1}.txt", "language": "en"}, - ], - "use_multimodal_dataloading": True, - "batch_size": 4, - }, - global_rank=0, - world_size=1, - dataset=MyDatasetClass(), # converts CutSet -> dict[str, Tensor] - tokenizer=my_tokenizer, - ) - -**Raw text file pairs.** Pairs of raw text files with corresponding lines. This can represent machine translation data. -This parser is registered under ``type: txt_pair``. - -Data format examples:: - - # file: document_en_0.txt - This is a machine translation example. - Wall Street is expecting major news tomorrow. - - # file: document_pl_0.txt - To jest przykład tłumaczenia maszynowego. - Wall Street spodziewa się jutro ważnych wiadomości. - -Dataloading configuration example:: - - input_cfg: - - type: txt_pair - source_path: /path/to/document_en_{0..N}.txt - target_path: /path/to/document_pl_{0..N}.txt - source_language: en # optional - target_language: pl # optional - -Python object example:: - - from nemo.collections.common.data.lhotse.text_adapters import SourceTargetTextExample - - example = SourceTargetTextExample( - source=TextExample( - text="This is a language modeling example.", - language="en", # optional - ), - target=TextExample( - text="To jest przykład tłumaczenia maszynowego.", - language="pl", # optional - ), - ) - -Python dataloader instantiation example:: - - from nemo.collections.common.data.lhotse.dataloader import get_lhotse_dataloader_from_config - - dl = get_lhotse_dataloader_from_config({ - "input_cfg": [ - { - "type": "txt_pair", - "source_path": "/path/to/document_en_{0..N}.txt", - "target_path": "/path/to/document_pl_{0..N}.txt", - "source_language": "en" - "target_language": "en" - }, - ], - "use_multimodal_dataloading": True, - "prompt_format": "t5nmt", - "batch_size": 4, - }, - global_rank=0, - world_size=1, - dataset=MyDatasetClass(), # converts CutSet -> dict[str, Tensor] - tokenizer=my_tokenizer, - ) - -**NeMo multimodal conversations.** A JSON-Lines (JSONL) file that defines multi-turn conversations with mixed text and audio turns. -This parser is registered under ``type: multimodal_conversation``. - -Data format examples:: - - # file: chat_0.jsonl - {"id": "conv-0", "conversations": [{"from": "user", "value": "speak to me", "type": "text"}, {"from": "assistant": "value": "/path/to/audio.wav", "duration": 17.1, "type": "audio"}]} - {"id": "conv-1", "conversations": [{"from": "user", "value": "speak to me", "type": "text"}, {"from": "assistant": "value": "/path/to/audio.wav", "duration": 5, "offset": 17.1, "type": "audio"}]} - -Dataloading configuration example:: - - token_equivalent_duration: 0.08 - input_cfg: - - type: multimodal_conversation - manifest_filepath: /path/to/chat_{0..N}.jsonl - audio_locator_tag: [audio] - -Python object example:: - - from lhotse import Recording - from nemo.collections.common.data.lhotse.text_adapters import MultimodalConversation, TextTurn, AudioTurn - - conversation = NeMoMultimodalConversation( - id="conv-0", - turns=[ - TextTurn(value="speak to me", role="user"), - AudioTurn(cut=Recording.from_file("/path/to/audio.wav").to_cut(), role="assistant", audio_locator_tag="[audio]"), - ], - token_equivalent_duration=0.08, # this value will be auto-inserted by the dataloader - ) - -Python dataloader instantiation example:: - - from nemo.collections.common.data.lhotse.dataloader import get_lhotse_dataloader_from_config - - dl = get_lhotse_dataloader_from_config({ - "input_cfg": [ - { - "type": "multimodal_conversation", - "manifest_filepath": "/path/to/chat_{0..N}.jsonl", - "audio_locator_tag": "[audio]", - }, - ], - "use_multimodal_dataloading": True, - "token_equivalent_duration": 0.08, - "prompt_format": "llama2", - "batch_size": 4, - }, - global_rank=0, - world_size=1, - dataset=MyDatasetClass(), # converts CutSet -> dict[str, Tensor] - tokenizer=my_tokenizer, - ) - -**Dataloading and bucketing of text and multimodal data.** When dataloading text or multimodal data, pay attention to the following config options (we provide example values for convenience): - -* ``use_multimodal_sampling: true`` tells Lhotse to switch from measuring audio duration to measuring token counts; required for text. - -* ``prompt_format: "prompt-name"`` will apply a specified PromptFormatter during data sampling to accurately reflect its token counts. - -* ``measure_total_length: true`` customizes length measurement for decoder-only and encoder-decoder models. Decoder-only models consume a linear sequence of context + answer, so we should measure the total length (``true``). On the other hand, encoder-decoder models deal with two different sequence lengths: input (context) sequence length for the encoder, and output (answer) sequence length for the decoder. For such models set this to ``false``. - -* ``min_tokens: 1``/``max_tokens: 4096`` filters examples based on their token count (after applying the prompt format). - -* ``min_tpt: 0.1``/``max_tpt: 10`` filter examples based on their output-token-per-input-token-ratio. For example, a ``max_tpt: 10`` means we'll filter every example that has more than 10 output tokens per 1 input token. Very useful for removing sequence length outliers that lead to OOM. Use ``estimate_token_bins.py`` to view token count distributions for calbirating this value. - -* (multimodal-only) ``token_equivalent_duration: 0.08`` is used to be able to measure audio examples in the number of "tokens". For example, if we're using fbank with 0.01s frame shift and an acoustic model that has a subsampling factor of 0.08, then a reasonable setting for this could be 0.08 (which means every subsampled frame counts as one token). Calibrate this value to fit your needs. - -**Text/multimodal bucketing and OOMptimizer.** Analogous to bucketing for audio data, we provide two scripts to support efficient bucketing: - -* ``scripts/speech_llm/estimate_token_bins.py`` which estimates 1D or 2D buckets based on the input config, tokenizer, and prompt format. It also estimates input/output token count distribution and suggested ``max_tpt`` (token-per-token) filtering values. - -* (experimental) ``scripts/speech_llm/oomptimizer.py`` which works with SALM/BESTOW GPT/T5 models and estimates the optimal ``bucket_batch_size`` for a given model config and bucket bins value. Given the complexity of Speech LLM some configurations may not be supported yet at the time of writing (e.g., model parallelism). - -To enable bucketing, set ``batch_size: null`` and use the following options: - -* ``use_bucketing: true`` - -* ``bucket_duration_bins`` - the output of ``estimate_token_bins.py``. If ``null``, it will be estimated at the start of training at the cost of some run time (not recommended). - -* (oomptimizer-only) ``bucket_batch_size`` - the output of OOMptimizer. - -* (non-oomptimizer-only) ``batch_tokens`` is the maximum number of tokens we want to find inside a mini-batch. Similarly to ``batch_duration``, this number does consider padding tokens too, therefore enabling bucketing is recommended to maximize the ratio of real vs padding tokens. Note that it's just a heuristic for determining the optimal batch sizes for different buckets, and may be less efficient than using OOMptimizer. - -* (non-oomptimizer-only) ``quadratic_factor`` is a quadratic penalty to equalize the GPU memory usage between buckets of short and long sequence lengths for models with quadratic memory usage. It is only a heuristic and may not be as efficient as using OOMptimizer. - -**Joint dataloading of text/audio/multimodal data.** The key strength of this approach is that we can easily combine audio datasets and text datasets, -and benefit from every other technique we described in this doc, such as: dynamic data mixing, data weighting, dynamic bucketing, and so on. - -This approach is described in the `EMMeTT`_ paper. There's also a notebook tutorial called Multimodal Lhotse Dataloading. We construct a separate sampler (with its own batching settings) for each modality, -and specify how the samplers should be fused together via the option ``sampler_fusion``: - -* ``sampler_fusion: "round_robin"`` will iterate single sampler per step, taking turns. For example: step 0 - audio batch, step 1 - text batch, step 2 - audio batch, etc. - -* ``sampler_fusion: "randomized_round_robin"`` is similar, but at each chooses a sampler randomly using ``sampler_weights: [w0, w1]`` (weights can be unnormalized). - -* ``sampler_fusion: "zip"`` will draw a mini-batch from each sampler at every step, and merge them into a single ``CutSet``. This approach combines well with multimodal gradient accumulation (run forward+backward for one modality, then the other, then the update step). - -.. _EMMeTT: https://arxiv.org/abs/2409.13523 - -Example. Combine an ASR (audio-text) dataset with an MT (text-only) dataset so that mini-batches have some examples from both datasets: - -.. code-block:: yaml - - model: - ... - train_ds: - multi_config: True, - sampler_fusion: zip - shuffle: true - num_workers: 4 - - audio: - prompt_format: t5nmt - use_bucketing: true - min_duration: 0.5 - max_duration: 30.0 - max_tps: 12.0 - bucket_duration_bins: [[3.16, 10], [3.16, 22], [5.18, 15], ...] - bucket_batch_size: [1024, 768, 832, ...] - input_cfg: - - type: nemo_tarred - manifest_filepath: /path/to/manifest__OP_0..512_CL_.json - tarred_audio_filepath: /path/to/tarred_audio/audio__OP_0..512_CL_.tar - weight: 0.5 - tags: - context: "Translate the following to English" - - text: - prompt_format: t5nmt - use_multimodal_sampling: true - min_tokens: 1 - max_tokens: 256 - min_tpt: 0.333 - max_tpt: 3.0 - measure_total_length: false - use_bucketing: true - bucket_duration_bins: [[10, 4], [10, 26], [15, 10], ...] - bucket_batch_size: [512, 128, 192, ...] - input_cfg: - - type: txt_pair - source_path: /path/to/en__OP_0..512_CL_.txt - target_path: /path/to/pl__OP_0..512_CL_.txt - source_language: en - target_language: pl - weight: 0.5 - tags: - question: "Translate the following to Polish" - -.. caution:: We strongly recommend to use multiple shards for text files as well so that different nodes and dataloading workers are able to randomize the order of text iteration. Otherwise, multi-GPU training has a high risk of duplication of text examples. - -Pre-computing bucket duration bins -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -We recommend to pre-compute the bucket duration bins in order to accelerate the start of the training -- otherwise, the dynamic bucketing sampler will have to spend some time estimating them before the training starts. -The following script may be used: - -.. code-block:: bash - - $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 manifest.json - - # The script's output: - Use the following options in your config: - num_buckets=30 - bucket_duration_bins=[1.78,2.34,2.69,... - - -For multi-dataset setups, one may provide a dataset config directly: - -.. code-block:: bash - - $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 input_cfg.yaml - - # The script's output: - Use the following options in your config: - num_buckets=30 - bucket_duration_bins=[1.91,3.02,3.56,... - - -It's also possible to manually specify the list of data manifests (optionally together with weights): - -.. code-block:: bash - - $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 [[manifest.json,0.7],[other.json,0.3]] - - # The script's output: - Use the following options in your config: - num_buckets=30 - bucket_duration_bins=[1.91,3.02,3.56,... - - -2D bucketing -~~~~~~~~~~~~ - -To achieve maximum training efficiency for some classes of models it is necessary to stratify the sampling -both on the input sequence lengths and the output sequence lengths. -One such example are attention encoder-decoder models, where the overall GPU memory usage can be factorized -into two main components: input-sequence-length bound (encoder activations) and output-sequence-length bound -(decoder activations). -Classical bucketing techniques only stratify on the input sequence length (e.g. duration in speech), -which leverages encoder effectively but leads to excessive padding on on decoder's side. - -To amend this we support a 2D bucketing technique which estimates the buckets in two stages. -The first stage is identical to 1D bucketing, i.e. we determine the input-sequence bucket bins so that -every bin holds roughly an equal duration of audio. -In the second stage, we use a tokenizer and optionally a prompt formatter (for prompted models) to -estimate the total number of tokens in each duration bin, and sub-divide it into several sub-buckets, -where each sub-bucket again holds roughly an equal number of tokens. - -To run 2D bucketing with 30 buckets sub-divided into 5 sub-buckets each (150 buckets total), use the following script: - -.. code-block:: bash - - $ python scripts/speech_recognition/estimate_duration_bins_2d.py \ - --tokenizer path/to/tokenizer.model \ - --buckets 30 \ - --sub-buckets 5 \ - input_cfg.yaml - - # The script's output: - Use the following options in your config: - use_bucketing=1 - num_buckets=30 - bucket_duration_bins=[[1.91,10],[1.91,17],[1.91,25],... - The max_tps setting below is optional, use it if your data has low quality long transcript outliers: - max_tps=[13.2,13.2,11.8,11.8,...] - -Note that the output in ``bucket_duration_bins`` is a nested list, where every bin specifies -the maximum duration and the maximum number of tokens that go into the bucket. -Passing this option to Lhotse dataloader will automatically enable 2D bucketing. - -Note the presence of ``max_tps`` (token-per-second) option. -It is optional to include it in the dataloader configuration: if you do, we will apply an extra filter -that discards examples which have more tokens per second than the threshold value. -The threshold is determined for each bucket separately based on data distribution, and can be controlled -with the option ``--token_outlier_threshold``. -This filtering is useful primarily for noisy datasets to discard low quality examples / outliers. - -We also support aggregate tokenizers for 2D bucketing estimation: - -.. code-block:: bash - - $ python scripts/speech_recognition/estimate_duration_bins_2d.py \ - --tokenizer path/to/en/tokenizer.model path/to/pl/tokenizer1.model \ - --langs en pl \ - --buckets 30 \ - --sub-buckets 5 \ - input_cfg.yaml - -To estimate 2D buckets for a prompted model such as Canary-1B, provide prompt format name and an example prompt. -For Canary-1B, we'll also provide the special tokens tokenizer. Example: - -.. code-block:: bash - - $ python scripts/speech_recognition/estimate_duration_bins_2d.py \ - --prompt-format canary \ - --prompt "[{'role':'user','slots':{'source_lang':'en','target_lang':'de','task':'ast','pnc':'yes'}}]" \ - --tokenizer path/to/spl_tokens/tokenizer.model path/to/en/tokenizer.model path/to/de/tokenizer1.model \ - --langs spl_tokens en de \ - --buckets 30 \ - --sub-buckets 5 \ - input_cfg.yaml - -Pushing GPU utilization to the limits with bucketing and OOMptimizer -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The default approach of specifying a ``batch_duration``, ``bucket_duration_bins`` and ``quadratic_duration`` -is quite flexible, but is not maximally efficient. We observed that in practice it often leads to under-utilization -of GPU memory and compute for most buckets (especially those with shorter durations). -While it is impossible to estimate GPU memory usage up-front, we can determine it empirically with a bit of search. - -OOMptimizer is an approach that given a NeMo model, optimizer, and a list of buckets (1D or 2D) -estimates the maximum possible batch size to use for each bucket. -It performs a binary search over batch sizes that succeed or lead to CUDA OOM until convergence. -We find that the resulting bucketing batch size profiles enable full GPU utilization in training, -while it only takes a couple of minutes to complete the search. - -In order to run OOMptimizer, you only need the bucketing bins (from previous sections) and a model configuration: - -.. code-block:: bash - - $ python scripts/speech_recognition/oomptimizer.py \ - --config-path fast-conformer_aed.yaml \ - --module-name nemo.collections.asr.models.EncDecMultiTaskModel \ - --buckets '[[3.975,30],[3.975,48],[4.97,37],[4.97,60],[5.851,42],[5.851,71],[6.563,46],[6.563,79],[7.32,49],[7.32,88],[8.19,54],[8.19,99],[8.88,61],[8.88,107],[9.75,66],[9.75,117],[10.55,72],[10.55,127],[11.21,76],[11.21,135],[11.87,79],[11.87,143],[12.54,82],[12.54,151],[13.08,87],[13.08,157],[13.62,91],[13.62,164],[14.16,93],[14.16,170],[14.7,96],[14.7,177],[15.19,99],[15.19,183],[15.67,101],[15.67,189],[16.13,103],[16.13,194],[16.66,105],[16.66,200],[17.2,108],[17.2,207],[17.73,111],[17.73,213],[18.2,114],[18.2,219],[18.69,117],[18.69,225],[19.15,120],[19.15,230],[19.62,123],[19.62,236],[20.264,122],[20.264,244],[32.547,173],[32.547,391],[36.587,227],[36.587,440],[40.0,253],[40.0,480]]' - - # The script's output: - - The final profile is: - bucket_duration_bins=[[3.975,30],[3.975,48],[4.97,37],[4.97,60],[5.851,42],[5.851,71],[6.563,46],[6.563,79],[7.32,49],[7.32,88],[8.19,54],[8.19,99],[8.88,61],[8.88,107],[9.75,66],[9.75,117],[10.55,72],[10.55,127],[11.21,76],[11.21,135],[11.87,79],[11.87,143],[12.54,82],[12.54,151],[13.08,87],[13.08,157],[13.62,91],[13.62,164],[14.16,93],[14.16,170],[14.7,96],[14.7,177],[15.19,99],[15.19,183],[15.67,101],[15.67,189],[16.13,103],[16.13,194],[16.66,105],[16.66,200],[17.2,108],[17.2,207],[17.73,111],[17.73,213],[18.2,114],[18.2,219],[18.69,117],[18.69,225],[19.15,120],[19.15,230],[19.62,123],[19.62,236],[20.264,122],[20.264,244],[32.547,173],[32.547,391],[36.587,227],[36.587,440],[40.0,253],[40.0,480]] - bucket_batch_size=[352,308,280,245,245,206,206,180,186,163,168,142,151,132,136,119,126,106,116,98,110,92,104,88,99,83,94,79,90,76,86,72,86,72,81,68,80,65,78,63,74,60,72,58,70,58,68,54,66,52,65,52,62,50,37,28,31,24,28,21] - max_tps=12.0 - max_duration=40.0 - -Use the resulting options in your training configuration (typically under namespace ``model.train_ds``) to apply the profile. - -It's also possible to run OOMptimizer using a pretrained model's name and bucket bins corresponding -to your fine-tuning data: - - $ python scripts/speech_recognition/oomptimizer.py \ - --pretrained-name nvidia/canary-1b \ - --buckets '[2.0,3.1,5.6,6.6,...]' - -Note that your training script can perform some additional actions using GPU RAM that cannot be anticipated by the OOMptimizer. -By default, we let the script use up to 90% of GPU's RAM for this estimation to account for that. -In the unlikely case you run into an OutOfMemoryError during training, you can try re-estimating the profile with the option ``--memory-fraction 0.75`` (or another value) that will further cap OOMptimizer's available GPU RAM. - -Seeds and randomness -~~~~~~~~~~~~~~~~~~~~ - -In Lhotse dataloading configuration we have two parameters controlling randomness: ``seed`` and ``shard_seed``. -Both of them can be either set to a fixed number, or one of two string options ``"randomized"`` and ``"trng"``. -Their roles are: - -* ``seed`` is the base random seed, and is one of several factors used to initialize various RNGs participating in dataloading. - -* ``shard_seed`` controls the shard randomization strategy in distributed data parallel setups when using sharded tarred datasets. - -Below are the typical examples of configuration with an explanation of the expected outcome. - -Case 1 (default): ``seed=`` and ``shard_seed="trng"``: - -* The ``trng`` setting discards ``seed`` and causes the actual random seed to be drawn using OS's true RNG. Each node/GPU/dataloading worker draws its own unique random seed when it first needs it. - -* Each node/GPU/dataloading worker yields data in a different order (no mini-batch duplication). - -* On each training script run, the order of dataloader examples are **different**. - -* Since the random seed is unpredictable, the exact dataloading order is not replicable. - -Case 2: ``seed=`` and ``shard_seed="randomized"``: - -* The ``randomized`` setting uses ``seed`` along with DDP ``rank`` and dataloading ``worker_id`` to set a unique but deterministic random seed in each dataloading process across all GPUs. - -* Each node/GPU/dataloading worker yields data in a different order (no mini-batch duplication). - -* On each training script run, the order of dataloader examples are **identical** as long as ``seed`` is the same. - -* This setup guarantees 100% dataloading reproducibility. - -* Resuming training without changing of the ``seed`` value will cause the model to train on data it has already seen. For large data setups, not managing the ``seed`` may cause the model to never be trained on a majority of data. This is why this mode is not the default. - -* If you're combining DDP with model parallelism techniques (Tensor Parallel, Pipeline Parallel, etc.) you need to use ``shard_seed="randomized"``. Using ``"trng"`` will cause different model parallel ranks to desynchronize and cause a deadlock. - -* Generally the seed can be managed by the user by providing a different value each time the training script is launched. For example, for most models the option to override would be ``model.train_ds.seed=``. If you're launching multiple tasks queued one after another on a grid system, you can generate a different random seed for each task, e.g. on most Unix systems ``RSEED=$(od -An -N4 -tu4 < /dev/urandom | tr -d ' ')`` would generate a random uint32 number that can be provided as the seed. - -Other, more exotic configurations: - -* With ``shard_seed=``, all dataloading workers will yield the same results. This is only useful for unit testing and maybe debugging. - -* With ``seed="trng"``, the base random seed itself will be drawn using a TRNG. It will be different on each GPU training process. This setting is not recommended. +NeMo supports `Lhotse `_ for advanced dataloading with dynamic batch sizes, dynamic bucketing, OOMptimizer, and multi-dataset configuration. -* With ``seed="randomized"``, the base random seed is set to Python's global RNG seed. It might be different on each GPU training process. This setting is not recommended. +See :doc:`Lhotse Dataloading ` for full documentation. diff --git a/docs/source/asr/fine_tuning.rst b/docs/source/asr/fine_tuning.rst new file mode 100644 index 000000000000..4394cf67aaaa --- /dev/null +++ b/docs/source/asr/fine_tuning.rst @@ -0,0 +1,161 @@ +.. _asr-fine-tuning: + +=========== +Fine-Tuning +=========== + +This page covers how to fine-tune pretrained ASR models in NeMo. + + +When to Fine-Tune +----------------- + +Fine-tuning is recommended when: + +* You have domain-specific data (medical, legal, call center, etc.) and want to improve accuracy on that domain. +* You need to adapt to a new accent, speaking style, or acoustic environment. +* You want to add support for a new language using a pretrained multilingual model. + +If you have a large, diverse dataset and want to train from scratch, see :doc:`Configuration Files <./configs>` for full training setup. + + +Fine-Tuning Script +------------------ + +Use the ``speech_to_text_finetune.py`` script: + +.. code-block:: bash + + python examples/asr/speech_to_text_finetune.py \ + --config-path= \ + --config-name= \ + model.train_ds.manifest_filepath= \ + model.validation_ds.manifest_filepath= \ + trainer.devices=1 \ + trainer.max_epochs=50 + +The script handles model initialization from a pretrained checkpoint using the ``init_from_nemo_model`` or ``init_from_pretrained_model`` config options. + + +Initialization Options +----------------------- + +NeMo supports several ways to initialize a model for fine-tuning: + +**From a pretrained model (NGC/HuggingFace):** + +.. code-block:: yaml + + init_from_pretrained_model: "nvidia/parakeet-tdt-0.6b-v2" + +**From a local .nemo checkpoint:** + +.. code-block:: yaml + + init_from_nemo_model: "/path/to/checkpoint.nemo" + +**Partial loading (selective layers):** + +You can include or exclude specific model components using ``include`` and ``exclude`` lists: + +.. code-block:: yaml + + init_from_nemo_model: "/path/to/checkpoint.nemo" + init_from_nemo_model_include: + - encoder + - preprocessor + init_from_nemo_model_exclude: + - decoder + +This is useful when changing the decoder architecture or tokenizer while keeping the pretrained encoder. + + +Tokenizer Changes +------------------ + +**Same tokenizer (same vocabulary):** + +No special handling needed — fine-tune directly. + +**New tokenizer (different vocabulary):** + +When changing the tokenizer (e.g., for a new language or domain), you need to: + +1. Provide the new tokenizer directory in the config. +2. Exclude the decoder/joint from initialization (for Transducer models) or exclude the final linear layer (for CTC models). + +.. code-block:: yaml + + model: + tokenizer: + dir: /path/to/new/tokenizer + type: bpe + + init_from_nemo_model: "/path/to/pretrained.nemo" + init_from_nemo_model_exclude: + - decoder + - joint + + +Fine-Tuning with HuggingFace Datasets +--------------------------------------- + +NeMo supports loading datasets directly from HuggingFace: + +.. code-block:: bash + + python examples/asr/speech_to_text_finetune_with_hf.py \ + --config-path= \ + --config-name= \ + model.train_ds.hf_data_cfg.path="mozilla-foundation/common_voice_11_0" \ + model.train_ds.hf_data_cfg.name="en" \ + model.train_ds.hf_data_cfg.split="train" \ + model.validation_ds.hf_data_cfg.path="mozilla-foundation/common_voice_11_0" \ + model.validation_ds.hf_data_cfg.name="en" \ + model.validation_ds.hf_data_cfg.split="validation" + + +Key Configuration Parameters +----------------------------- + +The most important parameters for fine-tuning: + +.. list-table:: + :header-rows: 1 + :widths: 30 70 + + * - Parameter + - Description + * - ``trainer.max_epochs`` + - Number of fine-tuning epochs (typically 50-100 for domain adaptation) + * - ``model.optim.lr`` + - Learning rate (use lower than training from scratch, e.g., 1e-4 to 1e-5) + * - ``model.train_ds.manifest_filepath`` + - Path to training manifest (NeMo JSON format) + * - ``model.train_ds.batch_size`` + - Batch size per GPU + * - ``init_from_pretrained_model`` + - NGC/HF model name to initialize from + * - ``init_from_nemo_model`` + - Local .nemo file to initialize from + +For the complete configuration reference, see :doc:`Configuration Files <./configs>`. + + +Execution Flow +-------------- + +The fine-tuning execution flow for CTC and Transducer models is documented in: + +* `CTC Fine-tuning README `_ +* `Transducer Fine-tuning README `_ + + +Tips +---- + +1. **Start with a low learning rate** — fine-tuning with too high a learning rate can destroy pretrained features. +2. **Use Lhotse dataloading** for efficient training with dynamic batching. See :doc:`Lhotse Dataloading `. +3. **Monitor validation WER** closely — fine-tuning can overfit quickly on small datasets. +4. **Use spec augmentation** during fine-tuning to improve robustness. +5. **For multilingual fine-tuning**, consider using ``AggregateTokenizer`` and the Hybrid model with prompt conditioning. diff --git a/docs/source/asr/inference.rst b/docs/source/asr/inference.rst new file mode 100644 index 000000000000..4ae2bbbddcf8 --- /dev/null +++ b/docs/source/asr/inference.rst @@ -0,0 +1,229 @@ +.. _asr-inference: + +========= +Inference +========= + +This page covers how to load ASR models and run inference in NeMo. + + +Loading Checkpoints +------------------- + +**From a local file:** + +.. code-block:: python + + import nemo.collections.asr as nemo_asr + model = nemo_asr.models.ASRModel.restore_from("path/to/checkpoint.nemo") + +**From HuggingFace or NGC:** + +.. code-block:: python + + # HuggingFace (prefix with nvidia/) + model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2") + + # NGC (no prefix) + model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") + +.. note:: + + For resuming an unfinished training experiment, use the Experiment Manager with ``resume_if_exists=True`` instead. + + +Basic Transcription +------------------- + +**Python API:** + +.. code-block:: python + + outputs = model.transcribe(audio=["file1.wav", "file2.wav"], batch_size=4) + print(outputs[0].text) + +The ``audio`` argument accepts file paths (strings), lists of paths, numpy arrays, or PyTorch tensors. +Audio must be 16 kHz mono-channel. + +**Numpy/Tensor inputs:** + +.. code-block:: python + + import soundfile as sf + audio, sr = sf.read("audio.wav", dtype='float32') + outputs = model.transcribe([audio], batch_size=1) + +**Command line:** + +.. code-block:: bash + + python examples/asr/transcribe_speech.py \ + pretrained_name="nvidia/parakeet-tdt-0.6b-v2" \ + audio_dir= + +**Batch generator (for large datasets):** + +.. code-block:: python + + config = model.get_transcribe_config() + config.batch_size = 32 + for batch_outputs in model.transcribe_generator(audio_files, config): + # process each batch of results + ... + +**Alignments:** + +.. code-block:: python + + hyps = model.transcribe(audio=["file.wav"], return_hypotheses=True) + alignments = hyps[0].alignments + + +Timestamps +---------- + +Obtain word, segment, or character timestamps with Parakeet models (CTC/RNNT/TDT): + +**Simple usage:** + +.. code-block:: python + + hypotheses = model.transcribe(["audio.wav"], timestamps=True) + + for stamp in hypotheses[0].timestamp['word']: + print(f"{stamp['start']}s - {stamp['end']}s : {stamp['word']}") + + for stamp in hypotheses[0].timestamp['segment']: + print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}") + +**Advanced configuration:** + +.. code-block:: python + + from omegaconf import open_dict + + decoding_cfg = model.cfg.decoding + with open_dict(decoding_cfg): + decoding_cfg.preserve_alignments = True + decoding_cfg.compute_timestamps = True + decoding_cfg.segment_seperators = [".", "?", "!"] + decoding_cfg.word_seperator = " " + model.change_decoding_strategy(decoding_cfg) + + hypotheses = model.transcribe(["audio.wav"], return_hypotheses=True) + timestamp_dict = hypotheses[0].timestamp + + time_stride = 8 * model.cfg.preprocessor.window_stride + for stamp in timestamp_dict['word']: + start = stamp['start_offset'] * time_stride + end = stamp['end_offset'] * time_stride + word = stamp['char'] if 'char' in stamp else stamp['word'] + print(f"{start:0.2f} - {end:0.2f} : {word}") + + +Long Audio Inference +-------------------- + +For audio longer than what fits in memory (especially with Conformer's quadratic attention): + +**Buffered / chunked inference:** + +Divide audio into overlapping chunks and merge outputs. Scripts are in +`examples/asr/asr_chunked_inference `_. + +**Local attention (recommended for Fast Conformer):** + +Switch to Longformer-style local+global attention for linear-cost inference on audio >1 hour: + +.. code-block:: python + + model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-ctc-1.1b") + model.change_attention_model( + self_attention_model="rel_pos_local_attn", + att_context_size=[128, 128] + ) + +Or via CLI: + +.. code-block:: bash + + python examples/asr/speech_to_text_eval.py \ + (...other parameters...) \ + ++model_change.conformer.self_attention_model="rel_pos_local_attn" \ + ++model_change.conformer.att_context_size=[128, 128] + +**Subsampling memory optimization:** + +For very long files where even the subsampling module runs out of memory: + +.. code-block:: python + + model.change_subsampling_conv_chunking_factor(1) # auto-chunk subsampling + + +Multi-task Inference (Canary) +----------------------------- + +Canary models require task tokens. Use a manifest or specify task parameters directly: + +**Via manifest:** + +.. code-block:: python + + from nemo.collections.asr.models import EncDecMultiTaskModel + + canary = EncDecMultiTaskModel.from_pretrained("nvidia/canary-1b-v2") + decode_cfg = canary.cfg.decoding + decode_cfg.beam.beam_size = 1 + canary.change_decoding_strategy(decode_cfg) + + results = canary.transcribe("manifest.json", batch_size=16) + +Manifest format: + +.. code-block:: json + + {"audio_filepath": "/path/to/audio.wav", "duration": null, "taskname": "asr", "source_lang": "en", "target_lang": "en", "pnc": "yes", "answer": "na"} + +**Via direct parameters:** + +.. code-block:: python + + results = canary.transcribe( + audio=["audio.wav"], + batch_size=4, + task="asr", + source_lang="en", + target_lang="en", + pnc=True, + ) + + +Streaming Inference +------------------- + +NeMo provides a unified streaming-first Pipeline API for real-time ASR under ``nemo.collections.asr.inference``. +It supports buffered CTC/RNNT/TDT pipelines (overlapping chunks with any offline model) and cache-aware CTC/RNNT pipelines (processes each frame once using cached activations). + +See the `Streaming ASR Pipelines tutorial `_ for a comprehensive walkthrough covering buffered and cache-aware pipelines, per-stream options, EoU detection, word timestamps, per-stream biasing, ITN, and speech translation. + +See :ref:`cache-aware streaming conformer` for model architecture details. + + +Apple MPS Support +----------------- + +Inference on Apple M-Series GPUs is supported with PyTorch 2.0+: + +.. code-block:: bash + + PYTORCH_ENABLE_MPS_FALLBACK=1 python examples/asr/speech_to_text_eval.py \ + (...other parameters...) \ + allow_mps=true + + +Execution Flow +-------------- + +When writing custom inference scripts, follow the execution flow diagram at the +`ASR examples README `_. diff --git a/docs/source/asr/intro.rst b/docs/source/asr/intro.rst index 25efcb2a0ced..47fac852647e 100644 --- a/docs/source/asr/intro.rst +++ b/docs/source/asr/intro.rst @@ -2,13 +2,13 @@ Automatic Speech Recognition (ASR) ================================== Automatic Speech Recognition (ASR), also known as Speech To Text (STT), refers to the problem of automatically transcribing spoken language. -You can use NeMo to transcribe speech using open-sourced pretrained models in :ref:`14+ languages `, or :doc:`train your own<./examples/kinyarwanda_asr>` ASR models. +NeMo provides open-sourced pretrained models in 25+ languages. Browse the full list in :doc:`ASR Model Checkpoints <./asr_checkpoints>`. +Quick Start +----------- -Transcribe speech with 3 lines of code ----------------------------------------- -After :ref:`installing NeMo`, you can transcribe an audio file as follows: +After :ref:`installing NeMo`, transcribe an audio file in 3 lines: .. code-block:: python @@ -16,257 +16,59 @@ After :ref:`installing NeMo`, you can transcribe an audio file as asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2") transcript = asr_model.transcribe(["path/to/audio_file.wav"])[0].text -Obtain timestamps -^^^^^^^^^^^^^^^^^ +Timestamps +^^^^^^^^^^ -Obtaining char(token), word or segment timestamps is also possible with NeMo ASR Models. - -Currently, timestamps are available for Parakeet Models with all types of decoders (CTC/RNNT/TDT). Support for AED models would be added soon. - -There are two ways to obtain timestamps: -1. By using the `timestamps=True` flag in the `transcribe` method. -2. For more control over the timestamps, you can update the decoding config to mention type of timestamps (char, word, segment) and also specify the segment seperators or word seperator for segment and word level timestamps. - -With the `timestamps=True` flag, you can obtain timestamps for each character in the transcription as follows: +Obtain word, segment, or character timestamps with any Parakeet model (CTC/RNNT/TDT): .. code-block:: python - - # import nemo_asr and instantiate asr_model as above - import nemo.collections.asr as nemo_asr - asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2") - # specify flag `timestamps=True` hypotheses = asr_model.transcribe(["path/to/audio_file.wav"], timestamps=True) + for stamp in hypotheses[0].timestamp['word']: + print(f"{stamp['start']}s - {stamp['end']}s : {stamp['word']}") - # by default, timestamps are enabled for char, word and segment level - word_timestamps = hypotheses[0].timestamp['word'] # word level timestamps for first sample - segment_timestamps = hypotheses[0].timestamp['segment'] # segment level timestamps - char_timestamps = hypotheses[0].timestamp['char'] # char level timestamps - - for stamp in segment_timestamps: - print(f"{stamp['start']}s - {stamp['end']}s : {stamp['segment']}") - - # segment level timestamps (if model supports Punctuation and Capitalization, segment level timestamps are displayed based on punctuation otherwise complete transcription is considered as a single segment) - -For more control over the timestamps, you can update the decoding config to mention type of timestamps (char, word, segment) and also specify the segment seperators or word seperator for segment and word level timestamps as follows: - -.. code-block:: python - - # import nemo_asr and instantiate asr_model as above - import nemo.collections.asr as nemo_asr - asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2") - - # update decoding config to preserve alignments and compute timestamps - # if necessary also update the segment seperators or word seperator for segment and word level timestamps - from omegaconf import OmegaConf, open_dict - decoding_cfg = asr_model.cfg.decoding - with open_dict(decoding_cfg): - decoding_cfg.preserve_alignments = True - decoding_cfg.compute_timestamps = True - decoding_cfg.segment_seperators = [".", "?", "!"] - decoding_cfg.word_seperator = " " - asr_model.change_decoding_strategy(decoding_cfg) - - # specify flag `return_hypotheses=True`` - hypotheses = asr_model.transcribe(["path/to/audio_file.wav"], return_hypotheses=True) - - timestamp_dict = hypotheses[0].timestamp # extract timestamps from hypothesis of first (and only) audio file - print("Hypothesis contains following timestamp information :", list(timestamp_dict.keys())) - - # For a FastConformer model, you can display the word timestamps as follows: - # 80ms is duration of a timestamp at output of the Conformer - time_stride = 8 * asr_model.cfg.preprocessor.window_stride - - word_timestamps = timestamp_dict['word'] - segment_timestamps = timestamp_dict['segment'] - - for stamp in word_timestamps: - start = stamp['start_offset'] * time_stride - end = stamp['end_offset'] * time_stride - word = stamp['char'] if 'char' in stamp else stamp['word'] - - print(f"Time : {start:0.2f} - {end:0.2f} - {word}") - - for stamp in segment_timestamps: - start = stamp['start_offset'] * time_stride - end = stamp['end_offset'] * time_stride - segment = stamp['segment'] - - print(f"Time : {start:0.2f} - {end:0.2f} - {segment}") - -Transcribe speech via command line ----------------------------------- -You can also transcribe speech via the command line using the following `script `_, for example: - -.. code-block:: bash - - python /examples/asr/transcribe_speech.py \ - pretrained_name="stt_en_fastconformer_transducer_large" \ - audio_dir= # path to dir containing audio files to transcribe - -The script will save all transcriptions in a JSONL file where each line corresponds to an audio file in ````. -This file will correspond to a format that NeMo commonly uses for saving model predictions, and also for storing -input data for training and evaluation. You can learn more about the format that NeMo uses for these files -(which we refer to as "manifest files") :ref:`here`. - -You can also specify the files to be transcribed inside a manifest file, and pass that in using the argument -``dataset_manifest=`` instead of ``audio_dir``. - - -Improve ASR transcriptions by incorporating a language model (LM) ------------------------------------------------------------------ +See :doc:`Inference <./inference>` for full details on timestamps, long audio, streaming, and multi-task inference. -You can often improve transcription accuracy by incorporating a language model to guide the selection of more probable words in context. -Even a simple n-gram language model can yield a noticeable improvement. -NeMo supports GPU-accelerated language model fusion for all major ASR model types, including CTC, RNN-T, TDT, and AED. -Customization is available during both greedy and beam decoding. After :ref:`training ` an n-gram LM, you can apply it using the -`speech_to_text_eval.py `_ script. - -**To configure the evaluation:** - -1. Select the pretrained model: - Use the `pretrained_name` option or provide a local path using `model_path`. - -2. Set up the N-gram language model: - Provide the path to the NGPU-LM model with `ngram_lm_model`, and set LM weight with `ngram_lm_alpha`. - -3. Choose the decoding strategy: - - - CTC models: `greedy_batch` or `beam_batch` - - RNN-T models: `greedy_batch`, `malsd_batch`, or `maes_batch` - - TDT models: `greedy_batch` or `malsd_batch` - - AED models: `beam` (set `beam_size=1` for greedy decoding) - -4. Run the evaluation script. - -**Example: CTC Greedy Decoding with NGPU-LM** - -.. code-block:: bash - - python examples/asr/speech_to_text_eval.py \ - pretrained_name=nvidia/parakeet-ctc-1.1b \ - amp=false \ - amp_dtype=bfloat16 \ - matmul_precision=high \ - compute_dtype=bfloat16 \ - presort_manifest=true \ - cuda=0 \ - batch_size=32 \ - dataset_manifest= \ - ctc_decoding.greedy.ngram_lm_model= \ - ctc_decoding.greedy.ngram_lm_alpha=0.2 \ - ctc_decoding.greedy.allow_cuda_graphs=True \ - ctc_decoding.strategy="greedy_batch" - -**Example: RNN-T Beam Search with NGPU-LM** - -.. code-block:: bash - - python examples/asr/speech_to_text_eval.py \ - pretrained_name=nvidia/parakeet-rnnt-1.1b \ - amp=false \ - amp_dtype=bfloat16 \ - matmul_precision=high \ - compute_dtype=bfloat16 \ - presort_manifest=true \ - cuda=0 \ - batch_size=16 \ - dataset_manifest= \ - rnnt_decoding.beam.ngram_lm_model= \ - rnnt_decoding.beam.ngram_lm_alpha=0.3 \ - rnnt_decoding.beam.beam_size=10 \ - rnnt_decoding.strategy="malsd_batch" - -See detailed documentation here: :ref:`asr_language_modeling_and_customization`. - -Use real-time transcription ---------------------------- - -It is possible to use NeMo to transcribe speech in real-time. We provide tutorial notebooks for `Cache Aware Streaming `_ and `Buffered Streaming `_. - -Try different ASR models ------------------------- - -NeMo offers a variety of open-sourced pretrained ASR models that vary by model architecture: - -* **encoder architecture** (FastConformer, Conformer, etc.), -* **decoder architecture** (Transducer, CTC & hybrid of the two), -* **size** of the model (small, medium, large, etc.). - -The pretrained models also vary by: - -* **language** (English, Spanish, etc., including some **multilingual** and **code-switching** models), -* whether the output text contains **punctuation & capitalization** or not. - -The NeMo ASR checkpoints can be found on `HuggingFace `_, or on `NGC `_. All models released by the NeMo team can be found on NGC, and some of those are also available on HuggingFace. - -All NeMo ASR checkpoints open-sourced by the NeMo team follow the following naming convention: -``stt_{language}_{encoder name}_{decoder name}_{model size}{_optional descriptor}``. - -You can load the checkpoints automatically using the ``ASRModel.from_pretrained()`` class method, for example: - -.. code-block:: python - - import nemo.collections.asr as nemo_asr - # model will be fetched from NGC - asr_model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") - # if model name is prepended with "nvidia/", the model will be fetched from huggingface - asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/stt_en_fastconformer_transducer_large") - # you can also load open-sourced NeMo models released by other HF users using: - # asr_model = nemo_asr.models.ASRModel.from_pretrained("/") +Key Features +------------ -See further documentation about :doc:`loading checkpoints <./results>`, a full :ref:`list ` of models and their :doc:`benchmark scores <./scores>`. +**50+ Pretrained Models** — NeMo offers open-source checkpoints across 14+ languages, available on `HuggingFace `__ and `NGC `__. Browse the full list in :doc:`All Checkpoints <./asr_checkpoints>`. -There is also more information about the ASR model architectures available in NeMo :doc:`here <./models>`. +**Timestamps** — Character, word, and segment-level timestamps are supported for all Parakeet models with CTC, RNNT, and TDT decoders. +**Streaming** — Real-time transcription with cache-aware streaming Conformer models, supporting configurable latency-accuracy tradeoffs. See :ref:`cache-aware streaming conformer`. -Try out NeMo ASR transcription in your browser ----------------------------------------------- -You can try out transcription with a NeMo ASR model without leaving your browser, by using the HuggingFace Space embedded below. +**Multi-task (Canary)** — The Canary model family supports ASR and speech translation (AST) across 25 European languages, with built-in punctuation and capitalization. See :doc:`Models <./models>`. -This HuggingFace Space uses `Parakeet TDT 0.6B V2 `__, the latest ASR model from NVIDIA NeMo. It sits at the top of the `HuggingFace OpenASR Leaderboard `__ at time of writing (May 2nd 2025). +**Language Modeling** — GPU-accelerated n-gram LM fusion (NGPU-LM) for CTC, RNN-T, TDT, and AED models improves transcription accuracy without retraining. See :ref:`asr_language_modeling_and_customization`. -.. raw:: html +**Word Boosting** — Bias decoding toward specific words or phrases without retraining. Supports global and per-stream (per-utterance) boosting. See :ref:`word_boosting`. - +**Multitalker** — Streaming multi-speaker ASR with speaker kernel injection handles overlapping speech in real time. See `Multitalker Parakeet `__. - +**Long Audio** — Inference on audio over 1 hour via local attention or buffered chunked processing. +**Decoder Types** — NeMo supports CTC, RNN-T, TDT, AED, and Hybrid decoders. For a comparison of decoder types, see :ref:`asr_language_modeling_and_customization`. -ASR tutorial notebooks ----------------------- -Hands-on speech recognition tutorial notebooks can be found under `the ASR tutorials folder `_. -If you are a beginner to NeMo, consider trying out the `ASR with NeMo `_ tutorial. -This and most other tutorials can be run on Google Colab by specifying the link to the notebooks' GitHub pages on Colab. +ASR Customization +----------------- -ASR model configuration ------------------------ -Documentation regarding the configuration files specific to the ``nemo_asr`` models can be found in the :doc:`Configuration Files <./configs>` section. +NeMo supports decoding-time customization techniques to improve accuracy without retraining, including GPU-accelerated language model fusion (NGPU-LM), neural rescoring, and word boosting (GPU-PB, per-stream, Flashlight, CTC-WS). See :ref:`asr_language_modeling_and_customization` for full documentation. -Preparing ASR datasets ----------------------- -NeMo includes preprocessing scripts for several common ASR datasets. The :doc:`Datasets <./datasets>` section contains instructions on -running those scripts. It also includes guidance for creating your own NeMo-compatible dataset, if you have your own data. -NeMo ASR Documentation ----------------------- -For more information, see additional sections in the ASR docs on the left-hand-side menu or in the list below: +Further Reading +--------------- .. toctree:: :maxdepth: 1 models + asr_checkpoints + inference + fine_tuning datasets asr_language_modeling_and_customization - results - scores configs api - all_chkpt - streaming_decoding/canary_chunked_and_streaming_decoding - examples/kinyarwanda_asr.rst diff --git a/docs/source/asr/models.rst b/docs/source/asr/models.rst index 4a6e329435e4..22efc2e60898 100644 --- a/docs/source/asr/models.rst +++ b/docs/source/asr/models.rst @@ -1,17 +1,9 @@ Models ====== -This section gives a brief overview of the models that NeMo's ASR collection currently supports. - -Each of these models can be used with the example ASR scripts (in the ``/examples/asr`` directory) by -specifying the model architecture in the config file used. Examples of config files for each model can be found in -the ``/examples/asr/conf`` directory. - -For more information about the config files and how they should be structured, refer to the :doc:`./configs` section. - -Pretrained checkpoints for all of these models, as well as instructions on how to load them, can be found in the :doc:`./results` -section. You can use the available checkpoints for immediate inference, or fine-tune them on your own datasets. The checkpoints section -also contains benchmark results for the available ASR models. +NeMo's ASR collection supports several model architectures. This page covers the key model families and their capabilities. +For pretrained checkpoints, see :doc:`All Checkpoints <./asr_checkpoints>`. +For config file details, see :doc:`Configuration Files <./configs>`. Spotlight Models @@ -20,499 +12,90 @@ Spotlight Models Canary ~~~~~~ -Canary is the latest family of models from NVIDIA NeMo. Canary models are encoder-decoder models with a :ref:`FastConformer Encoder ` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`. -They are multi-lingual, multi-task model, supporting automatic speech-to-text recognition (ASR) in 25 EU languages as well as translation between English and the 24 other supported languages. - -Models: - -* `Canary-1B V2 `__ model card -* `Canary-1B Flash `__ model card -* `Canary-180M Flash `__ model card -* `Canary-1B `__ model card +Canary models are encoder-decoder models with a :ref:`FastConformer Encoder ` and Transformer Decoder :cite:`asr-models-vaswani2017aayn`. +They support ASR in 25 EU languages, speech translation (AST), and punctuation/capitalization (PnC). -Spaces: +* `Canary-1B V2 `__ — Flagship: 25 languages, PnC, timestamps +* `Canary-Qwen-2.5B `__ — English only, PnC, highest accuracy +* `Canary-1B Flash `__ / `180M Flash `__ — Optimized for speed -* `Canary-1B V2 `__ -* `Canary-1B Flash `__ -* `Canary-1B `__ - -Canary models support the following decoding methods for chunked and streaming inference: - -* :ref:`Chunked Inference ` -* :ref:`Streaming Inference ` +Canary supports chunked and `streaming inference `__. Parakeet ~~~~~~~~ -Parakeet is the name of a family of ASR models with a :ref:`FastConformer Encoder ` and a CTC, RNN-T, or TDT decoder. - -Model checkpoints: - -* `Parakeet-TDT-0.6B V2 `__ model card +Parakeet is a family of English ASR models with a :ref:`FastConformer Encoder ` and CTC, RNN-T, or TDT decoders. - * this model sits top of the `HuggingFace OpenASR Leaderboard `__ at time of writing (May 2nd 2025) +* `Parakeet-TDT-0.6B V3 `__ — 25 languages, PnC, blazing fast +* `Parakeet-TDT-0.6B V2 `__ — English-only, PnC, blazing fast +* `Parakeet-TDT/CTC-110M `__ — Edge deployment +* `Nemotron-Speech-Streaming `__ — Real-time streaming +* `Multitalker-Parakeet `__ — Multi-speaker streaming -* `Parakeet-CTC-0.6B `__ and `Parakeet-CTC-1.1B `__ model cards - -* `Parakeet-RNNT-0.6B `__ and `Parakeet-RNNT-1.1B `__ model cards - -* `Parakeet-TDT-1.1B `__ model card - -HuggingFace Spaces to try out Parakeet models in your browser: - -* `Parakeet-TDT-0.6B V2 `__ space .. _Conformer_model: Conformer --------- -.. _Conformer-CTC_model: - -Conformer-CTC -~~~~~~~~~~~~~ - -Conformer-CTC is a CTC-based variant of the Conformer model introduced in :cite:`asr-models-gulati2020conformer`. Conformer-CTC has a -similar encoder as the original Conformer but uses CTC loss and decoding instead of RNNT/Transducer loss, which makes it a non-autoregressive model. -We also drop the LSTM decoder and instead use a linear decoder on the top of the encoder. This model uses the combination of -self-attention and convolution modules to achieve the best of the two approaches, the self-attention layers can learn the global -interaction while the convolutions efficiently capture the local correlations. The self-attention modules support both regular -self-attention with absolute positional encoding, and also Transformer-XL's self-attention with relative positional encodings. - -Here is the overall architecture of the encoder of Conformer-CTC: +The Conformer :cite:`asr-models-gulati2020conformer` combines self-attention and convolution modules. NeMo supports CTC, Transducer, and HAT variants. -.. image:: images/conformer_ctc.png - :align: center - :alt: Conformer-CTC Model - :scale: 50% - -This model supports both the sub-word level and character level encodings. You can find more details on the config files for the -Conformer-CTC models in the :ref:`Conformer-CTC configuration documentation `. The variant with sub-word encoding is a BPE-based model -which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecCTCModelBPE` class, while the -character-based variant is based on :class:`~nemo.collections.asr.models.EncDecCTCModel`. - -You may find the example config files of Conformer-CTC model with character-based encoding at -``/examples/asr/conf/conformer/conformer_ctc_char.yaml`` and -with sub-word encoding at ``/examples/asr/conf/conformer/conformer_ctc_bpe.yaml``. +* **Conformer-CTC**: Non-autoregressive, uses :class:`~nemo.collections.asr.models.EncDecCTCModelBPE` +* **Conformer-Transducer**: Autoregressive, uses :class:`~nemo.collections.asr.models.EncDecRNNTBPEModel` +* **Conformer-HAT**: Separates labels and blank predictions for better external LM integration (`paper `_) +.. _Conformer-CTC_model: .. _Conformer-Transducer_model: - -Conformer-Transducer -~~~~~~~~~~~~~~~~~~~~ - -Conformer-Transducer is the Conformer model introduced in :cite:`asr-models-gulati2020conformer` and uses RNNT/Transducer loss/decoder. -It has the same encoder as Conformer-CTC but utilizes RNNT/Transducer loss/decoder which makes it an autoregressive model. - -Most of the config file for Conformer-Transducer models are similar to Conformer-CTC except the sections related to the decoder and loss: decoder, loss, joint, decoding. -You may take a look at our :doc:`tutorials page <../starthere/tutorials>` on Transducer models to become familiar with their configs: -`Introduction to Transducers `_ and -`ASR with Transducers `_ -You can find more details on the config files for the Conformer-Transducer models in the :ref:`Conformer-CTC configuration documentation `. - -This model supports both the sub-word level and character level encodings. The variant with sub-word encoding is a BPE-based model -which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecRNNTBPEModel` class, while the -character-based variant is based on :class:`~nemo.collections.asr.models.EncDecRNNTModel`. - -You may find the example config files of Conformer-Transducer model with character-based encoding at -``/examples/asr/conf/conformer/conformer_transducer_char.yaml`` and -with sub-word encoding at ``/examples/asr/conf/conformer/conformer_transducer_bpe.yaml``. - .. _Conformer-HAT_model: -Conformer-HAT -~~~~~~~~~~~~~ - -Conformer HAT (Hybrid Autoregressive Transducer) model (do not confuse it with Hybrid-Transducer-CTC) is a modification of Conformer-Transducer model based on this previous `work `_. -The main idea is to separate labels and blank score predictions, which allows to estimate the internal LM probabilities during decoding. -When external LM is available for inference, the internal LM can be subtracted from HAT model prediction in beamsearch decoding to improve external LM efficiency. -It can be helpful in the case of text-only adaptation for new domains. - -The only difference from the standard Conformer-Transducer model (RNNT) is the use of `"HATJoint" `_ -class (instead of "RNNTJoint") for joint module. The all HAT logic is implemented in the "HATJoint" class. - -.. image:: images/hat.png - :align: center - :alt: HAT Model - :scale: 50% - -You may find the example config files of Conformer-HAT model with character-based encoding at -``/examples/asr/conf/conformer/hat/conformer_hat_char.yaml`` and -with sub-word encoding at ``/examples/asr/conf/conformer/hat/conformer_hat_bpe.yaml``. - -By default, the decoding for HAT model works in the same way as for Conformer-Transducer. -In the case of external ngram LM fusion you can use ``/scripts/asr_language_modeling/ngram_lm/eval_beamsearch_ngram_transducer.py``. -To enable HAT internal LM subtraction set ``hat_subtract_ilm=True`` and find more appropriate couple of ``beam_alpha`` and ``hat_ilm_weight`` values in terms of the best recognition accuracy. +Configs: ``examples/asr/conf/conformer/`` .. _Fast-Conformer: Fast-Conformer -------------- -The Fast Conformer (CTC and RNNT) models have a faster version of the Conformer encoder and differ from it as follows: - -* 8x depthwise convolutional subsampling with 256 channels -* Reduced convolutional kernel size of 9 in the conformer blocks - -The Fast Conformer encoder is about 2.4x faster than the regular Conformer encoder without a significant model quality degradation. -128 subsampling channels yield a 2.7x speedup vs baseline but model quality starts to degrade. -With local attention, inference is possible on audios >1 hrs (256 subsampling channels) / >2 hrs (128 channels). - -Fast Conformer models were trained using CosineAnnealing (instead of Noam) as the scheduler. +Fast Conformer has 8x depthwise convolutional subsampling and reduced kernel sizes, making it ~2.4x faster than standard Conformer with minimal quality loss. +Supports Longformer-style local attention for audio >1 hour. -You may find the example CTC config at -``/examples/asr/conf/fastconformer/fast-conformer_ctc_bpe.yaml`` and -the transducer config at ``/examples/asr/conf/fastconformer/fast-conformer_transducer_bpe.yaml`` - -Note that both configs are subword-based (BPE). - -You can also train these models with longformer-style attention (https://arxiv.org/abs/2004.05150) using the following configs: CTC config at -``/examples/asr/conf/fastconformer/fast-conformer-long_ctc_bpe.yaml`` and transducer config at ``/examples/asr/conf/fastconformer/fast-conformer-long_transducer_bpe.yaml`` -This allows using the model on longer audio (up to 70 minutes with Fast Conformer). Note that the Fast Conformer checkpoints -can be used with limited context attention even if trained with full context. However, if you also want to use global tokens, -which help aggregate information from outside the limited context, then training is required. - -You may find more examples under ``/examples/asr/conf/fastconformer/``. +Configs: ``examples/asr/conf/fastconformer/`` .. _cache-aware streaming conformer: Cache-aware Streaming Conformer ------------------------------- -Try real-time ASR with the `Cache-aware Streaming Conformer tutorial notebook `_. - -Buffered streaming uses overlapping chunks to make an offline ASR model usable for streaming with reasonable accuracy. However, it causes significant amount of duplication in computation due to the overlapping chunks. -Also, there is an accuracy gap between the offline model and the streaming one, as there is inconsistency between how we train the model and how we perform inference for streaming. -The Cache-aware Streaming Conformer models tackle and address these disadvantages. These streaming Conformers are trained with limited right context, making it possible to match how the model is being used in both training and inference. -They also use caching to store intermediate activations to avoid any duplication in compute. -The cache-aware approach is supported for both the Conformer-CTC and Conformer-Transducer and enables the model to be used very efficiently for streaming. - -Three categories of layers in Conformer have access to right tokens: -#. depthwise convolutions -#. self-attention -#. convolutions in the downsampling layers. - -Streaming Conformer models use causal convolutions or convolutions with lower right context and also self-attention with limited right context to limit the effective right context for the input. -The model trained with such limitations can be used in streaming mode and give the exact same outputs and accuracy as when the whole audio is given to the model in offline mode. -These model can use caching mechanism to store and reuse the activations during streaming inference to avoid any duplications in the computations as much as possible. - -We support the following three right context modeling techniques: - -* | Fully causal model with zero look-ahead: tokens do not see any future tokens. Convolution layers are all causal and right tokens are masked for self-attention. - | - | It gives zero latency but with limited accuracy. - | To train such a model, you need to set `model.encoder.att_context_size=[left_context,0]` and `model.encoder.conv_context_size=causal` in the config. - -* | Regular look-ahead: convolutions are able to see few future frames, and self-attention also sees the same number of future tokens. - | - | In this approach the activations for the look-ahead part are not cached, and are recalculated in the next chunks. The right context in each layer should be a small number as multiple layers would increase the effective context size and then increase the look-ahead size and latency. - | For example for a model of 17 layers with 4x downsampling and 10ms window shift, then even 2 right context in each layer means 17*2*10*4=1360ms look-ahead. Each step after the downsampling corresponds to 4*10=40ms. - -* | Chunk-aware look-ahead: input is split into equal chunks. Convolutions are fully causal while self-attention layers are able to see all the tokens in their corresponding chunk. - | - | For example, in a model with chunk size of 20 tokens, tokens at the first position of each chunk would see all the next 19 tokens while the last token would see zero future tokens. - | This approach is more efficient than regular look-ahead in terms of computations as the activations for most of the look-ahead part would be cached and there is close to zero duplications in the calculations. - | In terms of accuracy, this approach gives similar or even better results in term of accuracy than regular look-ahead as each token in each layer have access to more tokens on average. That is why we recommend to use this approach for streaming. Therefore we recommend to use the chunk-aware for cache-aware models. - -.. note:: Latencies are based on the assumption that the forward time of the network is zero and it just estimates the time needed after a frame would be available until it is passed through the model. - -Approaches with non-zero look-ahead can give significantly better accuracy by sacrificing latency. The latency can get controlled by the left context size. Increasing the right context would help the accuracy to a limit but would increase the computation time. - -In all modes, left context can be controlled by the number of tokens visible in self-attention and the kernel size of the convolutions. -For example, if the left context of self-attention in each layer is set to 20 tokens and there are 10 layers of Conformer, then the effective left context is 20*10=200 tokens. -Left context of self-attention for regular look-ahead can be set as any number, while it should be set as a multiple of the right context in chunk-aware look-ahead. -For convolutions, if we use a left context of 30, then there would be 30*10=300 effective left context. -Left context of convolutions is dependent on their kernel size while it can be any number for self-attention layers. Higher left context for self-attention means larger cache and more computations for the self-attention. -A self-attention left context of around 6 secs would give close results to unlimited left context. For a model with 4x downsampling and shift window of 10ms in the preprocessor, each token corresponds to 4*10=40ms. - -If striding approach is used for downsampling, all the convolutions in downsampling would be fully causal and don't see future tokens. - -Multiple Look-aheads -~~~~~~~~~~~~~~~~~~~~ - -We support multiple look-aheads for cahce-aware models. You may specify a list of context sizes for att_context_size. -During the training, different context sizes would be used randomly with the distribution specified by att_context_probs. -For example you may enable multiple look-aheads by setting `model.encoder.att_context_size=[[70,13],[70,6],[70,1],[70,0]]` for the training. -The first item in the list would be the default during test/validation/inference. To switch between different look-aheads, you may use the method `asr_model.encoder.set_default_att_context_size(att_context_size)` or set the att_context_size like the following when using the script `speech_transcribe.py`: - -.. code-block:: bash - - python [NEMO_GIT_FOLDER]/examples/asr/transcribe_speech.py \ - pretrained_name="stt_en_fastconformer_hybrid_large_streaming_multi" \ - audio_dir="" \ - att_context_size=[70,0] - -.. - -You may find the example config files for cache-aware streaming FastConformer models at -``/examples/asr/conf/fastconformer/cache_aware_streaming/conformer_transducer_bpe_streaming.yaml`` for Transducer variant and -at ``/examples/asr/conf/conformer/cache_aware_streaming/conformer_ctc_bpe.yaml`` for CTC variant. It is recommended to use FastConformer as they are more than 2X faster in both training and inference than regular Conformer. -The hybrid versions of FastConformer can be found here: ``/examples/asr/conf/conformer/hybrid_cache_aware_streaming/`` - -Examples for regular Conformer can be found at -``/examples/asr/conf/conformer/cache_aware_streaming/conformer_transducer_bpe_streaming.yaml`` for Transducer variant and -at ``/examples/asr/conf/conformer/cache_aware_streaming/conformer_ctc_bpe.yaml`` for CTC variant. - -To simulate cache-aware streaming, you may use the script at ``/examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py``. It can simulate streaming in single stream or multi-stream mode (in batches) for an ASR model. -This script can be used for models trained offline with full-context but the accuracy would not be great unless the chunk size is large enough which would result in high latency. -It is recommended to train a model in streaming model with limited context for this script. More info can be found in the script. - -Note cache-aware streaming models are being exported without caching support by default. -To include caching support, `model.set_export_config({'cache_support' : 'True'})` should be called before export. -Or, if ``/scripts/export.py`` is being used: -`python export.py cache_aware_conformer.nemo cache_aware_conformer.onnx --export-config cache_support=True` - - -Multitalker Cache-aware Streaming FastConformer ------------------------------------------------ - -This model is a streaming multitalker ASR model based on the :ref:`Cache-aware Streaming FastConformer ` architecture. The model only takes the speaker diarization outputs as external information and eliminates the need for explicit speaker queries or enrollment audio :cite:`asr-models-wang25y_interspeech`. Unlike conventional target-speaker ASR approaches that require speaker embeddings, this model dynamically adapts to individual speakers through speaker-wise speech activity prediction. - - -Self-Speaker Adaptation Technique -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The key innovation involves injecting learnable speaker kernels into the pre-encode layer of the FastConformer encoder :cite:`asr-models-rekesh2023fastconformer`. These speaker kernels are generated via speaker supervision activations, enabling instantaneous adaptation to target speakers. This approach leverages the inherent tendency of streaming ASR systems to prioritize specific speakers, repurposing this mechanism to achieve robust speaker-focused recognition. - -The model architecture requires deploying one model instance per speaker, meaning the number of model instances matches the number of speakers in the conversation. While this necessitates additional computational resources, it achieves state-of-the-art performance in handling fully overlapped speech in both offline and streaming scenarios. - -This self-speaker adaptation approach offers several advantages over traditional multitalker ASR methods: - -* | No Speaker Enrollment: Unlike target-speaker ASR systems that require pre-enrollment audio or speaker embeddings, this model only needs speaker activity information from diarization -* | Handles Severe Overlap: Each instance focuses on a single speaker, enabling accurate transcription even during fully overlapped speech -* | Streaming Capable: Designed for real-time streaming scenarios with configurable latency-accuracy tradeoffs -* | Leverages Single-Speaker Models: Can be fine-tuned from strong pre-trained single-speaker ASR models, and single speaker ASR performance is also preserved - -Speaker Kernel Injection -~~~~~~~~~~~~~~~~~~~~~~~~~ - -The streaming multitalker Parakeet model employs a speaker kernel injection mechanism at some layers of the FastConformer encoder. The learnable speaker kernels are injected into selected encoder layers, enabling the model to dynamically adapt to specific speakers. +Streaming models trained with limited right context for real-time inference with caching to avoid duplicate computation. Supports three modes: fully causal, regular look-ahead, and chunk-aware look-ahead (recommended). -The speaker kernels are generated through speaker supervision activations that detect speech activity for each target speaker. This enables the encoder states to become more responsive to the targeted speaker's speech characteristics, even during periods of fully overlapped speech. +* `Tutorial notebook `_ +* Simulation script: ``examples/asr/asr_cache_aware_streaming/speech_to_text_cache_aware_streaming_infer.py`` +* Supports multiple look-aheads with ``att_context_size`` lists -Multi-Instance Architecture -~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +Configs: ``examples/asr/conf/fastconformer/cache_aware_streaming/`` -The model is based on the Parakeet architecture and consists of a NeMo Encoder for Speech Tasks (NEST) :cite:`asr-models-huang2025nest` which is based on FastConformer :cite:`asr-models-rekesh2023fastconformer` encoder. The key architectural park2024nestinnovation is the **multi-instance approach**, where one model instance is deployed per speaker. -Each model instance has the following characteristics: - -* | Receives the identical speaker-mixed audio input. -* | Injects speaker-specific kernels at the pre-encode layer. -* | Produces transcription output specific to its target speaker. -* | Operates independently and can run in parallel with other instances. - -This architecture enables the model to handle severe speech overlap by having each instance focus exclusively on one speaker, eliminating the permutation problem that affects other multitalker ASR approaches. -Find more details in the :cite:`asr-models-wang25y_interspeech` paper. - -The real-time multitalker ASR model is built on RNNT model structure. :class:`~nemo.collections.asr.models.EncDecMultiTalkerRNNTBPEModel` class inherits from :class:`~nemo.collections.asr.models.EncDecRNNTBPEModel` class and speaker kernel :class:`~nemo.collections.asr.parts.mixins.SpeakerKernel` class. - -Try real-time multitalker ASR with the tutorial notebook: `Streaming Multitalker ASR tutorial notebook `_. - -You can simulate the streaming audio stream and streaming multitalker ASR with the script: -``/examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py`` - - -.. note:: - Many ASR pipelines expect **16 kHz, mono-channel WAV** input. - If your audio is mp3/m4a or has a different sample rate/channel count, convert it first: - - .. code-block:: bash - - ffmpeg -i input.mp3 -ac 1 -ar 16000 -y output.wav - -For an individual audio file: - -.. code-block:: bash - - python /examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \ - asr_model="/path/to/multitalker-parakeet-streaming-0.6b-v1.nemo" \ - diar_model="/path/to/nvidia/diar_streaming_sortformer_4spk-v2.nemo" \ - audio_file="/path/to/your/example.wav" \ - output_path="/path/to/your/example_output.json" - -If you want to simulate the system on multiple files, use NeMo manifest: - -.. code-block:: bash - - python /examples/asr/asr_cache_aware_streaming/speech_to_text_multitalker_streaming_infer.py \ - asr_model="/path/to/multitalker-parakeet-streaming-0.6b-v1.nemo" \ - diar_model="/path/to/nvidia/diar_streaming_sortformer_4spk-v2.nemo" \ - manifest_file="/path/to/your/example_manifest.json" \ - output_path="/path/to/your/example_output.json" - - -Download model checkpoint and more details can be found on Huggingface model card: `Multitalker Parakeet (Cache-aware FastConformer) Streaming `__. +Multitalker Streaming +--------------------- +Streaming multi-speaker ASR based on cache-aware FastConformer with speaker kernel injection :cite:`asr-models-wang25y_interspeech`. Deploys one model instance per speaker for robust transcription of overlapped speech. +* `Model card `__ +* `Tutorial `_ .. _Hybrid-Transducer_CTC_model: Hybrid-Transducer-CTC ---------------------- - -Hybrid RNNT-CTC models is a group of models with both the RNNT and CTC decoders. Training a unified model would speedup the convergence for the CTC models and would enable -the user to use a single model which works as both a CTC and RNNT model. This category can be used with any of the ASR models. -Hybrid models uses two decoders of CTC and RNNT on the top of the encoder. The default decoding strategy after the training is done is RNNT. -User may use the ``asr_model.change_decoding_strategy(decoder_type='ctc' or 'rnnt')`` to change the default decoding. - -The variant with sub-word encoding is a BPE-based model -which can be instantiated using the :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCBPEModel` class, while the -character-based variant is based on :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCModel`. - -You may use the example scripts under ``/examples/asr/asr_hybrid_transducer_ctc`` for both the char-based encoding and sub-word encoding. -These examples can be used to train any Hybrid ASR model like Conformer. +---------------------- -You may find the example config files of Conformer variant of such hybrid models with character-based encoding at -``/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_char.yaml`` and -with sub-word encoding at ``/examples/asr/conf/conformer/hybrid_transducer_ctc/conformer_hybrid_transducer_ctc_bpe.yaml``. +Models with both RNN-T and CTC decoders trained jointly. Switch at inference time via ``asr_model.change_decoding_strategy(decoder_type='ctc' or 'rnnt')``. -Similar example configs for FastConformer variants of Hybrid models can be found here: -``/examples/asr/conf/fastconformer/hybrid_transducer_ctc/`` -``/examples/asr/conf/fastconformer/hybrid_cache_aware_streaming/`` - -Note Hybrid models are being exported as RNNT (encoder and decoder+joint parts) by default. -To export as CTC (single encoder+decoder graph), `model.set_export_config({'decoder_type' : 'ctc'})` should be called before export. -Or, if ``/scripts/export.py`` is being used: -`python export.py hybrid_transducer.nemo hybrid_transducer.onnx --export-config decoder_type=ctc` +* :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCBPEModel` (BPE) / :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCModel` (char) +* Configs: ``examples/asr/conf/fastconformer/hybrid_transducer_ctc/`` .. _Hybrid-Transducer-CTC-Prompt_model: -Hybrid-Transducer-CTC with Prompt Conditioning -~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - -The Hybrid RNNT-CTC model with prompt conditioning (``EncDecHybridRNNTCTCBPEModelWithPrompt``) extends the base Hybrid-Transducer-CTC model -to support multi-language and multi-domain ASR through prompt conditioning. This model leverages prompts to guide the transcription process, -enabling language-specific or domain-specific transcription from a single unified model. - -Key features of this model include: - -* **Prompt Feature**: Uses learnable prompt embeddings that are concatenated with acoustic features to guide transcription -* **Multi-language Support**: Can transcribe audio in multiple languages based on language prompts -* **Offline and Buffered Streaming Inference Support**: Can be used in offline and buffered streaming mode - -The model can be instantiated using the :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCBPEModelWithPrompt` class. - -Architecture Overview -^^^^^^^^^^^^^^^^^^^^^ - -The model architecture builds upon the standard Hybrid-Transducer-CTC model by incorporating prompt information directly into the decoder through a concatenation-based approach. This design enables scalable multilingual ASR/AST capabilities. - -**Core Components:** - -1. **Prompt Supervision Source**: Prompt label information extracted from training and inference manifests or provided as an input -2. **Prompt Vector Representation**: One-hot binary vectors where one element is 1 (prompt) and all others are 0 -3. **Concatenation-Based Prompt Encoding**: Direct combination of prompt vectors with acoustic features - -**Detailed Architecture:** - -**Prompt Vector Design:** -- **Dimensionality**: Default to 128-dimensional vectors for scalability (supports current target language and future prompts) without the need to change the architecture -- **Representation**: Binary one-hot encoding where each position represents a prompt ID -- **Expansion**: Prompt vectors are expanded at each time step to match acoustic feature temporal dimensions - -**Concatenation Method Implementation:** -The model adopts a concatenation approach where language vectors and ASR acoustic features are directly combined: - -1. **Feature Stacking**: Language vectors and encoded acoustic features are stacked along the feature dimension -2. **Projection**: The concatenated representation passes through a projection network (``prompt_kernel``) - - -**Inference Capabilities:** - -The model supports both offline and buffered streaming inference modes: - -- **Offline Mode**: Full context processing for maximum accuracy -- **Buffered Streaming**: Real-time multilingual speech-to-text processing with language-aware decoding - - -Configuration -^^^^^^^^^^^^^ - -The model supports several prompt-specific configuration parameters: - -* ``initialize_prompt_feature``: Boolean flag to enable prompt conditioning -* ``num_prompts``: Number of supported prompt categories (default: 128) -* ``prompt_dictionary``: Mapping from language/domain identifiers to prompt indices -* ``prompt_field``: Field name used for prompt extraction from manifest files - -Example config files for this model can be found at: -``/examples/asr/conf/fastconformer/hybrid_transducer_ctc/fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml`` - -Training -^^^^^^^^ - -To train the Hybrid-Transducer-CTC model with prompt feature, use the training script: - -``/examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe_prompt.py`` - -Example training command: - -.. code-block:: bash - - python /examples/asr/asr_hybrid_transducer_ctc/speech_to_text_hybrid_rnnt_ctc_bpe_prompt.py \ - --config-path=/examples/asr/conf/fastconformer/hybrid_transducer_ctc/ \ - --config-name=fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml \ - model.train_ds.manifest_filepath= \ - model.validation_ds.manifest_filepath= \ - model.tokenizer.dir= \ - model.test_ds.manifest_filepath= - -Usage Examples -^^^^^^^^^^^^^^ - -**Basic Transcription with Language Prompts:** - -.. code-block:: python - - # Load the model - asr_model = nemo_asr.models.EncDecHybridRNNTCTCBPEModelWithPrompt.restore_from("path/to/model.nemo") - - # Transcribe with specific target language - transcriptions = asr_model.transcribe( - paths2audio_files=["audio1.wav", "audio2.wav"], - target_lang="en-US", # Specify target language - ) - - -Training Data Requirements -^^^^^^^^^^^^^^^^^^^^^^^^^^ - -The model requires training data with prompt annotations. The recommended dataset format uses Lhotse with the -:class:`~nemo.collections.asr.data.audio_to_text_lhotse_prompt.LhotseSpeechToTextBpeDatasetWithPrompt` dataset class. - -Manifest files should include prompt information: - -.. code-block:: json - - { - "audio_filepath": "path/to/audio.wav", - "text": "transcription text", - "duration": 10.5, - "target_lang": "en-US" - } - - -.. _LSTM-Transducer_model: - -LSTM-Transducer ---------------- - -LSTM-Transducer is a model which uses RNNs (eg. LSTM) in the encoder. The architecture of this model is followed from suggestions in :cite:`asr-models-he2019streaming`. -It uses RNNT/Transducer loss/decoder. The encoder consists of RNN layers (LSTM as default) with lower projection size to increase the efficiency. -Layer norm is added between the layers to stabilize the training. -It can be trained/used in unidirectional or bidirectional mode. The unidirectional mode is fully causal and can be used easily for simple and efficient streaming. However the accuracy of this model is generally lower than other models like Conformer. - -This model supports both the sub-word level and character level encodings. You may find the example config file of RNNT model with wordpiece encoding at ``/examples/asr/conf/lstm/lstm_transducer_bpe.yaml``. -You can find more details on the config files for the RNNT models at :ref:`LSTM-Transducer `. - -.. _LSTM-CTC_model: - -LSTM-CTC --------- - -LSTM-CTC model is a CTC-variant of the LSTM-Transducer model which uses CTC loss/decoding instead of Transducer. -You may find the example config file of LSTM-CTC model with wordpiece encoding at ``/examples/asr/conf/lstm/lstm_ctc_bpe.yaml``. +**With Prompt Conditioning:** Extends Hybrid models with learnable prompt embeddings for multilingual/multi-domain ASR via :class:`~nemo.collections.asr.models.EncDecHybridRNNTCTCBPEModelWithPrompt`. Config: ``fastconformer_hybrid_transducer_ctc_bpe_prompt.yaml`` References diff --git a/docs/source/asr/results.rst b/docs/source/asr/results.rst index 294d5ea891d0..84851f2799b6 100644 --- a/docs/source/asr/results.rst +++ b/docs/source/asr/results.rst @@ -259,7 +259,7 @@ Automatic Speech Recognition Models Speech Recognition ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ -Below is a list of the high quality ASR models available in NeMo for specific languages, all ASR models can be found in :doc:`All checkpoints <./all_chkpt>`. +Below is a list of the high quality ASR models available in NeMo for specific languages. All ASR models can be found in :doc:`ASR Model Checkpoints <./asr_checkpoints>`. Multilingual Multitask ^^^^^^^^^^^^^^^^^^^^^^ diff --git a/docs/source/collections.rst b/docs/source/collections.rst index 31e5b9b64ac1..13328b4cd996 100644 --- a/docs/source/collections.rst +++ b/docs/source/collections.rst @@ -9,6 +9,7 @@ Documentation for the individual collections :titlesonly: asr/intro + dataloaders asr/speech_classification/intro asr/speaker_recognition/intro asr/speaker_diarization/intro diff --git a/docs/source/dataloaders.rst b/docs/source/dataloaders.rst new file mode 100644 index 000000000000..cd11f1f7985a --- /dev/null +++ b/docs/source/dataloaders.rst @@ -0,0 +1,687 @@ +.. _lhotse-dataloading: + +================== +Lhotse Dataloading +================== + +NeMo supports using `Lhotse`_, a speech data handling library, as a dataloading option. The key features of Lhotse used in NeMo are: + +* Dynamic batch sizes + Lhotse samples mini-batches to satisfy the constraint of total speech duration in a mini-batch (``batch_duration``), + rather than a specific number of examples (i.e., batch size). +* Dynamic bucketing + Instead of statically pre-bucketing the data, Lhotse allocates training examples to buckets dynamically. + This allows more rapid experimentation with bucketing settings (number of buckets, specific placement of bucket duration bins) + to minimize the amount of padding and accelerate training. +* Quadratic duration penalty + Adding a quadratic penalty to an utterance's duration allows to sample mini-batches so that the + GPU utilization is more consistent across big batches of short utterances and small batches of long utterances when using + models with quadratic time/memory complexity (such as transformer). +* Dynamic weighted data source multiplexing + An approach to combining diverse data sources (e.g. multiple domains, languages, tasks) + where each data source is treated as a separate stream with its own sampling probability. The resulting data stream is a + multiplexer that samples from each sub-stream. This approach ensures that the distribution of different sources is approximately + constant in time (i.e., stationary); in fact, each mini-batch will have roughly the same ratio of data coming from each source. + Since the multiplexing is done dynamically, it is very easy to tune the sampling weights. + +Lhotse dataloading supports the following types of inputs: + +* NeMo manifests + Regular NeMo JSON manifests. +* NeMo tarred data + Tarred NeMo JSON manifests + audio tar files; we also support combination of multiple NeMo + tarred data sources (e.g., multiple buckets of NeMo data or multiple datasets) via dynamic multiplexing. + + We support using a subset of Tarred NeMo JSON manifests along with audio tar files without disrupting the alignment between the tarred files and their corresponding manifests. + This feature is essential because large datasets often consist of numerous tar files and multiple versions of Tarred NeMo JSON manifest subsets, which may contain only a portion of the audio files due to filtering for various reasons. + To skip specific entries in the manifests without repeatedly copying and retarring audio files, the entries must include a ``_skipme`` key. This key should be set to ``True``, ``1``, or a reason for skipping (e.g., ``low character-rate``). + +* Lhotse CutSet manifests + Regular Lhotse CutSet manifests (typically gzipped JSONL). + See `Lhotse Cuts documentation`_ to learn more about Lhotse data formats. +* Lhotse Shar data + Lhotse Shar is a data format that also uses tar files for sequential data loading, + but is designed to be modular (i.e., easily extensible with new data sources and with new feature fields). + More details can be found here: |tutorial_shar| + +.. caution:: As of now, Lhotse is mainly supported in most ASR model configurations. We aim to gradually extend this support to other speech tasks. + +.. _Lhotse: https://github.com/lhotse-speech/lhotse +.. _Lhotse Cuts documentation: https://lhotse.readthedocs.io/en/latest/cuts.html +.. |tutorial_shar| image:: https://colab.research.google.com/assets/colab-badge.svg + :target: https://colab.research.google.com/github/lhotse-speech/lhotse/blob/master/examples/04-lhotse-shar.ipynb + +Enabling Lhotse via configuration +---------------------------------- + +.. note:: Using Lhotse with tarred datasets will make the dataloader infinite, ditching the notion of an "epoch". "Epoch" may still be logged in W&B/TensorBoard, but it will correspond to the number of executed training loops between validation loops. + +Start with an existing NeMo experiment YAML configuration. Typically, you'll only need to add a few options to enable Lhotse. +These options are:: + + # NeMo generic dataloading arguments + model.train_ds.manifest_filepath=... + model.train_ds.tarred_audio_filepaths=... # for tarred datasets only + model.train_ds.num_workers=4 + model.train_ds.min_duration=0.3 # optional + model.train_ds.max_duration=30.0 # optional + model.train_ds.shuffle=true # optional + + # Lhotse dataloading related arguments + ++model.train_ds.use_lhotse=True + ++model.train_ds.batch_duration=1100 + ++model.train_ds.quadratic_duration=30 + ++model.train_ds.num_buckets=30 + ++model.train_ds.num_cuts_for_bins_estimate=10000 + ++model.train_ds.bucket_buffer_size=10000 + ++model.train_ds.shuffle_buffer_size=10000 + + # PyTorch Lightning related arguments + ++trainer.use_distributed_sampler=false + ++trainer.limit_train_batches=1000 + trainer.val_check_interval=1000 + trainer.max_steps=300000 + +.. note:: The default values above are a reasonable starting point for a hybrid RNN-T + CTC ASR model on a 32GB GPU with a data distribution dominated by 15s long utterances. + +Let's briefly go over each of the Lhotse dataloading arguments: + +* ``use_lhotse`` enables Lhotse dataloading +* ``batch_duration`` is the total max duration of utterances in a mini-batch and controls the batch size; the more shorter utterances, the bigger the batch size, and vice versa. +* ``quadratic_duration`` adds a quadratically growing penalty for long utterances; useful in bucketing and transformer type of models. The value set here means utterances this long will count as if with a doubled duration. +* ``num_buckets`` is the number of buckets in the bucketing sampler. Bigger value means less padding but also less randomization. +* ``num_cuts_for_bins_estimate`` is the number of utterance we will sample before the start of the training to estimate the duration bins for buckets. Larger number results in a more accurate estimatation but also a bigger lag before starting the training. +* ``bucket_buffer_size`` is the number of utterances (data and metadata) we will hold in memory to be distributed between buckets. With bigger ``batch_duration``, this number may need to be increased for dynamic bucketing sampler to work properly (typically it will emit a warning if this is too low). +* ``shuffle_buffer_size`` is an extra number of utterances we will hold in memory to perform approximate shuffling (via reservoir-like sampling). Bigger number means more memory usage but also better randomness. + +The PyTorch Lightning ``trainer`` related arguments: + +* ``use_distributed_sampler=false`` is required because Lhotse has its own handling of distributed sampling. +* ``val_check_interval``/``limit_train_batches`` + These are required for dataloaders with tarred/Shar datasets + because Lhotse makes the dataloader infinite, so we'd never go past epoch 0. This approach guarantees + we will never hang the training because the dataloader in some node has less mini-batches than the others + in some epochs. The value provided here will be the effective length of each "pseudo-epoch" after which we'll + trigger the validation loop. +* ``max_steps`` is the total number of steps we expect to be training for. It is required for the same reason as ``limit_train_batches``; since we'd never go past epoch 0, the training would have never finished. + +Some other Lhotse related arguments we support: + +* ``cuts_path`` can be provided to read data from a Lhotse CutSet manifest instead of a NeMo manifest. + Specifying this option will result in ``manifest_filepaths`` and ``tarred_audio_filepaths`` being ignored. +* ``shar_path`` + Can be provided to read data from a Lhotse Shar manifest instead of a NeMo manifest. + Specifying this option will result in ``manifest_filepaths`` and ``tarred_audio_filepaths`` being ignored. + This argument can be a string (single Shar directory), a list of strings (Shar directories), + or a list of 2-item lists, where the first item is a Shar directory path, and the other is a sampling weight. + The user can also provide a dict mapping Lhotse Shar fields to a list of shard paths with data for that field. + For details about Lhotse Shar format, see: |tutorial_shar| +* ``bucket_duration_bins`` + Duration bins are a list of float values (seconds) that when provided, will skip the initial bucket bin estimation + and save some time. It has to have a length of ``num_buckets - 1``. An optimal value can be obtained by running CLI: + ``lhotse cut estimate-bucket-bins -b $num_buckets my-cuts.jsonl.gz`` +* ``use_bucketing`` is a boolean which indicates if we want to enable/disable dynamic bucketing. By defalt it's enabled. +* ``text_field`` is the name of the key in the JSON (NeMo) manifest from which we should be reading text (default="text"). +* ``lang_field`` is the name of the key in the JSON (NeMo) manifest from which we should be reading language tag (default="lang"). This is useful when working e.g. with ``AggregateTokenizer``. +* ``batch_size`` + Limits the number of examples in a mini-batch to this number, when combined with ``batch_duration``. + When ``batch_duration`` is not set, it acts as a static batch size. +* ``seed`` sets a random seed for the shuffle buffer. + +The full and always up-to-date list of supported options can be found in ``LhotseDataLoadingConfig`` class. + +.. _asr-dataset-config-format: + +Extended multi-dataset configuration format +-------------------------------------------- + +Combining a large number of datasets and defining weights for them can be tricky. +We offer an extended configuration format that allows you to explicitly define datasets, +dataset groups, and their weights either inline in the experiment configuration, +or as a path to a separate YAML file. + +In addition to the features above, this format introduces a special ``tags`` dict-like field. +The keys and values in ``tags`` are automatically attached to every sampled example, which +is very useful when combining multiple datasets with different properties. +The dataset class which converts these examples to tensors can partition the mini-batch and apply +different processing to each group. +For example, you may want to construct different prompts for the model using metadata in ``tags``. + +.. note:: When fine-tuning a model that was trained with ``input_cfg`` option, typically you'd only need + to override the following options: ``input_cfg=null`` and ``manifest_filepath=path/to/manifest.json``. + +Example 1. Combine two datasets with equal weights and attach custom metadata in ``tags`` to each cut: + +.. code-block:: yaml + + input_cfg: + - type: nemo_tarred + manifest_filepath: /path/to/manifest__OP_0..512_CL_.json + tarred_audio_filepath: /path/to/tarred_audio/audio__OP_0..512_CL_.tar + weight: 0.4 + tags: + lang: en + pnc: no + - type: nemo_tarred + manifest_filepath: /path/to/other/manifest__OP_0..512_CL_.json + tarred_audio_filepath: /path/to/other/tarred_audio/audio__OP_0..512_CL_.tar + weight: 0.6 + tags: + lang: pl + pnc: yes + +Example 2. Combine multiple (4) datasets, corresponding to different tasks (ASR, AST). +Each task gets its own group and its own weight. +Then within each task, each dataset get its own within-group weight as well. +The final weight is the product of outer and inner weight: + +.. code-block:: yaml + + input_cfg: + - type: group + weight: 0.7 + tags: + task: asr + input_cfg: + - type: nemo_tarred + manifest_filepath: /path/to/asr1/manifest__OP_0..512_CL_.json + tarred_audio_filepath: /path/to/tarred_audio/asr1/audio__OP_0..512_CL_.tar + weight: 0.6 + tags: + source_lang: en + target_lang: en + - type: nemo_tarred + manifest_filepath: /path/to/asr2/manifest__OP_0..512_CL_.json + tarred_audio_filepath: /path/to/asr2/tarred_audio/audio__OP_0..512_CL_.tar + weight: 0.4 + tags: + source_lang: pl + target_lang: pl + - type: group + weight: 0.3 + tags: + task: ast + input_cfg: + - type: nemo_tarred + manifest_filepath: /path/to/ast1/manifest__OP_0..512_CL_.json + tarred_audio_filepath: /path/to/ast1/tarred_audio/audio__OP_0..512_CL_.tar + weight: 0.2 + tags: + source_lang: en + target_lang: pl + - type: nemo_tarred + manifest_filepath: /path/to/ast2/manifest__OP_0..512_CL_.json + tarred_audio_filepath: /path/to/ast2/tarred_audio/audio__OP_0..512_CL_.tar + weight: 0.8 + tags: + source_lang: pl + target_lang: en + +Configuring multimodal dataloading +----------------------------------- + +Our configuration format supports specifying data sources from other modalities than just audio. +At this time, this support is extended to audio and text modalities. We provide the following parser types: + +**Raw text files.** Simple text files where each line is an individual text example. This can represent standard language modeling data. +This parser is registered under ``type: txt``. + +Data format examples:: + + # file: document_0.txt + This is a language modeling example. + Wall Street is expecting major news tomorrow. + + # file: document_1.txt + Invisible bats have stormed the city. + What an incredible event! + +Dataloading configuration example:: + + input_cfg: + - type: txt + paths: /path/to/document_{0..1}.txt + language: en # optional + +Python object example:: + + from nemo.collections.common.data.lhotse.text_adapters import TextExample + + example = TextExample( + text="This is a language modeling example.", + language="en", # optional + ) + +Python dataloader instantiation example:: + + from nemo.collections.common.data.lhotse.dataloader import get_lhotse_dataloader_from_config + + dl = get_lhotse_dataloader_from_config({ + "input_cfg": [ + {"type": "txt", "paths": "/path/to/document_{0..1}.txt", "language": "en"}, + ], + "use_multimodal_dataloading": True, + "batch_size": 4, + }, + global_rank=0, + world_size=1, + dataset=MyDatasetClass(), # converts CutSet -> dict[str, Tensor] + tokenizer=my_tokenizer, + ) + +**Raw text file pairs.** Pairs of raw text files with corresponding lines. This can represent machine translation data. +This parser is registered under ``type: txt_pair``. + +Data format examples:: + + # file: document_en_0.txt + This is a machine translation example. + Wall Street is expecting major news tomorrow. + + # file: document_pl_0.txt + To jest przykład tłumaczenia maszynowego. + Wall Street spodziewa się jutro ważnych wiadomości. + +Dataloading configuration example:: + + input_cfg: + - type: txt_pair + source_path: /path/to/document_en_{0..N}.txt + target_path: /path/to/document_pl_{0..N}.txt + source_language: en # optional + target_language: pl # optional + +Python object example:: + + from nemo.collections.common.data.lhotse.text_adapters import SourceTargetTextExample + + example = SourceTargetTextExample( + source=TextExample( + text="This is a language modeling example.", + language="en", # optional + ), + target=TextExample( + text="To jest przykład tłumaczenia maszynowego.", + language="pl", # optional + ), + ) + +Python dataloader instantiation example:: + + from nemo.collections.common.data.lhotse.dataloader import get_lhotse_dataloader_from_config + + dl = get_lhotse_dataloader_from_config({ + "input_cfg": [ + { + "type": "txt_pair", + "source_path": "/path/to/document_en_{0..N}.txt", + "target_path": "/path/to/document_pl_{0..N}.txt", + "source_language": "en" + "target_language": "en" + }, + ], + "use_multimodal_dataloading": True, + "prompt_format": "t5nmt", + "batch_size": 4, + }, + global_rank=0, + world_size=1, + dataset=MyDatasetClass(), # converts CutSet -> dict[str, Tensor] + tokenizer=my_tokenizer, + ) + +**NeMo multimodal conversations.** A JSON-Lines (JSONL) file that defines multi-turn conversations with mixed text and audio turns. +This parser is registered under ``type: multimodal_conversation``. + +Data format examples:: + + # file: chat_0.jsonl + {"id": "conv-0", "conversations": [{"from": "user", "value": "speak to me", "type": "text"}, {"from": "assistant": "value": "/path/to/audio.wav", "duration": 17.1, "type": "audio"}]} + {"id": "conv-1", "conversations": [{"from": "user", "value": "speak to me", "type": "text"}, {"from": "assistant": "value": "/path/to/audio.wav", "duration": 5, "offset": 17.1, "type": "audio"}]} + +Dataloading configuration example:: + + token_equivalent_duration: 0.08 + input_cfg: + - type: multimodal_conversation + manifest_filepath: /path/to/chat_{0..N}.jsonl + audio_locator_tag: [audio] + +Python object example:: + + from lhotse import Recording + from nemo.collections.common.data.lhotse.text_adapters import MultimodalConversation, TextTurn, AudioTurn + + conversation = NeMoMultimodalConversation( + id="conv-0", + turns=[ + TextTurn(value="speak to me", role="user"), + AudioTurn(cut=Recording.from_file("/path/to/audio.wav").to_cut(), role="assistant", audio_locator_tag="[audio]"), + ], + token_equivalent_duration=0.08, # this value will be auto-inserted by the dataloader + ) + +Python dataloader instantiation example:: + + from nemo.collections.common.data.lhotse.dataloader import get_lhotse_dataloader_from_config + + dl = get_lhotse_dataloader_from_config({ + "input_cfg": [ + { + "type": "multimodal_conversation", + "manifest_filepath": "/path/to/chat_{0..N}.jsonl", + "audio_locator_tag": "[audio]", + }, + ], + "use_multimodal_dataloading": True, + "token_equivalent_duration": 0.08, + "prompt_format": "llama2", + "batch_size": 4, + }, + global_rank=0, + world_size=1, + dataset=MyDatasetClass(), # converts CutSet -> dict[str, Tensor] + tokenizer=my_tokenizer, + ) + +**Dataloading and bucketing of text and multimodal data.** When dataloading text or multimodal data, pay attention to the following config options (we provide example values for convenience): + +* ``use_multimodal_sampling: true`` tells Lhotse to switch from measuring audio duration to measuring token counts; required for text. + +* ``prompt_format: "prompt-name"`` will apply a specified PromptFormatter during data sampling to accurately reflect its token counts. + +* ``measure_total_length: true`` customizes length measurement for decoder-only and encoder-decoder models. Decoder-only models consume a linear sequence of context + answer, so we should measure the total length (``true``). On the other hand, encoder-decoder models deal with two different sequence lengths: input (context) sequence length for the encoder, and output (answer) sequence length for the decoder. For such models set this to ``false``. + +* ``min_tokens: 1``/``max_tokens: 4096`` filters examples based on their token count (after applying the prompt format). + +* ``min_tpt: 0.1``/``max_tpt: 10`` filter examples based on their output-token-per-input-token-ratio. For example, a ``max_tpt: 10`` means we'll filter every example that has more than 10 output tokens per 1 input token. Very useful for removing sequence length outliers that lead to OOM. Use ``estimate_token_bins.py`` to view token count distributions for calbirating this value. + +* (multimodal-only) ``token_equivalent_duration: 0.08`` is used to be able to measure audio examples in the number of "tokens". For example, if we're using fbank with 0.01s frame shift and an acoustic model that has a subsampling factor of 0.08, then a reasonable setting for this could be 0.08 (which means every subsampled frame counts as one token). Calibrate this value to fit your needs. + +**Text/multimodal bucketing and OOMptimizer.** Analogous to bucketing for audio data, we provide two scripts to support efficient bucketing: + +* ``scripts/speech_llm/estimate_token_bins.py`` which estimates 1D or 2D buckets based on the input config, tokenizer, and prompt format. It also estimates input/output token count distribution and suggested ``max_tpt`` (token-per-token) filtering values. + +* (experimental) ``scripts/speech_llm/oomptimizer.py`` which works with SALM/BESTOW GPT/T5 models and estimates the optimal ``bucket_batch_size`` for a given model config and bucket bins value. Given the complexity of Speech LLM some configurations may not be supported yet at the time of writing (e.g., model parallelism). + +To enable bucketing, set ``batch_size: null`` and use the following options: + +* ``use_bucketing: true`` + +* ``bucket_duration_bins`` - the output of ``estimate_token_bins.py``. If ``null``, it will be estimated at the start of training at the cost of some run time (not recommended). + +* (oomptimizer-only) ``bucket_batch_size`` - the output of OOMptimizer. + +* (non-oomptimizer-only) ``batch_tokens`` is the maximum number of tokens we want to find inside a mini-batch. Similarly to ``batch_duration``, this number does consider padding tokens too, therefore enabling bucketing is recommended to maximize the ratio of real vs padding tokens. Note that it's just a heuristic for determining the optimal batch sizes for different buckets, and may be less efficient than using OOMptimizer. + +* (non-oomptimizer-only) ``quadratic_factor`` is a quadratic penalty to equalize the GPU memory usage between buckets of short and long sequence lengths for models with quadratic memory usage. It is only a heuristic and may not be as efficient as using OOMptimizer. + +**Joint dataloading of text/audio/multimodal data.** The key strength of this approach is that we can easily combine audio datasets and text datasets, +and benefit from every other technique we described in this doc, such as: dynamic data mixing, data weighting, dynamic bucketing, and so on. + +This approach is described in the `EMMeTT`_ paper. There's also a notebook tutorial called Multimodal Lhotse Dataloading. We construct a separate sampler (with its own batching settings) for each modality, +and specify how the samplers should be fused together via the option ``sampler_fusion``: + +* ``sampler_fusion: "round_robin"`` will iterate single sampler per step, taking turns. For example: step 0 - audio batch, step 1 - text batch, step 2 - audio batch, etc. + +* ``sampler_fusion: "randomized_round_robin"`` is similar, but at each chooses a sampler randomly using ``sampler_weights: [w0, w1]`` (weights can be unnormalized). + +* ``sampler_fusion: "zip"`` will draw a mini-batch from each sampler at every step, and merge them into a single ``CutSet``. This approach combines well with multimodal gradient accumulation (run forward+backward for one modality, then the other, then the update step). + +.. _EMMeTT: https://arxiv.org/abs/2409.13523 + +Example. Combine an ASR (audio-text) dataset with an MT (text-only) dataset so that mini-batches have some examples from both datasets: + +.. code-block:: yaml + + model: + ... + train_ds: + multi_config: True, + sampler_fusion: zip + shuffle: true + num_workers: 4 + + audio: + prompt_format: t5nmt + use_bucketing: true + min_duration: 0.5 + max_duration: 30.0 + max_tps: 12.0 + bucket_duration_bins: [[3.16, 10], [3.16, 22], [5.18, 15], ...] + bucket_batch_size: [1024, 768, 832, ...] + input_cfg: + - type: nemo_tarred + manifest_filepath: /path/to/manifest__OP_0..512_CL_.json + tarred_audio_filepath: /path/to/tarred_audio/audio__OP_0..512_CL_.tar + weight: 0.5 + tags: + context: "Translate the following to English" + + text: + prompt_format: t5nmt + use_multimodal_sampling: true + min_tokens: 1 + max_tokens: 256 + min_tpt: 0.333 + max_tpt: 3.0 + measure_total_length: false + use_bucketing: true + bucket_duration_bins: [[10, 4], [10, 26], [15, 10], ...] + bucket_batch_size: [512, 128, 192, ...] + input_cfg: + - type: txt_pair + source_path: /path/to/en__OP_0..512_CL_.txt + target_path: /path/to/pl__OP_0..512_CL_.txt + source_language: en + target_language: pl + weight: 0.5 + tags: + question: "Translate the following to Polish" + +.. caution:: We strongly recommend to use multiple shards for text files as well so that different nodes and dataloading workers are able to randomize the order of text iteration. Otherwise, multi-GPU training has a high risk of duplication of text examples. + +Pre-computing bucket duration bins +------------------------------------ + +We recommend to pre-compute the bucket duration bins in order to accelerate the start of the training -- otherwise, the dynamic bucketing sampler will have to spend some time estimating them before the training starts. +The following script may be used: + +.. code-block:: bash + + $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 manifest.json + + # The script's output: + Use the following options in your config: + num_buckets=30 + bucket_duration_bins=[1.78,2.34,2.69,... + + +For multi-dataset setups, one may provide a dataset config directly: + +.. code-block:: bash + + $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 input_cfg.yaml + + # The script's output: + Use the following options in your config: + num_buckets=30 + bucket_duration_bins=[1.91,3.02,3.56,... + + +It's also possible to manually specify the list of data manifests (optionally together with weights): + +.. code-block:: bash + + $ python scripts/speech_recognition/estimate_duration_bins.py -b 30 [[manifest.json,0.7],[other.json,0.3]] + + # The script's output: + Use the following options in your config: + num_buckets=30 + bucket_duration_bins=[1.91,3.02,3.56,... + + +2D bucketing +------------- + +To achieve maximum training efficiency for some classes of models it is necessary to stratify the sampling +both on the input sequence lengths and the output sequence lengths. +One such example are attention encoder-decoder models, where the overall GPU memory usage can be factorized +into two main components: input-sequence-length bound (encoder activations) and output-sequence-length bound +(decoder activations). +Classical bucketing techniques only stratify on the input sequence length (e.g. duration in speech), +which leverages encoder effectively but leads to excessive padding on on decoder's side. + +To amend this we support a 2D bucketing technique which estimates the buckets in two stages. +The first stage is identical to 1D bucketing, i.e. we determine the input-sequence bucket bins so that +every bin holds roughly an equal duration of audio. +In the second stage, we use a tokenizer and optionally a prompt formatter (for prompted models) to +estimate the total number of tokens in each duration bin, and sub-divide it into several sub-buckets, +where each sub-bucket again holds roughly an equal number of tokens. + +To run 2D bucketing with 30 buckets sub-divided into 5 sub-buckets each (150 buckets total), use the following script: + +.. code-block:: bash + + $ python scripts/speech_recognition/estimate_duration_bins_2d.py \ + --tokenizer path/to/tokenizer.model \ + --buckets 30 \ + --sub-buckets 5 \ + input_cfg.yaml + + # The script's output: + Use the following options in your config: + use_bucketing=1 + num_buckets=30 + bucket_duration_bins=[[1.91,10],[1.91,17],[1.91,25],... + The max_tps setting below is optional, use it if your data has low quality long transcript outliers: + max_tps=[13.2,13.2,11.8,11.8,...] + +Note that the output in ``bucket_duration_bins`` is a nested list, where every bin specifies +the maximum duration and the maximum number of tokens that go into the bucket. +Passing this option to Lhotse dataloader will automatically enable 2D bucketing. + +Note the presence of ``max_tps`` (token-per-second) option. +It is optional to include it in the dataloader configuration: if you do, we will apply an extra filter +that discards examples which have more tokens per second than the threshold value. +The threshold is determined for each bucket separately based on data distribution, and can be controlled +with the option ``--token_outlier_threshold``. +This filtering is useful primarily for noisy datasets to discard low quality examples / outliers. + +We also support aggregate tokenizers for 2D bucketing estimation: + +.. code-block:: bash + + $ python scripts/speech_recognition/estimate_duration_bins_2d.py \ + --tokenizer path/to/en/tokenizer.model path/to/pl/tokenizer1.model \ + --langs en pl \ + --buckets 30 \ + --sub-buckets 5 \ + input_cfg.yaml + +To estimate 2D buckets for a prompted model such as Canary-1B, provide prompt format name and an example prompt. +For Canary-1B, we'll also provide the special tokens tokenizer. Example: + +.. code-block:: bash + + $ python scripts/speech_recognition/estimate_duration_bins_2d.py \ + --prompt-format canary \ + --prompt "[{'role':'user','slots':{'source_lang':'en','target_lang':'de','task':'ast','pnc':'yes'}}]" \ + --tokenizer path/to/spl_tokens/tokenizer.model path/to/en/tokenizer.model path/to/de/tokenizer1.model \ + --langs spl_tokens en de \ + --buckets 30 \ + --sub-buckets 5 \ + input_cfg.yaml + +OOMptimizer +------------ + +The default approach of specifying a ``batch_duration``, ``bucket_duration_bins`` and ``quadratic_duration`` +is quite flexible, but is not maximally efficient. We observed that in practice it often leads to under-utilization +of GPU memory and compute for most buckets (especially those with shorter durations). +While it is impossible to estimate GPU memory usage up-front, we can determine it empirically with a bit of search. + +OOMptimizer is an approach that given a NeMo model, optimizer, and a list of buckets (1D or 2D) +estimates the maximum possible batch size to use for each bucket. +It performs a binary search over batch sizes that succeed or lead to CUDA OOM until convergence. +We find that the resulting bucketing batch size profiles enable full GPU utilization in training, +while it only takes a couple of minutes to complete the search. + +In order to run OOMptimizer, you only need the bucketing bins (from previous sections) and a model configuration: + +.. code-block:: bash + + $ python scripts/speech_recognition/oomptimizer.py \ + --config-path fast-conformer_aed.yaml \ + --module-name nemo.collections.asr.models.EncDecMultiTaskModel \ + --buckets '[[3.975,30],[3.975,48],[4.97,37],...]' + + # The script's output: + + The final profile is: + bucket_duration_bins=[[3.975,30],[3.975,48],...] + bucket_batch_size=[352,308,280,...] + max_tps=12.0 + max_duration=40.0 + +Use the resulting options in your training configuration (typically under namespace ``model.train_ds``) to apply the profile. + +It's also possible to run OOMptimizer using a pretrained model's name and bucket bins corresponding +to your fine-tuning data: + +.. code-block:: bash + + $ python scripts/speech_recognition/oomptimizer.py \ + --pretrained-name nvidia/canary-1b \ + --buckets '[2.0,3.1,5.6,6.6,...]' + +Note that your training script can perform some additional actions using GPU RAM that cannot be anticipated by the OOMptimizer. +By default, we let the script use up to 90% of GPU's RAM for this estimation to account for that. +In the unlikely case you run into an OutOfMemoryError during training, you can try re-estimating the profile with the option ``--memory-fraction 0.75`` (or another value) that will further cap OOMptimizer's available GPU RAM. + +Seeds and randomness +--------------------- + +In Lhotse dataloading configuration we have two parameters controlling randomness: ``seed`` and ``shard_seed``. +Both of them can be either set to a fixed number, or one of two string options ``"randomized"`` and ``"trng"``. +Their roles are: + +* ``seed`` is the base random seed, and is one of several factors used to initialize various RNGs participating in dataloading. + +* ``shard_seed`` controls the shard randomization strategy in distributed data parallel setups when using sharded tarred datasets. + +Below are the typical examples of configuration with an explanation of the expected outcome. + +Case 1 (default): ``seed=`` and ``shard_seed="trng"``: + +* The ``trng`` setting discards ``seed`` and causes the actual random seed to be drawn using OS's true RNG. Each node/GPU/dataloading worker draws its own unique random seed when it first needs it. + +* Each node/GPU/dataloading worker yields data in a different order (no mini-batch duplication). + +* On each training script run, the order of dataloader examples are **different**. + +* Since the random seed is unpredictable, the exact dataloading order is not replicable. + +Case 2: ``seed=`` and ``shard_seed="randomized"``: + +* The ``randomized`` setting uses ``seed`` along with DDP ``rank`` and dataloading ``worker_id`` to set a unique but deterministic random seed in each dataloading process across all GPUs. + +* Each node/GPU/dataloading worker yields data in a different order (no mini-batch duplication). + +* On each training script run, the order of dataloader examples are **identical** as long as ``seed`` is the same. + +* This setup guarantees 100% dataloading reproducibility. + +* Resuming training without changing of the ``seed`` value will cause the model to train on data it has already seen. For large data setups, not managing the ``seed`` may cause the model to never be trained on a majority of data. This is why this mode is not the default. + +* If you're combining DDP with model parallelism techniques (Tensor Parallel, Pipeline Parallel, etc.) you need to use ``shard_seed="randomized"``. Using ``"trng"`` will cause different model parallel ranks to desynchronize and cause a deadlock. + +* Generally the seed can be managed by the user by providing a different value each time the training script is launched. For example, for most models the option to override would be ``model.train_ds.seed=``. If you're launching multiple tasks queued one after another on a grid system, you can generate a different random seed for each task, e.g. on most Unix systems ``RSEED=$(od -An -N4 -tu4 < /dev/urandom | tr -d ' ')`` would generate a random uint32 number that can be provided as the seed. + +Other, more exotic configurations: + +* With ``shard_seed=``, all dataloading workers will yield the same results. This is only useful for unit testing and maybe debugging. + +* With ``seed="trng"``, the base random seed itself will be drawn using a TRNG. It will be different on each GPU training process. This setting is not recommended. + +* With ``seed="randomized"``, the base random seed is set to Python's global RNG seed. It might be different on each GPU training process. This setting is not recommended. diff --git a/docs/source/tools/nemo_forced_aligner.rst b/docs/source/tools/nemo_forced_aligner.rst index 1eb79825a5d0..d8f89c70447f 100644 --- a/docs/source/tools/nemo_forced_aligner.rst +++ b/docs/source/tools/nemo_forced_aligner.rst @@ -6,7 +6,7 @@ NFA is hosted here: https://github.com/NVIDIA/NeMo/tree/main/tools/nemo_forced_a NFA is a tool for generating token-, word- and segment-level timestamps of speech in audio using NeMo's CTC-based Automatic Speech Recognition models. You can provide your own reference text, or use ASR-generated transcription. -You can use NeMo's ASR Model checkpoints out of the box in :ref:`14+ languages `, or train your own model. +You can use NeMo's ASR Model checkpoints out of the box in 14+ languages (see :doc:`ASR Model Checkpoints `), or train your own model. NFA can be used on long audio files of 1+ hours duration (subject to your hardware and the ASR model used). Demos & Tutorials