Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
339 changes: 339 additions & 0 deletions docs/source/asr/asr_checkpoints.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,339 @@
.. _asr-checkpoints-list:

=======================
ASR Model Checkpoints
=======================

This page lists all supported ASR model checkpoints released by NVIDIA NeMo.
Benchmark scores for each model can be found on its `HuggingFace model card <https://huggingface.co/nvidia>`__.

Glossary
--------

.. list-table::
:header-rows: 1

* - Term
- Definition
* - **ASR**
- Automatic Speech Recognition — transcribing speech to text
* - **AST**
- Automatic Speech Translation — translating speech to text from one language to another
* - **AED**
- Attention Encoder-Decoder — autoregressive decoder using cross-attention (Canary family)
* - **CTC**
- Connectionist Temporal Classification — non-autoregressive decoder
* - **RNN-T**
- Recurrent Neural Network Transducer — autoregressive streaming-friendly decoder
* - **TDT**
- Token-and-Duration Transducer — extends RNN-T with duration prediction for faster inference
* - **Hybrid**
- Joint RNN-T + CTC model — both decoders trained together, either usable at inference
* - **PnC**
- Punctuation and Capitalization in the output
* - **Streaming**
- Real-time / cache-aware inference capability
* - **EU4**
- Multilingual: English, German, Spanish, French
* - **EU25**
- Multilingual: 25 European languages (de, en, es, fr, it, pl, pt, nl, ru, uk, be, hr, cs, bg, da, et, fi, el, hu, lv, lt, mt, ro, sk, sl, sv)


Canary Models (AED)
-------------------

Multi-task encoder-decoder models supporting ASR, AST, PnC, and timestamps across multiple languages.

.. list-table::
:header-rows: 1

* - Model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc some of these didn't really prioritize PnC no?

- Decoder
- Capabilities
- Languages
* - `canary-1b-v2 <https://huggingface.co/nvidia/canary-1b-v2>`__
- AED
- ASR, AST, PnC, timestamps
- EU25
* - `canary-qwen-2.5b <https://huggingface.co/nvidia/canary-qwen-2.5b>`__
- AED
- ASR, AST, PnC, timestamps
- EU25
* - `canary-1b-flash <https://huggingface.co/nvidia/canary-1b-flash>`__
- AED
- ASR, AST, PnC, timestamps, fast
- EU4
* - `canary-180m-flash <https://huggingface.co/nvidia/canary-180m-flash>`__
- AED
- ASR, AST, PnC, timestamps, fast
- EU4
* - `canary-1b <https://huggingface.co/nvidia/canary-1b>`__
- AED
- ASR, AST, PnC
- EU4


Parakeet Models (English)
--------------------------

High-accuracy English ASR models with FastConformer encoder.

.. list-table::
:header-rows: 1

* - Model
- Decoder
- Capabilities
- Size
* - `parakeet-tdt-0.6b-v3 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3>`__
- TDT
- ASR, PnC, timestamps
- 0.6B
* - `parakeet-tdt-0.6b-v2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`__
- TDT
- ASR, PnC, timestamps
- 0.6B
* - `parakeet-tdt-1.1b <https://huggingface.co/nvidia/parakeet-tdt-1.1b>`__
- TDT
- ASR, timestamps
- 1.1B
* - `parakeet-tdt_ctc-1.1b <https://huggingface.co/nvidia/parakeet-tdt_ctc-1.1b>`__
- Hybrid TDT+CTC
- ASR, timestamps
- 1.1B
* - `parakeet-tdt_ctc-0.6b-ja <https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja>`__
- Hybrid TDT+CTC
- ASR, timestamps, Japanese
- 0.6B
* - `parakeet-tdt_ctc-110m <https://huggingface.co/nvidia/parakeet-tdt_ctc-110m>`__
- Hybrid TDT+CTC
- ASR, timestamps
- 110M
* - `parakeet-rnnt-1.1b <https://huggingface.co/nvidia/parakeet-rnnt-1.1b>`__
- RNN-T
- ASR, timestamps
- 1.1B
* - `parakeet-rnnt-0.6b <https://huggingface.co/nvidia/parakeet-rnnt-0.6b>`__
- RNN-T
- ASR, timestamps
- 0.6B
* - `parakeet-ctc-1.1b <https://huggingface.co/nvidia/parakeet-ctc-1.1b>`__
- CTC
- ASR
- 1.1B
* - `parakeet-ctc-0.6b <https://huggingface.co/nvidia/parakeet-ctc-0.6b>`__
- CTC
- ASR
- 0.6B
* - `parakeet-rnnt-110m-da-dk <https://huggingface.co/nvidia/parakeet-rnnt-110m-da-dk>`__
- RNN-T
- ASR, Danish
- 110M


Streaming Models
-----------------

Cache-aware models for real-time / low-latency inference.

.. list-table::
:header-rows: 1

* - Model
- Decoder
- Capabilities
- Languages
* - `nemotron-speech-streaming-en-0.6b <https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b>`__
- Hybrid
- ASR, streaming
- en
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either extend the ISO code into full language name, or add another glossary at the end - we have to assume less technical people will be reading this too.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be more economical to just list the architecture and configure a list of supported language models, or maybe a matrix?

* - `multitalker-parakeet-streaming-0.6b-v1 <https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1>`__
- RNN-T
- ASR, multitalker, streaming
- en
* - `parakeet_realtime_eou_120m-v1 <https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1>`__
- RNN-T
- ASR, end-of-utterance, streaming
- en
* - `stt_en_fastconformer_hybrid_large_streaming_multi <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi>`__
- Hybrid
- ASR, streaming, multiple look-aheads
- en
* - `stt_en_fastconformer_hybrid_medium_streaming_80ms_pc <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms_pc>`__
- Hybrid
- ASR, PnC, streaming
- en
* - `stt_en_fastconformer_hybrid_medium_streaming_80ms <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms>`__
- Hybrid
- ASR, streaming
- en
* - `stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc>`__
- Hybrid
- ASR, PnC, streaming
- ka
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah on Piotr's above point, few know the georgian language code off hand.

* - `stt_en_fastconformer_hybrid_large_streaming_1040ms <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_1040ms>`__
- Hybrid
- ASR, streaming
- en


FastConformer English Models (Non-Streaming)
----------------------------------------------

.. list-table::
:header-rows: 1

* - Model
- Decoder
- Capabilities
- Size
* - `stt_en_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- Large
* - `stt_en_fastconformer_ctc_large <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_large>`__
- CTC
- ASR
- Large
* - `stt_en_fastconformer_ctc_xlarge <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_xlarge>`__
- CTC
- ASR
- XLarge
* - `stt_en_fastconformer_ctc_xxlarge <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_xxlarge>`__
- CTC
- ASR
- XXLarge
* - `stt_en_fastconformer_transducer_large <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_large>`__
- RNN-T
- ASR
- Large
* - `stt_en_fastconformer_transducer_xlarge <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_xlarge>`__
- RNN-T
- ASR
- XLarge
* - `stt_en_fastconformer_transducer_xxlarge <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_xxlarge>`__
- RNN-T
- ASR
- XXLarge
* - `stt_en_fastconformer_tdt_large <https://huggingface.co/nvidia/stt_en_fastconformer_tdt_large>`__
- TDT
- ASR
- Large


FastConformer Multilingual Models
----------------------------------

.. list-table::
:header-rows: 1

* - Model
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd move all fastconformers underneath parakeet. This'll just lead to confusion.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK, the concept here is that fastconformer are the older models and parakeet are the newer models.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehhh, i think our branding efforts are causing confusion, especially now Nemotron Speech is a thing. In the technical docs there should be clear understanding that these are the same architectures. The naming aspect can be left up to marketing but for devs it should be clear that fastcomformer and parakeet are largely equivalent.

- Decoder
- Capabilities
- Language
* - `stt_multilingual_fastconformer_hybrid_large_pc_blend_eu <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_multilingual_fastconformer_hybrid_large_pc_blend_eu>`__
- Hybrid
- ASR, PnC
- Multilingual EU
* - `stt_de_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_de_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- de
* - `stt_es_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_es_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- es
* - `stt_es_fastconformer_hybrid_large_pc_nc <https://huggingface.co/nvidia/stt_es_fastconformer_hybrid_large_pc_nc>`__
- Hybrid
- ASR, PnC
- es
* - `stt_fr_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- fr
* - `stt_it_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_it_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- it
* - `stt_ru_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ru_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- ru
* - `stt_ua_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ua_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- uk
* - `stt_pl_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_pl_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- pl
* - `stt_hr_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_hr_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- hr
* - `stt_be_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_be_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- be
* - `stt_nl_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_nl_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- nl
* - `stt_pt_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_pt_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- pt
* - `stt_fa_fastconformer_hybrid_large <https://huggingface.co/nvidia/stt_fa_fastconformer_hybrid_large>`__
- Hybrid
- ASR
- fa
* - `stt_ka_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- ka
* - `stt_hy_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_hy_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- hy
* - `stt_ar_fastconformer_hybrid_large_pc_v1.0 <https://huggingface.co/nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0>`__
- Hybrid
- ASR, PnC
- ar
* - `stt_ar_fastconformer_hybrid_large_pcd_v1.0 <https://huggingface.co/nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0>`__
- Hybrid
- ASR, PnC, diacritization
- ar
* - `stt_uz_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_uz_fastconformer_hybrid_large_pc>`__
- Hybrid
- ASR, PnC
- uz
* - `stt_kk_ru_fastconformer_hybrid_large <https://huggingface.co/nvidia/stt_kk_ru_fastconformer_hybrid_large>`__
- Hybrid
- ASR
- kk, ru
* - `parakeet-ctc-0.6b-Vietnamese <https://huggingface.co/nvidia/parakeet-ctc-0.6b-Vietnamese>`__
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go under Parakeet section above

- CTC
- ASR
- vi


Loading Models
--------------

All models can be loaded via the ``from_pretrained()`` API:

.. code-block:: python

import nemo.collections.asr as nemo_asr

# From HuggingFace (prefix with nvidia/)
model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2")

# From NGC (no prefix)
model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large")

To list all available models programmatically:

.. code-block:: python

nemo_asr.models.ASRModel.list_available_models()
Loading
Loading