-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Refactored all ASR collections documentation #15542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
55b65a4
df3e3bc
15bed94
15941f2
63144cb
49319f9
0db40f2
63fed73
a66642e
034ca21
9ae4b21
ccf8365
8769deb
edc5841
6e0e501
894bfdb
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,339 @@ | ||
| .. _asr-checkpoints-list: | ||
|
|
||
| ======================= | ||
| ASR Model Checkpoints | ||
| ======================= | ||
|
|
||
| This page lists all supported ASR model checkpoints released by NVIDIA NeMo. | ||
| Benchmark scores for each model can be found on its `HuggingFace model card <https://huggingface.co/nvidia>`__. | ||
|
|
||
| Glossary | ||
| -------- | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Term | ||
| - Definition | ||
| * - **ASR** | ||
| - Automatic Speech Recognition — transcribing speech to text | ||
| * - **AST** | ||
| - Automatic Speech Translation — translating speech to text from one language to another | ||
| * - **AED** | ||
| - Attention Encoder-Decoder — autoregressive decoder using cross-attention (Canary family) | ||
| * - **CTC** | ||
| - Connectionist Temporal Classification — non-autoregressive decoder | ||
| * - **RNN-T** | ||
| - Recurrent Neural Network Transducer — autoregressive streaming-friendly decoder | ||
| * - **TDT** | ||
| - Token-and-Duration Transducer — extends RNN-T with duration prediction for faster inference | ||
| * - **Hybrid** | ||
| - Joint RNN-T + CTC model — both decoders trained together, either usable at inference | ||
| * - **PnC** | ||
| - Punctuation and Capitalization in the output | ||
| * - **Streaming** | ||
| - Real-time / cache-aware inference capability | ||
| * - **EU4** | ||
| - Multilingual: English, German, Spanish, French | ||
| * - **EU25** | ||
| - Multilingual: 25 European languages (de, en, es, fr, it, pl, pt, nl, ru, uk, be, hr, cs, bg, da, et, fi, el, hu, lv, lt, mt, ro, sk, sl, sv) | ||
|
|
||
|
|
||
| Canary Models (AED) | ||
| ------------------- | ||
|
|
||
| Multi-task encoder-decoder models supporting ASR, AST, PnC, and timestamps across multiple languages. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
| - Decoder | ||
| - Capabilities | ||
| - Languages | ||
| * - `canary-1b-v2 <https://huggingface.co/nvidia/canary-1b-v2>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps | ||
| - EU25 | ||
| * - `canary-qwen-2.5b <https://huggingface.co/nvidia/canary-qwen-2.5b>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps | ||
| - EU25 | ||
| * - `canary-1b-flash <https://huggingface.co/nvidia/canary-1b-flash>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps, fast | ||
| - EU4 | ||
| * - `canary-180m-flash <https://huggingface.co/nvidia/canary-180m-flash>`__ | ||
| - AED | ||
| - ASR, AST, PnC, timestamps, fast | ||
| - EU4 | ||
| * - `canary-1b <https://huggingface.co/nvidia/canary-1b>`__ | ||
| - AED | ||
| - ASR, AST, PnC | ||
| - EU4 | ||
|
|
||
|
|
||
| Parakeet Models (English) | ||
| -------------------------- | ||
|
|
||
| High-accuracy English ASR models with FastConformer encoder. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
| - Decoder | ||
| - Capabilities | ||
| - Size | ||
| * - `parakeet-tdt-0.6b-v3 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v3>`__ | ||
| - TDT | ||
| - ASR, PnC, timestamps | ||
| - 0.6B | ||
| * - `parakeet-tdt-0.6b-v2 <https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2>`__ | ||
| - TDT | ||
| - ASR, PnC, timestamps | ||
| - 0.6B | ||
| * - `parakeet-tdt-1.1b <https://huggingface.co/nvidia/parakeet-tdt-1.1b>`__ | ||
| - TDT | ||
| - ASR, timestamps | ||
| - 1.1B | ||
| * - `parakeet-tdt_ctc-1.1b <https://huggingface.co/nvidia/parakeet-tdt_ctc-1.1b>`__ | ||
| - Hybrid TDT+CTC | ||
| - ASR, timestamps | ||
| - 1.1B | ||
| * - `parakeet-tdt_ctc-0.6b-ja <https://huggingface.co/nvidia/parakeet-tdt_ctc-0.6b-ja>`__ | ||
| - Hybrid TDT+CTC | ||
| - ASR, timestamps, Japanese | ||
| - 0.6B | ||
| * - `parakeet-tdt_ctc-110m <https://huggingface.co/nvidia/parakeet-tdt_ctc-110m>`__ | ||
| - Hybrid TDT+CTC | ||
| - ASR, timestamps | ||
| - 110M | ||
| * - `parakeet-rnnt-1.1b <https://huggingface.co/nvidia/parakeet-rnnt-1.1b>`__ | ||
| - RNN-T | ||
| - ASR, timestamps | ||
| - 1.1B | ||
| * - `parakeet-rnnt-0.6b <https://huggingface.co/nvidia/parakeet-rnnt-0.6b>`__ | ||
| - RNN-T | ||
| - ASR, timestamps | ||
| - 0.6B | ||
| * - `parakeet-ctc-1.1b <https://huggingface.co/nvidia/parakeet-ctc-1.1b>`__ | ||
| - CTC | ||
| - ASR | ||
| - 1.1B | ||
| * - `parakeet-ctc-0.6b <https://huggingface.co/nvidia/parakeet-ctc-0.6b>`__ | ||
| - CTC | ||
| - ASR | ||
| - 0.6B | ||
| * - `parakeet-rnnt-110m-da-dk <https://huggingface.co/nvidia/parakeet-rnnt-110m-da-dk>`__ | ||
| - RNN-T | ||
| - ASR, Danish | ||
| - 110M | ||
|
|
||
|
|
||
| Streaming Models | ||
| ----------------- | ||
|
|
||
| Cache-aware models for real-time / low-latency inference. | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
| - Decoder | ||
| - Capabilities | ||
| - Languages | ||
| * - `nemotron-speech-streaming-en-0.6b <https://huggingface.co/nvidia/nemotron-speech-streaming-en-0.6b>`__ | ||
| - Hybrid | ||
| - ASR, streaming | ||
| - en | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Either extend the ISO code into full language name, or add another glossary at the end - we have to assume less technical people will be reading this too.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It may be more economical to just list the architecture and configure a list of supported language models, or maybe a matrix? |
||
| * - `multitalker-parakeet-streaming-0.6b-v1 <https://huggingface.co/nvidia/multitalker-parakeet-streaming-0.6b-v1>`__ | ||
| - RNN-T | ||
| - ASR, multitalker, streaming | ||
| - en | ||
| * - `parakeet_realtime_eou_120m-v1 <https://huggingface.co/nvidia/parakeet_realtime_eou_120m-v1>`__ | ||
| - RNN-T | ||
| - ASR, end-of-utterance, streaming | ||
| - en | ||
| * - `stt_en_fastconformer_hybrid_large_streaming_multi <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi>`__ | ||
| - Hybrid | ||
| - ASR, streaming, multiple look-aheads | ||
| - en | ||
| * - `stt_en_fastconformer_hybrid_medium_streaming_80ms_pc <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC, streaming | ||
| - en | ||
| * - `stt_en_fastconformer_hybrid_medium_streaming_80ms <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_medium_streaming_80ms>`__ | ||
| - Hybrid | ||
| - ASR, streaming | ||
| - en | ||
| * - `stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_transducer_ctc_large_streaming_80ms_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC, streaming | ||
| - ka | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah on Piotr's above point, few know the georgian language code off hand. |
||
| * - `stt_en_fastconformer_hybrid_large_streaming_1040ms <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_fastconformer_hybrid_large_streaming_1040ms>`__ | ||
| - Hybrid | ||
| - ASR, streaming | ||
| - en | ||
|
|
||
|
|
||
| FastConformer English Models (Non-Streaming) | ||
| ---------------------------------------------- | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
| - Decoder | ||
| - Capabilities | ||
| - Size | ||
| * - `stt_en_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - Large | ||
| * - `stt_en_fastconformer_ctc_large <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_large>`__ | ||
| - CTC | ||
| - ASR | ||
| - Large | ||
| * - `stt_en_fastconformer_ctc_xlarge <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_xlarge>`__ | ||
| - CTC | ||
| - ASR | ||
| - XLarge | ||
| * - `stt_en_fastconformer_ctc_xxlarge <https://huggingface.co/nvidia/stt_en_fastconformer_ctc_xxlarge>`__ | ||
| - CTC | ||
| - ASR | ||
| - XXLarge | ||
| * - `stt_en_fastconformer_transducer_large <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_large>`__ | ||
| - RNN-T | ||
| - ASR | ||
| - Large | ||
| * - `stt_en_fastconformer_transducer_xlarge <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_xlarge>`__ | ||
| - RNN-T | ||
| - ASR | ||
| - XLarge | ||
| * - `stt_en_fastconformer_transducer_xxlarge <https://huggingface.co/nvidia/stt_en_fastconformer_transducer_xxlarge>`__ | ||
| - RNN-T | ||
| - ASR | ||
| - XXLarge | ||
| * - `stt_en_fastconformer_tdt_large <https://huggingface.co/nvidia/stt_en_fastconformer_tdt_large>`__ | ||
| - TDT | ||
| - ASR | ||
| - Large | ||
|
|
||
|
|
||
| FastConformer Multilingual Models | ||
| ---------------------------------- | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
|
|
||
| * - Model | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd move all fastconformers underneath parakeet. This'll just lead to confusion.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it's OK, the concept here is that
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ehhh, i think our branding efforts are causing confusion, especially now Nemotron Speech is a thing. In the technical docs there should be clear understanding that these are the same architectures. The naming aspect can be left up to marketing but for devs it should be clear that fastcomformer and parakeet are largely equivalent. |
||
| - Decoder | ||
| - Capabilities | ||
| - Language | ||
| * - `stt_multilingual_fastconformer_hybrid_large_pc_blend_eu <https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_multilingual_fastconformer_hybrid_large_pc_blend_eu>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - Multilingual EU | ||
| * - `stt_de_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_de_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - de | ||
| * - `stt_es_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_es_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - es | ||
| * - `stt_es_fastconformer_hybrid_large_pc_nc <https://huggingface.co/nvidia/stt_es_fastconformer_hybrid_large_pc_nc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - es | ||
| * - `stt_fr_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_fr_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - fr | ||
| * - `stt_it_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_it_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - it | ||
| * - `stt_ru_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ru_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - ru | ||
| * - `stt_ua_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ua_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - uk | ||
| * - `stt_pl_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_pl_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - pl | ||
| * - `stt_hr_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_hr_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - hr | ||
| * - `stt_be_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_be_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - be | ||
| * - `stt_nl_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_nl_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - nl | ||
| * - `stt_pt_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_pt_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - pt | ||
| * - `stt_fa_fastconformer_hybrid_large <https://huggingface.co/nvidia/stt_fa_fastconformer_hybrid_large>`__ | ||
| - Hybrid | ||
| - ASR | ||
| - fa | ||
| * - `stt_ka_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_ka_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - ka | ||
| * - `stt_hy_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_hy_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - hy | ||
| * - `stt_ar_fastconformer_hybrid_large_pc_v1.0 <https://huggingface.co/nvidia/stt_ar_fastconformer_hybrid_large_pc_v1.0>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - ar | ||
| * - `stt_ar_fastconformer_hybrid_large_pcd_v1.0 <https://huggingface.co/nvidia/stt_ar_fastconformer_hybrid_large_pcd_v1.0>`__ | ||
| - Hybrid | ||
| - ASR, PnC, diacritization | ||
| - ar | ||
| * - `stt_uz_fastconformer_hybrid_large_pc <https://huggingface.co/nvidia/stt_uz_fastconformer_hybrid_large_pc>`__ | ||
| - Hybrid | ||
| - ASR, PnC | ||
| - uz | ||
| * - `stt_kk_ru_fastconformer_hybrid_large <https://huggingface.co/nvidia/stt_kk_ru_fastconformer_hybrid_large>`__ | ||
| - Hybrid | ||
| - ASR | ||
| - kk, ru | ||
| * - `parakeet-ctc-0.6b-Vietnamese <https://huggingface.co/nvidia/parakeet-ctc-0.6b-Vietnamese>`__ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This should go under Parakeet section above |
||
| - CTC | ||
| - ASR | ||
| - vi | ||
|
|
||
|
|
||
| Loading Models | ||
| -------------- | ||
|
|
||
| All models can be loaded via the ``from_pretrained()`` API: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import nemo.collections.asr as nemo_asr | ||
|
|
||
| # From HuggingFace (prefix with nvidia/) | ||
| model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v2") | ||
|
|
||
| # From NGC (no prefix) | ||
| model = nemo_asr.models.ASRModel.from_pretrained("stt_en_fastconformer_transducer_large") | ||
|
|
||
| To list all available models programmatically: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| nemo_asr.models.ASRModel.list_available_models() | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc some of these didn't really prioritize PnC no?