Pipeline for harvesting, validating, and training Vietnamese ASR models with NVIDIA NeMo.
This repo focuses on data preparation, validation, and training. Serving an ASR model in production is a separate step. This separation follows the architecture defined in NVIDIA’s NIM Microservices coursework.
- collects Vietnamese speech data from YouTube
- prepares clean audio + transcripts for training
- generates NeMo-ready manifests
- benchmarks inference and validates the pipeline before fine-tuning
- NVIDIA NeMo for ASR training
- PyTorch for model execution
- yt-dlp + ffmpeg for audio collection and conversion
- pytest + GitHub Actions for validation
git clone https://github.com/wheevu/nemo-vietnamese-asr
cd nemo-vietnamese-asr
pip install -r requirements.txt
python -m src.yt_harvester --bulk links.txt --workers 4
python prepare_data.py --seed 42Run the notebook in Colab for GPU training: NVIDIA_NeMo_ASR.ipynb
MIT
Deep dive
This project has a Local-to-Cloud workflow. This allows development on a standard laptop (Mac/Windows) while only paying for cloud GPUs when actual training is required.
- Local Data Prep (CPU): Doing the "Heavy Lifting" of downloading and processing audio locally.
- Cloud Training (GPU): Uploading the clean data to Google Colab to run the actual NeMo training.
- Strict Validation: Prioritizing manual transcripts over auto-generated ones to make sure the model learns from high-quality data.
- Goal: Build a clean dataset and a training pipeline.
- Non-Goal: No live API endpoint or a web app here.
- Constraint: Designed to prevent "Out of Memory" (OOM) errors on T4 GPUs by chopping audio into 30-second chunks.
nemo-vietnamese-asr/
├── src/
│ └── yt_harvester/ # The Tool: Downloads & Cleans Data
│ ├── __main__.py # Entry point
│ ├── downloader.py # Logic to fetch YouTube video/audio
│ └── processor.py # Logic to analyze text (sentiment)
├── audio/ # Output: Clean 16kHz WAV files
├── transcripts/ # Output: Clean text files
├── prepare_data.py # Script: Generates NeMo manifest files
├── benchmark.py # Script: Tests model speed (FPS/WER)
├── NVIDIA_NeMo_ASR.ipynb # Notebook: Run this in Google Colab
└── tests/ # Quality Assurance (QA)The tool inside src/yt_harvester (reused legacy code) turns messy YouTube links into a clean dataset.
- Best Transcript First: It looks for a human-written transcript. If none exists, it falls back to auto-generated captions.
- Audio Formatting: It automatically converts audio to 16kHz Mono WAV, which is the standard required by NeMo models.
- Smart Skipping: If ran twice, it skips files already downloaded to save time.
# 1. Process a single video
python -m src.yt_harvester "https://www.youtube.com/watch?v=VIDEO_ID"
# 2. Process a list of videos
python -m src.yt_harvester --bulk links.txt --workers 4
# 3. Create the training files (Manifests)
python prepare_data.py --seed 42Before sending data to the GPU, this script checks the work. It removes empty files, fixes text formatting (lowercase), and splits the data into Train/Test sets (80/10/10).
Open NVIDIA_NeMo_ASR.ipynb in Google Colab to handle the GPU work.
- Load Data: Unzips the dataset directly to the Colab disk for speed.
- Load Model: Downloads the
stt_en_conformer_ctc_largemodel from NVIDIA. - Evaluate: Runs the model on the Vietnamese data.
- Note: Since I am using an English model on Vietnamese audio without fine-tuning, the accuracy will be low (high WER). This proves the pipeline works before spending hours fine-tuning.
I performed a "Zero-shot" test (running the English model on Vietnamese audio).
- Result: The model attempts to map Vietnamese sounds to English words phonetically.
| Original Vietnamese Audio | Model Transcription (English Phonetics) | Analysis |
|---|---|---|
| "Giang Ơi Radio" | "the radio" | Recognized the English loanword |
| "Chào bạn" | "ta bak" | Acoustic approximation (sounds similar) |
Conclusion: The pipeline successfully feeds audio to the model. The next logical step is Transfer Learning: freezing the model's "ear" (Encoder) and retraining its "brain" (Decoder) to understand Vietnamese text.
To make the model run faster on smaller GPUs (like the free Colab T4), I use Quantization (following “Quantization Fundamentals with Hugging Face”). This reduces the precision of the math inside the model (from 32-bit floating point to 16-bit) to save memory and increase speed.
I wrote benchmark.py to measure exactly how much faster the optimized model is.
| Precision | Speed (Latency) | VRAM Usage | Notes |
|---|---|---|---|
| float32 (Standard) | 151 ms/file | ~731 MB | Baseline speed. |
| float16 (Optimized) | 89 ms/file | ~888 MB | ~40% Faster. Recommended for T4. |
| int8 | N/A | ~166 MB | Currently incompatible with this model type. |
To run this benchmark yourself in Colab:
python benchmark.py --model stt_en_conformer_ctc_large --manifest val_manifest.jsonBad data ruins training runs. I included a professional test suite to catch errors before they crash the training script. (following "Testing Machine Learning Systems: Code, Data and Models" by Made With ML)
- Text Processing: Does the code handle YouTube URLs correctly? Does it preserve Vietnamese diacritics (accents)?
- Data Integrity: Are the audio files actually 16kHz mono? Do the JSON manifests point to real files?
# Run all tests
pytest tests/ -vtests/test_data_integrity.py::TestAudioFormatCompliance::test_audio_sample_rate_is_16khz PASSED
tests/test_text_processing.py::TestCleanCaptionLines::test_vietnamese_diacritics_preserved PASSED
I use GitHub Actions to automatically run these tests every time code is pushed. This ensures that a code change doesn't accidentally break the data processing pipeline.
Pipeline: Checkout Code -> Install Audio Libs -> Run pytest.
- Local:
yt-dlp(Downloading),ffmpeg(Audio conversion),textblob(Analysis),soundfile. - Cloud:
nemo_toolkit[all],pytorch-lightning,jiwer(Error rate calculation). - Testing:
pytest,pytest-cov.
Now that the pipeline is validated, the next steps for high-accuracy Vietnamese ASR are:
- Select Model: Switch to
stt_en_conformer_ctc_smallfor faster training.- Fine-Tune: Freeze the Encoder, retrain the Decoder on the Vietnamese corpus.
- Tokenizer: Replace the English tokenizer with a Vietnamese character-based tokenizer.
