Vietnamese ASR Pipeline with NVIDIA NeMo

Pipeline for harvesting, validating, and training Vietnamese ASR models with NVIDIA NeMo.

This repo focuses on data preparation, validation, and training. Serving an ASR model in production is a separate step. This separation follows the architecture defined in NVIDIA’s NIM Microservices coursework.

What it does

collects Vietnamese speech data from YouTube
prepares clean audio + transcripts for training
generates NeMo-ready manifests
benchmarks inference and validates the pipeline before fine-tuning

Stack

NVIDIA NeMo for ASR training
PyTorch for model execution
yt-dlp + ffmpeg for audio collection and conversion
pytest + GitHub Actions for validation

Quick Start

git clone https://github.com/wheevu/nemo-vietnamese-asr
cd nemo-vietnamese-asr
pip install -r requirements.txt
python -m src.yt_harvester --bulk links.txt --workers 4
python prepare_data.py --seed 42

Training

Run the notebook in Colab for GPU training: NVIDIA_NeMo_ASR.ipynb

License

MIT

Deep dive

This project has a Local-to-Cloud workflow. This allows development on a standard laptop (Mac/Windows) while only paying for cloud GPUs when actual training is required.

Local Data Prep (CPU): Doing the "Heavy Lifting" of downloading and processing audio locally.
Cloud Training (GPU): Uploading the clean data to Google Colab to run the actual NeMo training.
Strict Validation: Prioritizing manual transcripts over auto-generated ones to make sure the model learns from high-quality data.

Scope & Non-Goals

Goal: Build a clean dataset and a training pipeline.
Non-Goal: No live API endpoint or a web app here.
Constraint: Designed to prevent "Out of Memory" (OOM) errors on T4 GPUs by chopping audio into 30-second chunks.

Project Structure

nemo-vietnamese-asr/
├── src/
│   └── yt_harvester/           # The Tool: Downloads & Cleans Data
│       ├── __main__.py         # Entry point
│       ├── downloader.py       # Logic to fetch YouTube video/audio
│       └── processor.py        # Logic to analyze text (sentiment)
├── audio/                      # Output: Clean 16kHz WAV files
├── transcripts/                # Output: Clean text files
├── prepare_data.py             # Script: Generates NeMo manifest files
├── benchmark.py                # Script: Tests model speed (FPS/WER)
├── NVIDIA_NeMo_ASR.ipynb       # Notebook: Run this in Google Colab
└── tests/                      # Quality Assurance (QA)

Step 1: Local Data Engineering

The tool inside src/yt_harvester (reused legacy code) turns messy YouTube links into a clean dataset.

What it does

Best Transcript First: It looks for a human-written transcript. If none exists, it falls back to auto-generated captions.
Audio Formatting: It automatically converts audio to 16kHz Mono WAV, which is the standard required by NeMo models.
Smart Skipping: If ran twice, it skips files already downloaded to save time.

How to use it

# 1. Process a single video
python -m src.yt_harvester "https://www.youtube.com/watch?v=VIDEO_ID"

# 2. Process a list of videos
python -m src.yt_harvester --bulk links.txt --workers 4

# 3. Create the training files (Manifests)
python prepare_data.py --seed 42

The Validation Layer (`prepare_data.py`)

Before sending data to the GPU, this script checks the work. It removes empty files, fixes text formatting (lowercase), and splits the data into Train/Test sets (80/10/10).

Step 2: Cloud Workflow (Google Colab)

Open NVIDIA_NeMo_ASR.ipynb in Google Colab to handle the GPU work.

Workflow Logic

Load Data: Unzips the dataset directly to the Colab disk for speed.
Load Model: Downloads the stt_en_conformer_ctc_large model from NVIDIA.
Evaluate: Runs the model on the Vietnamese data.
- Note: Since I am using an English model on Vietnamese audio without fine-tuning, the accuracy will be low (high WER). This proves the pipeline works before spending hours fine-tuning.

Results & Analysis

I performed a "Zero-shot" test (running the English model on Vietnamese audio).

Result: The model attempts to map Vietnamese sounds to English words phonetically.

Original Vietnamese Audio	Model Transcription (English Phonetics)	Analysis
"Giang Ơi Radio"	"the radio"	Recognized the English loanword
"Chào bạn"	"ta bak"	Acoustic approximation (sounds similar)

Conclusion: The pipeline successfully feeds audio to the model. The next logical step is Transfer Learning: freezing the model's "ear" (Encoder) and retraining its "brain" (Decoder) to understand Vietnamese text.

Optimizing for Speed (Quantization)

To make the model run faster on smaller GPUs (like the free Colab T4), I use Quantization (following “Quantization Fundamentals with Hugging Face”). This reduces the precision of the math inside the model (from 32-bit floating point to 16-bit) to save memory and increase speed.

Benchmark Results (Colab T4 GPU)

I wrote benchmark.py to measure exactly how much faster the optimized model is.

Precision	Speed (Latency)	VRAM Usage	Notes
float32 (Standard)	151 ms/file	~731 MB	Baseline speed.
float16 (Optimized)	89 ms/file	~888 MB	~40% Faster. Recommended for T4.
int8	N/A	~166 MB	Currently incompatible with this model type.

To run this benchmark yourself in Colab:

python benchmark.py --model stt_en_conformer_ctc_large --manifest val_manifest.json

Testing & Quality Assurance

Bad data ruins training runs. I included a professional test suite to catch errors before they crash the training script. (following "Testing Machine Learning Systems: Code, Data and Models" by Made With ML)

What I test

Text Processing: Does the code handle YouTube URLs correctly? Does it preserve Vietnamese diacritics (accents)?
Data Integrity: Are the audio files actually 16kHz mono? Do the JSON manifests point to real files?

How to run tests

# Run all tests
pytest tests/ -v

Example Output

tests/test_data_integrity.py::TestAudioFormatCompliance::test_audio_sample_rate_is_16khz PASSED
tests/test_text_processing.py::TestCleanCaptionLines::test_vietnamese_diacritics_preserved PASSED

Continuous Integration (CI)

I use GitHub Actions to automatically run these tests every time code is pushed. This ensures that a code change doesn't accidentally break the data processing pipeline.

Pipeline: Checkout Code -> Install Audio Libs -> Run pytest.

Dependencies

Local: yt-dlp (Downloading), ffmpeg (Audio conversion), textblob (Analysis), soundfile.
Cloud: nemo_toolkit[all], pytorch-lightning, jiwer (Error rate calculation).
Testing: pytest, pytest-cov.

Future Work: Fine-Tuning Strategy

Now that the pipeline is validated, the next steps for high-accuracy Vietnamese ASR are:

Select Model: Switch to stt_en_conformer_ctc_small for faster training.

Fine-Tune: Freeze the Encoder, retrain the Decoder on the Vietnamese corpus.

Tokenizer: Replace the English tokenizer with a Vietnamese character-based tokenizer.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
asset		asset
github/workflows		github/workflows
src		src
structured_outputs		structured_outputs
tests		tests
transcripts		transcripts
.gitignore		.gitignore
LICENSE		LICENSE
NVIDIA_NeMo_ASR.ipynb		NVIDIA_NeMo_ASR.ipynb
README.md		README.md
benchmark.py		benchmark.py
links.txt		links.txt
prepare_data.py		prepare_data.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
test_manifest.json		test_manifest.json
train_manifest.json		train_manifest.json
val_manifest.json		val_manifest.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vietnamese ASR Pipeline with NVIDIA NeMo

What it does

Stack

Quick Start

Training

License

Scope & Non-Goals

Project Structure

Step 1: Local Data Engineering

What it does

How to use it

The Validation Layer (`prepare_data.py`)

Step 2: Cloud Workflow (Google Colab)

Workflow Logic

Results & Analysis

Optimizing for Speed (Quantization)

Benchmark Results (Colab T4 GPU)

Testing & Quality Assurance

What I test

How to run tests

Example Output

Continuous Integration (CI)

Dependencies

Future Work: Fine-Tuning Strategy

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vietnamese ASR Pipeline with NVIDIA NeMo

What it does

Stack

Quick Start

Training

License

Scope & Non-Goals

Project Structure

Step 1: Local Data Engineering

What it does

How to use it

The Validation Layer (prepare_data.py)

Step 2: Cloud Workflow (Google Colab)

Workflow Logic

Results & Analysis

Optimizing for Speed (Quantization)

Benchmark Results (Colab T4 GPU)

Testing & Quality Assurance

What I test

How to run tests

Example Output

Continuous Integration (CI)

Dependencies

Future Work: Fine-Tuning Strategy

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages

The Validation Layer (`prepare_data.py`)