TH-EN Machine Translation

Model | Demo (Huggingface Space) | Blog (Thai)

Dataset

In this project I used 2 datasets including SCB-1M and OPUS which can be downloaded from thai2nmt project the data is cleaned using script in preprocess_data folder. For test set I used IWSLT 2015.

Models

I experimented with 2 models mT5 and No Language Left Behind (NLLB). Models are evaluated using sacrebleu and the results are as follows.

Models	Finetuning Methods	Dataset	num_epochs	Validation Set BLEU Score	Test Set BLEU Score
mT5-Small	full finetuning	SCB-1M	5	13.15	12.14
NLLB-600M	full finetuning	SCB-1M	3	30.74	21.71
NLLB-600M	full finetuning	SCB-1M+OPUS	3	28.86	27.37
NLLB-600M	LoRA	SCB-1M+OPUS	9	24.00	24.23

Note: The validation set is sampled from full dataset with 80:20 ratio for SCB-1M and 85:15 ratio for SCB-1M+OPUS.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
NLLB		NLLB
mT5		mT5
preprocess_data		preprocess_data
utils		utils
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TH-EN Machine Translation

Dataset

Models

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TH-EN Machine Translation

Dataset

Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages