eng-kor-transformer

Inspired by Kpop (I'm a ONCE 🍭) and the emergence of ChatGPT, I decided to build a Transformer from scratch for English-Korean machine translation, following Attention is all you need paper.

See the live demo here.

Dataset

Ideally, a clean English–Korean parallel corpus can be obtained from AI Hub. However, account creation is restricted to users within Korea. As a result, two alternative datasets were considered:

A small but clean dataset from Huffon/pytorch-transformer-kor-eng which is used in this project.
A large but noisy dataset from OPUS-Corpora.

The size of training & testing datasets are as follow:

Training: 92,000 sentences
Testing: 11,500 sentences

Requirements

pyyaml==6.0.3
sentencepiece==0.2.0
sacrebleu==2.5.1
torch==2.6.0

Tokenizer

This project uses sentencepiece unigram model which selects subwords with highest likelihood.

To build your own tokenizer, please run .\02_src\build_tokenizer.py. The parameters for tokenizer can be modified at 02_src\config.yaml.

You might need to change input file names, the structure of these files should be similar to those in 01_data\sentencepiece.

build_tokenizer_model(
    Language.KOR, 
    os.path.join(SENTENCE_PIECE_DATA, "{}.txt".format('korean')) # IMPORTANT: Replace this filename with yours
)

Sentences are segmented into subword units defined in the .vocab file, while the trained tokenizer itself is stored in the .model file.

Training, testing & translating

Like many other AI enthusiast, I did not have access to a local GPU and Google Colab limits sessions to several hours. So I used Kaggle to train my models, which provides 30 hrs/week of free GPU.

All scripts related to training/testing can be found in 02_src\kor2eng-v1.ipynb.

To run this notebook, please change the input & output file destinations.

SENTENCE_PIECE_DATA = '/kaggle/input/korean-sentencepiece/sentencepiece' # IMPORTANT: Tokenizers & data directory

# IMPORTANT: Change tokenizer's filename 
ENG_MODEL_PATH = os.path.join(SENTENCE_PIECE_DATA,'english_unigram.model') 
KOR_MODEL_PATH = os.path.join(SENTENCE_PIECE_DATA,'korean_unigram.model')

# IMPORTANT: Change training/testing data filename
train_korean  = os.path.join(SENTENCE_PIECE_DATA, 'korean.txt') 
train_english = os.path.join(SENTENCE_PIECE_DATA, 'english.txt')
test_korean   = os.path.join(SENTENCE_PIECE_DATA, 'test_korean.txt')
test_english  = os.path.join(SENTENCE_PIECE_DATA, 'test_english.txt')

Results

Although the convergence rate for both directions are similar, the lowest validation loss of the KOR → ENG is 3.3341 compared to 4.5607 by that of ENG → KOR which is consistent with the BLEU scores in the table below.

From Language	ENG	KOR
BLEU	9.32	24.03

It can be seen that the BLEU score for KOR → ENG (24.03) is much higher than for ENG → KOR (9.32), likely due to linguistic asymmetry. In particular, when translating from English to Korean, the model must infer honorifics that are explicitly encoded in Korean but largely absent in English leading to lower BLEU score. For example:

저는 학생입니다. (formal)
저는 학생이에요.

carry the same meaning but differ in level of politeness. Below are sample translations from my trained models:

1. ENG -> KOR

SRC: I am going home tomorrow.
TRG: 내일 집에 갈 거예요.

2. KOR -> ENG

SRC: 저는 내일 집에 갑니다.
TRG: Tomorrow, I am going to go home.

References

SamLynnEvans/Transformer
Huffon/pytorch-transformer-kor-eng
A great playground explaining how Transformer works.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
00_docs		00_docs
01_data		01_data
02_src		02_src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eng-kor-transformer

Dataset

Requirements

Tokenizer

Training, testing & translating

Results

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

eng-kor-transformer

Dataset

Requirements

Tokenizer

Training, testing & translating

Results

References

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages