A Transformer-based machine translation project comparing:
- a custom Transformer implemented in PyTorch (with hyperparameter experiments), and
- a pre-trained T5 translation pipeline (Hugging Face) as a strong baseline.
Evaluation uses BERTScore and METEOR to measure translation quality and semantic similarity.
- Built a custom Transformer and analyzed how hyperparameters affect learning + translation quality.
- Used XLM-RoBERTa tokenizer for tokenization + padding for the custom model pipeline.
- Benchmarked against pre-trained T5 using the task prefix:
translate English to French:and beam search decoding. - Reported performance using Precision / Recall / F1 (BERTScore) + METEOR.
Goal: learn the Transformer architecture deeply and evaluate the effect of changing hyperparameters.
Pipeline:
- Create Train/Val/Test splits
- Clean + tokenize source/target text with XLM-RoBERTa tokenizer
- Train on training set and validate using BERTScore + METEOR
- Tune hyperparameters (embedding size, number of heads, batch size, dropout)
Default run behavior & example translation
- Training loss dropped from 5.536 → 2.991 across 10 epochs, and validation loss from 5.388 → 3.829.
- Example translation (test):
English: “Hello how are you today”
French: “comment êtes vous aujourd’hui”
with BERTScore P=0.8581, R=0.8435, F1=0.8506 and METEOR=0.2978.
Goal: use a strong pretrained Transformer for higher quality translations and compare against the custom model.
Method:
- Load T5 + tokenizer
- Encode with prefix:
translate English to French: - Decode using beam search for better translation quality
- Evaluate on test batches with BERTScore + METEOR
| Model | Precision | Recall | F1 | METEOR |
|---|---|---|---|---|
| Custom Transformer | 0.8581 | 0.8435 | 0.8506 | 0.2978 |
| T5 Pre-trained | 0.8960 | 0.8987 | 0.8972 | 0.5160 |
T5 outperforms the custom Transformer across metrics in the report’s comparison.
Experiments below were run (as reported) with epochs=10 and dataset size=7000! for tuning studies.
Metrics vs embedding size
- 512: Precision 0.8048 / Recall 0.8070 / F1 0.8057 / METEOR 0.1636
- 256: Precision 0.8079 / Recall 0.7998 / F1 0.8036 / METEOR 0.1327
- 128: Precision 0.7997 / Recall 0.7907 / F1 0.7950 / METEOR 0.0913
Training time + loss vs embedding size
- 512: Train 4.204 / Val 5.682 / Time 1:26:29
- 256: Train 5.042 / Val 6.181 / Time 41:08
- 128: Train 5.646 / Val 6.399 / Time 32:10 Observation: Larger embeddings improved learning/quality but required more compute.
| Heads | Training Loss | Validation Loss |
|---|---|---|
| 8 | 4.204 | 5.682 |
| 4 | 4.208 | 5.661 |
| 2 | 4.264 | 5.695 |
| 1 | 4.332 | 5.512 |
More heads reduced training loss slightly, but validation loss did not consistently improve (generalization limitations).
| Batch | Training Loss | Validation Loss |
|---|---|---|
| 16 | 3.808 | 5.509 |
| 32 | 4.204 | 5.682 |
| 64 | 4.687 | 5.830 |
Smaller batch size (16) produced better losses; larger batches increased both train/val loss.
| Dropout | Training Loss | Validation Loss |
|---|---|---|
| 0.1 | 4.204 | 5.682 |
| 0.01 | 3.721 | 5.620 |
| 0.001 | 3.647 | 5.637 |
| 0.0001 | 3.670 | 5.650 |
Report conclusion: 0.01 offered the best balance (0.1 underfit, 0.0001 overfit tendency).{index=22}
- Missed parts of the input (e.g., did not capture “Hello” in translation in one observation).
- Good BERTScore, but relatively low METEOR indicates fluency/wording gaps.
- Larger embeddings/heads help but increase compute requirements.
- Sometimes produces empty translations, which affects metrics (BERTScore can become 0 for those cases).
- Testing was slow: ~1 hour 43 minutes for 269 samples.
- Improve custom Transformer decoding by using beam search instead of greedy.
- Try alternative tokenizers (e.g., T5 tokenizer) to improve translation quality.
- Speed up evaluation for large datasets; reduce empty outputs in T5 testing.