Self-contained implementation of a Transformer for neural machine translation in TensorFlow 2. The project covers tokenization, positional encodings, masking (padding & look-ahead), multi-head self-attention, encoder–decoder stacks, training, and inference/decoding.
- Tokenization & vocab (with
tensorflow-text) and positional encodings. - Masks: padding mask for loss/attention; look-ahead mask for the decoder.
- Transformer blocks: scaled dot-product attention, multi-head attention, FFN, residuals + layer norm.
- Training loop with cross-entropy + accuracy; masked loss to ignore padding.
- Inference (greedy by default; extendable to beam search).
- Reproducibility: seeds set in the notebook; notes on deterministic decoding.
- How a Transformer uses self-attention, multi-head attention, and positional encodings to model sequences.
- Why positional encodings are needed (attention is permutation-invariant).
- How masks (padding & look-ahead) affect attention and the loss.
- Encoder–decoder structure; teacher forcing at training vs auto-regressive decoding at inference.
- Practical setup (tokenization, vocabularies, training loop, decoding).
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
jupyter lab transformer-nmt.ipynbPinned for stability/performance with this implementation:
tensorflow==2.14.0
tensorflow-text==2.14.0
tensorflow-datasets
numpy
matplotlib
jupyterlab
- Greedy decoding is deterministic with fixed weights and dropout disabled.
To ensure repeatable translations, set seeds and avoid sampling at inference. - Swap tokenizers/vocabs + final projection size to use a different language pair; core architecture stays the same.
MIT — see LICENSE.