The seq2seq model has a fundamental generation problem:
- Training loss DOES decrease properly (0.7 → 0.0022)
- Weights ARE updating correctly via backprop
- Encoder IS working (learned representations)
- BUT decoder generation gets stuck in loops repeating one word
During training: Decoder learns from full answer sequences (teacher forcing) During inference: Decoder must predict one token at a time without ground truth
This causes a exposure bias problem:
- Model was trained on correct answers
- During inference, it only sees its own predictions
- If it predicts wrongly once, error compounds
- Gets stuck in local minima (repeating "chatbot" forever)
With only 16 examples:
- Model quickly memorizes training data
- Decoder learns "safe" words that appear frequently
- No diversity to learn proper generation
- Overfits to repeating patterns
- Scheduled Sampling - Gradually expose model to its own errors during training
- Beam Search - Try multiple hypotheses at decode time
- Attention Mechanism - Decoder should reference encoder outputs
- More training data - 1000+ examples to learn diversity
- Use Pre-trained Models - GPT, BERT fine-tuned
- Retrieval + Ranking - Find similar Q&A, rank responses
This demonstrates a REAL neural chatbot implementation, but with realistic limitations of small datasets and inference bugs. The architecture is correct, but needs:
- Better training strategy
- Better decoding algorithm
- More/better data