Training defaults to windowed proteins with overlap, while evaluation defaults to full proteins.
Evidence:
src/pepseqpred/core/data/proteindataset.py
_iter_windows supports window_size and stride.
scripts/hpc/trainffnn.sh
WINDOW_SIZE=1000
STRIDE=900
src/pepseqpred/apps/evaluate_ffnn_cli.py
- Evaluation constructs datasets with
window_size=None and pad_last_window=False.
Why this can hurt:
- Overlapped residues are duplicated during training and validation.
- Validation threshold selection is based on windowed validation arrays, but final evaluation is on full proteins.
- Overlap duplicates can overweight boundary regions or long proteins.
- If there are very long multi-pathogen proteins, length and overlap can distort training and metrics.
Planning direction:
- Run an ablation with
--window-size 0 where feasible.
- Run an ablation with non-overlapping windows, for example
stride == window_size.
- Add metrics that count unique proteins/residues separately from yielded training residues.
Training defaults to windowed proteins with overlap, while evaluation defaults to full proteins.
Evidence:
src/pepseqpred/core/data/proteindataset.py_iter_windowssupportswindow_sizeandstride.scripts/hpc/trainffnn.shWINDOW_SIZE=1000STRIDE=900src/pepseqpred/apps/evaluate_ffnn_cli.pywindow_size=Noneandpad_last_window=False.Why this can hurt:
Planning direction:
--window-size 0where feasible.stride == window_size.