The default split type keeps families/groups intact, which is good for leakage control. However, folds are assigned primarily by group size, not by positive/negative support or source balance.
Evidence:
src/pepseqpred/core/train/split.py
split_ids_grouped and build_grouped_kfold_splits keep groups intact.
- Grouped k-fold assigns larger groups first to balance fold size.
Why this can hurt:
- Folds can have very different positive rates.
- Some validation folds may contain source/pathogen groups not meaningfully represented in training.
- Metrics can look much worse than expected if the split is actually a difficult cross-family generalization test.
Planning direction:
- Add split reports with per-fold:
- protein count
- valid residue count
- positive residue count
- negative residue count
- positive rate
- source/pathogen/family counts
- Consider grouped stratified splitting where possible.
- Preserve leakage safety, but balance label support more deliberately.
The default split type keeps families/groups intact, which is good for leakage control. However, folds are assigned primarily by group size, not by positive/negative support or source balance.
Evidence:
src/pepseqpred/core/train/split.pysplit_ids_groupedandbuild_grouped_kfold_splitskeep groups intact.Why this can hurt:
Planning direction: