Grouped splits prevent family leakage but are not label-stratified

The default split type keeps families/groups intact, which is good for leakage control. However, folds are assigned primarily by group size, not by positive/negative support or source balance.

Evidence:

- `src/pepseqpred/core/train/split.py`
  - `split_ids_grouped` and `build_grouped_kfold_splits` keep groups intact.
  - Grouped k-fold assigns larger groups first to balance fold size.

Why this can hurt:

- Folds can have very different positive rates.
- Some validation folds may contain source/pathogen groups not meaningfully represented in training.
- Metrics can look much worse than expected if the split is actually a difficult cross-family generalization test.

Planning direction:

- Add split reports with per-fold:
  - protein count
  - valid residue count
  - positive residue count
  - negative residue count
  - positive rate
  - source/pathogen/family counts
- Consider grouped stratified splitting where possible.
- Preserve leakage safety, but balance label support more deliberately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grouped splits prevent family leakage but are not label-stratified #83

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Grouped splits prevent family leakage but are not label-stratified #83

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions