Skip to content

Grouped splits prevent family leakage but are not label-stratified #83

@jeffreyHoelzel

Description

@jeffreyHoelzel

The default split type keeps families/groups intact, which is good for leakage control. However, folds are assigned primarily by group size, not by positive/negative support or source balance.

Evidence:

  • src/pepseqpred/core/train/split.py
    • split_ids_grouped and build_grouped_kfold_splits keep groups intact.
    • Grouped k-fold assigns larger groups first to balance fold size.

Why this can hurt:

  • Folds can have very different positive rates.
  • Some validation folds may contain source/pathogen groups not meaningfully represented in training.
  • Metrics can look much worse than expected if the split is actually a difficult cross-family generalization test.

Planning direction:

  • Add split reports with per-fold:
    • protein count
    • valid residue count
    • positive residue count
    • negative residue count
    • positive rate
    • source/pathogen/family counts
  • Consider grouped stratified splitting where possible.
  • Preserve leakage safety, but balance label support more deliberately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions