Skip to content

Label/objective mismatch for residue-level prediction #77

@jeffreyHoelzel

Description

@jeffreyHoelzel

Current label generation expands peptide labels across every residue in a peptide alignment window. If a peptide is reactive, every residue in that peptide window becomes a positive residue.

Evidence:

  • src/pepseqpred/core/labels/builder.py
    • _build_labels_for_protein marks def_mask[start:stop] = True for definite epitope peptides.
    • The FFNN then trains residue-level BCE on those expanded residue labels.

Why this can hurt:

  • A reactive peptide means "the peptide contains an epitope signal", not necessarily "every residue in this peptide is epitope".
  • This creates dense false-positive residue labels within positive peptides.
  • The model is optimized for residue-wise correctness under noisy residue labels, while downstream use may care about peptide/protein regions or sparse true epitopes.
  • The issue can compound in multi-pathogen data if peptide lengths, overlap density, or labeling criteria differ by source.

Planning direction:

  • Add diagnostics comparing peptide-level labels to residue-level expansion density.
  • Consider a multiple-instance learning objective, peptide-window objective, or soft/weak residue labels.
  • If residue labels remain, consider down-weighting positive residues within reactive peptide spans or using boundary-aware smoothing.

Disclaimer: This may not be an issue we can resolve given the data we have, but this issue outlines some possible direction to consider.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions