Label/objective mismatch for residue-level prediction

Current label generation expands peptide labels across every residue in a peptide alignment window. If a peptide is reactive, every residue in that peptide window becomes a positive residue.

Evidence:

- `src/pepseqpred/core/labels/builder.py`
  - `_build_labels_for_protein` marks `def_mask[start:stop] = True` for definite epitope peptides.
  - The FFNN then trains residue-level BCE on those expanded residue labels.

Why this can hurt:

- A reactive peptide means "the peptide contains an epitope signal", not necessarily "every residue in this peptide is epitope".
- This creates dense false-positive residue labels within positive peptides.
- The model is optimized for residue-wise correctness under noisy residue labels, while downstream use may care about peptide/protein regions or sparse true epitopes.
- The issue can compound in multi-pathogen data if peptide lengths, overlap density, or labeling criteria differ by source.

Planning direction:

- Add diagnostics comparing peptide-level labels to residue-level expansion density.
- Consider a multiple-instance learning objective, peptide-window objective, or soft/weak residue labels.
- If residue labels remain, consider down-weighting positive residues within reactive peptide spans or using boundary-aware smoothing.

**Disclaimer:** This may not be an issue we can resolve given the data we have, but this issue outlines some possible direction to consider. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Label/objective mismatch for residue-level prediction #77

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Label/objective mismatch for residue-level prediction #77

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions