Current label generation expands peptide labels across every residue in a peptide alignment window. If a peptide is reactive, every residue in that peptide window becomes a positive residue.
Evidence:
src/pepseqpred/core/labels/builder.py
_build_labels_for_protein marks def_mask[start:stop] = True for definite epitope peptides.
- The FFNN then trains residue-level BCE on those expanded residue labels.
Why this can hurt:
- A reactive peptide means "the peptide contains an epitope signal", not necessarily "every residue in this peptide is epitope".
- This creates dense false-positive residue labels within positive peptides.
- The model is optimized for residue-wise correctness under noisy residue labels, while downstream use may care about peptide/protein regions or sparse true epitopes.
- The issue can compound in multi-pathogen data if peptide lengths, overlap density, or labeling criteria differ by source.
Planning direction:
- Add diagnostics comparing peptide-level labels to residue-level expansion density.
- Consider a multiple-instance learning objective, peptide-window objective, or soft/weak residue labels.
- If residue labels remain, consider down-weighting positive residues within reactive peptide spans or using boundary-aware smoothing.
Disclaimer: This may not be an issue we can resolve given the data we have, but this issue outlines some possible direction to consider.
Current label generation expands peptide labels across every residue in a peptide alignment window. If a peptide is reactive, every residue in that peptide window becomes a positive residue.
Evidence:
src/pepseqpred/core/labels/builder.py_build_labels_for_proteinmarksdef_mask[start:stop] = Truefor definite epitope peptides.Why this can hurt:
Planning direction:
Disclaimer: This may not be an issue we can resolve given the data we have, but this issue outlines some possible direction to consider.