The embedding pipeline appends the original sequence length as an additional scalar feature to every residue embedding similar to BepiPred-3.0.
Evidence:
src/pepseqpred/core/embeddings/esm2.py
append_seq_len appends float(seq_len) as a column.
- Both embedding generation and prediction embedding paths use this behavior.
Why this can hurt:
- Raw length can be orders of magnitude larger than normalized embedding features.
- It can become a source/pathogen shortcut if sequence length correlates with dataset or label source.
- Multi-pathogen training is particularly susceptible to spurious source-specific features.
- It worked with BP3 but may actually be hurting us here.
Planning direction:
- Run an ablation without the length feature.
- If length is retained, normalize it, bucket it, or pass it through a controlled transform.
The embedding pipeline appends the original sequence length as an additional scalar feature to every residue embedding similar to BepiPred-3.0.
Evidence:
src/pepseqpred/core/embeddings/esm2.pyappend_seq_lenappendsfloat(seq_len)as a column.Why this can hurt:
Planning direction: