Skip to content

Raw protein sequence length is appended as an unnormalized feature #85

@jeffreyHoelzel

Description

@jeffreyHoelzel

The embedding pipeline appends the original sequence length as an additional scalar feature to every residue embedding similar to BepiPred-3.0.

Evidence:

  • src/pepseqpred/core/embeddings/esm2.py
    • append_seq_len appends float(seq_len) as a column.
    • Both embedding generation and prediction embedding paths use this behavior.

Why this can hurt:

  • Raw length can be orders of magnitude larger than normalized embedding features.
  • It can become a source/pathogen shortcut if sequence length correlates with dataset or label source.
  • Multi-pathogen training is particularly susceptible to spurious source-specific features.
  • It worked with BP3 but may actually be hurting us here.

Planning direction:

  • Run an ablation without the length feature.
  • If length is retained, normalize it, bucket it, or pass it through a controlled transform.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions