Raw protein sequence length is appended as an unnormalized feature

The embedding pipeline appends the original sequence length as an additional scalar feature to every residue embedding similar to BepiPred-3.0.

Evidence:

- `src/pepseqpred/core/embeddings/esm2.py`
  - `append_seq_len` appends `float(seq_len)` as a column.
  - Both embedding generation and prediction embedding paths use this behavior.

Why this can hurt:

- Raw length can be orders of magnitude larger than normalized embedding features.
- It can become a source/pathogen shortcut if sequence length correlates with dataset or label source.
- Multi-pathogen training is particularly susceptible to spurious source-specific features.
- It worked with BP3 but may actually be hurting us here.

Planning direction:

- Run an ablation without the length feature.
- If length is retained, normalize it, bucket it, or pass it through a controlled transform.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raw protein sequence length is appended as an unnormalized feature #85

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Raw protein sequence length is appended as an unnormalized feature #85

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions