Skip to content

Training uses overlapping windows while evaluation uses full proteins #80

@jeffreyHoelzel

Description

@jeffreyHoelzel

Training defaults to windowed proteins with overlap, while evaluation defaults to full proteins.

Evidence:

  • src/pepseqpred/core/data/proteindataset.py
    • _iter_windows supports window_size and stride.
  • scripts/hpc/trainffnn.sh
    • WINDOW_SIZE=1000
    • STRIDE=900
  • src/pepseqpred/apps/evaluate_ffnn_cli.py
    • Evaluation constructs datasets with window_size=None and pad_last_window=False.

Why this can hurt:

  • Overlapped residues are duplicated during training and validation.
  • Validation threshold selection is based on windowed validation arrays, but final evaluation is on full proteins.
  • Overlap duplicates can overweight boundary regions or long proteins.
  • If there are very long multi-pathogen proteins, length and overlap can distort training and metrics.

Planning direction:

  • Run an ablation with --window-size 0 where feasible.
  • Run an ablation with non-overlapping windows, for example stride == window_size.
  • Add metrics that count unique proteins/residues separately from yielded training residues.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions