Skip to content

Positive class weight is stale or fold-inappropriate #81

@jeffreyHoelzel

Description

@jeffreyHoelzel

HPC scripts hard-code a positive class weight. When auto-computed, the weight is computed over all label shards, not the current training split.

Evidence:

  • scripts/hpc/trainffnn.sh
    • POS_WEIGHT="${POS_WEIGHT:-13.18999647945325}"
    • The script always passes --pos-weight "$POS_WEIGHT".
  • scripts/hpc/trainffnnoptuna.sh
    • Same hard-coded default positive weight.
  • src/pepseqpred/apps/train_ffnn_cli.py
    • If --pos-weight is absent, pos_weight_from_label_shards(label_shards) uses all provided label shard totals.
  • docs/pv1_cwp_bkp_merge_split_and_pos_weight.md
    • Already notes that automatic train-time pos_weight uses shard-level totals, not train-only IDs.

Why this can hurt:

  • Multi-pathogen data can have very different positive rates than the dataset used to derive the hard-coded value.
  • K-fold training may have materially different positive rates per fold.
  • Using validation labels to compute the training class weight is also a mild leakage/selection mismatch, even if it is not target leakage in the usual model-fitting sense.

Planning direction:

  • Compute pos_weight from the current run's training IDs only.
  • Log per-run and per-fold positive/negative residue counts before training.
  • Stop hard-coding old positive weights in HPC defaults unless explicitly requested.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions