HPC scripts hard-code a positive class weight. When auto-computed, the weight is computed over all label shards, not the current training split.
Evidence:
scripts/hpc/trainffnn.sh
POS_WEIGHT="${POS_WEIGHT:-13.18999647945325}"
- The script always passes
--pos-weight "$POS_WEIGHT".
scripts/hpc/trainffnnoptuna.sh
- Same hard-coded default positive weight.
src/pepseqpred/apps/train_ffnn_cli.py
- If
--pos-weight is absent, pos_weight_from_label_shards(label_shards) uses all provided label shard totals.
docs/pv1_cwp_bkp_merge_split_and_pos_weight.md
- Already notes that automatic train-time
pos_weight uses shard-level totals, not train-only IDs.
Why this can hurt:
- Multi-pathogen data can have very different positive rates than the dataset used to derive the hard-coded value.
- K-fold training may have materially different positive rates per fold.
- Using validation labels to compute the training class weight is also a mild leakage/selection mismatch, even if it is not target leakage in the usual model-fitting sense.
Planning direction:
- Compute
pos_weight from the current run's training IDs only.
- Log per-run and per-fold positive/negative residue counts before training.
- Stop hard-coding old positive weights in HPC defaults unless explicitly requested.
HPC scripts hard-code a positive class weight. When auto-computed, the weight is computed over all label shards, not the current training split.
Evidence:
scripts/hpc/trainffnn.shPOS_WEIGHT="${POS_WEIGHT:-13.18999647945325}"--pos-weight "$POS_WEIGHT".scripts/hpc/trainffnnoptuna.shsrc/pepseqpred/apps/train_ffnn_cli.py--pos-weightis absent,pos_weight_from_label_shards(label_shards)uses all provided label shard totals.docs/pv1_cwp_bkp_merge_split_and_pos_weight.mdpos_weightuses shard-level totals, not train-only IDs.Why this can hurt:
Planning direction:
pos_weightfrom the current run's training IDs only.