Positive class weight is stale or fold-inappropriate

HPC scripts hard-code a positive class weight. When auto-computed, the weight is computed over all label shards, not the current training split.

Evidence:

- `scripts/hpc/trainffnn.sh`
  - `POS_WEIGHT="${POS_WEIGHT:-13.18999647945325}"`
  - The script always passes `--pos-weight "$POS_WEIGHT"`.
- `scripts/hpc/trainffnnoptuna.sh`
  - Same hard-coded default positive weight.
- `src/pepseqpred/apps/train_ffnn_cli.py`
  - If `--pos-weight` is absent, `pos_weight_from_label_shards(label_shards)` uses all provided label shard totals.
- `docs/pv1_cwp_bkp_merge_split_and_pos_weight.md`
  - Already notes that automatic train-time `pos_weight` uses shard-level totals, not train-only IDs.

Why this can hurt:

- Multi-pathogen data can have very different positive rates than the dataset used to derive the hard-coded value.
- K-fold training may have materially different positive rates per fold.
- Using validation labels to compute the training class weight is also a mild leakage/selection mismatch, even if it is not target leakage in the usual model-fitting sense.

Planning direction:

- Compute `pos_weight` from the current run's training IDs only.
- Log per-run and per-fold positive/negative residue counts before training.
- Stop hard-coding old positive weights in HPC defaults unless explicitly requested.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Positive class weight is stale or fold-inappropriate #81

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Positive class weight is stale or fold-inappropriate #81

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions