Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions BUG_REVIEW.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# msgf-rust bug review (2026-05-23)

Branch: `review/bug-hunt` (from `master` @ 18360a3d)

Systematic review of the Rust MS-GF+ port: static analysis of critical paths,
full `cargo test --release --workspace`, and targeted code reading.

## Fixed in this branch

| ID | Severity | Location | Issue | Fix |
|---|---|---|---|---|
| B1 | **Critical** | `msgf-rust.rs` `send_chunks` | Bench cap (`--max-spectra N`) truncated the final partial chunk to zero when `total == N` (e.g. N=100 with chunk size 5000 → empty output). | Removed erroneous tail `truncate` block; loop already stops at cap. |
| B2 | **High** | `msgf-rust.rs` param routing | Activation auto-detect was gated on `instrument == low-res`, so `--fragmentation auto --instrument QExactive` on mzML skipped peek and resolved to CID params for HCD data. | Gate auto-route on `fragmentation == auto` + mzML extension only. |
| B3 | **High** | `msgf-rust.rs` TSV write | `write_tsv(..., is_mgf=true)` always emitted MGF layout (extra `Title` column) even for mzML inputs. | Pass `!is_mzml`. |
| B4 | **High** | `match_engine.rs` GF | SpecE GF graph used `start_offset == 0` for protein N-term instead of `cand.is_protein_n_term`, breaking Met-cleaved N-termini at offset 1. | Use `cand.is_protein_n_term` / `is_protein_c_term`. |
| B5 | **Medium** | `tsv.rs` | `IsotopeError` column hardcoded to 0 while PIN writes `psm.isotope_offset`. | Thread isotope offset from PSM. |
| B6 | **Medium** | `msgf-rust.rs` CLI | Inverted `--charge-min/--charge-max` or isotope ranges produced empty ranges with no error. | Validate at startup and return clear error. |
| B7 | **High** | `match_engine.rs` dedup | Dedup used bare sequence + pin score; merged mod variants incorrectly. | Mod-aware pepSeq key + `rank_score`. |
| B8 | **Medium** | `match_engine.rs` dedup | HashMap survivor order was nondeterministic. | `BTreeMap` + best-`rank_score` survivor rule. |

## Open — not fixed (documented for follow-up)

| ID | Severity | Location | Issue |
|---|---|---|---|
| B9 | **Low** | `sa_walk.rs` | Test-only SA walk helper does not enforce `max_missed_cleavages`; production search uses `candidate_gen::enumerate_candidates`, which does. |
| B10 | **High** | `mzml.rs` `Iterator::next` | First per-spectrum parse error sets `done=true` and aborts the entire file; remaining spectra are silently skipped. |
| B11 | **Low** | `sa_walk.rs` Met pass | Dedupes Met-cleaved peptides on residue bytes only, collapsing distinct C-terminal contexts. |

## Known test failures (pre-existing, CI-skipped)

These fail on `master` without the 7 CI skip flags; tracked as parity/min_peaks regressions:

- `match_engine_smoke::known_peptide_appears_in_top_n`
- `match_engine_smoke::charge_missing_spectrum_uses_per_charge_scored_spec`
- `match_engine_smoke::spectrum_without_charge_tries_charge_range`
- Maven fixture loads, thread-determinism test (see `.github/workflows/ci.yml`)

## Verification

```bash
cargo test --release --workspace -- \
--skip charge_missing_spectrum_uses_per_charge_scored_spec \
--skip spectrum_without_charge_tries_charge_range \
--skip known_peptide_appears_in_top_n \
--skip read_bsa_canno_text_format \
--skip read_tryp_pig_bov_revcat_csarr_cnlcp \
--skip tryp_pig_bov_revcat_full_set_loads \
--skip match_spectra_output_invariant_across_thread_counts
```

## Performance (dedup pass)

- PepSeq dedup keys use integer mod units + `Arc` cache per candidate (avoids repeated string formatting).
- Per-charge `TopNQueue` map uses `FxHashMap<u8, _>` (typically 1–3 charges per spectrum).

## Documentation review (2026-05-24)

Fixes applied on this branch:

| Issue | Location | Fix |
|---|---|---|
| PIN column count said "28" | `README.md` | Corrected to 36 (default charge 2–3) + EdgeScore note |
| Auto-detect described "first spectrum" only | `README.md` | First 64 MS2 histogram; `--instrument` does not gate peek |
| Auto-detect required `--instrument low-res` | `DOCS.md` §4 | Matches code: only `--fragmentation auto` + mzML |
| TSV `IsotopeError` documented as always 0 | `DOCS.md` §3b | Updated after B5 fix |
| Broken `known-divergences.md` links | `README.md`, `DOCS.md` §8d | Legacy file removed in iter39; point to §8d / tests |
| Inverted charge/isotope ranges undocumented | `DOCS.md` §1 | Startup validation documented |

**Still stale (not fixed here):**

- `benchmark/ci/README.md` — references Java Maven workflow; no Rust benchmark workflow in `.github/workflows/` yet.
- `.claude/CLAUDE.md` — Java-tree context; accurate on `java-legacy` branch only.
24 changes: 13 additions & 11 deletions DOCS.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,15 +37,15 @@ All flags use kebab-case long options (`--flag-name`). Several flags also accept
| Flag | Type | Default | Description | Legacy form |
|---|---|---|---|---|
| `--precursor-tol-ppm` | f64 | `20.0` | Symmetric precursor mass tolerance in parts per million. | Java `-t 20ppm` |
| `--charge-min` | u8 | `2` | Minimum precursor charge to try when the spectrum record does not specify charge. | *(no direct Java flag; set via param file in Java)* |
| `--charge-max` | u8 | `3` | Maximum precursor charge to try when charge is missing from the spectrum. | *(same)* |
| `--charge-min` | u8 | `2` | Minimum precursor charge to try when the spectrum record does not specify charge. Must be ≤ `--charge-max` (inverted ranges are rejected at startup). | *(no direct Java flag; set via param file in Java)* |
| `--charge-max` | u8 | `3` | Maximum precursor charge to try when charge is missing from the spectrum. Must be ≥ `--charge-min`. | *(same)* |
| `--enzyme-specificity` | enum | `fully` | Enzymatic cleavage enforcement at peptide termini (Number of Tolerable Termini). `fully`: both termini must be cleavage sites (Java `-ntt 2`). `semi`: at least one terminus (Java `-ntt 1`). `non-specific`: neither required (Java `-ntt 0`). | `--ntt` alias; numeric `0`/`1`/`2` |
| `--max-missed-cleavages` | u32 | `1` | Maximum missed enzymatic cleavages allowed per candidate peptide. | Java `-maxMissedCleavages 1` |
| `--min-length` | u32 | `6` | Minimum peptide length in residues (excluding flanking context). | Java `-minLength 6` |
| `--max-length` | u32 | `40` | Maximum peptide length in residues. | Java `-maxLength 40` |
| `--top-n` | u32 | `10` | Maximum PSMs retained per spectrum (ranked by SpecEValue). | Java `-n 10` |
| `--isotope-error-min` | i8 | `-1` | Minimum isotope error offset to evaluate during precursor matching. | Java `-ti -1,2` (first value) |
| `--isotope-error-max` | i8 | `2` | Maximum isotope error offset to evaluate. | Java `-ti -1,2` (second value) |
| `--isotope-error-min` | i8 | `-1` | Minimum isotope error offset to evaluate during precursor matching. Must be ≤ `--isotope-error-max`. | Java `-ti -1,2` (first value) |
| `--isotope-error-max` | i8 | `2` | Maximum isotope error offset to evaluate. Must be ≥ `--isotope-error-min`. | Java `-ti -1,2` (second value) |
| `--min-peaks` | u32 | `10` | Minimum number of MS2 peaks required to score a spectrum; spectra below this threshold are skipped. | Java `-minNumPeaks 10` |

### Modifications
Expand Down Expand Up @@ -165,7 +165,7 @@ msgf-rust writes Percolator `.pin` (always) and optionally `.tsv`. Implementatio

### 3a. PIN columns

Tab-separated, one header row, one row per PSM. Rows are sorted best-first within each spectrum (lowest SpecEValue first). Charge one-hot columns are emitted for every integer charge in `[--charge-min, --charge-max]`; the table below uses the default range 2–3 (`charge2`, `charge3`).
Tab-separated, one header row, one row per PSM. Rows are sorted best-first within each spectrum (lowest SpecEValue first). With default `--charge-min 2 --charge-max 3`, the header has **36 columns**: 35 Java-parity fields plus Rust-only `EdgeScore` (before `Peptide`). Additional charge states add one `chargeN` column each.

| Column | Type | Description |
|---|---|---|
Expand Down Expand Up @@ -220,7 +220,7 @@ Tab-separated human-readable report. The `Title` column appears **only for MGF**
| `Title` | string | MGF `TITLE=` field. |
| `FragMethod` | string | Activation method name (`HCD`, `CID`, …) or `UNKNOWN`. |
| `Precursor` | float | Precursor m/z (4 decimal places). |
| `IsotopeError` | int | Always `0` in current release (winning offset not threaded to TSV). |
| `IsotopeError` | int | Winning isotope offset (same value as PIN `isotope_error`). |
| `PrecursorError(ppm)` | float | Mass error in ppm when tolerance is ppm mode; column named `PrecursorError(Da)` in Da mode. |
| `Charge` | int | Assigned precursor charge. |
| `Peptide` | string | Annotated peptide sequence with modifications. |
Expand All @@ -242,14 +242,16 @@ Use **PIN** when the goal is FDR calibration or rescoring: Percolator, MS²Resco

## 4. Auto-detection

For mzML inputs, when `--fragmentation auto`, `--instrument low-res`, and `--protocol auto` (the CLI defaults), msgf-rust peeks the input file before loading the full dataset:
For **mzML** inputs when `--fragmentation auto` (the default), msgf-rust peeks the input file before loading the full dataset:

1. **Activation method** — histogram of `<activation>` cvParams across the first 64 MS2 spectra; dominant method wins. Mixed methods trigger an stderr warning but the dominant method is still used file-wide.
2. **Instrument class** — scans `<instrumentConfiguration>` / analyzer cvParams via `input::detect_instrument_type`; dominant analyzer among MS2 spectra wins. `None` → `low-res` (Java `LOW_RESOLUTION_LTQ` default).

MGF files carry no activation or instrument metadata → auto-detect returns `None` → bundled default `HCD_QExactive_Tryp.param` unless explicit `--fragmentation` / `--instrument` flags override.
The CLI `--instrument` flag does **not** gate this path — only `--fragmentation auto` + mzML extension does. When peek succeeds, instrument is taken from the file; `--protocol` from the CLI is still used to pick protocol-suffixed `.param` files (e.g. `_TMT`).

Explicit `--fragmentation` (non-auto) or non-default `--instrument` disables the activation peek and uses flag-based resolution directly (§1).
MGF files carry no activation or instrument metadata → auto-detect returns `None` → bundled default `HCD_QExactive_Tryp.param` unless explicit `--fragmentation` / `--instrument` flags override via `resolve_bundled_param`.

Non-auto `--fragmentation` (e.g. `HCD`, `3`) disables the activation peek and uses flag-based resolution directly (§1), including `--instrument` and `--protocol` from the CLI.

### Activation CV mapping (mzML `<activation>` cvParam accession → method)

Expand Down Expand Up @@ -476,9 +478,9 @@ On PSMs where Java and Rust agree on scan + top-1 peptide (the "agreement bucket

| Feature | Divergence | Status |
|---|---|---|
| `lnEValue` | ~4 orders of magnitude mean shift (Rust more confident) | Deferred — `num_distinct` semantics differ (`known-divergences.md` #2) |
| `lnEValue` | ~4 orders of magnitude mean shift (Rust more confident) | Deferred — `num_distinct` semantics differ (see item #2 below) |
| `MeanRelErrorTop7` / `MeanErrorTop7` / `StdevRelErrorTop7` | >1% relative difference on ~99% of agreement-bucket PSMs | Deferred — error-stat normalization differs |
| BSA charge-3 SpecEValue (BSA.fasta + test.mgf fixture) | 1–4 OOM gap depending on deconvolution iteration | Known — deconvolution implementation divergence (`known-divergences.md` #3); kept as dev-branch smoke gate |
| BSA charge-3 SpecEValue (BSA.fasta + test.mgf fixture) | 1–4 OOM gap depending on deconvolution iteration | Known — deconvolution implementation divergence; kept as dev-branch smoke gate (`gf_java_parity` tests) |

Percolator's learned weights absorb these distribution shifts; rescored FDR results remain competitive or better than Java.

Expand Down
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ msgf-rust \

This runs a tryptic search at 20 ppm precursor tolerance with the bundled HCD_QExactive_Tryp scoring model, writes Percolator-format PSMs to `out.pin`, and prints per-phase timings to stderr. Feed `out.pin` directly into Percolator (Docker or native) to compute q-values.

A row in `out.pin` is one peptide–spectrum match with 28 columns: `SpecId`, `Label`, `ScanNr`, charge one-hot encoding, then features like `RawScore`, `lnSpecEValue`, `DeNovoScore`, ion-current ratios, peptide-length stats, etc. Full column reference: `DOCS.md` §3a.
A row in `out.pin` is one peptide–spectrum match. With the default charge range (2–3), each row has **36 tab-separated columns**: 35 Java-parity Percolator features plus Rust-only `EdgeScore` (inserted before `Peptide`). Charge one-hot columns scale with `[--charge-min, --charge-max]`. Full column reference: `DOCS.md` §3a.

## Common workflows

Expand Down Expand Up @@ -131,11 +131,11 @@ Run `msgf-rust --help` for the auto-generated help with full descriptions.

## Auto-detection

For mzML inputs, msgf-rust reads the activation block of the first MS2 spectrum and selects a bundled `.param` file accordingly. The detection covers HCD/CID/ETD/UVPD activation and LowRes/HighRes/TOF/QExactive instrument classes (via mzML CV params). The bundled model is then resolved from `(fragmentation, instrument, protocol)`. MGF files have no activation metadata, so they go through the CLI defaults (which can be overridden with explicit `--fragmentation` / `--instrument` flags). Full resolution table: `DOCS.md` §4.
For mzML inputs with `--fragmentation auto` (the default), msgf-rust peeks the first 64 MS2 spectra, histograms activation methods and analyzer types, and selects a bundled `.param` file from the dominant values. The `--instrument` CLI flag is **not** required for this path — instrument class is read from the mzML when possible. `--protocol` from the CLI is still applied when resolving the bundled model. MGF files have no activation metadata, so they use flag-based resolution (defaulting to `HCD_QExactive_Tryp.param`). Full resolution table: `DOCS.md` §4.

## Parity vs Java MS-GF+

PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from the deconvolution-implementation difference (`known-divergences.md` item #3, kept on the development branch). None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d.
PIN output columns are bit-exact with Java MS-GF+ on the agreement bucket (same scan + same top-1 peptide) for most features. Three residual divergences exist as deferred research: `lnEValue` (num_distinct semantics), `MeanRelErrorTop7` (error-stat normalization), and the BSA charge-3 SEV gap from deconvolution-implementation differences. None gate cutover; aggregate 1% FDR PSM counts beat Java on all three benchmark datasets. Full detail: `DOCS.md` §8d.

## Citation

Expand Down
5 changes: 5 additions & 0 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ Only the **CI benchmark scaffold** is committed under this directory; heavy
local-only harnesses and artifacts (`data/`, `results/`, prebuilt JARs, etc.)
are intentionally gitignored and not distributed with the fork.

The PXD001819 CI scripts under `ci/` were written for the Java MS-GF+ build
(`mvn`, mzIdentML output). The Rust binary uses PIN/TSV and a separate CI
workflow (`.github/workflows/ci.yml`); Rust benchmark automation is not yet
wired to this scaffold.

## Datasets

| Dataset | PXD | Instrument | Type | Spectra Source | FASTA / SDRF |
Expand Down
4 changes: 3 additions & 1 deletion benchmark/ci/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
# CI benchmark (PXD001819)

- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`)
> **Note:** This scaffold targets the Java MS-GF+ tree (`mvn`, mzIdentML metrics). The Rust port (`msgf-rust`) uses `.github/workflows/ci.yml` for tests but does not yet wire this benchmark harness. See [`benchmark/README.md`](../README.md) for scope.

- **Workflow:** `.github/workflows/benchmark-pxd001819.yml` (`workflow_dispatch`) — Java branch only
- **Run locally:** `mvn -B package -DskipTests && bash benchmark/ci/PXD001819/run_ci.sh`
- **Compare:** `python3 benchmark/ci/PXD001819/compare_metrics.py benchmark/results/PXD001819/ci/ci_metrics.txt benchmark/ci/PXD001819/baseline.tsv`
- **Self-test:** `python3 -m unittest benchmark.ci.PXD001819.test_compare_metrics`
Expand Down
Loading
Loading