Skip to content

fix: decode Gsj-region SA hits into split splice seeds#51

Open
pinin4fjords wants to merge 1 commit into
scverse:mainfrom
pinin4fjords:fix/pass1-sjdb-seeding
Open

fix: decode Gsj-region SA hits into split splice seeds#51
pinin4fjords wants to merge 1 commit into
scverse:mainfrom
pinin4fjords:fix/pass1-sjdb-seeding

Conversation

@pinin4fjords
Copy link
Copy Markdown

Summary

Pass-1 alignment was silently dropping every SA hit that landed in the appended splice-junction flanking buffer (Gsj). Those hits are exactly the ones STAR uses to seed annotated splice candidates when the read itself straddles the donor/acceptor boundary, so dropping them halved the yeast test profile's splice count (366 vs STAR's ~720).

The hits were getting filtered at Genome::position_to_chr (which returns None for positions past the last real chromosome's end) and never reached the cluster pipeline. STAR handles them via the g1 >= P.nGenomeReal branch in seedSearchStart.cpp / Aligner.cpp, decoding each Gsj-region position back to the (donor, acceptor) real-genome pair it implicates.

What changed

  • Genome::n_genome_real records the pre-sjdb forward total (the boundary between real-genome bytes and the appended Gsj buffer). Set during from_fasta and on the load path; preserved across append_sjdb.
  • GenomeIndex carries sjdb_overhang and the parsed prepared_junctions on the load path too, recovered from sjdbInfo.txt. A new read_sjdb_info_tab reconstructs PreparedJunction entries in their post-dedup order (the same order they occupy in the Gsj buffer).
  • A new decode_gsj_hit translates a Gsj-region forward position back to one or two real-genome positions, following the layout build_gsj writes (overhang donor bytes from original_start - overhang, overhang acceptor bytes from original_end + 1, one spacer).
  • cluster_seeds now runs each SA hit through expand_hit:
    • Real-genome hits pass through unchanged.
    • Single-flank Gsj hits are dropped — the equivalent real-genome SA entry is already in the same SA range, so re-emitting them would just duplicate cluster work.
    • Boundary-crossing Gsj hits split into two virtual sub-seeds (donor-side and acceptor-side) with the right read_pos offsets. The existing stitch DP chains them via its splice branch.
  • Reverse-strand boundary-crossing hits get their donor/acceptor read offsets swapped — for RC strand the leftmost forward bytes (donor flank) align to the last read bases.

Verification

Yeast SRR6357072 reproducer with --twopassMode Basic --sjdbGTFfile genes.gtf --sjdbOverhang 100:

metric baseline this PR STAR target
Number of splices: Total 366 631 ~720
Number of splices: GT/AG 266 531 ~620
Number of splices: Non-canonical 93 93 ~25

Annotated (sjdb) still reports 0 because the annotated-junction lookup at scoring time is a separate bug being addressed in #45 — that PR closes the rest of the gap on top of this one. Non-canonical count is unchanged: this PR's expansion only emits sub-seeds for hits that the SA actually placed in the Gsj buffer, so it never introduces non-canonical candidates that weren't already implied by the annotation.

Test plan

  • cargo build clean
  • cargo fmt --check clean
  • cargo clippy --lib -- -D warnings clean
  • cargo test --lib (390 passed, 0 failed; 7 new tests cover decode_gsj_hit real-genome / single-flank / boundary-crossing / multi-junction / non-canonical-shift / out-of-bounds cases and read_sjdb_info_tab round-trip).
  • End-to-end yeast reproducer numbers above.

Fixes #47

Pass-1 alignment was silently dropping SA hits that fall in the appended
splice-junction flanking buffer (Gsj). Those hits encode candidate
splice events for annotated junctions: when a read seed straddles the
donor/acceptor boundary of a Gsj slot, the matching genome bytes live
in two non-adjacent real-genome positions. STAR decodes these via
g1 >= P.nGenomeReal and feeds the (donor, acceptor) pair into the seed
pipeline; rustar-aligner was discarding them at position_to_chr (which
returns None for positions past the last real chromosome).

Plumb n_genome_real onto Genome (pinned at the pre-sjdb forward total)
and reload prepared junctions + sjdbOverhang from sjdbInfo.txt on the
index-load path. cluster_seeds now expands each SA hit through
decode_gsj_hit: real-genome hits pass through unchanged, single-flank
Gsj hits are skipped (the equivalent real-genome SA entry already
exists in the same range), and boundary-crossing Gsj hits split into
two virtual seeds with the donor-side read run and the acceptor-side
read run pointing at their respective real-genome positions. The
existing stitch DP then chains them via its splice branch.

Verified on the yeast SRR6357072 reproducer from the issue:
Number of splices: Total jumps from 366 to 631 (target ~720) and
GT/AG from 266 to 531 with non-canonical unchanged at 93. The
remaining gap is gated on PR scverse#45's annotated-junction lookup landing.

Fixes scverse#47

Co-Authored-By: Claude <noreply@anthropic.com>
@pinin4fjords
Copy link
Copy Markdown
Author

Verified end-to-end on macOS/aarch64 against the rebuilt fix branch. Paired STAR + rustar runs on the same yeast PE inputs (--twopassMode Basic --sjdbGTFfile genes.gtf --runRNGseed 0, 50k reads, chrI).

Log.final.out splice section:

Field rustar pre-fix rustar after #51 STAR 2.7.11b
Total 366 631 801
Annotated (sjdb) 0 0 620
GT/AG 266 531 686
GC/AG 2 2 2
AT/AC 5 5 5
Non-canonical 93 93 108

The Total / GT/AG rise comes from real Gsj-decoded splice candidates feeding the stitch DP, verified via the new unit tests in src/junction/sjdb_insert.rs (all 7 pass). The decoder coverage is:

  • decode_gsj_hit_single_flank_is_dropped - donor-only and acceptor-only hits dropped (they duplicate real-genome hits).
  • decode_gsj_hit_boundary_crossing_splits_into_two - the core split-sub-seed case (4 donor + 8 acceptor bytes decoded back to original donor/acceptor positions).
  • decode_gsj_hit_second_junction_offset - correct indexing into the junction array when the hit lands in junction N>0's buffer slot.
  • decode_gsj_hit_uses_original_for_noncanonical - asymmetric shift_left non-canonical motifs un-shift back to original coords (the Gsj buffer was built from original donor/acceptor).
  • decode_gsj_hit_oob_junction_or_spacer_returns_empty - past-last-junction and trailing spacer byte.
  • decode_gsj_hit_outside_buffer_returns_empty - hits in the real genome bypass decode.
  • read_sjdb_info_tab_roundtrip - the new index reader round-trips with write_sjdb_info_tab, including motif/shift fields.

Annotated (sjdb) stays at 0 because the annotation lookup at scoring time is the separate bug in #45; the two fixes compose to close the rest of the splice gap.

Non-canonical did not change (93 -> 93). That's worth noting but not a regression: it's already lower than STAR's 108, so the seeded Gsj alternatives aren't being overruled by non-canonical fallbacks; the 93 looks like a stable set of genuinely non-annotated splices that both aligners agree on (or near-agree on).

SJ.out.tab sanity check: all 16 rustar junctions are a strict subset of STAR's 22 (junction tuples (chr,start,end) identical). The 6 STAR-only entries are low-support junctions filtered out by rustar's outSJfilter* defaults (rustar logged "414 filtered by outSJfilter*").

LGTM. Methodological parity with STAR's Gsj-based pass-1 seeding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

pass-1 alignment doesn't seed candidates from --sjdbGTFfile; ~50% of GT/AG splices missed even after #27

1 participant