bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (samtools/bcftools#2557)#2011
Conversation
bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (#2557)bcf_sr_set_regions: opt-in fastpath for dense single-base BEDs (samtools/bcftools#2557)
142224b to
deec4fa
Compare
bcftools view -R FILE.bed.gz exhibits 300-500x slowdown vs a sequential scan when the BED contains many densely-clustered single-base regions -- a common pattern in PRS / ancestry pipelines (HGDP+1kGP 84M, AADR 1240k, PGS Catalog). A production run on 84M such regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU. Root cause: bcf_sr_set_regions(is_file=1) routes through _readers_next_region() which issues one tbx_itr_queryi() per BED entry -- 84M iterator setup/teardown cycles for an 84M-entry panel. This patch adds a sniffer in bcf_sr_set_regions(): when the BED's first 256 non-comment entries are all single-base regions, internally promote to bcf_sr_set_targets(), which streams the VCF top-to-bottom and filters via in-memory regidx. Existing well-trodden code path; no new hot-path logic. The 0/1/2 --regions-overlap semantics match --targets-overlap so the user value carries over unchanged. Measured speedup against a synthetic reproducer (single-base BED at ~10bp avg spacing, matching VCF; bcftools 1.23.1, macOS arm64): N=100K: 13.3s -> 0.19s (70x) N=1M : 156.6s -> 0.55s (285x) N=5M : 705s -> 2.76s (255x) The 84M production case extrapolates from hours to minutes. Regression test: test_bcf_sr_regions_fastpath asserts -R FILE and -T FILE produce identical output on a 300-entry single-base fixture (>256 triggers the sniffer). test-bcf-sr gains -R/-T file-mode flags to exercise bcf_sr_set_regions/set_targets is_file=1 paths. Tracks samtools/bcftools#2557.
Four follow-ups to the auto-promotion patch: 1. API state leak (normal). The fastpath called bcf_sr_set_targets() unconditionally, which populates readers->targets while leaving readers->regions NULL and explicit_regs=0 -- both observable to callers. A subsequent bcf_sr_set_targets() then fails (e.g. for the -R FILE -T OTHER workflow); downstream consumers inspecting explicit_regs/regions saw "no regions set" even though set_regions succeeded. Whether identical caller code worked depended on the BED contents (the sniffer's content-dependent dispatch). Fix: gate the fastpath behind a new BCF_SR_AUTO_TARGETS_FROM_REGIONS opt. Default off, so existing callers keep the documented bcf_sr_set_regions() semantics. Callers that opt in accept the incompatibility with bcf_sr_set_targets() -- documented at the enum. 2. Sparse-BED regression (normal). The lone 256-line count threshold accepted 256 SNPs scattered genome-wide, which (compounded with samtools#1's explicit_regs=0) flipped ~256 cheap tabix seeks into full per-chrom streaming scans on whole-genome BCFs. Sniffer now also requires average intra-chromosome inter-entry distance below 10 kbp; 84M/1240k panels pass, scattered 256-entry BEDs fail. 3. FIFO consume-prefix (nit). The sniffer's hts_open + read + close permanently drained the first 256 lines from unseekable inputs. Sniffer now gates on hisremote() / '-' / stat()+S_ISREG() before opening, so FIFOs, stdin and char devices skip the sniff entirely. URLs are also skipped to avoid an extra round-trip. 4. Tautological regression test (nit). The prior test diffed -R against -T, but the fastpath made -R literally call bcf_sr_set_targets() -- the same function -T invokes -- so a defect in set_targets would shift both sides identically and the diff would still pass. Test now diffs three genuinely different code paths: -R slow path (per-region tbx_itr_queryi), -R fastpath (sniff + set_targets via opt), and -T direct. test-bcf-sr gains --auto-targets-from-regions to drive the new opt. Tracks samtools/bcftools#2557.
Matches the style established by 6e5a94b ("Log error when hts_close() fails in synced reader cleanup"), which is the most recent contribution to this file. Our sniffer is read-only and local-files-only (we gate on S_ISREG before opening), so close errors are exceedingly rare in practice -- but the cost is two lines and the precedent is in this same file.
e32e635 to
010999b
Compare
|
It looks like this is trying to make I'm currently having a good look at the synced reader for other reasons, so I'll see if I can work out how hard it would be to add multi-region support while I'm in there. Incidentally, we've recently added some policy around developer sign-off and the use of AI coding tools. Please see CONTRIBUTING.md for details. |
|
Closing in favor of Rob's planned multi region work on the synced reader, which is the better long term fix. The reproducer and benchmark numbers are preserved at samtools/bcftools#2557 if useful as a test case. Happy to test the multi region implementation against the same workload once something lands. Closing samtools/bcftools#2561 as well since it is just the bcftools side wiring for the opt added here. Re. CONTRIBUTING.md sign off and AI policy, noted for any future PR. |
Summary
bcf_sr_set_regions(file=1)runs onetbx_itr_queryi()per BED entry, which is 300-500× slower than a sequential scan at SNP-panel sizes — samtools/bcftools#2557 was a production case where 84M single-base regions against a 23 GB VCF did not complete in 11+ hours of 100% CPU. The streaming-targets path already handles this workload at near-baseline speed; this PR adds auto-routing to it.Design
New
BCF_SR_AUTO_TARGETS_FROM_REGIONSopt (end ofbcf_sr_opt_t, ABI-safe; default off). When set,bcf_sr_set_regions()runs a sniffer that accepts only when ALL hold:bcf_sr_regions_init()reopens).END − START == 1).On accept,
regions_overlapis copied totargets_overlap(value semantics are identical) andbcf_sr_set_targets()takes over. On reject (sparse / wide / malformed / too small / non-local), the existing seek-per-region path runs unchanged.The opt-in gate is the safety mechanism: successful promotion populates
readers->targets, so a subsequentbcf_sr_set_targets()fails — callers must explicitly accept that trade-off.Measurements
bcftools 1.23.1 + this PR; single-base BED at 10 bp avg spacing, matching VCF; macOS arm64:
Production (84M entries × 23 GB VCF) extrapolates from hours to minutes.
Test plan
New
test_bcf_sr_regions_fastpath(test/test.pl) diffs three genuinely different code paths producing identical output on a 300-entry single-base fixture:-R FILEslow path (opt off)-R FILE --auto-targets-from-regionsfastpath (opt + sniffer)-T FILEexisting streaming-targets pathtest-bcf-srgains matching-R/-Tfile-mode flags and the--auto-targets-from-regionsflag. All 359 existing htslib tests pass.Notes
The matching bcftools-side wiring is at samtools/bcftools#2561; without it,
bcftools view -R FILEkeeps the slow path. This PR is strictly the htslib half.Tracks samtools/bcftools#2557.