Skip to content

Add tbx and bcf multi-region iterators#2030

Open
daviesrob wants to merge 4 commits into
samtools:developfrom
daviesrob:tbx-multi-itr
Open

Add tbx and bcf multi-region iterators#2030
daviesrob wants to merge 4 commits into
samtools:developfrom
daviesrob:tbx-multi-itr

Conversation

@daviesrob
Copy link
Copy Markdown
Member

@daviesrob daviesrob commented Jun 4, 2026

Adds bcf_itr_regarray() and tbx_itr_regarray() interfaces to make multi-region iterators for tabix and BCF files, along with the infrastructure needed to make them work. Also adds a -u, --unique option to tabix that uses them, and updates the tabix tests to exercise the new options (along with the -R and -T ones which previously escaped being tested). This is mostly fairly simple as much of the work needed has already been done for BAM and can be reused. The only minor complication is getting the tbx_t struct to the tbx_readrec() function in multi mode, which is done by wrapping its pointer along with the output buffer one in another struct so both can be passed in together via the existing API.

The internal bam_pseek() and bam_ptell() functions are renamed to bgzf_pseek() and bgzf_ptell() as the same code needs to be shared by the new iterator functions. They are also moved to bgzf.c.

Some lines that attempted to set threading on the tabix output file when searching BCF files (it outputs VCF) have been removed as they always failed because the output file type was not known when attempting to set the pool up. The output is always uncompressed at present, so threading isn't useful at the moment anyway.

Closes #1997
Closes #2022

daviesrob added 4 commits June 3, 2026 15:12
Identical functions are needed to add multi-iterator support
for BCF and tabix, so rename them to be more generic, move them
to bgzf.c and add declarations to the bgzf_internal.h header.

Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
Add interface bcf_itr_regarray() to make multi-region iterators
over BCF files (analogous to sam_itr_regarray() on SAM etc.)
along with the infrastructure needed to make it work.

Add a `-u, --unique` option to tabix to enable use of the new
iterator (albeit only on BCF for now).

Some lines to enable threading on the tabix output for BCF have
been removed, as they seemingly never worked...

Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
Add interface tbx_itr_regarray() to make multi-region iterators
over files indexed by tabix, along with the infrastructure
to make it work.

There is some minor complication because the hts_itr_multi_next()
function lacks a native way to pass the tbx_t struct to the
tbx_readrec() function used to read and parse records.  This
is worked around by using a struct to wrap the tbx_t pointer
up along with the one to the output buffer, and then unwrapping
them in a new tbx_multi_readrec() function before passing
them on to tbx_readrec().

Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
Test the new -u option.

Test tabix on bcf files - vcf_file.bcf was already there, but
only used to test bcf to vcf conversion.  It's now also used
for tabix bcf indexing tests.

Test -R and -T options.

Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants