Add tbx and bcf multi-region iterators#2030
Open
daviesrob wants to merge 4 commits into
Open
Conversation
Identical functions are needed to add multi-iterator support for BCF and tabix, so rename them to be more generic, move them to bgzf.c and add declarations to the bgzf_internal.h header. Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
Add interface bcf_itr_regarray() to make multi-region iterators over BCF files (analogous to sam_itr_regarray() on SAM etc.) along with the infrastructure needed to make it work. Add a `-u, --unique` option to tabix to enable use of the new iterator (albeit only on BCF for now). Some lines to enable threading on the tabix output for BCF have been removed, as they seemingly never worked... Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
Add interface tbx_itr_regarray() to make multi-region iterators over files indexed by tabix, along with the infrastructure to make it work. There is some minor complication because the hts_itr_multi_next() function lacks a native way to pass the tbx_t struct to the tbx_readrec() function used to read and parse records. This is worked around by using a struct to wrap the tbx_t pointer up along with the one to the output buffer, and then unwrapping them in a new tbx_multi_readrec() function before passing them on to tbx_readrec(). Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
Test the new -u option. Test tabix on bcf files - vcf_file.bcf was already there, but only used to test bcf to vcf conversion. It's now also used for tabix bcf indexing tests. Test -R and -T options. Signed-off-by: Rob Davies <rmd+git@sanger.ac.uk>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds
bcf_itr_regarray()andtbx_itr_regarray()interfaces to make multi-region iterators for tabix and BCF files, along with the infrastructure needed to make them work. Also adds a-u, --uniqueoption to tabix that uses them, and updates the tabix tests to exercise the new options (along with the-Rand-Tones which previously escaped being tested). This is mostly fairly simple as much of the work needed has already been done for BAM and can be reused. The only minor complication is getting thetbx_tstruct to thetbx_readrec()function in multi mode, which is done by wrapping its pointer along with the output buffer one in another struct so both can be passed in together via the existing API.The internal
bam_pseek()andbam_ptell()functions are renamed tobgzf_pseek()andbgzf_ptell()as the same code needs to be shared by the new iterator functions. They are also moved tobgzf.c.Some lines that attempted to set threading on the tabix output file when searching BCF files (it outputs VCF) have been removed as they always failed because the output file type was not known when attempting to set the pool up. The output is always uncompressed at present, so threading isn't useful at the moment anyway.
Closes #1997
Closes #2022