Skip to content

change to have hts wide depth filtering #2016

Open
vasudeva8 wants to merge 1 commit into
samtools:developfrom
vasudeva8:dpth5
Open

change to have hts wide depth filtering #2016
vasudeva8 wants to merge 1 commit into
samtools:developfrom
vasudeva8:dpth5

Conversation

@vasudeva8
Copy link
Copy Markdown
Contributor

This implements a caching mechanism for hts wide depth filtering.
Reads are added to a cache and on tid change/eof/a window size, the
cached reads are processed to find which reads should be used.
Any read for which pair is present in the window will also be selected.
At end of processing, selected reads are sent to user and read continues
until next process event.

The data is expected to be sorted by position.

The cache is allocated first and increased as and when required. Reads have their
order of read set in them. They are kept in order of position and based on length
when position are same. Internally it is a linked list where reads are cached.
Selected reads are moved to a selected list and others to non-selected list.
Any read selected out of turn from non-selected list, due to pair processing, are
moved to an insert list. Post processing, reads are retrieved from selected and insert
list based on their ordinal. Once all selected are transferred, the read and cache cycle
continues.

The same works with iterators as well, with new hts_itr_usecache method.

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented May 28, 2026

It's working on some real world data, but often significantly over-stepping the depth. Eg overshoot.sam.gz from #1084

Region     orig         m=1000  m=100   m=10    m=1                             
104524809  1-4591       1-1045  1-156   1-67    1-67  (mpileup -d $depth)                            
                        1-4591  1-3509  1-529   1-59  (-i hts_maxdepth=$depth)                          

I then produced some synthetic data and can verify this.
The awk summary is showing "position depth" or "range depth", for brevity.

perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=($i>=450 && $i<=454)?200:1;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}'| ./test/test_view -i hts_maxdepth=200 - | samtools depth -|awk 'BEGIN {p=1;l=0} {if ($3==l) next;if (p==$2-1) {printf("%d         \t%d\n",p,l)} else {printf("%d - %d   \t%d\n",p,$2-1,l)};p=$2;l=$3}'|awk '$1>=100 && $1<1000'
100 - 449   	100
450         	299
451         	498
452         	697
453         	896
454 - 549   	1095
550         	896
551         	697
552         	498
553         	299
554 - 999   	100

This basically lets all reads through the cache, and it's the same depth as unfiltered data.

Compare to the mpileup equivalent:

perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=($i>=450 && $i<=454)?200:1;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}'| samtools mpileup -d 200 -| cut -f 1,2,4|awk 'BEGIN {p=1;l=0} {if ($3==l) next;if (p==$2-1) {printf("%d         \t%d\n",p,l)} else {printf("%d - %d   \t%d\n",p,$2-1,l)};p=$2;l=$3}'|awk '$1>=100 && $1<1000'
[mpileup] 1 samples in 1 input files
100 - 449   	100
450 - 549   	199
550 - 999   	100

@jkbonfield
Copy link
Copy Markdown
Contributor

I can verify on my simulated amplicon sequencing data it works.

Simulated with:

#!/usr/bin/perl -w
use strict;

my $glen = 5000; # genome length
my $tlen = 350;  # template length
my $rlen = 250;  # read length
my $dist = 300;  # distance between amplicons.  <= $tlen
my $depth = 1000; # number of templates per amplicon

print "\@SQ\tSN:1\tLN:$glen\n";
my $seq = "A" x $rlen;
my $qual = "I" x $rlen;
my $tnum = 0;
for (my $pos = 1; $pos < $glen; $pos+=$dist) {
    my $pright = $pos + $tlen - $rlen;

    for (my $i = 0; $i < $depth; $i++) {
	# Left: 99 = PAIRED,PROPER_PAIR,MREVERSE,READ1
	print "read_$tnum\t99\t1\t$pos\t60\t${rlen}M\t=\t$pright\t$tlen\t$seq\t$qual\n";

	# Right: 147 = PAIRED,PROPER_PAIR,REVERSE,READ2
	print "read_$tnum\t147\t1\t$pright\t60\t${rlen}M\t=\t$pos\t-$tlen\t$seq\t$qual\n";

	$tnum++;
    }
}

We plot the raw depth ("_orig"), mpileup -d 100, and this PR with test_view -hts_maxdepth=1 | samtools depth.

This was one scenario that broke the original depth reduction code by crashing down to depth 1 following an over-deep section. The new PR doesn't have this and the depth neatly tracks the up/down nature we saw in the unfiltered data, but at a reduced scale (so 1000 / 2000 alternating depth becomes 100 / 200 alternating depth). This is ideal.

image

@jkbonfield
Copy link
Copy Markdown
Contributor

Working on the assumption (mostly valid) that overshoot.sam.gz alignments are all just 75M, then we can produce a table of start to end regions spanned by alignments and uniq -c then to get counts of each. It's quite small:

zcat /tmp/overshoot.sam.gz|awk '!/^@/ {print $4,$4+74}'|uniq -c

     10 104524607 104524681
      1 104524611 104524685
      2 104524626 104524700
      1 104524633 104524707
      1 104524780 104524854
     58 104524783 104524857
     75 104524784 104524858
    114 104524785 104524859
    103 104524786 104524860
     29 104524787 104524861
     14 104524788 104524862
     23 104524789 104524863
     80 104524790 104524864
     49 104524791 104524865
    162 104524792 104524866
     59 104524793 104524867
    136 104524794 104524868
    125 104524795 104524869
     79 104524796 104524870
     38 104524797 104524871
     23 104524798 104524872
     10 104524799 104524873
    101 104524800 104524874
     39 104524801 104524875
    155 104524802 104524876
    104 104524803 104524877
     54 104524804 104524878
    193 104524805 104524879
    178 104524806 104524880
     61 104524807 104524881
    180 104524808 104524882
    165 104524809 104524883
    183 104524810 104524884
    151 104524811 104524885
    132 104524812 104524886
    284 104524813 104524887
     86 104524814 104524888
    131 104524815 104524889
     97 104524816 104524890
    144 104524817 104524891
    100 104524818 104524892
     55 104524819 104524893
    112 104524820 104524894
    133 104524821 104524895
     98 104524822 104524896
    195 104524823 104524897
     13 104524824 104524898
    136 104524794 104524868
    125 104524795 104524869
     79 104524796 104524870
     38 104524797 104524871
     23 104524798 104524872
     10 104524799 104524873
    101 104524800 104524874
     39 104524801 104524875
    155 104524802 104524876
    104 104524803 104524877
     54 104524804 104524878
    193 104524805 104524879
    178 104524806 104524880
     61 104524807 104524881
    180 104524808 104524882
    165 104524809 104524883
    183 104524810 104524884
    151 104524811 104524885
    132 104524812 104524886
    284 104524813 104524887
     86 104524814 104524888
    131 104524815 104524889
     97 104524816 104524890
    144 104524817 104524891
    100 104524818 104524892
     55 104524819 104524893
    112 104524820 104524894
    133 104524821 104524895
     98 104524822 104524896
    195 104524823 104524897
     13 104524824 104524898
     55 104524825 104524899
     44 104524826 104524900
     26 104524827 104524901
     15 104524828 104524902
     43 104524829 104524903
     40 104524830 104524904
     19 104524831 104524905
      2 104524832 104524906
      8 104524833 104524907
      1 104524834 104524908
      8 104524835 104524909
      2 104524836 104524910
      1 104524837 104524911
      1 104524841 104524915
      1 104524843 104524917
      4 104524844 104524918

That's frequency, start, end.

Let us assume we want to selection region 1:104524850-104524850 with a maximum depth of 100, then we get this:

./test/test_view -i hts_maxdepth=100 /tmp/overshoot.sam.gz 1:104524850-104524850|samtools depth - | grep 104524850
1	104524850	3508

So the depth is 3508 (of maximum 4592). The problem is bar the first dozen or so alignments, all of these span position 104524850. The first few lines being:

      1 104524780 104524854
     58 104524783 104524857
     75 104524784 104524858
    114 104524785 104524859
    103 104524786 104524860

By this time we're already over 300 fold depth for position 104524850, and as we know the end coordinate we could in theory stop here, but it wouldn't be a representative sub-selection as it's only records ending very close to our requested location. That's not the preferred output. I'm guessing it's continuing to add data until start coord is outside of the region requested, and each new alignment is extending the right-most base a little bit more and we don't yet know whether by not adding it we'd get a "depth crash".

However part of the reason for using a history / cache of records is that we can "undo" the additions and smooth out such peaks. This doesn't appear to have happened. There are 59 unique sets of start..end coordinates (assuming 75M) that span region 1:104524850-104524850, so obviously just picking 2 from each would be slightly over the requested 100 records. This demonstrates that, manually at least, there clearly are reads which can be selected to give a better profile.

Arguably for larger regions, we may wish to reflect the ups and downs that appear to a certain degree rather than be completely uniform, but we don't want the huge spikes. I think we need to have a better selection process than a greedy left-to-right scan.

@jkbonfield
Copy link
Copy Markdown
Contributor

jkbonfield commented Jun 3, 2026

Another easier to test example of the problem. We have 10 reads starting at every reference position, of length 100. So ignoring the first 100 and last 100 bps of the contig, everything is at depth 1000.

Asking it to filter to depth 100 does nothing at all, because every new contig coordinate hasn't yet been covered so we add another 10 records to it. We need to delay this decision. See the md5sums in these two commands, which only differ by ./test/test_view -i hts_maxdepth=100 .

@ [samtools.../htslib]; perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=10;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}' | md5sum
66e4fb9fb4fb48c2f3ce223b6731addd  -

@ [samtools.../htslib]; perl -e 'print "\@SQ\tSN:1\tLN:1100\n";for ($i=1;$i<1000;$i++) {$n=10;for ($j=0;$j<$n;$j++) {print "s_${i}_${j}\t0\t1\t$i\t60\t100M\t*\t0\t0\t","N"x100,"\t*\n"}}' | ./test/test_view -i hts_maxdepth=100 -|md5sum
66e4fb9fb4fb48c2f3ce223b6731addd  -

The cause of this is due to a strict left-to-right greedy algorithm. As every read is the same length, we always need to add new records because they always extend the right-most base covered by a little bit. As we never have more than 100 reads ending at a single base position, we never filter out with a hts_maxdepth=100 request so we end up on depth 1000 again.

An alternative is to not add anything covering position X until we have a record starting at X+1, and then look back through our history to see what records to add. However that may lead to reads starting 1-10 and ending 101-110, and then starting 101-110 and ending 201-210, etc. It's a bit "blocky" and many positions (eg 105) are covered solely by the extreme starts or extreme ends of reads. Ideally we'd have every coordinate covered a broad mix of locations with the alignments, like a view -s 0.1 would give us, but with the guarantees of not going too low.

Perhaps some sort of reproducible stochastic approach where we add records over the current maximum depth with a probability inversely based on how over-depth we are, would give a smoother profile that still partially reflects ups and down trends. but with the ability to back fill from our history cache when this leaves us at lower depth than we asked for. It would overshoot, but only by a fixed amount (depending on our probabilities), and still give us data covering all coordinates from bases in the middle of reads rather than some only having extreme start/end bases.

Copy link
Copy Markdown
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is significantly longer than I was expecting for the functionality, but I think the complexity probably comes from the desire to do read pairs properly. The old depth filtering code didn't care about read-pairs at all, but that was one of many problems that needed fixing. This is a hard nut to crack, but worth trying!

I've only done basic scans through the code so far and looked at the header files. I then got distracted by doing usability checks and that took up my main time. The core algorithm needs fixing first to filter better. I'm playing with some ideas.

Anyway, here is a fast pass of issues. The review is still in progress however.

Comment thread sam_cache.h
bam1_t *r;
struct ce_t *next, *prev;
uint64_t len;
kstring_t log;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kstring_t log isn't needed unless CACHE_DBG_LOG is enabled, and likewise the ks_initialise/ks_free calls in sam_cache.c.

Comment thread sam_cache.h
uint64_t ord; //ordinal
bam1_t *r;
struct ce_t *next, *prev;
uint64_t len;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Len comes from bam_cigar2rlen which returns hts_pos_t. It's basically the same type, but we probably ought to use that type here instead of uint64_t.

Comment thread Makefile
hfile_libcurl.o hfile_libcurl.pico: hfile_libcurl.c config.h $(hfile_internal_h) $(htslib_hts_h) $(htslib_hts_alloc_h) $(htslib_kstring_h) $(htslib_khash_h)
hfile_s3.o hfile_s3.pico: hfile_s3.c config.h $(hfile_internal_h) $(htslib_hts_h) $(htslib_hts_alloc_h) $(htslib_kstring_h) $(hts_time_funcs_h)
hts.o hts.pico: hts.c config.h os/lzma_stub.h $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h config_vars.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_alloc_h) $(htslib_hts_expr_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h) $(htscodecs_htscodecs_h)
hts.o hts.pico: hts.c config.h os/lzma_stub.h $(htslib_hts_h) $(htslib_bgzf_h) $(cram_h) $(htslib_hfile_h) $(htslib_hts_endian_h) version.h config_vars.h $(hts_internal_h) $(hfile_internal_h) $(sam_internal_h) $(htslib_hts_alloc_h) $(htslib_hts_expr_h) $(htslib_hts_os_h) $(htslib_khash_h) $(htslib_kseq_h) $(htslib_ksort_h) $(htslib_tbx_h) $(htscodecs_htscodecs_h) $(sam_cache_h)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need a makefile rule for sam_cache.o

Comment thread sam_cache.h
ce_t *head_ins, *tail_ins; //inserted alignments
uint64_t selcnt, nselcnt, inscnt,rcnt;
uint64_t ord; //last ordinal
int trgr; //sts: 0 not ready 1 caching 2 wnd full 3 ready 4 end
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an enum, which will help to make all the assignments and checks of trgr in sam_cache.c more self-documenting

Comment thread htslib/hts.h
const char *fnidx;
struct sam_hdr_t *bam_header;
struct hts_filter_t *filter;
void *c; //for cache
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please pick a better name than c. Maybe cache?

It also shouldn't be void *. It should be an opaque type name, declared in a public header somewhere (or here), with the actual non-opaque version in the internal sam_cache.h. This then avoids casts elsewhere and improves code documentation / readability.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I note this is also changing the size of the struct and is therefore technically ABI breaking, however the filter element appeared in 1.12 and we didn't bump LIBHTS_SOVERSION` for that. I think the justification was probably that we have an explicit comment saying "This structure should be considered opaque by end users", which gives us some more leeway. (Albeit the size being known means we could declare an array of htsFile, but that wouldn't be possible if it was anonymous so should also come under the "opaque" declaration.)

Comment thread hts.c
// wndsz = 350;
// dpth = dpth & 0x00FF; //upto 65535
//todo check whether sorted by pos and setup only if so
if (dpth > 0)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A stylistic comment, but generally if we have inner braces used we prefer outer braces too to aid readability.

Comment thread hts.c
}
if (c && ret >= -1) {
processcache_iter(c);
if (!getfromreadcache_iter(c, s, &iter->tid, &iter->curr_beg, &iter->curr_end)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this correct? iter claims:

 * tid       - Reference id of the target region
 * beg       - Start position of the target region
 * end       - End position of the target region

 * curr_tid   - Reference id of the current alignment
 * curr_beg   - Start position of the current alignment
 * curr_end   - End position of the current alignment

Here we're mixing current alignment coordinates with region coordinates by using tid from one and beg/end from the other. It may be correct and I haven't followed the logic, but it seems different to most of the other getfromreadcache_iter calls.

Comment thread sam_cache.c
Comment on lines +242 to +246
if (!c) { //create cache
if (!(c = calloc(1, sizeof(rc_t))))
return NULL;
fp->c = c;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this call setupcache instead? I'm not sure what the benefit is of allocating the structure memory without setting all the other parameters up.

Comment thread sam_cache.c
c->maxdpth = maxdpth;
c->w_st = c->w_en = -1;
c->tid = -2; //start
c->selpair = kh_init(kh_pair);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing errr check on kh_init, which does a malloc and can therefore fail.

Comment thread sam.c
const hts_cram_idx_t *cidx = (const hts_cram_idx_t *) idx;
if (idx == NULL)
return hts_itr_query(NULL, tid, beg, end, sam_readrec_rest);
return hts_itr_usecache(hts_itr_query(NULL, tid, beg, end, sam_readrec_rest));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this logic.

hts_itr_usecache is in the hts.h:

static inline hts_itr_t* hts_itr_usecache(hts_itr_t *itr) { if (itr) itr->usecache = 1; return itr; }

It's static inline, so we don't have an extra function overhead. That's fine.

However it's basically saying if hts_itr_query worked and didn't return NULL, then set itr->usecache = 1. In that case the iter query function may as well just set it to 1 itself as it's permanently enabled.

Although why is it always on? Shouldn't it only be enabled if the user specified the command line option? Is it because we're passing in NULL so we don't have an index, let alone a file descriptor to check for the CLI option being enabled? (I confess the overloading in the indexing code scares and confuses me as to what type is what and when it exists! Hence why CRAM simply has the index attached to the cram_fd rather than as an additional type floating around.)

It seems to be used in getsamcache and if set (which it will be) then it enables calls to getfromreadcache_iter in hts_itr_next. This seems to imply caching is permanently enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants