Adding the sylph coverage model to yacht by rtraborn · Pull Request #141 · KoslickiLab/YACHT

rtraborn · 2025-12-08T15:23:53Z

Hi @dkoslicki and team!
I created a sylph coverage model from Shaw and Yu, 2024 and added it to yacht, in a branch I named superyacht just for fun.
This is a draft that I'm still testing, so that and other caveats still apply. A few notes:

The coverage code is in a function called cov_calc, which calculates lambda and ani according as specified by the sylph paper.
For engineering reasons, I decided to put cov_calc inside get_exclusive_hashes, given that that function provides us with the signature objects needed to make the calculations.
Because of this, I am passing the output of cov_calc, a pandas dataframe, along with hypothesis_recovery. There are probably good ways to integrate this, and I'll give this some more thought.
Also, I decided to not incorporate the output of cov_calc more deeply into hypothesis_recovery for now. I have some ideas on what might be the best approach that we could discuss if you'd like. I thought it would be best to share this new branch while I look into this more deeply.
The script internal_superyacht_test.py is just a script that I have been using to test the new branch, and this can be ignored; I'll remove it once we move towards publication.
I plan to update the way I instantiated the AdjustStatusLambda enum in a more idiomatic python way this week. It should be a relatively quick fix.
I did not incorporate the taxonomic reassignment/winner_map routine from sylph, but it's something I would like to add.

I'm going to do more testing this week on additional datasets. Happy to discuss here or via email/video!

…ate effective coverage, etc according to sylph (Shaw and Yu, 2024).

… that aren't necessary.

… print statements with logger. Moved all constants to utils.py.

rtraborn · 2026-01-02T22:02:16Z

After some more testing, I just pushed some additional updates to this branch.

Fixed a typo: corrected to logger.warning in cov_calc.py
Added missing scipy.special.gamma import to utils.py
Fixed a few bugs I discovered in binary_search_lambda()
Replaced print statements with logger.info() for consistency
Consolidated duplicate constants into utils.py
Removed a few unused local variables from hypothesis_recovery_src.py

rtraborn · 2026-01-07T18:12:11Z

A small update, but with my most recent commit from last night I made the promised change to the AdjustStatusLambda enum in cov_calc to make it more idiomatically python-like. I think it looks cleaner- thanks to @standage for the flagging this!

…eed.

rtraborn · 2026-01-14T19:14:02Z

Hi All!

I've made some more changes over the past week. Here's an overview of my most recent updates (Part 1 of 2):

Formally adds the Winner takes all strategy from sylph, which performs abundance calculation and k-mer reassignment. This helps prevent double-counting of k-mers, assigning shared k-mers to the taxon with the highest calculated ANI. I'll note that in the interest of performance this procedure is only being done once per instance, rather than twice (as in sylph). We can discuss this in the future.
This procedure tracks k-mers lost to reassignment (kmers_lost column) and filters orgs with final_est_ani < 0.90 (90% ANI threshold)
Coverage results are now merged into the overall Excel output of yacht run. Previously the cov_calc output was passed along as a separate dataframe. I tried to do this without touching too much of the original yacht code; let me know what you think and if we need to make any tweaks.
The columns added to this new output are naive_ani, final_est_ani, final_est_cov, mean_cov, median_cov, lambda_status, ani_ci, lambda_ci, rel_abund, kmers_lost
Fixed a bug created by the refactoring described above 🙃 (MIN_ANI_THRESHOLD is defined).
I fixed a bug that I encountered after more extensive testing. What happened was that ani_from_lambda function had a ZeroDivisionError when lambda_val was 0 or very close to 0. It took my testing on a pretty diverse metagenomic sample (more on this test later) for this to crop up, but I'm glad it did.

…takes-all k-mer reassignment.

…ion due to low lambda.

…pdate to median_ani_threshold.

sonarqubecloud · 2026-01-29T20:13:35Z

Quality Gate passed

Issues
20 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…S, replacing it with the system gzip

…--calculate-coverage and --no_two_pass.

sonarqubecloud · 2026-03-13T18:16:23Z

Quality Gate failed

Failed conditions
1 Security Hotspot

See analysis details on SonarQube Cloud

rtraborn · 2026-03-16T20:02:14Z

Hi all! I made more updates to this branch (along with a decent amount of testing) as we discussed and the code is ready for your review.
Updates include the following:

added new CLI argument (--winner-takes-all) for the winner-map functionality. Added two-pass winner-takes all.
added ANI capping- prevents lambda-corrected ANI values from being higher than 1.
added test that skipped correction under high-coverage conditions (coverage >= 3.0) as per sylph
Most notably, added coverage calculation for yacht using the top-level --calculate-coverage CLI flag that replaces the user-supplied values for c. When run, instead of n worksheets, a single worksheet is produced in the (*.xlsx) output.
(Minor) this adds a small fix to the gzip functionality when running yacht train- replaced python zgip with the system gzip. This corrects an error with incomplete uncompression that I encountered on Mac OSX.
This was tested on about 5 standard metagenomic datasets with no errors or taxa misidentifications. I look forward to your comments and thoughts!
[P.S. As an aside, the reason that the most recent SonarCloud scan failed appears to be because of regular yacht code that preexists this branch, so I refrained from pushing the fixes here. I can make those changes separately if desired.]

R. Taylor Raborn and others added 5 commits December 5, 2025 09:53

First commit on superyacht branch, including files required to calcul…

00e2db0

…ate effective coverage, etc according to sylph (Shaw and Yu, 2024).

Move files [skip ci]

21dc0cb

Removed remaining print statements throughout.

3b5a370

Moved test script [skip ci]

6f0d8ff

Corrected incorrect indent. [skip ci]

e09ec7b

dkoslicki mentioned this pull request Dec 17, 2025

Update workflow to trigger on pull requests #144

Merged

rtraborn added 2 commits December 29, 2025 22:37

Various code improvements to improve legibility and remove stray bits…

50ff040

… that aren't necessary.

Various bug fixes and improvements to code quality. Replaced relevant…

4dd7972

… print statements with logger. Moved all constants to utils.py.

Changed to a more typically python enum for cov_calc.

92e25cc

rtraborn and others added 5 commits January 10, 2026 17:11

Minor update to README and .gitignore.

80bedab

Added winner map functionality to hypothesis_recovery_src.py.

b1bd726

Minor update to .gitignore.

bab404f

Updated utils.py to define MIN_ANI_THRESHOLD.

b8b476a

Added parallelization to the coverage calculation; should increase sp…

1f81fcc

…eed.

R. Taylor Raborn added 11 commits January 14, 2026 14:31

Cleaning up documentation.

12fcde9

Added sample_sig as a pool variable to remove big performance overheads.

94cb13a

Improvements to parallel processing and new arguments for the winner-…

1f643c1

…takes-all k-mer reassignment.

Small tweaks to help statement for new winner-map related arguments.

ef9077d

Fixed a small typo.

45faa74

Renamed MEDIAN_ANI_THRESHOLD to avoid confusion.

6997b5c

Fixed critical bug due to umap_unordered- now matching on organism name.

3333649

Added ANI-capping utility to avoid biologically impossible ANI inflat…

782b2fd

…ion due to low lambda.

Updated cov_calc to cap ANIs at 1.0; pt II of ani capping.

835f434

Removed old commented code that was replaced by the winner-map addition.

8b36fb1

Updated ANI adjustment to avoid ANI adjustment under high-coverage, u…

7e9f9f9

…pdate to median_ani_threshold.

Patched a minor bug relating to performance of python's gzip on Mac O…

d35725d

…S, replacing it with the system gzip

R. Taylor Raborn added 2 commits March 13, 2026 14:10

Added two-pass mode to --winner-take-all procedure.

7f04e17

Various updates to run_YACHT, including adding yacht-level arugments …

832faa9

…--calculate-coverage and --no_two_pass.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding the sylph coverage model to yacht#141

Adding the sylph coverage model to yacht#141
rtraborn wants to merge 27 commits intoKoslickiLab:superyachtfrom
rtraborn:superyacht

rtraborn commented Dec 8, 2025

Uh oh!

rtraborn commented Jan 2, 2026 •

edited

Loading

Uh oh!

rtraborn commented Jan 7, 2026 •

edited

Loading

Uh oh!

rtraborn commented Jan 14, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Jan 29, 2026

Uh oh!

sonarqubecloud bot commented Mar 13, 2026

Uh oh!

rtraborn commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

rtraborn commented Dec 8, 2025

Uh oh!

rtraborn commented Jan 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtraborn commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rtraborn commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Jan 29, 2026

Quality Gate passed

Uh oh!

sonarqubecloud bot commented Mar 13, 2026

Quality Gate failed

Uh oh!

rtraborn commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rtraborn commented Jan 2, 2026 •

edited

Loading

rtraborn commented Jan 7, 2026 •

edited

Loading

rtraborn commented Jan 14, 2026 •

edited

Loading