Added capsid (CA) variant reporting for gag gene#131
Open
rbaldwin-bugseq wants to merge 7 commits into
Open
Conversation
- Add is_capsid_resistance() method to identify CA gene mutations with CAI drug class - Add isCapsidResistance boolean flag to mutation output - Capsid resistance mutations include Major and Accessory types for lenacapavir
- Add CA gene nucleotide coordinates (1186-1878) from Gag region - Add gag_start (790) reference point for CA amino acid position calculations - Rename pol_nuc_map to gene_nuc_map to support multiple gene regions - Update create_gene_map() to handle CA's Gag reference vs Pol reference for other genes - Add CA to min_overlap dictionary requiring 60 AA minimum coverage - Enables processing of CA gene mutations and LEN (lenacapavir) resistance scoring Tested: CA gene now appears in alignedGeneSequences and drugResistance sections
Documents the working CA gene support including: - Feature overview and verified functionality - Example JSON output format - Unit test results showing correct mutation identification - Next steps for cascade pipeline integration
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
CA mutations are scored using standard HIVDB resistance system. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Contributor
|
Thanks for contributing this PR! I am travelling between conferences right now but will read over the changes once things have settled down |
|
Hi @rbaldwin-bugseq! I was hoping to also help with this. Was wondering if there's anything else to add to this PR other than to answer this question: Why do we need +1 for pol and not gag? Do you think it'd be better to investigate this after this PR is merged and if there's a fix needed it's in a new PR? Kindly let me know when you have the time, thanks! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This was a response to a previous PR (#109). It was tested by taking a subtype C sequence AB254155.1 and adding an artificial triple mutant sequence containing three lenacapavir resistance mutations in the capsid region of gag.
M66I (Methionine → Isoleucine at position 66) - Major, score: 60
Q67H (Glutamine → Histidine at position 67) - Major, score: 30
K70S (Lysine → Serine at position 70) - Major, score: 30
See the attached input sequence and results
ca_results.zip
A min overlap of 23 was selected based on the size of the protein region in the gag gene (231 aa) and the fact that resistant variant are distributed through the region. The IN region (288 aa) had 30 aa overlap for ~10% coverage so the ~10% coverage for CA seemed appropriate as well.
Unresolved question: existing behavior was to add +1 for pol gene, but it seemed that gag was using correct indexing, and so I did not implement this for gag. Why do we need +1 for pol and not gag?