Summary
mavedb-api runs on ga4gh.vrs 2.3.x, whose multi-variant model is CisPhasedBlock (digest prefix ga4gh:CPB.). dcd_mapping is pinned to ga4gh.vrs==2.0.0a6, whose equivalent model is Haplotype (digest prefix ga4gh:HT.). As a result, the API's alleles table now holds two vocabularies for the same concept: assay-level multi-variant alleles produced by dcd_mapping are stored as type "Haplotype", while the reverse-translation cis-phased candidates added in the VRS-correctness work are stored as type "CisPhasedBlock".
This issue tracks making dcd_mapping emit CisPhasedBlock so the multi-variant vocabulary is consistent end to end. It is more of a dependency-modernization effort than a focused rename.
Problem
- A single biological multi-variant variation is represented two different ways depending on its source:
Haplotype (ga4gh:HT.) from dcd_mapping vs CisPhasedBlock (ga4gh:CPB.) from the API reverse-translation job.
- Because the container type participates in the GA4GH digest, the same set of member alleles digests differently under
Haplotype vs CisPhasedBlock, so identical multi-variant variations from the two sources will not deduplicate against each other in the alleles table.
- The inconsistency is a clarity/maintainability liability (consumers, queries, and debugging must understand both vocabularies), even though it is not currently a functional defect.
Why it is safe to defer (current state is not broken)
- dcd_mapping emits
"Haplotype", and the API already accepts both "Haplotype" and "CisPhasedBlock": in the API's annotation utilities, vrs_object_from_mapped_variant builds a CisPhasedBlock from either type string. Serving and annotation work regardless of which vocabulary is stored.
- Single-
Allele VRS digests are byte-identical across ga4gh.vrs 2.0.0a6 and 2.3.1 (verified: ga4gh:VA.5IGw5Cw0n9PmJP3E-Mv5OaFUSFN9wTmx). Single-variant score sets therefore need no re-mapping; only multi-variant score sets change digest (HT → CPB).
Spike findings (already investigated)
CisPhasedBlock does not exist in ga4gh.vrs 2.0.0a6; it is a later-alpha rename, so adopting it requires bumping to ~2.3.x.
- A minimal bump (only
ga4gh.vrs) is not viable: ga4gh.vrs 2.3.1 pulls a newer ga4gh.core that removed ga4gh.core.models.Gene, which dcd_mapping's pinned gene_normalizer==0.3.0.dev2 imports at module load. This raises an AttributeError/ImportError at import time.
- The bump therefore cascades:
gene_normalizer must move from 0.3.0.dev2 to a current release (up to 0.11.5), and cool-seq-tool almost certainly from 0.4.0.dev3 to a current release (up to 0.17.0). These are multi-year version jumps with their own breaking API changes that ripple through dcd_mapping's transcript lookup, transcript selection, and alignment modules.
ga4gh.vrs._internal.models is removed in 2.3.1; the public ga4gh.vrs.models path must be used instead.
Proposed behavior
dcd_mapping produces CisPhasedBlock (digest prefix ga4gh:CPB.) for all multi-variant variations in both pre-mapped and post-mapped outputs, on a modernized ga4gh.vrs / gene_normalizer / cool-seq-tool stack, with the existing mapping behavior otherwise unchanged for single-variant cases.
Acceptance criteria
Implementation notes
- Dependency cascade (resolve as a set, not one-by-one):
ga4gh.vrs: 2.0.0a6 → ~2.3.x
gene_normalizer: 0.3.0.dev2 → current (up to 0.11.5); root cause of the cascade is the removed ga4gh.core.models.Gene.
cool-seq-tool: 0.4.0.dev3 → current (up to 0.17.0); expect breaking API changes affecting transcript lookup, transcript selection, and alignment.
- Import migration: replace
ga4gh.vrs._internal.models with ga4gh.vrs.models in the VRS mapping module, schemas, annotation module, lookup module, VRS utilities, and the VRS map tests.
- Rename
Haplotype → CisPhasedBlock:
- VRS mapping module: the multi-member construction in
_construct_vrs_allele (returns a block when more than one member) and its return type annotation.
- Schemas: the
Haplotype import and the pre_mapped / post_mapped union types, plus the VRS 1.3 union members.
- Annotation module: the
Haplotype import, the member-extraction helper for ref-position computation, the post-mapped type checks, the consistency check that rejects mixed Allele/Haplotype structures, and the typed accessors for pre/post-mapped.
- Tests: VRS map tests that assert on
Haplotype.
- VRS 1.3 back-compat: rework the helper that converts the VRS 2.x multi-variant object to a VRS 1.3
Haplotype (VRSATILE variation descriptors) so it consumes a CisPhasedBlock. The VRS 1.3 output schema name stays Haplotype; only the input type changes.
- Cheaper alternative if the full cascade is not justified: convert
Haplotype → CisPhasedBlock at the mavedb-api ingestion boundary (worker mapping job) so the alleles table is uniformly CisPhasedBlock without touching dcd_mapping's dependencies. The conversion logic already exists in the API annotation utilities. Tradeoff: this recomputes the stored digest for multi-variant assay-level alleles (CPB instead of HT) and leaves dcd_mapping's native output and VRS 1.3 layer still emitting Haplotype, so the source and the API's stored form still diverge — but the API's data model becomes internally consistent. Worth deciding between this and the full migration before starting.
Assumptions
- dcd_mapping still needs to produce VRS 1.3 output for existing consumers; if that requirement has been dropped, the VRS 1.3 back-compat work can be removed instead of reworked.
- Target repository is
ave-dcd/dcd_mapping. Suggested labels: dependencies, vrs, tech-debt.
Summary
mavedb-api runs on
ga4gh.vrs2.3.x, whose multi-variant model isCisPhasedBlock(digest prefixga4gh:CPB.). dcd_mapping is pinned toga4gh.vrs==2.0.0a6, whose equivalent model isHaplotype(digest prefixga4gh:HT.). As a result, the API'sallelestable now holds two vocabularies for the same concept: assay-level multi-variant alleles produced by dcd_mapping are stored as type"Haplotype", while the reverse-translation cis-phased candidates added in the VRS-correctness work are stored as type"CisPhasedBlock".This issue tracks making dcd_mapping emit
CisPhasedBlockso the multi-variant vocabulary is consistent end to end. It is more of a dependency-modernization effort than a focused rename.Problem
Haplotype(ga4gh:HT.) from dcd_mapping vsCisPhasedBlock(ga4gh:CPB.) from the API reverse-translation job.HaplotypevsCisPhasedBlock, so identical multi-variant variations from the two sources will not deduplicate against each other in theallelestable.Why it is safe to defer (current state is not broken)
"Haplotype", and the API already accepts both"Haplotype"and"CisPhasedBlock": in the API's annotation utilities,vrs_object_from_mapped_variantbuilds aCisPhasedBlockfrom either type string. Serving and annotation work regardless of which vocabulary is stored.AlleleVRS digests are byte-identical acrossga4gh.vrs2.0.0a6 and 2.3.1 (verified:ga4gh:VA.5IGw5Cw0n9PmJP3E-Mv5OaFUSFN9wTmx). Single-variant score sets therefore need no re-mapping; only multi-variant score sets change digest (HT→CPB).Spike findings (already investigated)
CisPhasedBlockdoes not exist inga4gh.vrs2.0.0a6; it is a later-alpha rename, so adopting it requires bumping to ~2.3.x.ga4gh.vrs) is not viable:ga4gh.vrs2.3.1 pulls a newerga4gh.corethat removedga4gh.core.models.Gene, which dcd_mapping's pinnedgene_normalizer==0.3.0.dev2imports at module load. This raises anAttributeError/ImportErrorat import time.gene_normalizermust move from0.3.0.dev2to a current release (up to0.11.5), andcool-seq-toolalmost certainly from0.4.0.dev3to a current release (up to0.17.0). These are multi-year version jumps with their own breaking API changes that ripple through dcd_mapping's transcript lookup, transcript selection, and alignment modules.ga4gh.vrs._internal.modelsis removed in 2.3.1; the publicga4gh.vrs.modelspath must be used instead.Proposed behavior
dcd_mapping produces
CisPhasedBlock(digest prefixga4gh:CPB.) for all multi-variant variations in both pre-mapped and post-mapped outputs, on a modernizedga4gh.vrs/gene_normalizer/cool-seq-toolstack, with the existing mapping behavior otherwise unchanged for single-variant cases.Acceptance criteria
ga4gh.vrsversion that providesCisPhasedBlock(~2.3.x), withgene_normalizerandcool-seq-toolbumped to mutually compatible current releases.pip check(or the project's equivalent dependency-consistency check) reports no conflicts in the dcd_mapping environment.ga4gh.vrs._internal.modelsimports are replaced withga4gh.vrs.modelsacross the codebase and tests.Haplotyperemain in dcd_mapping production code or tests for VRS 2.x output; multi-variant results are emitted asCisPhasedBlock."type": "CisPhasedBlock"and aga4gh:CPB.digest; single-variant output is unchanged and produces the samega4gh:VA.digests as before the bump.CisPhasedBlockinstead ofHaplotype.CisPhasedBlock); confirmed by an end-to-end mapping → ingestion check for a multi-variant score set.Implementation notes
ga4gh.vrs:2.0.0a6→ ~2.3.xgene_normalizer:0.3.0.dev2→ current (up to0.11.5); root cause of the cascade is the removedga4gh.core.models.Gene.cool-seq-tool:0.4.0.dev3→ current (up to0.17.0); expect breaking API changes affecting transcript lookup, transcript selection, and alignment.ga4gh.vrs._internal.modelswithga4gh.vrs.modelsin the VRS mapping module, schemas, annotation module, lookup module, VRS utilities, and the VRS map tests.Haplotype→CisPhasedBlock:_construct_vrs_allele(returns a block when more than one member) and its return type annotation.Haplotypeimport and thepre_mapped/post_mappedunion types, plus the VRS 1.3 union members.Haplotypeimport, the member-extraction helper for ref-position computation, the post-mapped type checks, the consistency check that rejects mixed Allele/Haplotype structures, and the typed accessors for pre/post-mapped.Haplotype.Haplotype(VRSATILE variation descriptors) so it consumes aCisPhasedBlock. The VRS 1.3 output schema name staysHaplotype; only the input type changes.Haplotype→CisPhasedBlockat the mavedb-api ingestion boundary (worker mapping job) so theallelestable is uniformlyCisPhasedBlockwithout touching dcd_mapping's dependencies. The conversion logic already exists in the API annotation utilities. Tradeoff: this recomputes the stored digest for multi-variant assay-level alleles (CPBinstead ofHT) and leaves dcd_mapping's native output and VRS 1.3 layer still emittingHaplotype, so the source and the API's stored form still diverge — but the API's data model becomes internally consistent. Worth deciding between this and the full migration before starting.Assumptions
ave-dcd/dcd_mapping. Suggested labels:dependencies,vrs,tech-debt.