Skip to content

Migrate dcd_mapping from VRS Haplotype to CisPhasedBlock (requires dependency-stack modernization) #104

@bencap

Description

@bencap

Summary

mavedb-api runs on ga4gh.vrs 2.3.x, whose multi-variant model is CisPhasedBlock (digest prefix ga4gh:CPB.). dcd_mapping is pinned to ga4gh.vrs==2.0.0a6, whose equivalent model is Haplotype (digest prefix ga4gh:HT.). As a result, the API's alleles table now holds two vocabularies for the same concept: assay-level multi-variant alleles produced by dcd_mapping are stored as type "Haplotype", while the reverse-translation cis-phased candidates added in the VRS-correctness work are stored as type "CisPhasedBlock".

This issue tracks making dcd_mapping emit CisPhasedBlock so the multi-variant vocabulary is consistent end to end. It is more of a dependency-modernization effort than a focused rename.

Problem

  • A single biological multi-variant variation is represented two different ways depending on its source: Haplotype (ga4gh:HT.) from dcd_mapping vs CisPhasedBlock (ga4gh:CPB.) from the API reverse-translation job.
  • Because the container type participates in the GA4GH digest, the same set of member alleles digests differently under Haplotype vs CisPhasedBlock, so identical multi-variant variations from the two sources will not deduplicate against each other in the alleles table.
  • The inconsistency is a clarity/maintainability liability (consumers, queries, and debugging must understand both vocabularies), even though it is not currently a functional defect.

Why it is safe to defer (current state is not broken)

  • dcd_mapping emits "Haplotype", and the API already accepts both "Haplotype" and "CisPhasedBlock": in the API's annotation utilities, vrs_object_from_mapped_variant builds a CisPhasedBlock from either type string. Serving and annotation work regardless of which vocabulary is stored.
  • Single-Allele VRS digests are byte-identical across ga4gh.vrs 2.0.0a6 and 2.3.1 (verified: ga4gh:VA.5IGw5Cw0n9PmJP3E-Mv5OaFUSFN9wTmx). Single-variant score sets therefore need no re-mapping; only multi-variant score sets change digest (HTCPB).

Spike findings (already investigated)

  • CisPhasedBlock does not exist in ga4gh.vrs 2.0.0a6; it is a later-alpha rename, so adopting it requires bumping to ~2.3.x.
  • A minimal bump (only ga4gh.vrs) is not viable: ga4gh.vrs 2.3.1 pulls a newer ga4gh.core that removed ga4gh.core.models.Gene, which dcd_mapping's pinned gene_normalizer==0.3.0.dev2 imports at module load. This raises an AttributeError/ImportError at import time.
  • The bump therefore cascades: gene_normalizer must move from 0.3.0.dev2 to a current release (up to 0.11.5), and cool-seq-tool almost certainly from 0.4.0.dev3 to a current release (up to 0.17.0). These are multi-year version jumps with their own breaking API changes that ripple through dcd_mapping's transcript lookup, transcript selection, and alignment modules.
  • ga4gh.vrs._internal.models is removed in 2.3.1; the public ga4gh.vrs.models path must be used instead.

Proposed behavior

dcd_mapping produces CisPhasedBlock (digest prefix ga4gh:CPB.) for all multi-variant variations in both pre-mapped and post-mapped outputs, on a modernized ga4gh.vrs / gene_normalizer / cool-seq-tool stack, with the existing mapping behavior otherwise unchanged for single-variant cases.

Acceptance criteria

  • dcd_mapping depends on a ga4gh.vrs version that provides CisPhasedBlock (~2.3.x), with gene_normalizer and cool-seq-tool bumped to mutually compatible current releases.
  • pip check (or the project's equivalent dependency-consistency check) reports no conflicts in the dcd_mapping environment.
  • All ga4gh.vrs._internal.models imports are replaced with ga4gh.vrs.models across the codebase and tests.
  • No references to Haplotype remain in dcd_mapping production code or tests for VRS 2.x output; multi-variant results are emitted as CisPhasedBlock.
  • Multi-variant mapping output serializes with "type": "CisPhasedBlock" and a ga4gh:CPB. digest; single-variant output is unchanged and produces the same ga4gh:VA. digests as before the bump.
  • The VRS 1.3 back-compat output path produces equivalent VRS 1.3 representations derived from CisPhasedBlock instead of Haplotype.
  • The full dcd_mapping test suite passes on the upgraded stack.
  • Multi-variant score sets are re-mapped (or a re-map plan is documented) before any read cutover that depends on consistent digests.
  • mavedb-api ingestion continues to accept dcd_mapping output (already supports CisPhasedBlock); confirmed by an end-to-end mapping → ingestion check for a multi-variant score set.

Implementation notes

  • Dependency cascade (resolve as a set, not one-by-one):
    • ga4gh.vrs: 2.0.0a6 → ~2.3.x
    • gene_normalizer: 0.3.0.dev2 → current (up to 0.11.5); root cause of the cascade is the removed ga4gh.core.models.Gene.
    • cool-seq-tool: 0.4.0.dev3 → current (up to 0.17.0); expect breaking API changes affecting transcript lookup, transcript selection, and alignment.
  • Import migration: replace ga4gh.vrs._internal.models with ga4gh.vrs.models in the VRS mapping module, schemas, annotation module, lookup module, VRS utilities, and the VRS map tests.
  • Rename HaplotypeCisPhasedBlock:
    • VRS mapping module: the multi-member construction in _construct_vrs_allele (returns a block when more than one member) and its return type annotation.
    • Schemas: the Haplotype import and the pre_mapped / post_mapped union types, plus the VRS 1.3 union members.
    • Annotation module: the Haplotype import, the member-extraction helper for ref-position computation, the post-mapped type checks, the consistency check that rejects mixed Allele/Haplotype structures, and the typed accessors for pre/post-mapped.
    • Tests: VRS map tests that assert on Haplotype.
  • VRS 1.3 back-compat: rework the helper that converts the VRS 2.x multi-variant object to a VRS 1.3 Haplotype (VRSATILE variation descriptors) so it consumes a CisPhasedBlock. The VRS 1.3 output schema name stays Haplotype; only the input type changes.
  • Cheaper alternative if the full cascade is not justified: convert HaplotypeCisPhasedBlock at the mavedb-api ingestion boundary (worker mapping job) so the alleles table is uniformly CisPhasedBlock without touching dcd_mapping's dependencies. The conversion logic already exists in the API annotation utilities. Tradeoff: this recomputes the stored digest for multi-variant assay-level alleles (CPB instead of HT) and leaves dcd_mapping's native output and VRS 1.3 layer still emitting Haplotype, so the source and the API's stored form still diverge — but the API's data model becomes internally consistent. Worth deciding between this and the full migration before starting.

Assumptions

  • dcd_mapping still needs to produce VRS 1.3 output for existing consumers; if that requirement has been dropped, the VRS 1.3 back-compat work can be removed instead of reworked.
  • Target repository is ave-dcd/dcd_mapping. Suggested labels: dependencies, vrs, tech-debt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    app: mapperTask implementation touches the mappertype: maintenanceMaintaining this project

    Type

    No fields configured for Task.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions