Skip to content

feat: Cross-level translation for multivariant assay inputs (haplotypes + multi-AA delins) #103

@bencap

Description

@bencap

Context

Cross-level translation shipped with single-position input only (team decision,
2026-06-02): only single VRS Alleles representing a single-position change are
translated to the levels the assay did not map. Every multivariant input is
persisted at its authoritative assay level with no cross-level fill, and the API
records a cross_level_translation annotation status of skipped for it. This
issue tracks lifting that limitation.

Part of epic VariantEffect/mavedb-api#746.

When single-variant support shipped, the machinery that previously attempted some of
these cases — translate_from_protein_range, _adjacent_protein_haplotype_range,
_AA_TO_CODONS, MAX_RANGE_CODON_POSITIONS, and the non-single-allele branches in
mapping_records._build_record — was removed. This is therefore a redesign, not a
guard relaxation: _build_record currently routes every multivariant input (and
multi-AA single-Allele delins) to assay-level-only persistence.

Scope note: this is about multivariant input. A single-position change whose codon
edits land on non-adjacent bases already correctly emits a genomic g.[a;b] Haplotype
output — that ships and is not in scope here.

In-scope cases

Case 1: Adjacent protein haplotype / multi-AA delins

p.[Ala2Val;Pro3Gly] and p.Ala2_Pro3delinsVG are semantically identical — adjacent
amino acid positions occupy adjacent coding positions with no nucleotide gap. These
normalize to a single contiguous coding allele (c.4_9delinsXXXXXX), not a Haplotype,
and need reverse_translate_hgvs_p extended (or a dedicated range path) to accept range
inputs. (Previously scoped out of this issue; folded back in now that single-variant-only
shipped.)

Case 2: Non-adjacent protein haplotype

p.[Ala2Val;Gly4Asp] — positions 2 and 4 with position 3 unchanged. Coding positions
4–6 and 10–12 have a nucleotide gap at 7–9. Each member is reverse-translated
independently and the results are combined into a coding Haplotype. Unlike Case 1, the
output is a Haplotype of coding Alleles, not a flat Allele.

Case 3: Multi-member nucleotide haplotype

A genomic or coding haplotype assay variant (e.g. g.[a;b;c] or c.[a;b;c]).
Cross-level translation requires per-member c→g or g→c (deterministic,
member-independent) followed by c→p. The c→p step is only correct when no two members
share a codon — if two variants land in the same codon, independent
AssemblyMapper.c_to_p calls give wrong results and the combined coding sequence must be
translated as a unit instead (detected by comparing (position-1)//3 codon indices).

Open design question — bounding the fan-out

The previous implementation used an arbitrary 3-member cap. The team's direction is to
design the bound deliberately rather than ship an unexplained constant — a cap may still
be the answer, but as a conclusion, not a default. Inputs to the decision:

  • Protein haplotypes: the Cartesian product of per-position candidates grows
    multiplicatively. Per-position candidate counts are typically 1–5 (valid edits to the
    reference codon, not all codons encoding the target amino acid).
  • The database has ~3.2M variants matching p.[%] and score sets with up to 7-member
    protein haplotypes — so any bound must be justified against the realistic distribution,
    not just the common case.
  • Options to weigh: a member cap, a candidate-count/product budget, or per-case bounds.

Whatever bound is chosen, skipped inputs must be logged (INFO) and persisted at the native
assay level, and the API's cross_level_translation status should reflect the skip.

Implementation

Translator (translate.py)

  • Reintroduce a range path for Case 1 (single contiguous coding allele from a multi-AA /
    adjacent-haplotype change).
  • translate_from_protein_haplotype(members, transcript) — per-member reverse
    translation, product of candidates, Haplotype assembly per product (Case 2).
  • translate_from_nucleotide_haplotype(members, transcript) — per-member c→g / g→c lift;
    c→p skipped if any two members share a codon (Case 3).
  • Route through translate_other_levels based on whether the input is a single Allele, a
    multi-AA Allele, or a Haplotype.

_build_record (mapping_records.py)

  • Replace the single-position / assay_is_single_allele routing with dispatch to the new
    paths, subject to the chosen bound.
  • Set translation_attempted=True on the draft for inputs that are now translated (drives
    the API cross_level_translation status).

Acceptance criteria

  • Multi-AA delins / adjacent protein haplotypes (≤ bound) produce a single contiguous
    coding allele + genomic projection.
  • Non-adjacent protein haplotypes (≤ bound) produce coding and genomic Haplotype drafts.
  • Multi-member nucleotide haplotypes (≤ bound) produce cross-level drafts; c→p is skipped
    (protein omitted) when any two members share a codon.
  • Inputs above the bound are persisted at the native assay level only, with an INFO log and
    a skipped cross_level_translation status.
  • Integration tests cover: multi-AA delins, 2-member non-adjacent protein haplotype,
    2-member nucleotide haplotype (non-sharing codons), 2-member nucleotide haplotype
    (sharing a codon → protein skipped), and an over-bound input (native level only).

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions