Fix variant phase assignment with per-location approach#144
Fix variant phase assignment with per-location approach#144kathsherratt merged 4 commits intomainfrom
Conversation
… guards
Replace the global median-date hack for variant phase classification
with the correct per-location approach, plus three data quality fixes:
1. Aggregate variant percentages across data sources (ECDC TESSy/GISAID)
before finding dominant variant, avoiding spurious dominance from
small-sample sources (fixes PT anomalous Omicron in March 2021)
2. Require >50% variant share before marking a phase transition,
filtering out noise from sparse surveillance weeks
3. Enforce chronological ordering of variant phases (Alpha → Delta →
Omicron-BA.1 → ...) to prevent out-of-sequence assignments
Also:
- Fix Sys.Date() → as.Date("2023-03-17") for reproducibility
- Fix many-to-many join warning in Swiss hospital variant data
- Add library(ggplot2) to analysis-model.R for diagnostic plots
- Include PR 1 fixes in results.qmd (format: html, epi_target rename)
- Regenerate results.rds and diagnostic PDFs with updated variant phases
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update manuscript methods to describe per-location variant phase assignment with data sources and 50% threshold. Add detailed methods description to Supplement after the variant heatmap. Re-render Supplement PDF to reflect per-location variant phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two bugs caused incorrect variant phase assignments: 1. CH deduplication (get_variants_ch): The dedup filter discarded 26 weeks of hospital data (BA.2+) for weeks where wastewater data also existed. Fixed by keeping both sources with distinct labels and letting the aggregation step handle overlap. 2. mean() vs sum() (set_variant_phases): Sub-variants mapped to the same phase (e.g., BA.4 + BA.5 → Omicron-BA.4/5) were averaged instead of summed, so BA.4/5 never exceeded the 50% threshold for GB. Fixed with two-step aggregation: sum within each source, then average across sources. Result: All 32 locations now show correct variant phase sequences including BA.4/5. CH goes from 2 phases to 6; GB gains BA.4/5. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
3fab90c to
3be01ed
Compare
Hungary's sparse ECDC surveillance data caused Delta to backfill to the start of the timeseries, missing the Alpha phase entirely. Added manual override setting Alpha from study start and Delta from the week following 23 July 2021, based on epidemiological reporting. Source: https://abouthungary.hu/news-in-brief/delta-and-gamma-variant-identified-in-hungary Re-ran GAMM and re-rendered Supplement with updated heatmap and methods text. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Prompts used to generate this PRThis PR was generated and iteratively refined using Claude Code (claude-opus-4-6). Below are the user prompts that drove the work, in chronological order. Session 1 — Planning
(During planning, user was asked about variant fix approach and chose "Switch to per-location (Recommended)") Session 1 — PR 2 implementationThe initial per-location variant phase fix was implemented as planned: replacing the median-date hack with per-location dominant variant phases based on when each variant first exceeded 50% of sequenced samples. Commit: Session 1 — Follow-up: methods text and Supplement
(Approved the manuscript text change) Session 1 — Debugging CH/GB variant data
(When asked whether to add to PR #144 or create a new PR, user chose "Add to PR #144". Claude investigated and identified two bugs: CH deduplication discarding hospital data for BA.2+ variants, and mean() instead of sum() preventing GB from detecting BA.4/5 dominance. User approved the fix plan.) Session 2 — Rebase after PR #143 merge
(Approved force-push after clean rebase) Session 2 — Hungary variant data fix
|
|
Noting that this was initially prompted by following Plan.md. And code manually reviewed at each stage as per prompt history above. |
Summary
Sys.Date()to study end date2023-03-17for reproducibilityoutput/results.rds) and diagnostic plotsKey results change
With per-location variant phases providing better covariate adjustment, Method and CountryTargets effects remain near-zero — consistent with, and slightly strengthening, the original finding of no systematic performance difference between model structures.
Known limitation
Hungary (HU) starts with Delta phase instead of Alpha due to sparse early variant surveillance data (no weeks with Alpha >50% in ECDC). This is acceptable given the data resolution.
Test plan
classify_variant_phases()runs without errorsquarto render report/results.qmdproduces complete HTML🤖 Generated with Claude Code