From 5d4b027a1d508d3bba2dfd85f0a81078d9ebee80 Mon Sep 17 00:00:00 2001
From: Vahid Ahmadi <va.vahidahmadi@gmail.com>
Date: Wed, 27 May 2026 11:37:49 +0200
Subject: [PATCH] Add UK pipeline alignment planning doc (#1621)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

#1621 is an umbrella tracker. Persist it in-repo as a docs page so the
queue of architectural fixes has a stable reference.

Auditing the current state:

- The three small bugs flagged in the issue body (tax_credits min_benefit
  unit, benunit_weekly_hours label, state_pension_type heuristic) have
  all been fixed since the issue was filed. Note them as done.
- The state pension residual is documented separately in #1632 (now
  covered by docs/book/programs/gov/dwp/state-pension.md), so this page
  links there rather than restating.

The remaining architectural gaps captured:

- runtime _reported consultation in would_claim_housing_benefit,
  would_claim_IS, would_claim_WTC, would_claim_CTC — needs conversion
  to input-only matching the UC/PC pattern, with stochastic assignment
  in policyengine-uk-data;
- take-up assignment in uk-data ignores reported data — needs port of
  the US assign_takeup_with_reported_anchors helper;
- second-stage QRF gap on SPI donors (analogous to
  policyengine-us-data#589) so consequential FRS-only variables
  (rent, gift_aid, pension contributions, *_reported benefits) are
  consistent with imputed income;
- residual aggregate gaps against OBR (income_support, esa_contrib,
  attendance_allowance, tax_free_childcare, state_pension).

Ends with a candidate PR sequence so the receiving repos (this one +
policyengine-uk-data) have a low-risk-first ordering.
---
 changelog.d/1621.md                           |   1 +
 .../assumptions/uk-pipeline-alignment-plan.md | 146 ++++++++++++++++++
 2 files changed, 147 insertions(+)
 create mode 100644 changelog.d/1621.md
 create mode 100644 docs/book/assumptions/uk-pipeline-alignment-plan.md

diff --git a/changelog.d/1621.md b/changelog.d/1621.md
new file mode 100644
index 000000000..f495e4ee5
--- /dev/null
+++ b/changelog.d/1621.md
@@ -0,0 +1 @@
+- Add a planning docs page at `docs/book/assumptions/uk-pipeline-alignment-plan.md` capturing the architectural gaps between the UK model/data pipeline and the US `policyengine-us` / `policyengine-us-data` patterns — remaining `_reported`-at-runtime `would_claim_*` formulas, take-up anchoring, second-stage QRF for SPI-donor FRS-only variables — and the residual benefit-aggregate gaps against OBR after the recent BASIC/NEW state-pension fixes.
diff --git a/docs/book/assumptions/uk-pipeline-alignment-plan.md b/docs/book/assumptions/uk-pipeline-alignment-plan.md
new file mode 100644
index 000000000..6faf35788
--- /dev/null
+++ b/docs/book/assumptions/uk-pipeline-alignment-plan.md
@@ -0,0 +1,146 @@
+# UK model/data pipeline alignment plan
+
+```{note}
+**Planning page** capturing the gaps between the UK model + enhanced FRS
+pipeline and the US `policyengine-us` / `policyengine-us-data` patterns,
+plus a sequence of scoped PRs that would close them. Tracked under
+[#1621](https://github.com/PolicyEngine/policyengine-uk/issues/1621).
+Most items here are **cross-repo** — they touch
+`policyengine-uk-data` rather than this repo — but the model-side
+contract is documented here so the receiving side has a stable target.
+```
+
+## Architectural gaps vs US
+
+### 1. `_reported` consulted at runtime in some `would_claim_*` formulas
+
+**US pattern**: `_reported` columns are consulted **once, in the
+data-build**, to anchor stochastic take-up flags (reported recipient →
+takeup = True with certainty; rest filled probabilistically to a target
+rate). Runtime formulas only read the pre-computed
+`takes_up_X_if_eligible` flag, so reforms that change eligibility flow
+through cleanly to take-up.
+
+**UK pattern (partial)**: some UK `would_claim_*` variables already
+match — `would_claim_uc` and `would_claim_pc` are input-only with
+`default_value = True`, populated stochastically in
+`policyengine-uk-data/datasets/frs.py`. But several still derive from
+`_reported` at runtime:
+
+- `would_claim_housing_benefit`
+- `would_claim_IS`
+- `would_claim_WTC`
+- `would_claim_CTC`
+
+The runtime pattern is `claims_all_entitled_benefits | (foo_reported > 0)`,
+which means a reform expanding eligibility only reaches existing
+`_reported > 0` claimants — non-claimants who newly qualify are
+invisible unless the user manually sets
+`claims_all_entitled_benefits = True` everywhere.
+
+**Fix**: convert each to input-only `default_value = True` (matching
+`would_claim_uc`), populate stochastically in
+`uk-data/datasets/frs.py`. Note: WTC and CTC ceased to pay new awards on
+2025-04-06 (see [tax-credits.md](../programs/gov/dwp/tax-credits.md)),
+so their fix is primarily for back-cast simulations.
+
+### 2. Take-up assignment ignores reported data
+
+Today's UK take-up assignment is a pure random draw:
+
+```python
+pe_benunit["would_claim_uc"] = generator.random(len(pe_benunit)) < universal_credit_rate
+```
+
+This discards information: respondents who reported receiving UC
+clearly took it up, and should have `would_claim_uc = True` with
+certainty.
+
+**Fix**: port `policyengine_us_data/utils/takeup.py::assign_takeup_with_reported_anchors`
+into `policyengine-uk-data` and apply it to each take-up variable where
+a `_reported` column exists. This tightens per-variable calibration
+without changing the target rates.
+
+### 3. Second-stage imputation gap on SPI donors
+
+`policyengine_uk_data/datasets/imputations/income.py::impute_income`
+trains a QRF that replaces only six income variables on SPI donor
+rows; everything else (rent, wealth, pension contributions, gift aid,
+benefits-reported flags) stays as whatever middle-income FRS donor
+happened to be sampled. So a synthetic SPI donor ending up with £2M
+imputed self-employment income still carries an unchanged middle-FRS-
+donor's savings balance and zero gift aid.
+
+The US equivalent ([`policyengine-us-data#589`](https://github.com/PolicyEngine/policyengine-us-data/pull/589))
+fixed this by adding a second-stage QRF: train on CPS with predictors =
+demographics + newly-imputed income vars, outputs = the ~60 CPS-only
+variables; for PUF-clone prediction, substitute imputed PUF incomes as
+predictors so CPS-only variables come out consistent with imputed
+income.
+
+**Fix**: add the analogous second-stage QRF in
+`policyengine_uk_data/datasets/imputations/`, training on FRS with
+predictors = demographics + 6 income vars, outputs = a wide set of
+"FRS-only" consequential variables including rent,
+`mortgage_interest_repayment`, `gift_aid`, `covenanted_payments`,
+`charitable_investment_gifts`, `other_deductions`,
+`employee_pension_contributions*`, `employer_pension_contributions`,
+total wealth components, `capital_allowances`, `deficiency_relief`,
+and the full list of `*_reported` benefits.
+
+## Specific aggregate residuals
+
+After the formula-side fixes already merged (BASIC/NEW classification
+in [PR #1618](https://github.com/PolicyEngine/policyengine-uk/pull/1618),
+new State Pension pro-rating + Protected Payment in
+[PR #1634](https://github.com/PolicyEngine/policyengine-uk/pull/1634)),
+the residual benefit-aggregate gaps against OBR are:
+
+| Variable | Model | Target | Gap | Likely cause |
+|----------|------:|------:|----:|--------------|
+| `income_support` | ~£1.07 bn | ~£0.2 bn | **+£0.9 bn** | Legacy benefit near-fully migrated; same `reported > 0` retention issue as the (pre-fix) tax credits. |
+| `esa_contrib` | ~£5.56 bn | ~£2.5 bn | **+£3.1 bn** | Large. Likely reported-based take-up plus no phase-out for migrated cases. |
+| `attendance_allowance` | ~£8.76 bn | ~£7.0 bn | +£1.8 bn | Cap-rate or take-up issue. |
+| `tax_free_childcare` | ~£0.41 bn | ~£0.9 bn | **-£0.5 bn** | Understated by about half — under-imputed eligible families. |
+| `state_pension` | ~£127.5 bn | ~£140 bn | -£12 bn | Documented in [state-pension.md](../programs/gov/dwp/state-pension.md); data-side ASP under-imputation. |
+
+Each warrants a scoped PR pair (model + data) once the architectural
+fixes above land.
+
+## Items already fixed since #1621 was filed
+
+The issue body flagged three small bugs; all are now resolved:
+
+- `tax_credits/min_benefit.yaml` unit was `currency-USD` — now `currency-GBP`.
+- `benunit_weekly_hours` label said "Average" but the formula sums; label now reads "Total weekly hours worked by adults in the benefit unit".
+- `state_pension_type` BASIC/NEW classification (the `is_SP_age` heuristic in the issue) was rewritten in PR #1618.
+
+This page reflects the *current* gap rather than the issue body, so the
+next reviewer working through the queue doesn't waste time on
+already-fixed items.
+
+## Candidate PR sequence
+
+1. **Convert remaining `would_claim_*` formulas to input-only** (model
+   side) + stochastic assignment in `policyengine-uk-data`
+   (`datasets/frs.py`). Low risk; matches the existing UC/PC pattern.
+2. **Port `assign_takeup_with_reported_anchors`** into
+   `policyengine-uk-data`. Pure data-pipeline change; tightens
+   calibration.
+3. **Add second-stage QRF for FRS-only variables on SPI donors** in
+   `policyengine-uk-data`. Biggest single ticket but directly addresses
+   the "high-income donor has zero gift aid / zero rent" failure.
+4. **Add `gift_aid` to `IMPUTATIONS`** (one-line addition in
+   `policyengine_uk_data/datasets/imputations/income.py`). Independent
+   of (3); can land immediately.
+5. **Residual benefit-aggregate follow-ups** — separate small PRs for
+   IS phase-out (analogous to the WTC/CTC reactive scheme close-out),
+   ESA contrib investigation, AA calibration, TFC under-imputation,
+   ASP data-side fix (the residual #1632 leg).
+
+## References
+
+- Tracking issue: [#1621](https://github.com/PolicyEngine/policyengine-uk/issues/1621).
+- US pattern reference: [`policyengine-us-data#589`](https://github.com/PolicyEngine/policyengine-us-data/pull/589).
+- Related state pension docs: [state-pension.md](../programs/gov/dwp/state-pension.md), [#1632](https://github.com/PolicyEngine/policyengine-uk/issues/1632).
+- Related calibration scope: [non-uk-benefit-receipt-plan.md](./non-uk-benefit-receipt-plan.md) (#842) — separate from this pipeline alignment but interacts with the calibration target side.