From 1f9dcfe8eefefc19ca40314b3ff8f49417a55249 Mon Sep 17 00:00:00 2001
From: Vahid Ahmadi <va.vahidahmadi@gmail.com>
Date: Wed, 27 May 2026 12:17:49 +0200
Subject: [PATCH] Add capital gains distribution plan (#818 #817)

#818 asks for a multivariate capital gains imputation instead of the
current per-income-decile-independent fits. #817 reports the related
failure mode of imputing gains to zero-wealth households.

The imputation lives in policyengine-uk-data, not in this repo, but
persist the model-facing plan here so the cross-repo agreement is
discoverable.

The page covers:

- the current state (per-decile marginal distributions fit to HMRC
  CGT statistics binned by income),
- three concrete limitations (overfitting boundaries, no wealth
  conditioning per #817, no within-decile correlation),
- two candidate approaches: A. multivariate Gaussian KDE following
  OG-USA's bequest-transmission model; B. two-stage QRF reusing the
  existing imputation infrastructure already proposed in the #1621
  pipeline-alignment plan,
- recommendation: B (QRF) because the infrastructure exists, it
  handles correlated predictors naturally, and it solves both #818
  and #817 in one move by adding total_wealth as a predictor,
- the model-side surface (negligible, since the input variable stays
  the same; one new regression test for "no gains on zero-wealth
  households"),
- open questions on the wealth donor source (FRS vs WAS) and on
  elasticity-by-band as a follow-up.

Closes both #818 and #817 in the planning sense.
---
 changelog.d/818.md                            |   1 +
 .../capital-gains-distribution-plan.md        | 139 ++++++++++++++++++
 2 files changed, 140 insertions(+)
 create mode 100644 changelog.d/818.md
 create mode 100644 docs/book/assumptions/capital-gains-distribution-plan.md

diff --git a/changelog.d/818.md b/changelog.d/818.md
new file mode 100644
index 000000000..1e645121f
--- /dev/null
+++ b/changelog.d/818.md
@@ -0,0 +1 @@
+- Add a capital gains distribution planning docs page at `docs/book/assumptions/capital-gains-distribution-plan.md` documenting the current per-decile-independent imputation (in `policyengine-uk-data`) and proposing a two-stage QRF replacement that jointly conditions on income, wealth, age and region — closing both the #818 overfitting concern and the #817 zero-wealth-gains failure mode in one move.
diff --git a/docs/book/assumptions/capital-gains-distribution-plan.md b/docs/book/assumptions/capital-gains-distribution-plan.md
new file mode 100644
index 000000000..0d00ba6d4
--- /dev/null
+++ b/docs/book/assumptions/capital-gains-distribution-plan.md
@@ -0,0 +1,139 @@
+# Capital gains distribution plan
+
+```{note}
+**Planning page.** Tracks [#818](https://github.com/PolicyEngine/policyengine-uk/issues/818)
+and the related zero-wealth imputation question raised in
+[#817](https://github.com/PolicyEngine/policyengine-uk/issues/817).
+PolicyEngine UK's capital gains imputation lives in
+`policyengine-uk-data`, not in this repo — but the modelling
+assumptions and the calibration constraints are documented here so the
+cross-repo plan is discoverable from a single page.
+```
+
+## Current state
+
+PolicyEngine UK reads `capital_gains_before_response` as a person-level
+input variable; the runtime variables under
+[`variables/gov/hmrc/capital_gains_tax/`](../../../policyengine_uk/variables/gov/hmrc/capital_gains_tax)
+then compute the elasticity-based behavioural response and the tax
+liability against the published CGT bracket structure.
+
+The input itself is produced in `policyengine-uk-data` by:
+
+1. Splitting the population into income deciles.
+2. Fitting a separate marginal distribution of capital gains within
+   each decile (against HMRC's *Capital Gains Statistics*).
+3. Drawing values for each FRS row from the decile-conditional
+   distribution.
+
+This is the **independent per-decile model** that #818 flagged.
+
+## Limitations of the current approach
+
+### 1. Overfitting at the boundaries
+
+Per-decile fitting means each decile's tail is fit independently from
+its neighbours. With HMRC's published CGT statistics binning gains by
+income, the per-decile tails contain few observations — the fitted
+upper tails are noisy, and the joint distribution of "income + capital
+gains" can have spurious features at decile boundaries.
+
+### 2. No conditioning on wealth
+
+#817 reported a related failure mode: capital gains are imputed to
+households with zero recorded wealth. That's structurally implausible
+(you can't realise gains on assets you don't own) and arises because
+the per-decile model conditions only on income, not on wealth holdings.
+Many low-income retirees with substantial wealth would receive gains
+the current model misses; many low-income tenants with zero wealth
+incorrectly receive them.
+
+### 3. No covariance between income and gains beyond decile membership
+
+A higher-rate-band earner with £200k income and a lower-rate earner
+with £40k income within the same decile (after weighting / SPI)
+receive draws from the same marginal distribution. In reality the
+correlation between income and gains is much stronger than that.
+
+## Proposed approach (#818)
+
+Switch to a **multivariate model over (income, wealth, age, gender)**
+that produces capital gains as a derived dimension. Concrete options:
+
+### A. Multivariate KDE
+
+The [OG-USA bequest model](https://pslmodels.github.io/OG-USA/content/api/bequest_transmission.html)
+referenced in #818 uses a multivariate Gaussian KDE over the relevant
+conditioning variables. For UK gains the bandwidth + kernel choice
+would need calibrating against HMRC's gains-by-income-by-age cross-tabs;
+the SAS *Survey of Personal Incomes* extracts (where available via
+Datalab) give a richer conditioning set.
+
+Tradeoffs:
+
+- **Pro**: no per-decile boundary artefacts; the joint distribution
+  comes out smooth in all dimensions.
+- **Con**: KDE bandwidth choice is tricky in the tail; large gains
+  remain noisy unless we supplement with a tail model.
+
+### B. Two-stage QRF
+
+Train a quantile-regression-forest (QRF) on the SPI donor set with
+predictors = `(age, gender, region, employment_income,
+self_employment_income, savings_interest_income, dividend_income,
+total_wealth, gross_financial_wealth)` and output = annual capital
+gains. This is the same machinery already used in
+`policyengine-uk-data/datasets/imputations/income.py` and the
+proposed second-stage QRF in [#1621](https://github.com/PolicyEngine/policyengine-uk/issues/1621)
+/ [pipeline alignment plan](./uk-pipeline-alignment-plan.md).
+
+Tradeoffs:
+
+- **Pro**: the QRF naturally handles correlated predictors and
+  produces well-calibrated quantiles in the tail. Reuses existing
+  imputation infrastructure.
+- **Con**: requires a clean SPI-linked donor set with all the
+  conditioning variables (currently the QRF in `income.py` doesn't
+  output capital gains).
+
+**Recommendation**: option B. The infrastructure is in place, the
+calibration is testable against HMRC published gains-by-income-band
+tables, and it consistently solves both #818 (overfitting) and #817
+(implausible zero-wealth gains) by making `total_wealth` a predictor.
+
+## What changes in this repo
+
+The model-side surface is small:
+
+- The `capital_gains_before_response` input variable stays the same.
+  All the changes are upstream in `policyengine-uk-data`.
+- A regression test in the model that asserts **no positive capital
+  gains for households with zero total wealth** would catch
+  reintroductions of the #817 failure mode and live well in
+  `policyengine_uk/tests/`.
+
+## Open questions
+
+- Wealth in the FRS is incomplete and noisy; the WAS (Wealth and
+  Assets Survey) is the better wealth conditioning source but is on a
+  different sampling frame. Should the QRF be trained on a WAS-linked
+  donor, or do we condition on the FRS-imputed wealth and accept the
+  noise?
+- HMRC's published CGT statistics break gains down by income, age, and
+  asset type but not by household wealth. Calibration targets will
+  need to be assembled across multiple HMRC and ONS sources.
+- Behavioural response (`capital_gains_behavioural_response` in this
+  repo) currently uses a single elasticity. A multivariate model that
+  gets the distribution right opens the door to **elasticity by
+  income / wealth band** — useful for reform analysis but adds a
+  parameter surface.
+
+## References
+
+- Issue: [#818](https://github.com/PolicyEngine/policyengine-uk/issues/818) — original "model gains jointly across income groups".
+- Related: [#817](https://github.com/PolicyEngine/policyengine-uk/issues/817) — avoid imputing CG to zero-wealth households.
+- Reference implementation: [OG-USA bequest-transmission multivariate KDE](https://pslmodels.github.io/OG-USA/content/api/bequest_transmission.html).
+- Reusable infrastructure: the QRF in [`policyengine_uk_data/datasets/imputations/income.py`](https://github.com/PolicyEngine/policyengine-uk-data) and the second-stage QRF plan in [pipeline alignment](./uk-pipeline-alignment-plan.md).
+- HMRC, [Capital Gains Tax statistics](https://www.gov.uk/government/collections/capital-gains-tax-statistics) — calibration source.
+- ONS, [Wealth and Assets Survey](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/incomeandwealth/bulletins/totalwealthingreatbritain/latest) — alternative conditioning source for wealth.
+- Existing variables: [`capital_gains_before_response`](../../../policyengine_uk/variables/gov/hmrc/capital_gains_tax/capital_gains_before_response.py), [`capital_gains`](../../../policyengine_uk/variables/household/income/capital_gains.py), [`capital_gains_behavioural_response`](../../../policyengine_uk/variables/gov/hmrc/capital_gains_tax/capital_gains_behavioural_response.py).