From 1f9dcfe8eefefc19ca40314b3ff8f49417a55249 Mon Sep 17 00:00:00 2001 From: Vahid Ahmadi Date: Wed, 27 May 2026 12:17:49 +0200 Subject: [PATCH] Add capital gains distribution plan (#818 #817) #818 asks for a multivariate capital gains imputation instead of the current per-income-decile-independent fits. #817 reports the related failure mode of imputing gains to zero-wealth households. The imputation lives in policyengine-uk-data, not in this repo, but persist the model-facing plan here so the cross-repo agreement is discoverable. The page covers: - the current state (per-decile marginal distributions fit to HMRC CGT statistics binned by income), - three concrete limitations (overfitting boundaries, no wealth conditioning per #817, no within-decile correlation), - two candidate approaches: A. multivariate Gaussian KDE following OG-USA's bequest-transmission model; B. two-stage QRF reusing the existing imputation infrastructure already proposed in the #1621 pipeline-alignment plan, - recommendation: B (QRF) because the infrastructure exists, it handles correlated predictors naturally, and it solves both #818 and #817 in one move by adding total_wealth as a predictor, - the model-side surface (negligible, since the input variable stays the same; one new regression test for "no gains on zero-wealth households"), - open questions on the wealth donor source (FRS vs WAS) and on elasticity-by-band as a follow-up. Closes both #818 and #817 in the planning sense. --- changelog.d/818.md | 1 + .../capital-gains-distribution-plan.md | 139 ++++++++++++++++++ 2 files changed, 140 insertions(+) create mode 100644 changelog.d/818.md create mode 100644 docs/book/assumptions/capital-gains-distribution-plan.md diff --git a/changelog.d/818.md b/changelog.d/818.md new file mode 100644 index 000000000..1e645121f --- /dev/null +++ b/changelog.d/818.md @@ -0,0 +1 @@ +- Add a capital gains distribution planning docs page at `docs/book/assumptions/capital-gains-distribution-plan.md` documenting the current per-decile-independent imputation (in `policyengine-uk-data`) and proposing a two-stage QRF replacement that jointly conditions on income, wealth, age and region — closing both the #818 overfitting concern and the #817 zero-wealth-gains failure mode in one move. diff --git a/docs/book/assumptions/capital-gains-distribution-plan.md b/docs/book/assumptions/capital-gains-distribution-plan.md new file mode 100644 index 000000000..0d00ba6d4 --- /dev/null +++ b/docs/book/assumptions/capital-gains-distribution-plan.md @@ -0,0 +1,139 @@ +# Capital gains distribution plan + +```{note} +**Planning page.** Tracks [#818](https://github.com/PolicyEngine/policyengine-uk/issues/818) +and the related zero-wealth imputation question raised in +[#817](https://github.com/PolicyEngine/policyengine-uk/issues/817). +PolicyEngine UK's capital gains imputation lives in +`policyengine-uk-data`, not in this repo — but the modelling +assumptions and the calibration constraints are documented here so the +cross-repo plan is discoverable from a single page. +``` + +## Current state + +PolicyEngine UK reads `capital_gains_before_response` as a person-level +input variable; the runtime variables under +[`variables/gov/hmrc/capital_gains_tax/`](../../../policyengine_uk/variables/gov/hmrc/capital_gains_tax) +then compute the elasticity-based behavioural response and the tax +liability against the published CGT bracket structure. + +The input itself is produced in `policyengine-uk-data` by: + +1. Splitting the population into income deciles. +2. Fitting a separate marginal distribution of capital gains within + each decile (against HMRC's *Capital Gains Statistics*). +3. Drawing values for each FRS row from the decile-conditional + distribution. + +This is the **independent per-decile model** that #818 flagged. + +## Limitations of the current approach + +### 1. Overfitting at the boundaries + +Per-decile fitting means each decile's tail is fit independently from +its neighbours. With HMRC's published CGT statistics binning gains by +income, the per-decile tails contain few observations — the fitted +upper tails are noisy, and the joint distribution of "income + capital +gains" can have spurious features at decile boundaries. + +### 2. No conditioning on wealth + +#817 reported a related failure mode: capital gains are imputed to +households with zero recorded wealth. That's structurally implausible +(you can't realise gains on assets you don't own) and arises because +the per-decile model conditions only on income, not on wealth holdings. +Many low-income retirees with substantial wealth would receive gains +the current model misses; many low-income tenants with zero wealth +incorrectly receive them. + +### 3. No covariance between income and gains beyond decile membership + +A higher-rate-band earner with £200k income and a lower-rate earner +with £40k income within the same decile (after weighting / SPI) +receive draws from the same marginal distribution. In reality the +correlation between income and gains is much stronger than that. + +## Proposed approach (#818) + +Switch to a **multivariate model over (income, wealth, age, gender)** +that produces capital gains as a derived dimension. Concrete options: + +### A. Multivariate KDE + +The [OG-USA bequest model](https://pslmodels.github.io/OG-USA/content/api/bequest_transmission.html) +referenced in #818 uses a multivariate Gaussian KDE over the relevant +conditioning variables. For UK gains the bandwidth + kernel choice +would need calibrating against HMRC's gains-by-income-by-age cross-tabs; +the SAS *Survey of Personal Incomes* extracts (where available via +Datalab) give a richer conditioning set. + +Tradeoffs: + +- **Pro**: no per-decile boundary artefacts; the joint distribution + comes out smooth in all dimensions. +- **Con**: KDE bandwidth choice is tricky in the tail; large gains + remain noisy unless we supplement with a tail model. + +### B. Two-stage QRF + +Train a quantile-regression-forest (QRF) on the SPI donor set with +predictors = `(age, gender, region, employment_income, +self_employment_income, savings_interest_income, dividend_income, +total_wealth, gross_financial_wealth)` and output = annual capital +gains. This is the same machinery already used in +`policyengine-uk-data/datasets/imputations/income.py` and the +proposed second-stage QRF in [#1621](https://github.com/PolicyEngine/policyengine-uk/issues/1621) +/ [pipeline alignment plan](./uk-pipeline-alignment-plan.md). + +Tradeoffs: + +- **Pro**: the QRF naturally handles correlated predictors and + produces well-calibrated quantiles in the tail. Reuses existing + imputation infrastructure. +- **Con**: requires a clean SPI-linked donor set with all the + conditioning variables (currently the QRF in `income.py` doesn't + output capital gains). + +**Recommendation**: option B. The infrastructure is in place, the +calibration is testable against HMRC published gains-by-income-band +tables, and it consistently solves both #818 (overfitting) and #817 +(implausible zero-wealth gains) by making `total_wealth` a predictor. + +## What changes in this repo + +The model-side surface is small: + +- The `capital_gains_before_response` input variable stays the same. + All the changes are upstream in `policyengine-uk-data`. +- A regression test in the model that asserts **no positive capital + gains for households with zero total wealth** would catch + reintroductions of the #817 failure mode and live well in + `policyengine_uk/tests/`. + +## Open questions + +- Wealth in the FRS is incomplete and noisy; the WAS (Wealth and + Assets Survey) is the better wealth conditioning source but is on a + different sampling frame. Should the QRF be trained on a WAS-linked + donor, or do we condition on the FRS-imputed wealth and accept the + noise? +- HMRC's published CGT statistics break gains down by income, age, and + asset type but not by household wealth. Calibration targets will + need to be assembled across multiple HMRC and ONS sources. +- Behavioural response (`capital_gains_behavioural_response` in this + repo) currently uses a single elasticity. A multivariate model that + gets the distribution right opens the door to **elasticity by + income / wealth band** — useful for reform analysis but adds a + parameter surface. + +## References + +- Issue: [#818](https://github.com/PolicyEngine/policyengine-uk/issues/818) — original "model gains jointly across income groups". +- Related: [#817](https://github.com/PolicyEngine/policyengine-uk/issues/817) — avoid imputing CG to zero-wealth households. +- Reference implementation: [OG-USA bequest-transmission multivariate KDE](https://pslmodels.github.io/OG-USA/content/api/bequest_transmission.html). +- Reusable infrastructure: the QRF in [`policyengine_uk_data/datasets/imputations/income.py`](https://github.com/PolicyEngine/policyengine-uk-data) and the second-stage QRF plan in [pipeline alignment](./uk-pipeline-alignment-plan.md). +- HMRC, [Capital Gains Tax statistics](https://www.gov.uk/government/collections/capital-gains-tax-statistics) — calibration source. +- ONS, [Wealth and Assets Survey](https://www.ons.gov.uk/peoplepopulationandcommunity/personalandhouseholdfinances/incomeandwealth/bulletins/totalwealthingreatbritain/latest) — alternative conditioning source for wealth. +- Existing variables: [`capital_gains_before_response`](../../../policyengine_uk/variables/gov/hmrc/capital_gains_tax/capital_gains_before_response.py), [`capital_gains`](../../../policyengine_uk/variables/household/income/capital_gains.py), [`capital_gains_behavioural_response`](../../../policyengine_uk/variables/gov/hmrc/capital_gains_tax/capital_gains_behavioural_response.py).