This document provides the academic foundations and key implementation requirements for each estimator in diff-diff. It serves as a reference for contributors and users who want to understand the theoretical basis of the methods.
Primary source: Canonical econometrics textbooks
- Wooldridge, J.M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd ed. MIT Press.
- Angrist, J.D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
Key implementation requirements:
Assumption checks / warnings:
- Treatment and post indicators must be binary (0/1) with variation in both
- Warns if no treated units in pre-period or no control units in post-period
- Parallel trends assumption is untestable but can be assessed with pre-treatment data
Estimator equation (as implemented):
ATT = (Ȳ_{treated,post} - Ȳ_{treated,pre}) - (Ȳ_{control,post} - Ȳ_{control,pre})
= E[Y(1) - Y(0) | D=1]
Regression form:
Y_it = α + β₁(Treated_i) + β₂(Post_t) + τ(Treated_i × Post_t) + X'γ + ε_it
where τ is the ATT.
Standard errors:
- Default: HC1 heteroskedasticity-robust
- Optional: Cluster-robust (specify
clusterparameter) - Optional: Wild cluster bootstrap for small number of clusters
Edge cases:
- Empty cells (e.g., no treated-pre observations) cause rank deficiency, handled per
rank_deficient_actionsetting- With "warn" (default): emits warning, sets NaN for affected coefficients
- With "error": raises ValueError
- With "silent": continues silently with NaN coefficients
- Singleton clusters (one observation): included in variance estimation; contribute to meat matrix via u_i² X_i X_i' (same formula as larger clusters with n_g=1)
- Rank-deficient design matrix (collinearity): warns and sets NA for dropped coefficients (R-style, matches
lm())- Tolerance:
1e-07(matches R'sqr()default), relative to largest diagonal element of R in QR decomposition - Controllable via
rank_deficient_actionparameter: "warn" (default), "error", or "silent"
- Tolerance:
Reference implementation(s):
- R:
fixest::feols()with interaction term - Stata:
reghdfeor manual regression with interaction
Requirements checklist:
- Treatment and time indicators are binary 0/1 with variation
- ATT equals coefficient on interaction term
- Wild bootstrap supports Rademacher, Mammen, Webb weight distributions
- Formula interface parses
y ~ treated * postcorrectly
Primary source: Event study methodology
- Freyaldenhoven, S., Hansen, C., Pérez, J.P., & Shapiro, J.M. (2021). Visualization, identification, and estimation in the linear panel event-study design. NBER Working Paper 29170.
- Wooldridge, J.M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd ed. MIT Press, Ch. 10, 13.
- Angrist, J.D., & Pischke, J.-S. (2009). Mostly Harmless Econometrics. Princeton University Press.
Scope: Simultaneous adoption event study. All treated units receive treatment at the same time. For staggered adoption (different units treated at different times), use CallawaySantAnna or SunAbraham instead.
Key implementation requirements:
Assumption checks / warnings:
- Treatment indicator must be binary (0/1) with variation in both groups
- Requires at least 1 pre-treatment and 1 post-treatment period
- Warns when only 1 pre-period available (≥2 needed to test parallel trends; ATT is still valid but pre-trends assessment is not possible)
- Reference period defaults to last pre-treatment period (e=-1 convention)
- Treatment indicator should be time-invariant ever-treated (D_i);
warns when time-varying D_it detected (requires
unitparameter) - Warns if treatment timing varies across units when
unitis provided (suggests CallawaySantAnna or SunAbraham instead) - Treatment must be an absorbing state (once treated, always treated)
Estimator equation (target specification):
With unit and time fixed effects absorbed:
Y_it = α_i + γ_t + Σ_{e≠-1} δ_e × D_i × 1(t = E + e) + X'β + ε_it
where:
- α_i = unit fixed effects (absorbed)
- γ_t = time fixed effects (absorbed)
- E = common treatment time (same for all treated units)
- D_i = treatment group indicator (1=treated, 0=control)
- e = t - E = event time (relative periods to treatment)
- δ_e = treatment effect at event time e
- δ_{-1} = 0 (reference period, omitted for identification)
For simultaneous treatment, this is equivalent to interacting treatment with calendar-time indicators:
Y_it = α_i + γ_t + Σ_{t≠t_ref} δ_t × (D_i × Period_t) + X'β + ε_it
where interactions are included for ALL periods (pre and post), not just post-treatment.
Pre-treatment coefficients (e < -1) test the parallel trends assumption: under H0 of parallel trends, δ_e = 0 for all e < 0.
Post-treatment coefficients (e ≥ 0) estimate dynamic treatment effects.
Average ATT over post-treatment periods:
ATT_avg = (1/|post|) × Σ_{e≥0} δ_e
with SE computed from the sub-VCV matrix:
Var(ATT_avg) = 1'V1 / |post|²
where V is the VCV sub-matrix for post-treatment δ_e coefficients.
Standard errors:
- Default: HC1 heteroskedasticity-robust (same as DifferenceInDifferences base class)
- Alternative: Cluster-robust at unit level via
clusterparameter (recommended for panel data) - Optional: Wild cluster bootstrap (complex for multi-coefficient testing; requires joint bootstrap distribution)
- Degrees of freedom adjusted for absorbed fixed effects
Edge cases:
- Reference period: omitted from design matrix; coefficient is zero by construction.
Default is last pre-treatment period (e=-1). User can override via
reference_period. - Post-period reference: raises ValueError. Post-period references would exclude a post-treatment period from estimation, biasing avg_att and breaking downstream inference.
- Reference period default change: FutureWarning emitted when
reference_periodis not explicitly specified and ≥2 pre-periods exist, noting the default changed from first to last pre-period (e=-1 convention, matching fixest/did). - Never-treated units: all event-time indicators are zero; they identify the time fixed effects and serve as comparison group.
- Endpoint binning: distant event times (e.g., e < -K or e > K) should be binned into endpoint indicators to avoid sparse cells. This prevents imprecise estimates at extreme leads/lags.
- Unbalanced panels: only uses observations where event-time is defined. Units not observed at all event times contribute to the periods they are present for.
- Rank-deficient design matrix (collinearity): warns and sets NA for dropped
coefficients (R-style, matches
lm()) - Average ATT (
avg_att) is NA if any post-period effect is unidentified (R-style NA propagation) - NaN inference for undefined statistics:
- t_stat: Uses NaN (not 0.0) when SE is non-finite or zero
- p_value and CI: Also NaN when t_stat is NaN
- avg_se: Checked for finiteness before computing avg_t_stat
- Note: Defensive enhancement matching CallawaySantAnna NaN convention
- Treatment reversal: warns if any unit transitions from treated to untreated (non-absorbing treatment violates the simultaneous adoption assumption)
- Time-varying treatment (D_it): warns when
unitparameter is provided and within-unit treatment variation is detected. Advises creating an ever-treated indicator. Without ever-treated D_i, pre-period interaction coefficients are unidentified. - Pre-test of parallel trends: joint F-test on pre-treatment δ_e coefficients. Low power in pre-test does not validate parallel trends (Roth 2022).
Reference implementation(s):
- R:
fixest::feols(y ~ i(time, treatment, ref=ref_period) | unit + time, data, cluster=~unit)or equivalentlyfeols(y ~ i(event_time, ref=-1) | unit + time, data, cluster=~unit) - Stata:
reghdfe y ib(-1).event_time#1.treatment, absorb(unit time) cluster(unit)
Requirements checklist:
- Event-time indicators for ALL periods (pre and post), not just post-treatment
- Reference period coefficient is zero (normalized by omission from design matrix)
- Pre-period coefficients available for parallel trends assessment
- Default cluster-robust SE at unit level (currently HC1; cluster-robust via
clusterparam) - Supports unit and time FE via absorption
- Endpoint binning for distant event times
- Average ATT correctly accounts for covariance between period effects
- Returns PeriodEffect objects with confidence intervals
- Supports both balanced and unbalanced panels
Primary source: Panel data econometrics
- Wooldridge, J.M. (2010). Econometric Analysis of Cross Section and Panel Data, 2nd ed. MIT Press, Chapter 10.
Key implementation requirements:
Assumption checks / warnings:
- Staggered treatment warning: If treatment timing varies across units, warns about potential bias from negative weights (Goodman-Bacon 2021, de Chaisemartin & D'Haultfœuille 2020)
- Requires sufficient within-unit and within-time variation
- Warns if any fixed effect is perfectly collinear with treatment
Estimator equation (as implemented):
Y_it = α_i + γ_t + τ(D_it) + X'β + ε_it
Estimated via within-transformation (demeaning):
Ỹ_it = τD̃_it + X̃'β + ε̃_it
where tildes denote demeaned variables.
Note: The interaction term D_i × Post_t is within-transformed (demeaned) alongside the
outcome and covariates before regression. This is required by the Frisch-Waugh-Lovell theorem:
all regressors must be projected out of the same fixed effects space as the dependent variable.
This matches the behavior of R's fixest::feols() with absorbed FE.
Standard errors:
- Default: Cluster-robust at unit level (accounts for serial correlation)
- Degrees of freedom adjusted for absorbed fixed effects:
df_adjustment = n_units + n_times - 2
Edge cases:
- Singleton units/periods are automatically dropped
- Treatment perfectly collinear with FE raises error with informative message listing dropped columns
- Covariate collinearity emits warning but estimation continues (ATT still identified)
- Rank-deficient design matrix: warns and sets NA for dropped coefficients (R-style, matches
lm()) - Unbalanced panels handled via proper demeaning
- Multi-period
timeparameter: only binary (0/1) post indicator is recommended; multi-period values producetreated × period_numberrather thantreated × post_indicator. AUserWarningis emitted whentimehas >2 unique values, advising users to create a binary post column. Non-{0,1} binary time (e.g., {2020, 2021}) also emits a warning, though the ATT is mathematically correct — the within-transformation absorbs the scaling. - Staggered warning limitation: requires
timeto have actual period values (not binary 0/1) so that different cohort first-treatment times can be distinguished. With binarytime="post", all treated units appear to start attime=1, making staggering undetectable. Users with staggered designs should usedecompose()orCallawaySantAnnadirectly.
Reference implementation(s):
- R:
fixest::feols(y ~ treat:post | unit + post, data, cluster = ~unit) - Stata:
reghdfe y treat, absorb(unit time) cluster(unit)
Requirements checklist:
- Staggered adoption detection warning (only fires when
timehas >2 unique values; with binarytime, staggering is undetectable) - Multi-period time warning (fires when
timehas >2 unique values) - Auto-clusters standard errors at unit level
-
decompose()method returns BaconDecompositionResults - Within-transformation correctly handles unbalanced panels
- Non-{0,1} binary time warning (fires when time has 2 unique values not in {0,1})
- ATT invariance to time encoding (verified by test)
Primary source: Callaway, B., & Sant'Anna, P.H.C. (2021). Difference-in-Differences with multiple time periods. Journal of Econometrics, 225(2), 200-230.
Key implementation requirements:
Assumption checks / warnings:
- Requires never-treated units as comparison group (identified by
first_treat=0ornever_treated=True) - Warns if no never-treated units exist (suggests alternative comparison strategies)
- Limited pre-treatment periods reduce ability to test parallel trends
Estimator equation (as implemented):
Group-time average treatment effect:
ATT(g,t) = E[Y_t - Y_{g-1} | G_g=1] - E[Y_t - Y_{g-1} | C=1]
where G_g=1 indicates units first treated in period g, and C=1 indicates never-treated.
Note: This equation uses g-1 as the base period, which applies to post-treatment effects (t ≥ g) and base_period="universal". With base_period="varying" (default), pre-treatment effects use t-1 as base for consecutive comparisons (see Base period selection in Edge cases).
With covariates (doubly robust):
ATT(g,t) = E[((G_g - p̂_g(X))/(1-p̂_g(X))) × (Y_t - Y_{g-1} - m̂_{0,g,t}(X) + m̂_{0,g,g-1}(X))] / E[G_g]
Aggregations:
- Simple:
ATT = Σ_{g,t} w_{g,t} × ATT(g,t)weighted by group size - Event-study:
ATT(e) = Σ_g w_g × ATT(g, g+e)for event-time e - Group:
ATT(g) = Σ_t ATT(g,t) / T_gaverage over post-periods
Standard errors:
- Default: Analytical (influence function-based)
- All aggregation SEs (simple, event study) include the weight influence function (WIF)
adjustment, matching R's
did::aggte(). The WIF accounts for uncertainty in estimating group-size aggregation weights. Group aggregation uses equal time weights (deterministic), so WIF is zero. - Bootstrap: Multiplier bootstrap with Rademacher, Mammen, or Webb weights. Bootstrap perturbs the combined influence function (standard IF + WIF) directly, not just fixed-weight re-aggregation. This correctly propagates weight estimation uncertainty.
- Block structure preserves within-unit correlation
- Simultaneous confidence bands (
cband=True, default): Uses sup-t bootstrap to compute a uniform critical value across event times, controlling family-wise error rate. Matches R'sdid::aggte(..., cband=TRUE)default. Requiresn_bootstrap > 0.
Bootstrap weight distributions:
The multiplier bootstrap uses random weights w_i with E[w]=0 and Var(w)=1:
| Weight Type | Values | Probabilities | Properties |
|---|---|---|---|
| Rademacher | ±1 | 1/2 each | Simplest; E[w³]=0 |
| Mammen | -(√5-1)/2, (√5+1)/2 | (√5+1)/(2√5), (√5-1)/(2√5) | E[w³]=1; better for skewed data |
| Webb | ±√(3/2), ±1, ±√(1/2) | 1/6 each | 6-point; recommended for few clusters |
Webb distribution details:
- Values: {-√(3/2), -1, -√(1/2), √(1/2), 1, √(3/2)} ≈ {-1.225, -1, -0.707, 0.707, 1, 1.225}
- Equal probabilities (1/6 each) giving E[w]=0, Var(w)=1
- Matches R's
didpackage implementation - Verification: Implementation matches
fwildclusterbootR package (C++ source) which uses identicalsqrt(1.5),1,sqrt(0.5)values with equal 1/6 probabilities. Some documentation shows simplified values (±1.5, ±1, ±0.5) but actual implementations use square root values to achieve unit variance. - Reference: Webb, M.D. (2023). Reworking Wild Bootstrap Based Inference for Clustered Errors. Queen's Economics Department Working Paper No. 1315. (Updated from Webb 2014)
Edge cases:
- Groups with single observation: included but may have high variance
- Missing group-time cells: ATT(g,t) set to NaN
- Note: When
balance_eis specified, cohorts with NaN effects at the anchor horizon are excluded from the balanced panel
- Note: When
- Anticipation:
anticipationparameter shifts reference period- Group aggregation includes periods t >= g - anticipation (not just t >= g)
- Both analytical SE and bootstrap SE aggregation respect anticipation
- Not-yet-treated + anticipation: control mask uses
G > max(t, base_period) + anticipationto exclude cohorts treated at either the evaluation period or the base period. This prevents control contamination whenbase_period="universal"and the base period is later than the evaluation period (e.g., pre-treatment ATT with universal base)
- Rank-deficient design matrix (covariate collinearity):
- Detection: Pivoted QR decomposition with tolerance
1e-07(R'sqr()default) - Handling: Warns and drops linearly dependent columns, sets NA for dropped coefficients (R-style, matches
lm()) - Parameter:
rank_deficient_actioncontrols behavior: "warn" (default), "error", or "silent"
- Detection: Pivoted QR decomposition with tolerance
- Non-finite inference values:
- Analytic SE: Returns NaN to signal invalid inference (not biased via zeroing)
- Bootstrap: Drops non-finite samples, warns, and adjusts p-value floor accordingly. SE, CI, and p-value are all NaN if the original point estimate is non-finite, SE is non-finite or zero (e.g., n_valid=1 with ddof=1, or identical samples)
- Threshold: Returns NaN if <50% of bootstrap samples are valid
- Per-effect t_stat: Uses NaN (not 0.0) when SE is non-finite or zero (consistent with overall_t_stat)
- Note: This is a defensive enhancement over reference implementations (R's
did::att_gt, Stata'scsdid) which may error or produce unhandled inf/nan in edge cases without informative warnings
- No post-treatment effects (all treatment occurs after data ends):
- Overall ATT set to NaN (no post-treatment periods to aggregate)
- All overall inference fields (SE, t-stat, p-value, CI) also set to NaN
- Warning emitted: "No post-treatment effects for aggregation"
- Individual pre-treatment ATT(g,t) are computed (for parallel trends assessment)
- Bootstrap runs for per-effect SEs even without post-treatment; only overall statistics are NaN
- Principle: NaN propagates consistently through overall inference fields; pre-treatment effects get full bootstrap inference
- Aggregated t_stat (event-study, group-level):
- Uses NaN when SE is non-finite or zero (matches per-effect and overall t_stat behavior)
- Previous behavior (0.0 default) was inconsistent and misleading
- Base period selection (
base_periodparameter):- "varying" (default): Pre-treatment uses t-1 as base (consecutive comparisons). Post-treatment uses long differences from g-1-anticipation. The parallel trends assumption for treatment effects is about long-run trends, but pre-treatment tests only check consecutive periods.
- "universal": ALL effects (pre and post) are long differences from g-1-anticipation. The parallel trends assumption is about long-run trends. Pre-treatment coefficients test cumulative divergence from the base period.
- Both produce identical post-treatment ATT(g,t); differ only pre-treatment
anticipationshifts the base period to g-1-anticipation, which moves it further from treatment and strengthens the parallel trends assumption.- Matches R
did::att_gt()base_period parameter - Event study output: With "universal", includes reference period (e=-1-anticipation) with effect=0, se=NaN, conf_int=(NaN, NaN). Inference fields are NaN since this is a normalization constraint, not an estimated effect. Only added when real effects exist.
- Base period interaction with Sun-Abraham comparison:
- CS with
base_period="varying"produces different pre-treatment estimates than SA - This is expected: CS uses consecutive comparisons, SA uses fixed reference (e=-1-anticipation)
- Use
base_period="universal"for methodologically comparable pre-treatment effects - Post-treatment effects match regardless of base_period setting
- CS with
- Propensity score estimation:
- Algorithm: IRLS (Fisher scoring), matching R's
glm(family=binomial)default - Note: Uses IRLS (Fisher scoring) for propensity score estimation, consistent
with R's
did::att_gt()which usesglm(family=binomial)internally - Near-separation detection: Warns when predicted probabilities are within 1e-5 of 0 or 1, or when IRLS fails to converge
- Trimming: Propensity scores clipped to
[pscore_trim, 1-pscore_trim](default 0.01) before weight computation. Warning emitted when scores are trimmed. - Fallback: If IRLS fails entirely (LinAlgError/ValueError), falls back to
unconditional propensity score with warning. Exception: when
rank_deficient_action="error", the error is re-raised instead of falling back.
- Algorithm: IRLS (Fisher scoring), matching R's
- Control group with
control_group="not_yet_treated":- Always excludes cohort g from controls when computing ATT(g,t)
- This applies to both pre-treatment (t < g) and post-treatment (t >= g) periods
- For pre-treatment periods: even though cohort g hasn't been treated yet at time t, they are the treated group for this ATT(g,t) and cannot serve as their own controls
- Control mask:
never_treated OR (first_treat > max(t, base_period) + anticipation AND first_treat != g) - The
max(t, base_period)ensures controls are untreated at both the evaluation period and the base period, preventing contamination whenbase_period="universal"uses a base period later thant(matching R'sdid::att_gt()) - Does not require never-treated units: when all units are eventually treated, not-yet-treated cohorts serve as controls for each other (requires ≥2 cohorts)
- Note: CallawaySantAnna survey support: weights, strata, PSU, and FPC are all supported. Analytical (
n_bootstrap=0): aggregated SEs use design-based variance viacompute_survey_if_variance(). Bootstrap (n_bootstrap>0): PSU-level multiplier weights replace analytical SEs for aggregated quantities. Regression method supports covariates; IPW/DR support no-covariate only (covariates+IPW/DR raises NotImplementedError — DRDID nuisance IF not yet implemented). Survey weights compose with IPW weights multiplicatively. WIF in aggregation matches R's did::wif() formula. Per-unit survey weights are extracted viagroupby(unit).first()from the panel-normalized pweight array; on unbalanced panels the pweight normalization (w * n_obs / sum(w)) preserves relative unit weights since all IF/WIF formulas use weight ratios (sw_i / sum(sw)) where the normalization constant cancels. Scale-invariance tests pass on both balanced and unbalanced panels. - Note (deviation from R): CallawaySantAnna survey reg+covariates per-cell SE uses a conservative plug-in IF based on WLS residuals. The treated IF is
inf_treated_i = (sw_i/sum(sw_treated)) * (resid_i - ATT)(normalized by treated weight sum, matching unweighted(resid-ATT)/n_t). The control IF isinf_control_i = -(sw_i/sum(sw_control)) * wls_resid_i(normalized by control weight sum, matching unweighted-resid/n_c). SE is computed assqrt(sum(sw_t_norm * (resid_t - ATT)^2) + sum(sw_c_norm * resid_c^2)), the weighted analogue of the unweightedsqrt(var_t/n_t + var_c/n_c). This omits the semiparametrically efficient nuisance correction from DRDID'sreg_did_panel— WLS residuals are orthogonal to the weighted design matrix by construction, so the first-order IF term is asymptotically valid but may be conservative. SEs pass weight-scale-invariance tests. The efficient DRDID correction is deferred to future work. - Note (deviation from R): Per-cell ATT(g,t) SEs under survey weights use influence-function-based variance (matching R's
did::att_gtanalytical SE path) rather than full Taylor-series linearization. When strata/PSU/FPC are present, analytical aggregated SEs (n_bootstrap=0) usecompute_survey_if_variance()on the combined IF/WIF; bootstrap aggregated SEs (n_bootstrap>0) use PSU-level multiplier weights.
Reference implementation(s):
- R:
did::att_gt()(Callaway & Sant'Anna's official package) - Stata:
csdid
Requirements checklist:
- Requires never-treated units when
control_group="never_treated"(default); not required for"not_yet_treated" - Bootstrap weights support Rademacher, Mammen, Webb distributions
- Aggregations: simple, event_study, group all implemented
- Doubly robust estimation when covariates provided
- Multiplier bootstrap preserves panel structure
Primary Source: Callaway, Goodman-Bacon & Sant'Anna (2024), "Difference-in-Differences with a Continuous Treatment," NBER Working Paper 32117.
R Reference: contdid v0.1.0 (CRAN).
Two levels of parallel trends (following CGBS 2024, Assumptions 1-2):
Parallel Trends (PT): for all doses d in D_+,
E[Y_t(0) - Y_{t-1}(0) | D = d] = E[Y_t(0) - Y_{t-1}(0) | D = 0].
Untreated potential outcome paths are the same across all dose groups and the
untreated group. Stronger than binary PT because it conditions on specific dose values.
Identifies: ATT(d|d), ATT^{loc}. Does NOT identify ATT(d), ACRT, or cross-dose comparisons.
Strong Parallel Trends (SPT): additionally, for all d in D,
E[Y_t(d) - Y_{t-1}(0) | D > 0] = E[Y_t(d) - Y_{t-1}(0) | D = d].
No selection into dose groups on the basis of treatment effects.
Implies ATT(d|d) = ATT(d) for all d.
Additionally identifies: ATT(d), ACRT(d), ACRT^{glob}, and cross-dose comparisons.
See docs/methodology/continuous-did.md Section 4 for full details.
Target parameters:
ATT(d|d) = E[Y_t(d) - Y_t(0) | D = d]— effect of dose d on units who received dose d (PT)ATT(d) = E[Y_t(d) - Y_t(0) | D > 0]— dose-response curve (SPT required)ACRT(d) = dATT(d)/dd— average causal response / marginal effect (SPT required)ATT^{loc} = E[ATT(D|D) | D > 0] = E[Delta Y | D > 0] - E[Delta Y | D = 0]— binarized ATT (PT); equalsATT^{glob}under SPTATT^{glob} = E[ATT(D) | D > 0]— global average dose-response level (SPT required)ACRT^{glob} = E[ACRT(D_i) | D > 0]— plug-in average marginal effect (SPT required)
Estimation via B-spline OLS:
- Compute
Delta_tilde_Y = (Y_t - Y_{t-1})_treated - mean((Y_t - Y_{t-1})_control) - Build B-spline basis
Psi(D_i)from treated doses - OLS:
beta = (Psi'Psi)^{-1} Psi' Delta_tilde_Y ATT(d) = Psi(d)' beta,ACRT(d) = dPsi(d)/dd' beta
- No untreated group: Remark 3.1 (lowest-dose-as-control) not implemented; requires P(D=0) > 0.
- Discrete treatment: Detect integer-valued dose and warn; saturated regression deferred.
- All-same dose: B-spline basis collapses; ACRT(d) = 0 everywhere.
- Rank deficiency: When n_treated <= n_basis, cell is skipped.
- Balanced panel required: Matches R
contdidv0.1.0. - Anticipation + not-yet-treated: Control mask uses
G > t + anticipation(not justG > t) to exclude cohorts in the anticipation window from not-yet-treated controls. Whenanticipation=0(default), behavior is unchanged. - Boundary knots: Knots are built once from all treated doses (global, not per-cell) to ensure a common basis across (g,t) cells for aggregation. Evaluation grid is clamped to training-dose boundary knots (
range(dose)). R'scontdidv0.1.0 has an inconsistency wheresplines2::bSpline(dvals)usesrange(dvals)instead ofrange(dose), which can produce extrapolation artifacts at dose grid extremes. Our approach avoids extrapolation and is methodologically sound.
- B-spline basis construction matching R's
splines2::bSpline(global knots from all treated doses; boundary knots use training-dose range; see deviation note above) - Multi-period (g,t) cell iteration with base period selection
- Dose-response and event-study aggregation with group-proportional weights (n_treated/n_total per group, divided among post-treatment cells; R
ptetoolsconvention) - Multiplier bootstrap for inference
- Analytical SEs via influence functions
- Equation verification tests (linear, quadratic, multi-period)
- Covariate support (deferred, matching R v0.1.0)
- Discrete treatment saturated regression
- Lowest-dose-as-control (Remark 3.1)
- Survey design support (Phase 3): weighted B-spline OLS, TSL on influence functions; bootstrap+survey supported (Phase 6)
- Note: ContinuousDiD bootstrap with survey weights supported (Phase 6) via PSU-level multiplier weights
Primary source: Chen, X., Sant'Anna, P. H. C., & Xie, H. (2025). Efficient Difference-in-Differences and Event Study Estimators.
Key implementation requirements:
Assumption checks / warnings:
- Random Sampling (Assumption S): Data is a random sample of
(Y_{1}, ..., Y_{T}, X', G)' - Overlap (Assumption O): For each group g, generalized propensity score
E[G_g | X]must be in(0, 1)a.s. Near-zero propensity scores cause ratiop_g(X)/p_{g'}(X)to explode; warn on finite-sample instability - No-anticipation (Assumption NA): For all treated groups g and pre-treatment periods t < g:
E[Y_t(g) | G=g, X] = E[Y_t(infinity) | G=g, X]a.s. - Parallel Trends -- two variants:
- PT-Post (weaker): PT holds only in post-treatment periods, comparison group = never-treated only, baseline = period g-1 only. Estimator is just-identified and reduces to standard single-baseline DiD (Corollary 3.2)
- PT-All (stronger): PT holds for all groups and all periods. Enables using any not-yet-treated cohort and any pre-treatment period as baseline. Model is overidentified (Lemma 2.1); paper derives optimal combination weights
- Absorbing treatment: Binary treatment must be irreversible (once treated, stays treated)
- Balanced panel: Short balanced panel required ("large-n, fixed-T" regime). Does not handle unbalanced panels or repeated cross-sections
- Warn if treatment varies within units (non-absorbing treatment)
- Warn if propensity score estimates are near boundary values
Estimator equation -- single treatment date (Equations 3.2, 3.5):
Transformed outcome (Equation 3.2):
Y_tilde_{g,t,t_pre} = (1/pi_g) * (G_g - p_g(X)/p_inf(X) * G_inf) * (Y_t - Y_{t_pre} - m_{inf,t,t_pre}(X))
Efficient ATT estimand (Equation 3.5):
ATT(g, t) = E[ (1' V*_{gt}(X)^{-1} / (1' V*_{gt}(X)^{-1} 1)) * Y_tilde_{g,t} ]
where:
G_g = 1{G = g}= indicator for belonging to treatment cohort gG_inf = 1{G = infinity}= indicator for never-treatedpi_g = P(G = g)= population share of cohort gp_g(X) = E[G_g | X]= generalized propensity scorem_{inf,t,t_pre}(X) = E[Y_t - Y_{t_pre} | G = infinity, X]= conditional mean outcome change for never-treatedV*_{gt}(X)=(g-1) x (g-1)conditional covariance matrix with(j,k)-th element (Equation 3.4):(1/p_g(X)) Cov(Y_t - Y_j, Y_t - Y_k | G=g, X) + (1/(1-p_g(X))) Cov(Y_t - Y_j, Y_t - Y_k | G=inf, X)
Estimator equation -- staggered adoption (Equations 3.9, 3.13, 4.3, 4.4):
Generated outcome for each (g', t_pre) pair (Equation 3.9 / sample analog 4.4):
Y_hat^{att(g,t)}_{g',t_pre} = (G_g / pi_hat_g) * (Y_t - Y_1 - m_hat_{inf,t,t_pre}(X) - m_hat_{g',t_pre,1}(X))
- r_hat_{g,inf}(X) * (G_inf / pi_hat_g) * (Y_t - Y_{t_pre} - m_hat_{inf,t,t_pre}(X))
- r_hat_{g,g'}(X) * (G_{g'} / pi_hat_g) * (Y_{t_pre} - Y_1 - m_hat_{g',t_pre,1}(X))
where:
r_hat_{g,g'}(X) = p_g(X)/p_{g'}(X)= estimated propensity score ratiom_hat_{g',t,t_pre}(X) = E[Y_t - Y_{t_pre} | G = g', X]= estimated conditional mean outcome change
Efficient ATT for staggered adoption (Equation 4.3):
ATT_hat_stg(g,t) = E_n[ (1' Omega_hat*_{gt}(X)^{-1}) / (1' Omega_hat*_{gt}(X)^{-1} 1) * Y_hat^{att(g,t)}_stg ]
where Omega*_{gt}(X) is the conditional covariance matrix with (j,k)-th element (Equation 3.12):
(1/p_g(X)) Cov(Y_t - Y_1, Y_t - Y_1 | G=g, X)
+ (1/p_inf(X)) Cov(Y_t - Y_{t'_j}, Y_t - Y_{t'_k} | G=inf, X)
- 1{g=g'_j}/p_g(X) * Cov(Y_t - Y_1, Y_{t'_j} - Y_1 | G=g, X)
- 1{g=g'_k}/p_g(X) * Cov(Y_t - Y_1, Y_{t'_k} - Y_1 | G=g, X)
+ 1{g_j=g'_k}/p_{g'_j}(X) * Cov(Y_{t'_j} - Y_1, Y_{t'_k} - Y_1 | G=g'_j, X)
Event study aggregation (Equations 3.8, 3.14, 4.5):
ES_hat(e) = sum_{g in G_{trt,e}} (pi_hat_g / sum_{g' in G_{trt,e}} pi_hat_{g'}) * ATT_hat_stg(g, g+e)
where G_{trt,e} = {g in G_trt : g + e <= T} and weights are cohort relative size weights.
Overall average event-study parameter (Equation 2.3):
ES_avg = (1/N_E) * sum_{e in E} ES(e)
With covariates / doubly robust:
The estimator is doubly robust by construction. Consistency requires correct specification of either:
- Outcome regression:
m_{g',t,t_pre}(X) = E[Y_t - Y_{t_pre} | G = g', X], OR - Propensity score ratio:
r_{g,g'}(X) = p_g(X)/p_{g'}(X)
The Neyman orthogonality property (Remark 4.2) permits modern ML estimators (random forests, lasso, ridge, neural nets, boosted trees) for nuisance parameters without loss of efficiency.
Without covariates (Section 4.1):
Estimator simplifies to closed-form expressions using only within-group sample means and sample covariances. No tuning parameters are needed. The covariance matrix Omega*_gt uses unconditional within-group covariances with pi_g replacing p_g(X).
Standard errors (Theorem 4.1, Section 4):
- Default: Analytical SE computed as the square root of the sample variance of estimated EIF values divided by n:
SE_analytical = sqrt( (1/n^2) * sum_{i=1}^{n} EIF_hat_i^2 ) - Alternative: Cluster-robust SE at cross-sectional unit level (used in empirical application, page 34-35)
- Bootstrap: Nonparametric clustered bootstrap (resampling clusters with replacement); 300 replications recommended (page 23, footnote 16)
- Small sample recommendation (Section 5.1): Use cluster bootstrap SEs rather than analytical SEs when n is small (n <= 50). Analytical SEs are anticonservative with n=50 (coverage ~0.80) but perform well with n >= 200 (coverage ~0.94)
- Simultaneous confidence bands: Multiplier bootstrap procedure for multiple
(g,t)pairs (footnote 13, referencing Callaway and Sant'Anna 2021, Theorems 2-3, Algorithm 1) - Implementation note: Phase 1 uses multiplier bootstrap on EIF values (Rademacher/Mammen/Webb weights) rather than nonparametric clustered bootstrap. This is asymptotically equivalent and computationally cheaper, consistent with the CallawaySantAnna implementation pattern. Clustered resampling bootstrap may be added in a future version
Efficient influence function for ATT(g,t) (Theorem 3.2):
EIF^{att(g,t)}_stg = (1' Omega*_{gt}(X)^{-1}) / (1' Omega*_{gt}(X)^{-1} 1) * IF^{att(g,t)}_stg
Efficient influence function for ES(e) (following Theorem 3.2, page 17):
EIF^{es(e)}_stg = sum_{g in G_{trt,e}} ( q_{g,e} * EIF^{att(g,g+e)}_stg
+ ATT(g,g+e) / (sum_{g' in G_{trt,e}} pi_{g'}) * (G_g - pi_g)
- q_{g,e} * sum_{s in G_{trt,e}} (G_s - pi_s) )
where q_{g,e} = pi_g / sum_{g' in G_{trt,e}} pi_{g'}.
Edge cases:
- Single pre-treatment period (g=2):
V*_{gt}(X)is 1x1, efficient weights are trivially 1, estimator collapses to standard DiD with single baseline - Rank deficiency in
V*_{gt}(X)orOmega*_{gt}(X): Inverse does not exist if outcome changes are linearly dependent conditional on covariates. Detect via matrix condition number; fall back to pseudoinverse or standard estimator - Near-zero propensity scores: Ratio
p_g(X)/p_{g'}(X)explodes. Overlap assumption (O) rules this out in population; implement trimming or warn on finite-sample instability - Note: When no sieve degree K succeeds for ratio estimation (basis dimension exceeds comparison group size, or all linear systems are singular), the estimator falls back to a constant ratio of 1 for all units with a UserWarning. The outcome regression adjustment remains active, so the generated outcomes (Eq 4.4) still incorporate covariate information via the m_hat terms. The DR property ensures consistency as long as the outcome regression is correctly specified.
- Note: When no sieve degree K succeeds for inverse propensity estimation (algorithm step 4), the estimator falls back to unconditional n/n_group scaling with a UserWarning, which reduces to the unconditional Omega* approximation for the affected group.
- All units eventually treated: Last cohort serves as "never-treated" by dropping time periods from the last cohort's treatment onset onward. Use
control_group="last_cohort"to enable; default"never_treated"raises ValueError if no never-treated units exist - Negative weights: Explicitly stated as harmless for bias and beneficial for precision; arise from efficiency optimization under overidentification (Section 5.2)
- PT-Post regime (just-identified): Under PT-Post, EDiD automatically reduces to standard single-baseline estimator (Corollary 3.2). No downside to using EDiD -- it subsumes standard estimators
- Duplicate rows: Duplicate
(unit, time)entries are rejected withValueError. The estimator requires exactly one observation per unit-period - Note: PT-All index set includes g'=∞ (never-treated) as a candidate comparison group and excludes period_1 for all g'. When g'=∞, the second and third Eq 3.9 terms telescope so all (∞, t_pre) moments produce the same 2x2 DiD value; these redundant moments are handled by Omega*'s pseudoinverse. When t_pre = period_1, the third term degenerates to E[Y_1 - Y_1 | G=g'] = 0 for any g', adding no information. Valid pairs require only t_pre < g' (pre-treatment for comparison group), not t_pre < g. Same-group pairs (g'=g) are valid and contribute overidentifying moments (Equation 3.9).
- Note: Bootstrap aggregation uses fixed cohort-size weights for overall/event-study reaggregation, matching the CallawaySantAnna bootstrap pattern (staggered_bootstrap.py:281 computes
bootstrap_overall = bootstrap_atts_gt[:, post_indices] @ weights; L297 uses the same fixed-weight pattern for event study). The analytical path includes a WIF correction; fixed-weight bootstrap captures the same sampling variability through per-cell EIF perturbation without re-estimating aggregation weights, consistent with both the library's CS implementation and the Rdidpackage. - Overall ATT convention: The library's
overall_attuses cohort-size-weighted averaging of post-treatment (g,t) cells, matching the CallawaySantAnna simple aggregation. This differs from the paper's ES_avg (Eq 2.3), which uniformly averages over event-time horizons. ES_avg can be computed from event study output asmean(event_study_effects[e]["effect"] for e >= 0)
Algorithm (two-step semiparametric estimation, Section 4):
Step 1: Estimate nuisance parameters
- Estimate outcome regressions
m_hat_{g',t,t_pre}(X)using sieve regression, kernel smoothing, or ML methods (for each valid(g', t_pre)pair) - Estimate propensity score ratios
r_hat_{g,g'}(X) = p_g(X)/p_{g'}(X)via convex minimization (Equation 4.1):Sieve estimator (Equation 4.2):r_{g,g'}(X) = arg min_{r} E[ r(X)^2 * G_{g'} - 2*r(X)*G_g ]beta_hat_K = arg min_{beta_K} E_n[ G_{g'} * (psi^K(X)' beta_K)^2 - 2*G_g * (psi^K(X)' beta_K) ] - Select sieve index K via information criterion:
K_hat = arg min_K { 2*loss(K) + C_n * K / n }whereC_n = 2(AIC) orC_n = log(n)(BIC) - Estimate
s_hat_{g'}(X) = 1/p_{g'}(X)via analogous convex minimization - Estimate conditional covariance
Omega_hat*_{gt}(X)using kernel smoothing with bandwidth h
Step 2: Construct efficient estimator
6. Compute generated outcomes Y_hat^{att(g,t)}_{g',t_pre} for each valid (g', t_pre) pair using Equation 4.4
7. Compute efficient weights w(X) = 1' Omega_hat*_{gt}(X)^{-1} / (1' Omega_hat*_{gt}(X)^{-1} 1)
8. Compute ATT_hat_stg(g,t) = E_n[ w(X_i) * Y_hat^{att(g,t)}_stg ] (Equation 4.3)
9. Aggregate to event-study: ES_hat(e) = sum_g (pi_hat_g / sum pi_hat) * ATT_hat_stg(g, g+e) (Equation 4.5)
10. Compute SE from sample variance of estimated EIF values
Without covariates: Steps 1-5 simplify to within-group sample means and sample covariances. No nuisance estimation or tuning needed.
Reference implementation(s):
- No specific software package named in the paper for the EDiD estimator
- Estimators compared against: Callaway-Sant'Anna (
didR package), de Chaisemartin-D'Haultfoeuille (DIDmultiplegtR package /did_multiplegtStata), Borusyak-Jaravel-Spiess / Gardner / Wooldridge imputation estimators - Empirical replication: HRS data from Dobkin et al. (2018) following Sun and Abraham (2021) sample selection
Requirements checklist:
- Implements two-step semiparametric estimator (Equation 4.3)
- Supports both PT-Post (just-identified) and PT-All (overidentified) regimes
- Computes efficient weights from conditional covariance matrix inverse
- Doubly robust: consistent if either outcome regression or propensity score ratio is correct
- No-covariates case uses closed-form sample means/covariances (no tuning)
- With covariates: sieve-based propensity ratio estimation with AIC/BIC selection
- Kernel-smoothed conditional covariance estimation
- Analytical SE from EIF sample variance
- Cluster-robust SE option (analytical from EIF + cluster-level multiplier bootstrap)
- Event-study aggregation ES(e) with cohort-size weights
- Hausman-type pre-test for PT-All vs PT-Post (Theorem A.1)
- Each ATT(g,t) can be estimated independently (parallelizable)
- Absorbing treatment validation
- Overlap diagnostics for propensity score ratios
- Survey design support (Phase 3): survey-weighted means/covariances in Omega*, TSL on EIF scores; bootstrap+survey supported (Phase 6)
- Note: Sieve ratio estimation uses polynomial basis functions (total degree up to K) with AIC/BIC model selection. The paper describes sieve estimators generally without specifying a particular basis family; polynomial sieves are a standard choice (Section 4, Eq 4.2). Negative sieve ratio predictions are clipped to a small positive value since the population ratio p_g(X)/p_{g'}(X) is non-negative.
- Note: Kernel-smoothed conditional covariance Omega*(X) uses Gaussian kernel with Silverman's rule-of-thumb bandwidth by default. The paper specifies kernel smoothing (step 5, Section 4) without mandating a particular kernel or bandwidth selection method.
- Note: Conditional covariance Omega*(X) scales each term by per-unit sieve-estimated inverse propensities s_hat_{g'}(X) = 1/p_{g'}(X) (algorithm step 4), matching Eq 3.12. The inverse propensity estimation uses the same polynomial sieve convex minimization as the ratio estimator. Estimated s_hat values are clipped to [1, n] with a UserWarning when clipping binds, mirroring the ratio path's overlap diagnostics.
- Note: Outcome regressions m_hat_{g',t,tpre}(X) use linear OLS working models. The paper's Section 4 describes flexible nonparametric nuisance estimation (sieve regression, kernel smoothing, or ML methods). The DR property ensures consistency if either the OLS outcome model or the sieve propensity ratio is correctly specified, but the linear OLS specification does not generically guarantee attainment of the semiparametric efficiency bound unless the conditional mean is linear in the covariates.
- Note: EfficientDiD bootstrap with survey weights supported (Phase 6) via PSU-level multiplier weights
- Note: EfficientDiD covariates (DR path) with survey weights deferred — the doubly robust nuisance estimation does not yet thread survey weights through sieve/kernel steps
- Note: Cluster-robust SEs use the standard Liang-Zeger clustered sandwich estimator applied to EIF values: aggregate EIF within clusters, center, and compute variance with G/(G-1) small-sample correction. Cluster bootstrap generates multiplier weights at the cluster level (all units in a cluster share the same weight). Analytical clustered SEs are the default when
clusteris set; cluster bootstrap is opt-in vian_bootstrap > 0. - Note: Hausman pretest operates on the post-treatment event-study vector ES(e) per Theorem A.1. Both PT-All and PT-Post fits are aggregated to ES(e) using cohort-size weights before computing the test statistic H = delta' V^{-1} delta where delta = ES_post - ES_all and V = Cov(ES_post) - Cov(ES_all). Covariance is computed from aggregated ES(e)-level EIF values. The variance-difference matrix V is inverted via Moore-Penrose pseudoinverse to handle finite-sample non-positive-definiteness. Effective rank of V (number of positive eigenvalues) is used as degrees of freedom.
- Note: Last-cohort-as-control (
control_group="last_cohort") reclassifies the latest treatment cohort as pseudo-never-treated and drops time periods at/after that cohort's treatment start. This is distinct from CallawaySantAnna'snot_yet_treatedoption which dynamically selects not-yet-treated units per (g,t) pair.
Key implementation requirements:
Assumption checks / warnings:
- Requires never-treated units as control group
- Warns if treatment effects may be heterogeneous across cohorts (which the method handles)
- Reference period: e=-1-anticipation (defaults to e=-1 when anticipation=0)
Estimator equation (as implemented):
Saturated regression with cohort-specific effects:
Y_it = α_i + γ_t + Σ_{g∈G} Σ_{e≠-1} δ_{g,e} × 1(G_i=g) × D^e_{it} + ε_it
where G_i is unit i's cohort (first treatment period), D^e_{it} = 1(t - G_i = e).
Interaction-weighted estimator:
δ̂_e = Σ_g ŵ_{g,e} × δ̂_{g,e}
where weights ŵ_{g,e} = n_{g,e} / Σ_g n_{g,e} (sample share of cohort g at event-time e).
Standard errors:
- Default: Cluster-robust at unit level
- Delta method for aggregated coefficients
- Optional: Pairs bootstrap for robustness
Edge cases:
- Single cohort: reduces to standard event study
- Cohorts with no observations at some event-times: weighted appropriately
- Extrapolation beyond observed event-times: not estimated
- Event-time range: no artificial cap (estimates all available relative times, matching R's
fixest::sunab()) - No post-treatment effects: returns
(NaN, NaN)for overall ATT/SE; all inference fields (t_stat, p_value, conf_int) propagate NaN vianp.isfinite()guards min_pre_periods/min_post_periodsparameters: removed (previously deprecated withFutureWarning; callers passing these will now getTypeError)- Variance fallback: when full weight vector cannot be constructed for overall ATT SE, uses simplified variance (ignores covariances between periods) with
UserWarning - Rank-deficient design matrix (covariate collinearity):
- Detection: Pivoted QR decomposition with tolerance
1e-07(R'sqr()default) - Handling: Warns and drops linearly dependent columns, sets NA for dropped coefficients (R-style, matches
lm()) - Parameter:
rank_deficient_actioncontrols behavior: "warn" (default), "error", or "silent"
- Detection: Pivoted QR decomposition with tolerance
- NaN inference for undefined statistics:
- t_stat: Uses NaN (not 0.0) when SE is non-finite or zero
- Analytical inference: p_value and CI also NaN when t_stat is NaN (NaN propagates through
compute_p_valueandcompute_confidence_interval) - Bootstrap inference: p_value and CI computed from bootstrap distribution. SE, CI, and p-value are all NaN if the original point estimate is non-finite, SE is non-finite or zero, or if <50% of bootstrap samples are valid
- Applies to overall ATT, per-effect event study, and aggregated event study
- Note: Defensive enhancement matching CallawaySantAnna behavior; R's
fixest::sunab()may produce Inf/NaN without warning
- Inference distribution:
- Cohort-level p-values: t-distribution (via
LinearRegression.get_inference()) - Aggregated event study and overall ATT p-values: normal distribution (via
compute_p_value()) - This is asymptotically equivalent and standard for delta-method-aggregated quantities
- Deviation from R: R's fixest uses t-distribution at all levels; aggregated p-values may differ slightly for small samples
- Cohort-level p-values: t-distribution (via
Reference implementation(s):
- R:
fixest::sunab()(Laurent Bergé's implementation) - Stata:
eventstudyinteract
Requirements checklist:
- Never-treated units required as controls
- Interaction weights sum to 1 within each relative time period
- Reference period defaults to e=-1, coefficient normalized to zero
- Cohort-specific effects recoverable from results
- Cluster-robust SEs with delta method for aggregates
- R comparison: ATT matches within machine precision (<1e-11)
- R comparison: SE matches within 0.3% (well within 1% threshold)
- R comparison: Event study effects match perfectly (correlation 1.0)
- Survey design support (Phase 3): weighted within-transform, survey weights in LinearRegression with TSL vcov; bootstrap+survey supported (Phase 6) via Rao-Wu rescaled bootstrap
Key implementation requirements:
Assumption checks / warnings:
- Parallel trends (Assumption 1):
E[Y_it(0)] = alpha_i + beta_tfor all observations. General form allowsE[Y_it(0)] = alpha_i + beta_t + X'_it * deltawith time-varying covariates. - No-anticipation effects (Assumption 2):
Y_it = Y_it(0)for all untreated observations. Adjustable viaanticipationparameter. - Treatment must be absorbing:
D_itswitches from 0 to 1 and stays at 1. - Covariate space of treated observations must be spanned by untreated observations (rank condition). For unit/period FE case: every treated unit must have ≥1 untreated period; every post-treatment period must have ≥1 untreated unit.
- Without never-treated units, long-run effects at horizon
K_it >= H_bar(whereH_bar = max(first_treat) - min(first_treat)) are not identified (Proposition 5). Set to NaN with warning.
Estimator equation (Theorem 2, as implemented):
Step 1. Estimate counterfactual model on untreated observations only (it in Omega_0):
Y_it = alpha_i + beta_t [+ X'_it * delta] + epsilon_it
Step 2. For each treated observation (it in Omega_1), impute:
Y_hat_it(0) = alpha_hat_i + beta_hat_t [+ X'_it * delta_hat]
tau_hat_it = Y_it - Y_hat_it(0)
Step 3. Aggregate:
tau_hat_w = sum_{it in Omega_1} w_it * tau_hat_it
where:
Omega_0 = {it : D_it = 0}— all untreated observations (never-treated + not-yet-treated)Omega_1 = {it : D_it = 1}— all treated observationsw_it= pre-specified weights (overall ATT:w_it = 1/N_1)
Common estimation targets (weighting schemes):
- Overall ATT:
w_it = 1/N_1for allit in Omega_1 - Horizon-specific:
w_it = 1[K_it = h] / |Omega_{1,h}|forK_it = t - E_i - Group-specific:
w_it = 1[G_i = g] / |Omega_{1,g}|
Standard errors (Theorem 3, Equation 7):
Conservative clustered variance estimator:
sigma_hat^2_w = sum_i ( sum_{t: it in Omega} v_it * epsilon_tilde_it )^2
Observation weights v_it:
- For treated
(i,t) in Omega_1:v_it = w_it(the aggregation weight) - For untreated
(i,t) in Omega_0(FE-only case):v_it = -(w_i./n_{0,i} + w_.t/n_{0,t} - w../N_0)wherew_i. = sum of w over treated obs of unit i,n_{0,i} = untreated periods for unit i, etc. - For untreated with covariates:
v_untreated = -A_0 (A_0' A_0)^{-1} A_1' w_treatedwhereA_0,A_1are design matrices for untreated/treated observations.
Note on v_it derivation: The paper's Supplementary Proposition A3 provides the explicit formula for v_it^*, but was not in the extraction range for the paper review. The FE-only closed form above is reconstructed from Theorem 3's general form — it follows from the chain rule of the imputation estimator's dependence on the Step 1 OLS estimates. The covariate case uses the OLS projection matrix directly.
Auxiliary model residuals (Equation 8):
- Partition
Omega_1into groupsG_g(default: cohort × horizon) - Compute
tau_tilde_gfor each group (weighted average within group) epsilon_tilde_it = Y_it - alpha_hat_i - beta_hat_t [- X'delta_hat] - tau_tilde_g(treated)epsilon_tilde_it = Y_it - alpha_hat_i - beta_hat_t [- X'delta_hat](untreated, i.e., Step 1 residuals)
The aux_partition parameter controls the partition: "cohort_horizon" (default, tightest SEs), "cohort" (coarser, more conservative), "horizon" (groups by relative time only).
Pre-trend test (Test 1, Equation 9):
Y_it = alpha_i + beta_t [+ X'_it * delta] + W'_it * gamma + epsilon_it
- Estimate on untreated observations only
- Test
gamma = 0via cluster-robust Wald F-test - Independent of treatment effect estimation (Proposition 9)
Edge cases:
- Unbalanced panels: FE estimated via iterative alternating projection (Gauss-Seidel), equivalent to OLS with unit+time dummies. Converges in O(max_iter) passes; typically 5-20 iterations for unbalanced panels, 1-2 for balanced. One-pass demeaning is only exact for balanced panels.
- No never-treated units (Proposition 5): Long-run effects at horizons
h >= H_barare not identified. Set to NaN with warning listing affected horizons. - Rank condition failure: Every treated unit must have ≥1 untreated period; every post-treatment period must have ≥1 untreated unit. Behavior controlled by
rank_deficient_action: "warn" (default), "error", or "silent". Missing FE produce NaN treatment effects for affected observations. - Always-treated units: Units with
first_treatat or before the earliest time period have no untreated observations. Warning emitted; these units are excluded from Step 1 OLS but their treated observations contribute to aggregation if imputation is possible. - NaN propagation: If all
tau_hatvalues for a given horizon or group are NaN, the aggregated effect and all inference fields (SE, t-stat, p-value, CI) are set to NaN. NaN in v*eps product (from missing FE) is zeroed for variance computation (matching R's did_imputation which drops unimputable obs). - NaN inference for undefined statistics: t_stat uses NaN when SE is non-finite or zero; p_value and CI also NaN. Matches CallawaySantAnna NaN convention.
- Pre-trend test: Uses iterative demeaning (same as Step 1 FE) for exact within-transformation on unbalanced panels. One-pass demeaning is only exact for balanced panels.
- Overall ATT variance: Weights zero out non-finite tau_hat and renormalize, matching the ATT estimand (which averages only finite tau_hat).
_compute_conservative_variancereturns 0.0 for all-zeros weights, so the n_valid==0 guard is necessary to return NaN SE. balance_ecohort filtering: Whenbalance_eis set, cohort balance is checked against the full panel (pre + post treatment) via_build_cohort_rel_times(), requiring observations at every relative time in[-balance_e, max_h]. Both analytical aggregation and bootstrap inference use the same_compute_balanced_cohort_maskwith pre-computed cohort horizons.- Bootstrap clustering: Multiplier bootstrap generates weights at
cluster_vargranularity (defaults tounitifclusternot specified). Invalid cluster column raises ValueError. - Non-constant
first_treatwithin a unit: EmitsUserWarningidentifying the count and example unit. The estimator proceeds using the first observed value per unit (via.first()aggregation), but results may be unreliable. - treatment_effects DataFrame weights:
weightcolumn uses1/n_validfor finite tau_hat and 0 for NaN tau_hat, consistent with the ATT estimand (unweighted), or normalized survey weightssw_i/sum(sw)whensurvey_designis active. - Rank-deficient covariates in variance: Covariates with NaN coefficients (dropped for rank deficiency in Step 1) are excluded from the variance design matrices
A_0/A_1. Only covariates with finite coefficients participate in thev_itprojection. - Sparse variance solver:
_compute_v_untreated_with_covariatesusesscipy.sparse.linalg.spsolveto solve(A_0'A_0) z = A_1'wwithout densifying the normal equations matrix. Falls back to denselstsqif the sparse solver fails. - Note: Survey weights enter ImputationDiD via weighted iterative FE (Step 1), survey-weighted ATT aggregation (Step 3), and survey-weighted conservative variance (Theorem 3). PSU is used as the cluster variable for Theorem 3 variance. Strata enters survey df (n_PSU - n_strata) for t-distribution inference. FPC is not supported (raises NotImplementedError). Strata does NOT enter the variance formula itself (no stratified sandwich) — this is conservative relative to stratified variance. Bootstrap + survey supported (Phase 6) via PSU-level multiplier weights.
- Bootstrap inference: Uses multiplier bootstrap on the Theorem 3 influence function:
psi_i = sum_t v_it * epsilon_tilde_it. Cluster-level psi sums are pre-computed for each aggregation target (overall, per-horizon, per-group), then perturbed with multiplier weights (Rademacher by default; configurable viabootstrap_weightsparameter to use Mammen or Webb weights, matching CallawaySantAnna). This is a library extension (not in the paper) consistent with CallawaySantAnna/SunAbraham bootstrap patterns. - Auxiliary residuals (Equation 8): Uses v_it-weighted tau_tilde_g formula:
tau_tilde_g = sum(v_it * tau_hat_it) / sum(v_it)within each partition group. Zero-weight groups (common in event-study SE computation) fall back to unweighted mean.
Reference implementation(s):
- Stata:
did_imputation(Borusyak, Jaravel, Spiess; available from SSC) - R:
didimputationpackage (Kyle Butts)
Requirements checklist:
- Step 1: OLS on untreated observations only (never-treated + not-yet-treated)
- Step 2: Impute counterfactual
Y_hat_it(0)for treated observations - Step 3: Aggregate with researcher-chosen weights
w_it - Conservative clustered variance estimator (Theorem 3, Equation 7)
- Auxiliary model for treated residuals (Equation 8) with configurable partition (
aux_partition) - Supports unit FE, period FE, and time-varying covariates
- Refuses to estimate unidentified estimands (Proposition 5) — sets NaN with warning
- Pre-trend test uses only untreated observations (Test 1, Equation 9)
- Supports balanced and unbalanced panels (iterative Gauss-Seidel demeaning for exact FE)
- Event study and group aggregation
Primary source: Gardner, J. (2022). Two-stage differences in differences. arXiv:2207.05943.
Key implementation requirements:
Assumption checks / warnings:
- Parallel trends (same as ImputationDiD):
E[Y_it(0)] = alpha_i + beta_tfor all observations. - No-anticipation effects:
Y_it = Y_it(0)for all untreated observations. - Treatment must be absorbing:
D_itswitches from 0 to 1 and stays at 1. - Always-treated units (treated in all periods) are excluded with a warning, since they have no untreated observations for Stage 1 FE estimation.
Estimator equation (two-stage procedure, as implemented):
Stage 1. Estimate unit + time fixed effects on untreated observations only (it in Omega_0):
Y_it = alpha_i + beta_t + epsilon_it
Compute residuals: y_tilde_it = Y_it - alpha_hat_i - beta_hat_t (for ALL observations)
Stage 2. Regress residualized outcomes on treatment indicators (on treated observations):
y_tilde_it = tau * D_it + eta_it
(or event-study specification with horizon indicators)
Point estimates are identical to ImputationDiD (Borusyak et al. 2024). The two-stage procedure is algebraically equivalent to the imputation approach: both estimate unit+time FE on untreated observations and recover treatment effects from the difference between observed and counterfactual outcomes.
Variance: GMM sandwich (Newey & McFadden 1994 Theorem 6.1):
The variance accounts for first-stage estimation error propagating into Stage 2, following the GMM framework:
V(tau_hat) = (D'D)^{-1} * Bread * (D'D)^{-1}
Bread = sum_c ( sum_{i in c} psi_i )( sum_{i in c} psi_i )'
where psi_i is the stacked influence function for unit i across all its observations, combining the Stage 2 score and the Stage 1 correction term.
Note on Equation 6 discrepancy: The paper's Equation 6 uses a per-cluster inverse (D_c'D_c)^{-1} when forming the influence function contribution. The R did2s implementation and our code use the GLOBAL inverse (D'D)^{-1} following standard GMM theory (Newey & McFadden 1994). We follow the R implementation, which is consistent with standard GMM sandwich variance estimation.
No finite-sample adjustments: The variance estimator uses the raw asymptotic sandwich without degrees-of-freedom corrections (no HC1-style n/(n-k) adjustment). This matches the R did2s implementation.
Bootstrap:
Our implementation uses multiplier bootstrap on the GMM influence function: cluster-level psi sums are pre-computed, then perturbed with multiplier weights (Rademacher by default; configurable via bootstrap_weights parameter to use Mammen or Webb weights, matching CallawaySantAnna). The R did2s package defaults to block bootstrap (resampling clusters with replacement). Both approaches are asymptotically valid; the multiplier bootstrap is computationally cheaper and consistent with the CallawaySantAnna/ImputationDiD bootstrap patterns in this library.
Edge cases:
- Always-treated units: Units treated in all observed periods have no untreated observations for Stage 1 FE estimation. These are excluded with a warning listing the affected unit IDs. Their treated observations do NOT contribute to Stage 2.
- Rank condition violations: If the Stage 1 design matrix (unit+time dummies on untreated obs) is rank-deficient, or if certain unit/time FE are unidentified (e.g., a unit with no untreated periods after excluding always-treated), the affected FE produce NaN. Behavior controlled by
rank_deficient_action: "warn" (default), "error", or "silent". - NaN y_tilde handling: When Stage 1 FE are unidentified for some observations, the residualized outcome
y_tildeis NaN. These observations are zeroed out (excluded) from the Stage 2 regression and variance computation, matching the treatment of unimputable observations in ImputationDiD. - NaN inference for undefined statistics: t_stat uses NaN when SE is non-finite or zero; p_value and CI also NaN. Matches CallawaySantAnna/ImputationDiD NaN convention.
- Event study aggregation: Horizon-specific effects use the same two-stage procedure with horizon indicator dummies in Stage 2. Unidentified horizons (e.g., long-run effects without never-treated units, per Proposition 5 of Borusyak et al. 2024) produce NaN.
- balance_e with no qualifying cohorts: If no cohorts have sufficient pre/post coverage for the requested
balance_e, a warning is emitted and event study results contain only the reference period. - No never-treated units (Proposition 5): When there are no never-treated units and multiple treatment cohorts, horizons h >= h_bar (where h_bar = max(groups) - min(groups)) are unidentified per Proposition 5 of Borusyak et al. (2024). These produce NaN inference with n_obs > 0 (treated observations exist but counterfactual is unidentified) and a warning listing affected horizons. Matches ImputationDiD behavior. Proposition 5 applies to event study horizons only, not cohort aggregation — a cohort whose treated obs all fall at Prop 5 horizons naturally gets n_obs=0 in group effects because all its y_tilde values are NaN.
- Zero-observation horizons after filtering: When
balance_eor NaNy_tildefiltering results in zero observations for some non-Prop-5 event study horizons, those horizons produce NaN for all inference fields (effect, SE, t-stat, p-value, CI) with n_obs=0. - Zero-observation cohorts in group effects: If all treated observations for a cohort have NaN
y_tilde(excluded from estimation), that cohort's group effect is NaN with n_obs=0. - Note: Survey weights in TwoStageDiD GMM sandwich via weighted cross-products: bread uses (X'2 W X_2)^{-1}, gamma_hat uses (X'{10} W X_{10})^{-1}(X'_1 W X_2), per-cluster scores multiply by survey weights. PSU is used as the cluster variable for GMM variance. Strata enters survey df (n_PSU - n_strata) for t-distribution inference. FPC is not supported (raises NotImplementedError). Strata does NOT enter the variance formula itself (no stratified sandwich) — this is conservative. Bootstrap + survey supported (Phase 6) via PSU-level multiplier weights.
Reference implementation(s):
- R:
did2s::did2s()(Kyle Butts & John Gardner)
Requirements checklist:
- Stage 1: OLS on untreated observations only for unit+time FE
- Stage 2: Regress residualized outcomes on treatment indicators
- Point estimates match ImputationDiD
- GMM sandwich variance (Newey & McFadden 1994 Theorem 6.1)
- Global
(D'D)^{-1}in variance (matches Rdid2s, not paper Eq. 6) - No finite-sample adjustment (raw asymptotic sandwich)
- Always-treated units excluded with warning
- Multiplier bootstrap on GMM influence function
- Event study and overall ATT aggregation
Primary source: Wing, C., Freedman, S. M., & Hollingsworth, A. (2024). Stacked Difference-in-Differences. NBER Working Paper 32054. http://www.nber.org/papers/w32054
Key implementation requirements:
Assumption checks / warnings:
- Assumption 1 (No Anticipation): ATT(a, a+e) = 0 for all e < 0
- Assumption 2 (Common Trends): E[Y_{s,a+e}(0) - Y_{s,a-1}(0) | A_s = a] = E[Y_{s,a+e}(0) - Y_{s,a-1}(0) | A_s > a + e]
- Clean controls must exist for each sub-experiment (IC2)
- Event window must fit within observed data range (IC1)
Target parameter (Equation 2):
theta_kappa^e = sum_{a in Omega_kappa} ATT(a, a+e) * (N_a^D / N_Omega_kappa^D)
where:
theta_kappa^e= trimmed aggregate ATT at event time eOmega_kappa= trimmed set of adoption events satisfying IC1 and IC2N_a^D= number of treated units in sub-experiment aN_Omega_kappa^D= total treated units across all sub-experiments in trimmed set
Estimator equation (Equation 3 — weighted saturated event study, recommended):
Y_sae = alpha_0 + alpha_1 * D_sa + sum_{h != -1} [lambda_h * 1(e=h) + delta_h * D_sa * 1(e=h)] + U_sae
Estimated via WLS with Q-weights. The delta_h coefficients identify theta_kappa^e.
Q-weights (Section 5.3, Table 1):
Q_sa = 1 if D_sa = 1 (treated)
Q_sa = (N_a^D / N^D) / (N_a^C / N^C) if D_sa = 0 (control, aggregate weighting)
Q_sa = (Pop_a^D / Pop^D) / (N_a^C / N^C) if D_sa = 0 (control, population weighting)
Q_sa = ((N_a + N_a^C)/(N^D+N^C)) / (N_a^C/N^C) if D_sa = 0 (control, sample share weighting)
Standard errors (Section 5.4):
- Default: Cluster-robust standard errors at the group (unit) level
- Alternative: Cluster at group x sub-experiment level
- Both approaches yield approximately correct coverage when clusters > 100 (Table 2)
- No special bootstrap procedure specified; standard cluster-robust SEs recommended
- For post-period average: delta method or
lincom/marginaleffects
Edge cases:
- All events trimmed:
len(Omega_kappa) == 0-> ValueError suggesting reduced kappa - No clean controls for event a: IC2 check fails -> Trim event, warn user
- Single cohort in trimmed set: Valid — Q-weights simplify
- Duplicate observations: Same (unit, time) appears in multiple sub-experiments -> handled by clustering at unit level
- Constant treatment share across sub-exps: Unweighted FE recovers correct estimand (special case, Section 5.5)
- Anticipation > 0: Reference period shifts to e = -1 - anticipation. Post-treatment includes anticipation periods (e >= -anticipation). Window expands by anticipation pre-periods.
- Group aggregation: Not supported — pooled stacked regression cannot produce cohort-specific effects. Use CallawaySantAnna or ImputationDiD.
Algorithm (Section 5):
- Choose kappa_pre, kappa_post event window
- Apply IC1 (window fits in data) and IC2 (clean controls exist) to get Omega_kappa
- For each a in Omega_kappa: build sub-experiment with treated (A_s = a), clean controls (A_s > a + kappa_post), time window [a - kappa_pre, a + kappa_post] (with anticipation: [a - kappa_pre - anticipation, a + kappa_post])
- Stack all sub-experiments vertically
- Compute Q-weights: aggregate weighting uses observation counts per (event_time, sub_exp), matching R reference. Population/sample_share use unit counts per sub_exp (paper notation).
- Run WLS regression of Equation 3 with Q-weights
- Extract delta_h coefficients as event-study ATTs
- Compute cluster-robust SEs at unit level
IC1 (Adoption Event Window, Section 3):
IC1_a = 1[a - kappa_pre >= T_min AND a + kappa_post <= T_max]
Note: Matches R reference implementation (focalAdoptionTime - kappa_pre >= minTime).
The reference period a-1 is included in the window [a-kappa_pre, a+kappa_post] when kappa_pre >= 1.
The paper text states a stricter bound (T_min + 1) but the R code by the co-author uses T_min.
IC2 (Clean Controls Exist, Section 3):
IC2_a = 1[exists s with A_s > a + kappa_post] (not_yet_treated)
IC2_a = 1[exists s with A_s > a + kappa_post + kappa_pre] (strict)
IC2_a = 1[exists s with A_s = infinity] (never_treated)
Reference implementation(s):
- R: https://github.com/hollina/stacked-did-weights (
create_sub_exp(),compute_weights()) - No Stata or Python package; Stata estimation via standard
reghdfewith Q-weight column
Requirements checklist:
- Sub-experiment construction with treated + clean controls + time window
- IC1 and IC2 trimming with warnings
- Q-weight computation for all three weighting schemes (Table 1)
- WLS via sqrt(w) transformation
- Event study regression (Equation 3) with reference period e=-1
- Cluster-robust SEs at unit or unit x sub-exp level
- Overall ATT as average of post-treatment delta_h with delta-method SE
- Anticipation parameter support
- Never-treated encoding (0 and inf)
- Survey design support (Phase 3): Q-weights compose multiplicatively with survey weights; TSL vcov on composed weights; survey design columns propagated through sub-experiments
- Note: Survey weights compose multiplicatively with Q-weights for StackedDiD; only
weight_type="pweight"(default) is supported —fweightandaweightare rejected because Q-weight composition changes weight semantics (non-integer for fweight, non-inverse-variance for aweight)
Key implementation requirements:
Assumption checks / warnings:
- Requires balanced panel (same units observed in all periods)
- Warns if pre-treatment fit is poor (high RMSE)
- Treatment must be "block" structure: all treated units treated at same time
Estimator equation (as implemented):
τ̂^sdid = Σ_t λ_t (Ȳ_{tr,t} - Σ_j ω_j Y_{j,t})
where Ȳ_{tr,t} is the mean treated outcome at time t, ω_j are unit weights, and λ_t are time weights.
Unit weights ω (Frank-Wolfe on collapsed form):
Build collapsed-form matrix Y_unit of shape (T_pre, N_co + 1), where the last column is the mean treated pre-period outcomes. Solve via Frank-Wolfe on the simplex:
min_{ω on simplex} ζ_ω² ||ω||₂² + (1/T_pre) ||A_centered ω - b_centered||₂²
where A = Y_unit[:, :N_co], b = Y_unit[:, N_co], and centering is column-wise (intercept=True).
Two-pass sparsification procedure (matches R's synthdid::sc.weight.fw + sparsify_function):
- First pass: Run Frank-Wolfe for 100 iterations (max_iter_pre_sparsify) from uniform initialization
- Sparsify:
v[v <= max(v)/4] = 0; v = v / sum(v)(zero out small weights, renormalize) - Second pass: Run Frank-Wolfe for 10000 iterations (max_iter) starting from sparsified weights
The sparsification step concentrates weights on the most important control units, improving interpretability and stability.
Time weights λ (Frank-Wolfe on collapsed form):
Build collapsed-form matrix Y_time of shape (N_co, T_pre + 1), where the last column is the per-control post-period mean (averaged across post-periods for each control unit). Solve:
min_{λ on simplex} ζ_λ² ||λ||₂² + (1/N_co) ||A_centered λ - b_centered||₂²
where A = Y_time[:, :T_pre], b = Y_time[:, T_pre], and centering is column-wise.
Auto-regularization (matching R's synthdid):
noise_level = sd(first_differences of control outcomes) # pooled across units
zeta_omega = (N_treated × T_post)^(1/4) × noise_level
zeta_lambda = 1e-6 × noise_level
The noise level is computed as the standard deviation (ddof=1) of np.diff(Y_pre_control, axis=0),
which computes first-differences across time for each control unit and pools all values.
This matches R's sd(apply(Y[1:N0, 1:T0], 1, diff)).
Frank-Wolfe step (matches R's fw.step()):
half_grad = A' (A x - b) + η x # η = N × ζ²
i = argmin(half_grad) # vertex selection (simplex corner)
d_x = e_i - x # direction toward vertex i
step = -(half_grad · d_x) / (||A d_x||² + η ||d_x||²)
step = clip(step, 0, 1)
x_new = x + step × d_x
Convergence criterion: stop when objective decrease < min_decrease² (default min_decrease = 1e-5 × noise_level, max_iter = 10000, max_iter_pre_sparsify = 100).
Standard errors:
-
Default: Placebo variance estimator (Algorithm 4 in paper)
- Randomly permute control unit indices
- Split into pseudo-controls (first N_co - N_tr) and pseudo-treated (last N_tr)
- Re-estimate unit weights (Frank-Wolfe) on pseudo-control/pseudo-treated data
- Re-estimate time weights (Frank-Wolfe) on pseudo-control data
- Compute SDID estimate with re-estimated weights
- Repeat
replicationstimes (default 200) SE = sqrt((r-1)/r) × sd(placebo_estimates)where r = number of successful replications
This matches R's
synthdid::vcov(method="placebo")which passesupdate.omega=TRUE, update.lambda=TRUEviaopts. -
Alternative: Bootstrap at unit level (matching R's
synthdid::vcov(method="bootstrap"))- Resample ALL units (control + treated) with replacement
- Identify which resampled units are control vs treated
- Renormalize original unit weights for resampled controls:
ω_boot = _sum_normalize(ω[boot_control_idx]) - Time weights λ remain unchanged from original estimation (fixed weights)
- Compute SDID estimate with renormalized ω and original λ
SE = sd(bootstrap_estimates, ddof=1)
Edge cases:
- Frank-Wolfe non-convergence: Returns current weights after max_iter iterations. No warning emitted; the convergence check
vals[t-1] - vals[t] < min_decrease²simply does not trigger early exit, and the final iterate is returned. _sparsifyall-zero input: Ifmax(v) <= 0, returns uniform weightsones(len(v)) / len(v).- Single control unit:
compute_sdid_unit_weightsreturns[1.0]immediately (short-circuit before Frank-Wolfe). - Zero control units:
compute_sdid_unit_weightsreturns empty array[]. - Single pre-period:
compute_time_weightsreturns[1.0]whenn_pre <= 1(Frank-Wolfe on a 1-element simplex is trivial). - Bootstrap with 0 control or 0 treated in resample: Skip iteration (
continue). If ALL bootstrap iterations fail, raisesValueError. If only 1 succeeds, warns and returns SE=0.0. If >5% failure rate, warns about reliability. - Placebo with n_control <= n_treated: Warns that not enough control units for placebo variance estimation, returns SE=0.0 and empty placebo effects array. The check is
n_control - n_treated < 1. - Note: Power analysis functions (
simulate_power,simulate_mde,simulate_sample_size) raiseValueErrorfor placebo variance whenn_control <= n_treated. The registry path checks pre-generation usingn_units * treatment_fraction; the custom-DGP path checks post-generation on the realized data (first iteration only, since treatment allocation is deterministic pern_units/treatment_fraction). - Negative weights attempted: Frank-Wolfe operates on the simplex (non-negative, sum-to-1), so weights are always feasible by construction. The step size is clipped to [0, 1] and the move is toward a simplex vertex.
- Perfect pre-treatment fit: Regularization (ζ² ||ω||²) prevents overfitting by penalizing weight concentration.
- Single treated unit: Valid; placebo variance uses jackknife-style permutations of controls.
- Noise level with < 2 pre-periods: Returns 0.0, which makes both zeta_omega and zeta_lambda equal to 0.0 (no regularization). Note (deviation from R):
min_decreaseuses a1e-5floor whennoise_level == 0to enable Frank-Wolfe early stopping. R would use0.0, causing FW to run allmax_iteriterations; the result is equivalent since zero-noise data has no variation to optimize. - NaN inference for undefined statistics: t_stat uses NaN when SE is zero or non-finite; p_value and CI also NaN. Matches CallawaySantAnna NaN convention.
- Placebo p-value floor:
p_value = max(empirical_p, 1/(n_replications + 1))to avoid reporting exactly zero. - Varying treatment within unit: Raises
ValueError. SDID requires block treatment (constant within each unit). Suggests CallawaySantAnna or ImputationDiD for staggered adoption. - Unbalanced panel: Raises
ValueError. SDID requires all units observed in all periods. Suggestsbalance_panel(). - Poor pre-treatment fit: Warns (
UserWarning) whenpre_fit_rmse > std(treated_pre_outcomes, ddof=1). Diagnostic only; estimation proceeds. - Note: Survey support: weights, strata, PSU, and FPC are all supported. Full-design surveys use Rao-Wu rescaled bootstrap (Phase 6);
variance_method="placebo"requires weights-only (strata/PSU/FPC require bootstrap). Both sides weighted per WLS regression interpretation: treated-side means are survey-weighted (Frank-Wolfe target and ATT formula); control-side synthetic weights are composed with survey weights post-optimization (ω_eff = ω * w_co, renormalized). Frank-Wolfe optimization itself is unweighted — survey importance enters after trajectory-matching. Covariate residualization uses WLS with survey weights. Placebo and bootstrap SE preserve survey weights on both sides.
Reference implementation(s):
- R:
synthdid::synthdid_estimate()(Arkhangelsky et al.'s official package) - Key R functions matched:
sc.weight.fw()(Frank-Wolfe),sparsify_function(sparsification),vcov.synthdid_estimate()(variance)
Requirements checklist:
- Unit weights: Frank-Wolfe on collapsed form (T_pre, N_co+1), two-pass sparsification (100 iters -> sparsify -> 10000 iters)
- Time weights: Frank-Wolfe on collapsed form (N_co, T_pre+1), last column = per-control post mean
- Unit and time weights: sum to 1, non-negative (simplex constraint)
- Auto-regularization: noise_level = sd(first_diffs), zeta_omega = (N1*T1)^0.25 * noise_level, zeta_lambda = 1e-6 * noise_level
- Sparsification: v[v <= max(v)/4] = 0; v = v/sum(v)
- Placebo SE formula: sqrt((r-1)/r) * sd(placebo_estimates)
- Placebo SE: re-estimates omega and lambda per replication (matching R's update.omega=TRUE, update.lambda=TRUE)
- Bootstrap: fixed weights (original lambda unchanged, omega renormalized for resampled controls)
- Returns both unit and time weights for interpretation
- Column centering (intercept=True) in Frank-Wolfe optimization
Primary source: Ortiz-Villavicencio, M., & Sant'Anna, P.H.C. (2025). Better Understanding Triple Differences Estimators. arXiv:2505.09942.
Key implementation requirements:
Assumption checks / warnings:
- Requires all 8 cells of the 2×2×2 design: Group(0/1) × Period(0/1) × Treatment(0/1)
- Warns if any cell has fewer than threshold observations
- Propensity score overlap required for IPW/DR methods
Estimator equation (as implemented):
Three-DiD decomposition (matching R's triplediff::ddd()):
Subgroups: 4=G1P1, 3=G1P0, 2=G0P1, 1=G0P0
DDD = DiD_3 + DiD_2 - DiD_1
where DiD_j is a pairwise DiD comparing subgroup j vs subgroup 4 (reference).
Each pairwise DiD uses the selected estimation method (DR, IPW, or RA) with
repeated cross-section implementation (panel=FALSE in R).
Regression adjustment (RA): Separate OLS per subgroup-time cell within each pairwise comparison, imputed counterfactual means.
IPW: Propensity score P(subgroup=4|X) within {j, 4} subset, Hajek normalization.
Doubly robust (DR): Combines outcome regression and IPW with efficiency correction (OR bias correction term).
Standard errors (all methods):
Individual-level (default):
SE = std(w₃·IF₃ + w₂·IF₂ - w₁·IF₁, ddof=1) / sqrt(n)
where w_j = n / n_j, n_j = |{subgroup=j}| + |{subgroup=4}|, and IF_j is the
per-observation influence function for pairwise DiD j (padded to full n with zeros).
Cluster-robust (when cluster parameter is provided):
SE = sqrt( (G/(G-1)) * (1/n²) * Σ_c ψ_c² )
where G is the number of clusters, ψ_c = Σ_{i∈c} IF_i is the sum of the combined
influence function within cluster c, and the G/(G-1) factor is the Liang-Zeger
finite-sample adjustment.
Note: IF-based SEs are inherently heteroskedasticity-robust; the robust parameter
has no additional effect.
Edge cases:
- Propensity scores near 0/1: trimmed at
pscore_trim(default 0.01) - Empty cells: raises ValueError with diagnostic message
- Low cell counts: warns when any cell has fewer than 10 observations
- Cluster-robust SE: requires at least 2 clusters (raises
ValueError) - Cluster IDs: must not contain NaN (raises
ValueError) - Overlap warning: emitted when >5% of observations are trimmed at pscore bounds (IPW/DR only)
- Propensity score estimation failure: falls back to unconditional probability P(subgroup=4),
sets hessian=None (skipping PS correction in influence function), emits UserWarning.
Exception: when
rank_deficient_action="error", the error is re-raised instead of falling back. - Collinear covariates: detected via pivoted QR in
solve_ols(), action controlled byrank_deficient_action("warn", "error", "silent") - Non-finite influence function values (e.g., from extreme propensity scores in IPW/DR or near-singular design): warns and sets SE to NaN, propagated to t_stat/p_value/CI via safe_inference()
- NaN inference for undefined statistics:
- t_stat: Uses NaN (not 0.0) when SE is non-finite or zero
- p_value and CI: Also NaN when t_stat is NaN
- Note: Defensive enhancement; reference implementation behavior not yet documented
Reference implementation(s):
- R
triplediff::ddd()(v0.2.1, CRAN) — official companion by paper authors
Requirements checklist:
- All 8 cells (G×P×T) must have observations
- Propensity scores clipped at
pscore_trimbounds - Doubly robust consistent if either propensity or outcome model correct
- Returns cell means for diagnostic inspection
- Supports RA, IPW, and DR estimation methods
- Three-DiD decomposition: DDD = DiD_3 + DiD_2 - DiD_1 (matching R)
- Influence function SE: std(w3·IF_3 + w2·IF_2 - w1·IF_1) / sqrt(n)
- Cluster-robust SE via Liang-Zeger variance on influence function
- ATT and SE match R within <0.001% for all methods and DGP types
- Survey design support: all methods (reg, IPW, DR) with weighted OLS/logit + TSL on combined influence functions. Weighted solve_logit() for propensity scores in IPW/DR paths.
- Note: TripleDifference survey SE: for IPW/DR, pairwise IFs incorporate survey weights via weighted Riesz representers (
riesz *= weights), so the combined IF is divided by per-observation survey weights (inf / sw) before passing tocompute_survey_vcov()to prevent double-weighting. For regression (RA), pairwise IFs are already on the unweighted residual scale (WLS fits use weights internally but the IF is not Riesz-multiplied), so the combined IF passes directly to TSL without de-weighting. The OLS nuisance IF corrections in DR mode use weighted cross-products normalized by subgroup row countn(notsum(weights)).
Primary source: Athey, S., Imbens, G.W., Qu, Z., & Viviano, D. (2025). Triply Robust Panel Estimators. arXiv:2508.21536.
Key implementation requirements:
Assumption checks / warnings:
- Requires sufficient pre-treatment periods for factor estimation (at least 2)
- Warns if estimated rank seems too high/low relative to panel dimensions
- Unit weights can become degenerate if λ_unit too large
- Returns Q(λ) = ∞ if ANY LOOCV fit fails (Equation 5 compliance)
Treatment indicator (D matrix) semantics:
D must be an ABSORBING STATE indicator, not a treatment timing indicator:
- D[t, i] = 0 for all t < g_i (pre-treatment periods for unit i)
- D[t, i] = 1 for all t >= g_i (during and after treatment for unit i)
where g_i is the treatment start time for unit i.
For staggered adoption, different units have different treatment start times g_i. The D matrix naturally handles this - distances use periods where BOTH units have D=0, matching the paper's (1 - W_iu)(1 - W_ju) formula in Equation 3.
Wrong D specification: If user provides event-style D (only first treatment period has D=1), ATT will be incorrect - document this clearly.
ATT definition (Equation 1, Section 6.1):
τ̂ = (1 / Σ_i Σ_t W_{it}) Σ_{i=1}^N Σ_{t=1}^T W_{it} τ̂_{it}(λ̂)
- ATT averages over ALL cells where D_it=1 (treatment indicator)
- No separate "post_periods" concept - D matrix is the sole input for treatment timing
- Supports general assignment patterns including staggered adoption
Estimator equation (as implemented, Section 2.2):
Working model (separating unit/time FE from regularized factor component):
Y_it(0) = α_i + β_t + L_it + ε_it, E[ε_it | L] = 0
where α_i are unit fixed effects, β_t are time fixed effects, and L = UΣV' is a low-rank factor structure. The FE are estimated separately from L because L is regularized but the fixed effects are not.
Optimization (Equation 2):
(α̂, β̂, L̂) = argmin_{α,β,L} Σ_j Σ_s θ_s^{i,t} ω_j^{i,t} (1-W_js)(Y_js - α_j - β_s - L_js)² + λ_nn ||L||_*
Solved via alternating minimization. For α, β: weighted least squares (closed form). The global solver adds an intercept μ and solves for (μ, α, β, L) on control data only, extracting τ_it post-hoc as residuals (see Global section below). For L: proximal gradient with step size η = 1/(2·max(W)):
Gradient step: G = L + (W/max(W)) ⊙ (R - L)
Proximal step: L = U × soft_threshold(Σ, η·λ_nn) × V' (SVD of G = UΣV')
where R is the residual after removing fixed effects. Both the local and global solvers use FISTA/Nesterov acceleration for the inner L update (O(1/k²) convergence rate, up to 20 inner iterations per outer alternating step).
Per-observation weights (Equation 3):
θ_s^{i,t}(λ) = exp(-λ_time × |t - s|)
ω_j^{i,t}(λ) = exp(-λ_unit × dist^unit_{-t}(j, i))
dist^unit_{-t}(j, i) = (Σ_u 1{u≠t}(1-W_iu)(1-W_ju)(Y_iu - Y_ju)² / Σ_u 1{u≠t}(1-W_iu)(1-W_ju))^{1/2}
Note: weights are per-(i,t) observation-specific. The distance formula excludes the target period t and uses only periods where both units are untreated (W=0).
Special cases (Section 2.2):
- λ_nn=∞, ω_j=θ_s=1 (uniform weights) → recovers DID/TWFE
- ω_j=θ_s=1, λ_nn<∞ → recovers Matrix Completion (Athey et al. 2021)
- λ_nn=∞ with specific ω_j, θ_s → recovers SC/SDID
LOOCV tuning parameter selection (Equation 5, Footnote 2):
Q(λ) = Σ_{j,s: D_js=0} [τ̂_js^loocv(λ)]²
- Score is SUM of squared pseudo-treatment effects on control observations
- Two-stage procedure (per paper's footnote 2):
- Stage 1: Univariate grid searches with extreme fixed values
- λ_time search: fix λ_unit=0, λ_nn=∞ (disabled)
- λ_nn search: fix λ_time=0 (uniform time weights), λ_unit=0
- λ_unit search: fix λ_nn=∞, λ_time=0
- Stage 2: Cycling (coordinate descent) until convergence
- Stage 1: Univariate grid searches with extreme fixed values
- "Disabled" parameter semantics (per paper Section 4.3, Table 5, Footnote 2):
λ_time=0: Uniform time weights (disabled), because exp(-0 × dist) = 1λ_unit=0: Uniform unit weights (disabled), because exp(-0 × dist) = 1λ_nn=∞: Factor model disabled (L=0), because infinite penalty; converted to1e10internally- Note:
λ_nn=0means NO regularization (full-rank L), which is the OPPOSITE of "disabled" - Validation:
lambda_time_gridandlambda_unit_gridmust not contain inf. AValueErroris raised if they do, guiding users to use 0.0 for uniform weights per Eq. 3.
- LOOCV failure handling (Equation 5 compliance):
- If ANY LOOCV fit fails for a parameter combination, Q(λ) = ∞
- A warning is emitted on the first failure with the observation (t, i) and λ values
- Subsequent failures for the same λ are not individually warned (early return)
- This ensures λ selection only considers fully estimable combinations
Standard errors:
- Block bootstrap preserving panel structure (Algorithm 3)
Edge cases:
- Rank selection: automatic via cross-validation, information criterion, or elbow
- Zero singular values: handled by soft-thresholding
- Extreme distances: weights regularized to prevent degeneracy
- LOOCV fit failures: returns Q(λ) = ∞ on first failure (per Equation 5 requirement that Q sums over ALL control observations where D==0); if all parameter combinations fail, falls back to defaults (1.0, 1.0, 0.1)
- λ_nn=∞ implementation: Only λ_nn uses infinity (converted to 1e10 for computation):
- λ_nn=∞ → 1e10 (large penalty → L≈0, factor model disabled)
- Conversion applied to grid values during LOOCV (including Rust backend)
- Conversion applied to selected values for point estimation
- Conversion applied to selected values for variance estimation (ensures SE matches ATT)
- Results storage:
TROPResultsstores original λ_nn value (inf), while computations use 1e10. λ_time and λ_unit store their selected values directly (0.0 = uniform).
- Empty control observations: If no valid control observations exist, returns Q(λ) = ∞ with warning. A score of 0.0 would incorrectly "win" over legitimate parameters.
- Infinite LOOCV score handling: If best LOOCV score is infinite,
best_lambdais set to None, triggering defaults fallback - Validation: requires at least 2 periods before first treatment
- D matrix validation: Treatment indicator must be an absorbing state (monotonic non-decreasing per unit)
- Detection:
np.diff(D, axis=0) < 0for any column indicates violation - Handling: Raises
ValueErrorwith list of violating unit IDs and remediation guidance - Error message includes: "convert to absorbing state: D[t, i] = 1 for all t >= first treatment period"
- Rationale: Event-style D (0→1→0) silently biases ATT; runtime validation prevents misuse
- Unbalanced panels: Missing unit-period observations are allowed. Monotonicity validation checks each unit's observed D sequence for monotonicity, which correctly catches 1→0 violations that span missing period gaps (e.g., D[2]=1, missing [3,4], D[5]=0 is detected as a violation even though the gap hides the transition in adjacent-period checks).
- n_post_periods metadata: Counts periods where D=1 is actually observed (at least one unit has D=1), not calendar periods from first treatment. In unbalanced panels where treated units are missing in some post-treatment periods, only periods with observed D=1 values are counted.
- Detection:
- Wrong D specification: if user provides event-style D (only first treatment period), the absorbing-state validation will raise ValueError with helpful guidance
- Bootstrap minimum:
n_bootstrapmust be >= 2 (enforced viaValueError). TROP uses bootstrap for all variance estimation — there is no analytical SE formula. - LOOCV failure metadata: When LOOCV fits fail in the Rust backend, the first failed observation coordinates (t, i) are returned to Python for informative warning messages
- Inference CI distribution: After
safe_inference()migration, CI uses t-distribution (df = max(1, n_treated_obs - 1)), consistent with p_value. Previously CI used normal-distribution while p_value used t-distribution (inconsistent). This is a minor behavioral change; CIs may be slightly wider for small n_treated_obs.
Reference implementation(s):
- Authors' replication code (forthcoming)
Requirements checklist:
- Factor matrix estimated via soft-threshold SVD
- Unit weights:
exp(-λ_unit × distance)(unnormalized, matching Eq. 2) - LOOCV implemented for tuning parameter selection
- LOOCV uses SUM of squared errors per Equation 5
- Multiple rank selection methods: cv, ic, elbow
- Returns factor loadings and scores for interpretation
- ATT averages over all D==1 cells (general assignment patterns)
- No post_periods parameter (D matrix determines treatment timing)
- D matrix semantics documented (absorbing state, not event indicator)
- Unbalanced panels supported (missing observations don't trigger false violations)
- Note: Survey support: weights, strata, PSU, and FPC are all supported via Rao-Wu rescaled bootstrap with cross-classified pseudo-strata (Phase 6). Rust backend remains pweight-only; full-design surveys fall back to the Python bootstrap path. Survey weights enter ATT aggregation only — population-weighted average of per-observation treatment effects. Model fitting (kernel weights, LOOCV, nuclear norm regularization) stays unchanged. Rust and Python bootstrap paths both support survey-weighted ATT in each iteration.
Method: method="global" in TROP estimator (method="joint" is a deprecated alias;
method="twostep" is a deprecated alias for method="local")
Approach: Computationally efficient adaptation using the (1-W) masking
principle from Eq. 2. Fits a single global model on control data, then
extracts treatment effects as post-hoc residuals. For the paper's full
per-treated-cell estimator (Algorithm 2), use method='local'.
Objective function (Equation G1):
min_{μ, α, β, L} Σ_{i,t} (1-W_{it}) × δ_{it} × (Y_{it} - μ - α_i - β_t - L_{it})² + λ_nn×||L||_*
where:
- (1-W_{it}) masks out treated observations — model is fit on control data only
- δ_{it} = δ_time(t) × δ_unit(i) are observation weights (product of time and unit weights)
- μ is the intercept
- α_i are unit fixed effects
- β_t are time fixed effects
- L_{it} is the low-rank factor component
Post-hoc treatment effect extraction:
τ̂_{it} = Y_{it} - μ̂ - α̂_i - β̂_t - L̂_{it} for all (i,t) where W_{it} = 1
ATT = mean(τ̂_{it}) over all treated observations
Treatment effects are heterogeneous per-observation values. ATT is their mean.
Weight computation (differs from local):
- Time weights: δ_time(t) = exp(-λ_time × |t - center|) where center = T - treated_periods/2
- Unit weights: δ_unit(i) = exp(-λ_unit × RMSE(i, treated_avg)) where RMSE is computed over pre-treatment periods comparing to average treated trajectory
- (1-W) masking applied after outer product: δ_{it} = 0 for all treated cells
Implementation approach (without CVXPY):
-
Without low-rank (λ_nn = ∞): Standard weighted least squares
- Build design matrix with unit/time dummies (no treatment indicator)
- Solve via np.linalg.lstsq for (μ, α, β) using (1-W)-masked weights
-
With low-rank (finite λ_nn): Alternating minimization
- Alternate between:
- Fix L, solve weighted LS for (μ, α, β)
- Fix (μ, α, β), proximal gradient for L:
- Lipschitz constant of ∇f is L_f = 2·max(δ)
- Step size η = 1/L_f = 1/(2·max(δ))
- Proximal operator: soft_threshold(gradient_step, η·λ_nn)
- Inner solver uses FISTA/Nesterov acceleration (O(1/k²))
- Continue until max(|L_new - L_old|) < tol
- Alternate between:
-
Post-hoc: Extract τ̂_{it} = Y_{it} - μ̂ - α̂_i - β̂_t - L̂_{it} for treated cells
LOOCV parameter selection (unified with local, Equation 5): Following paper's Equation 5 and footnote 2:
Q(λ) = Σ_{j,s: D_js=0} [τ̂_js^loocv(λ)]²
where τ̂_js^loocv is the pseudo-treatment effect at control observation (j,s) with that observation excluded from fitting.
For global method, LOOCV works as follows:
- For each control observation (t, i):
- Zero out weight δ_{ti} = 0 (exclude from weighted objective)
- Fit global model on remaining data → obtain (μ̂, α̂, β̂, L̂)
- Compute pseudo-treatment: τ̂_{ti} = Y_{ti} - μ̂ - α̂_i - β̂_t - L̂_{ti}
- Score = Σ τ̂_{ti}² (sum of squared pseudo-treatment effects)
- Select λ combination that minimizes Q(λ)
Rust acceleration: The LOOCV grid search is parallelized in Rust for 5-10x speedup.
loocv_grid_search_global()- Parallel LOOCV across all λ combinationsbootstrap_trop_variance_global()- Parallel bootstrap variance estimation
Key differences from local method:
- Global weights (distance to treated block center) vs. per-observation weights
- Single model fit per λ combination vs. N_treated fits
- Treatment effects are post-hoc residuals from a single global model (global) vs. post-hoc residuals from per-observation models (local)
- Both use (1-W) masking (control-only fitting)
- Faster computation for large panels
Assumptions:
- Simultaneous adoption (enforced): The global method requires all treated units
to receive treatment at the same time. A
ValueErroris raised if staggered adoption is detected (units first treated at different periods). Treatment timing is inferred once and held constant for bootstrap variance estimation. For staggered adoption designs, usemethod="local".
Reference: Adapted from reference implementation. See also Athey et al. (2025).
Edge Cases (treated NaN outcomes):
- Partial NaN: When some treated outcomes Y_{it} are NaN/missing:
_extract_posthoc_tau()(global) skips these cells; only finite τ̂ values are averaged- Local loop skips NaN outcomes entirely (no model fit, no tau appended)
n_treated_obsin results reflects valid (finite) count, not total D==1 countdf_trop = max(1, n_valid_treated - 1)uses valid count- Warning issued when n_valid_treated < total treated count
- All NaN: When all treated outcomes are NaN:
- ATT = NaN, warning issued
n_treated_obs = 0
- Bootstrap SE with <2 draws: Returns
se=NaN(not 0.0) when zero bootstrap iterations succeed.safe_inference()propagates NaN downstream.
Requirements checklist:
-
Same LOOCV framework as local (Equation 5)
-
Global weight computation using treated block center
-
(1-W) masking for control-only fitting (per paper Eq. 2)
-
Alternating minimization for nuclear norm penalty
-
Returns ATT = mean of per-observation post-hoc τ̂_{it}
-
Rust acceleration for LOOCV and bootstrap
-
Note:
method="twostep"renamed tomethod="local"andmethod="joint"renamed tomethod="global"to form a natural local/global pair. Both old names are deprecated aliases, removal planned for v3.0.
Module: diff_diff/diagnostics.py
Edge cases:
- NaN inference for undefined statistics:
permutation_test: t_stat is NaN when permutation SE is zero (all permutations produce identical estimates)leave_one_out_test: t_stat, p_value, CI are NaN when LOO SE is zero (all LOO effects identical)- Note: Defensive enhancement matching CallawaySantAnna NaN convention
Primary source: Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics, 225(2), 254-277.
Key implementation requirements:
Assumption checks / warnings:
- Requires variation in treatment timing (staggered adoption)
- Warns if only one treatment cohort (decomposition not meaningful)
- Uses never-treated units as controls when present; falls back to timing-only comparisons otherwise
Estimator equation (as implemented):
TWFE decomposes as:
τ̂^TWFE = Σ_k s_k × τ̂_k
where k indexes 2×2 comparisons and s_k are Bacon weights.
Three comparison types:
-
Treated vs. Never-treated (if never-treated exist):
τ̂_{T,U} = (Ȳ_{T,post} - Ȳ_{T,pre}) - (Ȳ_{U,post} - Ȳ_{U,pre}) -
Earlier vs. Later-treated (Earlier as treated, Later as control pre-treatment):
τ̂_{k,l} = DiD using early-treated as treatment, late-treated as control -
Later vs. Earlier-treated (problematic: uses post-treatment outcomes as control):
τ̂_{l,k} = DiD using late-treated as treatment, early-treated (post) as control
Weights depend on group sizes and variance in treatment timing.
Standard errors:
- Not typically computed (decomposition is exact)
- Individual 2×2 estimates can have SEs
Edge cases:
- Continuous treatment: not supported, requires binary
- Weights may be negative for later-vs-earlier comparisons
- Single treatment time: no decomposition possible
Reference implementation(s):
- R:
bacondecomp::bacon() - Stata:
bacondecomp
Requirements checklist:
- Three comparison types: treated_vs_never, earlier_vs_later, later_vs_earlier
- Weights sum to approximately 1 (numerical precision)
- TWFE coefficient ≈ weighted sum of 2×2 estimates
- Visualization shows weight vs. estimate by comparison type
- Survey design support (Phase 3): weighted cell means, weighted within-transform, weighted group shares
- Note: Bacon decomposition with survey weights is diagnostic; exact-sum guarantee is approximate;
weights="exact"requires within-unit-constant survey columns (approximate path accepts time-varying weights)
Primary source: Rambachan, A., & Roth, J. (2023). A More Credible Approach to Parallel Trends. Review of Economic Studies, 90(5), 2555-2591.
Key implementation requirements:
Assumption checks / warnings:
- Requires event-study estimates with pre-treatment coefficients
- Warns if pre-treatment coefficients suggest parallel trends violation
- M=0 corresponds to exact parallel trends assumption
Estimator equation (as implemented):
Identified set under smoothness restriction (Δ^SD):
Δ^SD(M) = {δ : |δ_t - δ_{t-1}| ≤ M for all pre-treatment t}
Identified set under relative magnitudes (Δ^RM):
Δ^RM(M̄) = {δ : |δ_post| ≤ M̄ × max_t |δ_t^pre|}
Bounds computed via linear programming:
[τ_L, τ_U] = [min_δ∈Δ τ(δ), max_δ∈Δ τ(δ)]
Confidence intervals:
- FLCI (Fixed-Length Confidence Interval) for smoothness
- C-LF (Conditional Least-Favorable) for relative magnitudes
Standard errors:
- Inherits from underlying event-study estimation
- Sensitivity analysis reports bounds, not point estimates
Edge cases:
- Breakdown point: smallest M where CI includes zero
- M=0: reduces to standard parallel trends
- Negative M: not valid (constraints become infeasible)
Reference implementation(s):
- R:
HonestDiDpackage (Rambachan & Roth's official package)
Requirements checklist:
- M parameter must be ≥ 0
- Breakdown point (breakdown_M) correctly identified
- Delta^SD (smoothness) and Delta^RM (relative magnitudes) both supported
- Sensitivity plot shows bounds vs. M
- FLCI and C-LF confidence intervals implemented
Key implementation requirements:
Assumption checks / warnings:
- Requires specification of variance-covariance matrix of pre-treatment estimates
- Warns if pre-trends test has low power (uninformative)
- Different violation types have different power properties
Estimator equation (as implemented):
Pre-trends test statistic (Wald):
W = δ̂_pre' V̂_pre^{-1} δ̂_pre ~ χ²(k)
Power function:
Power(δ_true) = P(W > χ²_{α,k} | δ = δ_true)
Minimum detectable violation (MDV):
MDV(power=0.8) = min{|δ| : Power(δ) ≥ 0.8}
Violation types:
- Linear: δ_t = c × t (linear pre-trend)
- Constant: δ_t = c (level shift)
- Last period: δ_{-1} = c, others zero
- Custom: user-specified pattern
Standard errors:
- Power calculations are exact (no sampling variability)
- Uncertainty comes from estimated Σ
Edge cases:
- Perfect collinearity in pre-periods: test not well-defined
- Single pre-period: power calculation trivial
- Very high power: MDV approaches zero
Reference implementation(s):
- R:
pretrendspackage (Roth's official package)
Requirements checklist:
- MDV = minimum detectable violation at target power level
- Violation types: linear, constant, last_period, custom all implemented
- Power curve plotting over violation magnitudes
- Integrates with HonestDiD for combined sensitivity analysis
Primary source:
- Bloom, H.S. (1995). Minimum Detectable Effects: A Simple Way to Report the Statistical Power of Experimental Designs. Evaluation Review, 19(5), 547-556. https://doi.org/10.1177/0193841X9501900504
- Burlig, F., Preonas, L., & Woerman, M. (2020). Panel Data and Experimental Design. Journal of Development Economics, 144, 102458.
Key implementation requirements:
Assumption checks / warnings:
- Requires specification of outcome variance and intraclass correlation
- Warns if power is very low (<0.5) or sample size insufficient
- Cluster randomization requires cluster-level parameters
Estimator equation (as implemented):
Minimum detectable effect (MDE):
MDE = (t_{α/2} + t_{1-κ}) × SE(τ̂)
where κ is target power (typically 0.8).
Standard error for DiD:
SE(τ̂) = σ × √(1/n_T + 1/n_C) × √(1 + ρ(m-1)) / √(1 - R²)
where:
- ρ = intraclass correlation
- m = cluster size
- R² = variance explained by covariates
Power function:
Power = Φ(|τ|/SE - z_{α/2})
Sample size for target power:
n = 2(t_{α/2} + t_{1-κ})² σ² / MDE²
Standard errors:
- Analytical formulas (no estimation uncertainty in power calculation)
- Simulation-based power accounts for finite-sample and model-specific factors
Edge cases:
- Very small effects: may require infeasibly large samples
- High ICC: dramatically reduces effective sample size
- Unequal allocation: optimal is often 50-50 but depends on costs
- Note:
data_generator_kwargskeys that overlap with registry-managed simulation inputs (treatment_effect,noise_sd,n_units,n_periods,treatment_fraction,treatment_period,n_pre,n_post) are rejected withValueErrorto prevent silent desync between the DGP and result metadata.n_preandn_postare derived fromtreatment_periodandn_periodsin factor-model DGPs (SyntheticDiD, TROP); the 3-way intersection check naturally scopes the rejection to those estimators only. Use the correspondingsimulate_power()parameters directly, or pass a customdata_generatorto override the DGP entirely. - Note:
simulate_sample_size()rejectsn_per_cellindata_generator_kwargsforTripleDifferencebecausen_per_cellis derived fromn_units(the search variable). A fixed override would freeze the effective sample size across bisection iterations, making the search degenerate. Usesimulate_power()with a fixedn_per_celloverride instead, or pass a customdata_generator. - Note: The simulation-based power registry (
simulate_power,simulate_mde,simulate_sample_size) uses a single-cohort staggered DGP by default. Estimators configured withcontrol_group="not_yet_treated",clean_control="strict", oranticipation>0will receive aUserWarningbecause the default DGP does not match their identification strategy. Users must supplydata_generator_kwargs(e.g.,cohort_periods=[2, 4],never_treated_frac=0.0) or a customdata_generatorto match the estimator design. - Note: The
TripleDifferenceregistry adapter usesgenerate_ddd_data, a fixed 2×2×2 factorial DGP (group × partition × time). Then_periods,treatment_period, andtreatment_fractionparameters are ignored — DDD always simulates 2 periods with balanced groups.n_unitsis mapped ton_per_cell = max(2, n_units // 8)(effective total N =n_per_cell × 8), so non-multiples of 8 are rounded down and values below 16 are clamped to 16. AUserWarningis emitted when simulation inputs differ from the effective DDD design. When rounding occurs, all result objects (SimulationPowerResults,SimulationMDEResults,SimulationSampleSizeResults) seteffective_n_unitsto the actual sample size used; it isNonewhen no rounding occurred.simulate_sample_size()snaps bisection candidates to multiples of 8 so thatrequired_nis always a realizable DDD sample size. Passingn_per_cellindata_generator_kwargssuppresses the effective-N rounding warning but not warnings for ignored parameters (n_periods,treatment_period,treatment_fraction).
Reference implementation(s):
- R:
pwrpackage (general),DeclareDesign(simulation-based) - Stata:
powercommand
Requirements checklist:
- MDE calculation given sample size and variance parameters
- Power calculation given effect size and sample size
- Sample size calculation given MDE and target power
- Simulation-based power for complex designs
- Cluster adjustment for clustered designs
Reference Period Normalization
Normalization only occurs when reference_period is explicitly specified by the user:
-
Explicit
reference_period=X: Normalizes effects (subtracts ref effect), sets ref SE to NaN- Point estimates:
effect_normalized = effect - effect_ref - Reference period SE → NaN (it's now a constraint, not an estimate)
- Other periods' SEs unchanged (uncertainty relative to the constraint)
- CIs recomputed from normalized effects and original SEs
- Point estimates:
-
Auto-inferred reference (from CallawaySantAnna results): Hollow marker styling only, no normalization
- Original effects are plotted unchanged
- Reference period shown with hollow marker for visual indication
- All periods retain their original SEs and error bars
This design prevents unintended normalization when the reference period isn't a true
identifying constraint (e.g., CallawaySantAnna with base_period="varying" where different
cohorts use different comparison periods).
The explicit-only normalization follows the fixest (R) convention where the omitted/reference
category is an identifying constraint with no associated uncertainty. Auto-inferred references
follow the did (R) package convention which does not normalize and reports full inference.
Rationale: When normalizing to a reference period, we're treating that period as an identifying constraint (effect ≡ 0 by definition). The variance of a constant is zero, but since it's a constraint rather than an estimated quantity, we report NaN rather than 0. Auto-inferred references may not represent true identifying constraints, so normalization should be a deliberate user choice.
Edge Cases:
- If
reference_periodnot in data: No normalization applied - If reference effect is NaN: No normalization applied
- Reference period CI becomes (NaN, NaN) after normalization (explicit only)
- Reference period is plotted with hollow marker (both explicit and auto-inferred)
- Reference period error bars: removed for explicit, retained for auto-inferred
Reference implementation(s):
- R:
fixest::coefplot()with reference category shown at 0 with no CI - R:
did::ggdid()does not normalize; shows full inference for all periods
| Estimator | Default SE | Alternatives |
|---|---|---|
| DifferenceInDifferences | HC1 robust | Cluster-robust, wild bootstrap |
| MultiPeriodDiD | HC1 robust | Cluster-robust (via cluster param), wild bootstrap |
| TwoWayFixedEffects | Cluster at unit | Wild bootstrap |
| CallawaySantAnna | Analytical (influence fn) | Multiplier bootstrap |
| SunAbraham | Cluster-robust + delta method | Pairs bootstrap |
| ImputationDiD | Conservative clustered (Thm 3) | Multiplier bootstrap (library extension; percentile CIs and empirical p-values, consistent with CS/SA) |
| TwoStageDiD | GMM sandwich (Newey & McFadden 1994) | Multiplier bootstrap on GMM influence function |
| SyntheticDiD | Placebo variance (Alg 4) | Unit-level bootstrap (fixed weights) |
| TripleDifference | Influence function (all methods) | SE = std(IF) / sqrt(n) |
| StackedDiD | Cluster-robust (unit) | Cluster at unit × sub-experiment |
| TROP | Block bootstrap | — |
| BaconDecomposition | N/A (exact decomposition) | Individual 2×2 SEs |
| HonestDiD | Inherited from event study | FLCI, C-LF |
| PreTrendsPower | Exact (analytical) | - |
| PowerAnalysis | Exact (analytical) | Simulation-based |
| diff-diff Estimator | R Package | Function |
|---|---|---|
| DifferenceInDifferences | fixest | feols(y ~ treat:post, ...) |
| MultiPeriodDiD | fixest | feols(y ~ i(time, treat, ref=ref) | unit + time) |
| TwoWayFixedEffects | fixest | feols(y ~ treat | unit + time, ...) |
| CallawaySantAnna | did | att_gt() |
| SunAbraham | fixest | sunab() |
| ImputationDiD | didimputation | did_imputation() |
| TwoStageDiD | did2s | did2s() |
| ContinuousDiD | contdid | cont_did() |
| SyntheticDiD | synthdid | synthdid_estimate() |
| TripleDifference | triplediff | ddd() |
| StackedDiD | stacked-did-weights | create_sub_exp() + compute_weights() |
| TROP | - | (forthcoming) |
| BaconDecomposition | bacondecomp | bacon() |
| HonestDiD | HonestDiD | createSensitivityResults() |
| PreTrendsPower | pretrends | pretrends() |
| PowerAnalysis | pwr / DeclareDesign | pwr.t.test() / simulation |
Survey-weighted estimation allows correct population-level inference from data collected via complex survey designs (multi-stage sampling, stratification, unequal selection probabilities).
- Reference: Lumley (2004) "Analysis of Complex Survey Samples", Journal of Statistical Software 9(8). Solon, Haider, & Wooldridge (2015) "What Are We Weighting For?" Journal of Human Resources 50(2).
- WLS formula:
beta_WLS = (X'WX)^{-1} X'WywhereW = diag(w_i) - Implementation: Equivalent transformation via
sqrt(w)scaling, then standard OLS. Residuals back-transformed to original scale. - Weight types: pweight (inverse selection probability), fweight (frequency/expansion), aweight (inverse variance/precision)
- Note: Weight normalization uses
sum(w) = nconvention (DRDID/Stata), not raw weights (Rsurvey). Coefficients are identical; SEs differ by constant factor.
- Reference: Binder (1983) "On the Variances of Asymptotically Normal Estimators from Complex Surveys", International Statistical Review 51(3). Lumley (2004).
- Formula:
V_TSL = (X'WX)^{-1} [sum_h V_h] (X'WX)^{-1}with stratified PSU-level scores - Relationship to sandwich estimator: TSL is a generalization of the Huber-White sandwich estimator that accounts for stratification and finite population correction
- Deviation from R: R
surveydefaultslonely_psuto "fail"; we default to "remove" with warning, matching common applied practice - Edge case: Singleton strata (one PSU per stratum) — handled via
lonely_psuparameter ("remove", "certainty", or "adjust") - Note: For unstratified designs with a single PSU, all
lonely_psumodes produce NaN variance. The "adjust" mode cannot center against a global mean when there is only one stratum (the single PSU is the entire sample). - Note: Weights-only designs (no explicit PSU or strata) use implicit
per-observation PSUs for the TSL meat computation, consistent with the
stratified-no-PSU path. The adjustment factor is
n/(n-1)(not HC1'sn/(n-k)).
- Note: aweights use unweighted meat in the sandwich estimator (no
winu^2term). This matches Stata convention. Rationale: aweights model known heteroskedasticity; after WLS transformation, errors are approximately homoskedastic. - Note: fweights affect degrees of freedom (
df = sum(w) - k, notn - k). This matches Stata convention for frequency-expanded data. - Note: pweight HC1 meat uses score outer products (Σ s_i s_i' where s_i = w_i x_i u_i), giving w² in the meat. fweight HC1 meat uses X'diag(w u²)X (one power of w), matching frequency-expanded HC1.
- Note: fweights must be non-negative integers; fractional values are
rejected by
_validate_weights(). All-zero vectors rejected at solver level. This matches Stata's convention.
- Note: When
absorbis used with a single variable in DiD/MultiPeriodDiD, all regressors (treatment, time, interactions, covariates) are within-transformed alongside the outcome per the FWL theorem. Regressors collinear with the absorbed FE (e.g., treatment after absorbing unit FE) are dropped via rank-deficiency handling. Multiple absorbed variables with survey weights are rejected (single-pass sequential demeaning is not the correct weighted FWL projection for N > 1 dimensions; iterative alternating projections are needed but not yet implemented).
- Reference: Korn & Graubard (1990) "Simultaneous Testing of Regression Coefficients with Complex Survey Data", JASA 85(409).
- Formula:
df = n_PSU - n_strata(replacesn - kfor t-distribution inference) - Deviation from R: Some software uses Satterthwaite-type df approximation;
we use the simpler and more common
n_PSU - n_strataconvention. - Note: When no explicit PSU is specified (weights-only or stratified-no-PSU
designs), each observation is treated as its own PSU for df purposes. Survey df
becomes
n_obs - n_strata(orn_obs - 1when unstratified). - Note: When survey_design specifies weights only (no PSU) and cluster=
is specified, cluster IDs are injected as effective PSUs for Taylor Series
Linearization variance estimation, matching the R
surveypackage convention that clusters are the primary sampling units.
Two strategies for bootstrap variance under complex survey designs:
Multiplier Bootstrap at PSU Level (CallawaySantAnna, ImputationDiD, TwoStageDiD, ContinuousDiD, EfficientDiD):
- Reference: Standard Taylor linearization bootstrap (Shao 2003, "Impact of the Bootstrap on Sample Surveys", Statistical Science 18(2))
- Formula: Generate multiplier weights independently within strata at the PSU level.
Scale by
sqrt(1 - f_h)for FPC. Perturbation:ATT_boot[b] = ATT + w_b^T @ psi_psuwherepsi_psuare PSU-aggregated IF sums. - Note: When no strata/PSU/FPC, degenerates to standard unit-level multiplier bootstrap.
Rao-Wu Rescaled Bootstrap (SunAbraham, SyntheticDiD, TROP):
- Reference: Rao & Wu (1988) "Resampling Inference with Complex Survey Data", JASA 83(401); Rao, Wu & Yue (1992) "Some Recent Work on Resampling Methods for Complex Surveys", Survey Methodology 18(2), Section 3.
- Formula: Within each stratum h with n_h PSUs, draw
m_hPSUs with replacement. Without FPC:m_h = n_h - 1. With FPC:m_h = max(1, round((1 - f_h) * (n_h - 1))). Rescaled weight:w*_i = w_i * (n_h / m_h) * r_hiwherer_hi= count of PSU i drawn. - Note: FPC enters through the resample size
m_h, not as a post-hoc scaling factor. Whenf_h >= 1(census stratum), observations keep original weights (zero variance). - Note: Bootstrap paths support
lonely_psu="remove"and"certainty"only.lonely_psu="adjust"raisesNotImplementedErrorfor survey-aware bootstrap; use analytical inference for designs requiringadjustsemantics. - Deviation from R: For the no-FPC case (
m_h = n_h - 1), this matches Rsurvey::as.svrepdesign(type="subbootstrap"). The FPC-adjusted resample sizem_h = round((1-f_h)*(n_h-1))follows Rao, Wu & Yue (1992) Section 3.
CallawaySantAnna Design-Based Aggregated SEs:
- Formula:
V_design = sum_h (1-f_h) * (n_h/(n_h-1)) * sum_j (psi_hj - psi_h_bar)^2wherepsi_hj = sum_{i in PSU j} psi_iandpsi_iis the combined IF (standard + WIF). - Note: Per-(g,t) cell SEs use the simpler IF-based formula
sqrt(sum(psi^2))which already incorporates survey weights. Only aggregated SEs (overall, event study, group) use the full design-based variance.
TROP Cross-Classified Strata:
- Note (deviation from R): When survey strata and treatment groups both exist, TROP
creates pseudo-strata as
(survey_stratum x treatment_group)for Rao-Wu resampling. This preserves both survey variance structure and treatment ratio. Survey df computed from pseudo-strata structure. - Note: When
survey_design.stratais None but PSU/FPC trigger full-design bootstrap, TROP uses treatment group (treated vs control) as pseudo-strata for Rao-Wu resampling to preserve treatment ratio. FPC is applied within these pseudo-strata. This matches TROP's existing treatment-stratified resampling pattern. - Note (deviation from block bootstrap): In Rao-Wu survey bootstrap, per-observation treatment effects tau_{it} are deterministic given (Y, D, lambda) because survey weights do not enter the kernel-weighted matrix completion. The Rao-Wu path therefore precomputes tau values once and only varies the ATT aggregation weights across draws. This is mathematically equivalent to refitting per draw and avoids redundant computation.
Alternative to TSL: re-run WLS for each replicate weight column and compute variance from the distribution of replicate estimates.
- Reference: Wolter (2007) "Introduction to Variance Estimation", 2nd ed. Rao & Wu (1988).
- Supported methods: BRR, Fay's BRR, JK1, JKn
- Formulas:
- BRR:
V = (1/R) * sum_r (theta_r - theta)^2 - Fay:
V = 1/(R*(1-rho)^2) * sum_r (theta_r - theta)^2 - JK1:
V = (R-1)/R * sum_r (theta_r - theta)^2 - JKn:
V = sum_h ((n_h-1)/n_h) * sum_{r in h} (theta_r - theta)^2
- BRR:
- IF-based replicate variance: For influence-function estimators (CS
aggregation, ContinuousDiD, EfficientDiD, TripleDifference), replicate
contrasts are formed via weight-ratio rescaling:
theta_r = sum((w_r/w_full) * psi)whencombined_weights=True,theta_r = sum(w_r * psi)whencombined_weights=False. - Survey df: QR-rank of the analysis-weight matrix minus 1,
matching R's
survey::degf()which usesqr(..., tol=1e-5)$rank. Forcombined_weights=True(default), analysis weights are the raw replicate columns. Forcombined_weights=False, analysis weights arereplicate_weights * full_sample_weights. ReturnsNone(undefined) when rank <= 1, yielding NaN inference. Replacesn_PSU - n_strata. - Mutual exclusion: Replicate weights cannot be combined with strata/psu/fpc (the replicates encode design structure implicitly)
- Design parameters (matching R
svrepdesign()):combined_weights(default True): replicate columns include full-sample weight. If False, replicate columns are perturbation factors multiplied by full-sample weight before WLS.replicate_scale: overall variance multiplier, applied multiplicatively withreplicate_rscaleswhen both are provided (scale * rscales)replicate_rscales: per-replicate scaling factors (vector of length R). BRR and Fay ignore customreplicate_scale/replicate_rscaleswith a warning (fixed scaling by design); JK1/JKn allow overrides.mse(default False, matching R'ssurvey::svrepdesign()): if True, center variance on full-sample estimate; if False, center on mean of replicate estimates. Whenreplicate_rscalescontains zero entries andmse=False, centering excludes zero-scaled replicates, matching R'ssurvey::svrVar()convention.
- Note: Replicate columns are NOT normalized — raw values are preserved to maintain correct weight ratios in the IF path.
- Note: JKn requires explicit
replicate_strata(per-replicate stratum assignment). Auto-derivation from weight patterns is not supported. - Note: Invalid replicate solves (singular/degenerate) are dropped with
a warning. Variance is computed from valid replicates only. Fewer than 2
valid replicates returns NaN variance. The variance scaling factor
(e.g.,
1/Rfor BRR,(R-1)/Rfor JK1) uses the original design'sR, not the valid count — matching R'ssurveypackage convention where the design structure is fixed and dropped replicates contribute zero to the sum without changing the scale. Survey df usesn_valid - 1for t-based inference. - Note: Replicate-weight support matrix:
- Supported: CallawaySantAnna (reg/ipw/dr without covariates, no bootstrap), ContinuousDiD (no bootstrap), EfficientDiD (no bootstrap), TripleDifference (all methods), LinearRegression (OLS path)
- Rejected with NotImplementedError: SunAbraham, TwoWayFixedEffects (within-transformation must be recomputed per replicate), DifferenceInDifferences, MultiPeriodDiD, StackedDiD (use compute_survey_vcov directly), ImputationDiD, TwoStageDiD (custom variance), SyntheticDiD, TROP (bootstrap-based variance), BaconDecomposition (diagnostic only)
- CS/ContinuousDiD/EfficientDiD reject replicate +
n_bootstrap > 0(replicate weights provide analytical variance)
- Note: When invalid replicates are dropped in
compute_replicate_vcov(OLS path),n_validis returned and used fordf_survey = n_valid - 1inLinearRegression.fit(). For IF-based replicate paths, replicates essentially never fail (weighted sums cannot be singular), son_validequalsRin practice and df propagation is not needed.
Per-coefficient design effect comparing survey variance to SRS variance.
- Reference: Kish (1965) "Survey Sampling", Wiley. Chapter 8.
- Formula:
DEFF_k = Var_survey(beta_k) / Var_SRS(beta_k)where SRS baseline uses HC1 sandwich ignoring design structure - Effective n:
n_eff_k = n / DEFF_k - Display: Existing weight-based DEFF labeled "Kish DEFF (weights)";
per-coefficient DEFF available via
compute_deff_diagnostics()orLinearRegression.compute_deff()post-fit - Note: Opt-in computation — not run automatically. Users call standalone function or post-fit method when diagnostics are needed.
Domain estimation preserving full design structure.
- Reference: Lumley (2004) Section 3.4. Stata
svy: subpop. - Method:
SurveyDesign.subpopulation(data, mask)zeros out weights for excluded observations while retaining strata/PSU layout for correct variance estimation - Note: Unlike naive subsetting, subpopulation analysis preserves design information (PSU structure, strata counts) that would be lost by dropping observations. This is the methodologically correct approach for domain estimation under complex survey designs.
- Note: Weight validation relaxed from "strictly positive" to "non-negative" to support zero-weight observations. Negative weights still rejected. All-zero weight vectors rejected at solver level.
- Note: Survey design df (
n_PSU - n_strata) uses total design structure (including zero-weight rows), matching R'ssurvey::degf()convention aftersubset(). The generic HC1/classical inference paths use positive-weight count for df adjustments, ensuring zero-weight padding is inference-invariant outside the survey vcov path. DEFF effective-n also uses positive-weight count. - Note: For replicate-weight designs,
subpopulation()zeros out both full-sample and replicate weight columns for excluded observations, preserving all replicate metadata. - Note: Defensive enhancement: ContinuousDiD and TripleDifference
validate the positive-weight effective sample size before WLS cell fits.
After
subpopulation()zeroes weights, raw row counts may exceed the regression rank requirement while the weighted effective sample does not. Underidentified cells are skipped (ContinuousDiD) or fall back to weighted means (TripleDifference).
The 8-step workflow in docs/llms-practitioner.txt is adapted from Baker et al. (2025)
"Difference-in-Differences Designs: A Practitioner's Guide" (arXiv:2503.13323), not a
1:1 mapping of the paper's forward-engineering framework.
- Note: The diff-diff canonical numbering is: 1-Define, 2-Assumptions, 3-Test PT, 4-Choose estimator, 5-Estimate, 6-Sensitivity, 7-Heterogeneity, 8-Robustness. Paper's numbering: 1-Define, 2-Assumptions, 3-Estimation method, 4-Uncertainty, 5-Estimate, 6-Sensitivity, 7-Heterogeneity, 8-Keep learning.
- Note: Parallel trends testing is a separate Step 3 (paper embeds it in Step 2), to ensure AI agents execute it as a distinct action.
- Note: Sources of uncertainty (paper's Step 4) is folded into Step 5 (Estimate) with an explicit cluster-count check directive (>= 50 clusters for asymptotic SEs, otherwise wild bootstrap). The 50-cluster threshold is a diff-diff convention.
- Note: Step 8 is "Robustness & Reporting" (compare estimators, report with/without covariates). Paper's Step 8 is "Keep learning." The mandatory with/without covariate comparison is a diff-diff convention.
- v1.3 (2026-03-26): Added Replicate Weight Variance, DEFF Diagnostics, and Subpopulation Analysis sections (Phase 6 completion)
- v1.2 (2026-03-24): Added Survey-Aware Bootstrap section (Phase 6)
- v1.1 (2026-03-20): Added Survey Data Support section
- v1.0 (2025-01-19): Initial registry with 12 estimators