Methodology Review

This document tracks the progress of reviewing each estimator's implementation against the Methodology Registry and academic references. It ensures that implementations are correct, consistent, and well-documented.

For the methodology registry with academic foundations and key equations, see docs/methodology/REGISTRY.md.

Overview

Each estimator in diff-diff should be periodically reviewed to ensure:

Correctness: Implementation matches the academic paper's equations
Reference alignment: Behavior matches reference implementations (R packages, Stata commands)
Edge case handling: Documented edge cases are handled correctly
Standard errors: SE formulas match the documented approach

Review Status Summary

Estimator	Module	R Reference	Status	Last Review
DifferenceInDifferences	`estimators.py`	`fixest::feols()`	Complete	2026-01-24
MultiPeriodDiD	`estimators.py`	`fixest::feols()`	Complete	2026-02-02
TwoWayFixedEffects	`twfe.py`	`fixest::feols()`	Complete	2026-02-08
CallawaySantAnna	`staggered.py`	`did::att_gt()`	Complete	2026-01-24
SunAbraham	`sun_abraham.py`	`fixest::sunab()`	Complete	2026-02-15
SyntheticDiD	`synthetic_did.py`	`synthdid::synthdid_estimate()`	Complete	2026-02-10
TripleDifference	`triple_diff.py`	`triplediff::ddd()`	Complete	2026-02-18
StackedDiD	`stacked_did.py`	`stacked-did-weights`	Complete	2026-02-19
TROP	`trop.py`	(forthcoming)	Not Started	-
BaconDecomposition	`bacon.py`	`bacondecomp::bacon()`	Not Started	-
HonestDiD	`honest_did.py`	`HonestDiD` package	Not Started	-
PreTrendsPower	`pretrends.py`	`pretrends` package	Not Started	-
PowerAnalysis	`power.py`	`pwr` / `DeclareDesign`	Not Started	-

Status legend:

Not Started: No formal review conducted
In Progress: Review underway
Complete: Review finished, implementation verified

Detailed Review Notes

Core DiD Estimators

DifferenceInDifferences

Field	Value
Module	`estimators.py`
Primary Reference	Wooldridge (2010), Angrist & Pischke (2009)
R Reference	`fixest::feols()`
Status	Complete
Last Review	2026-01-24

Verified Components:

Test Coverage:

53 methodology verification tests in tests/test_methodology_did.py
123 existing tests in tests/test_estimators.py
R benchmark tests (skip if R not available)

R Comparison Results:

ATT matches within 1e-3 (R JSON truncation limits precision)
HC1 SE matches within 5%
Cluster-robust SE matches within 10%
Fixed effects results match within 1%

Corrections Made:

(None - implementation verified correct)

Outstanding Concerns:

R comparison precision limited by JSON output truncation (4 decimal places)
Consider improving R script to output full precision for tighter tolerances

Edge Cases Verified:

Empty cells: Produces rank deficiency warning (expected behavior)
Singleton clusters: Included in variance estimation, contribute via residuals (corrected REGISTRY.md)
Rank deficiency: All three modes (warn/error/silent) working
Non-binary treatment/time: Raises ValueError as expected
No variation in treatment/time: Raises ValueError as expected
Missing values: Raises ValueError as expected

MultiPeriodDiD

Field	Value
Module	`estimators.py`
Primary Reference	Freyaldenhoven et al. (2021), Wooldridge (2010), Angrist & Pischke (2009)
R Reference	`fixest::feols()`
Status	Complete
Last Review	2026-02-02

Verified Components:

Test Coverage:

50 tests across TestMultiPeriodDiD and TestMultiPeriodDiDEventStudy in tests/test_estimators.py
18 new event-study specification tests added in PR #125

Corrections Made:

PR #125 (2026-02-02): Transformed from post-period-only estimator into full event-study specification with pre-period coefficients. Reference period default changed from first pre-period to last pre-period (e=-1 convention). HonestDiD/PreTrendsPower VCV extraction fixed to use interaction sub-VCV instead of full regression VCV.

Outstanding Concerns:

~~No R comparison benchmarks yet~~ — Resolved: R comparison benchmark added via benchmarks/R/benchmark_multiperiod.R using fixest::feols(outcome ~ treated * time_f | unit). Results match R exactly: ATT diff < 1e-11, SE diff 0.0%, period effects correlation 1.0. Validated at small (200 units) and 1k scales.
Default SE is HC1 (not cluster-robust at unit level as fixest uses). Cluster-robust available via cluster parameter but not the default.
Endpoint binning for distant event times not yet implemented.
FutureWarning for reference_period default change should eventually be removed once the transition is complete.

TwoWayFixedEffects

Field	Value
Module	`twfe.py`
Primary Reference	Wooldridge (2010), Ch. 10
R Reference	`fixest::feols()`
Status	Complete
Last Review	2026-02-08

Verified Components:

Key Implementation Detail: The interaction term D_i × Post_t must be within-transformed (demeaned) alongside the outcome, consistent with the Frisch-Waugh-Lovell (FWL) theorem: all regressors and the outcome must be projected out of the fixed effects space. R's fixest::feols() does this automatically when variables appear to the left of the | separator.

Corrections Made:

Bug fix: interaction term must be within-transformed (found during review). The previous implementation used raw (un-demeaned) D_i × Post_t in the demeaned regression. This gave correct results only for 2-period panels where post == period. For multi-period panels (e.g., 4 periods with binary post), the raw interaction had incorrect correlation with demeaned Y, producing ATT approximately 1/3 of the true value. Fixed by applying the same within-transformation to the interaction term before regression. This matches R's fixest::feols() behavior. (twfe.py lines 99-113)

Outstanding Concerns:

Multi-period time parameter: Multi-period time values (e.g., 1,2,3,4) produce treated × period_number instead of treated × post_indicator, which is not the standard D_it treatment indicator. A UserWarning is emitted when time has >2 unique values. For binary time with non-{0,1} values (e.g., {2020, 2021}), the ATT is mathematically correct (the within-transformation absorbs the scaling), but a warning recommends 0/1 encoding for clarity. Users with multi-period data should create a binary post column.
Staggered treatment warning: The warning only fires when time has >2 unique values (i.e., actual period numbers). With binary time="post", all treated units appear to start treatment at time=1, making staggering undetectable. Users with staggered designs should use decompose() or CallawaySantAnna directly for proper diagnostics.

Modern Staggered Estimators

CallawaySantAnna

Field	Value
Module	`staggered.py`
Primary Reference	Callaway & Sant'Anna (2021)
R Reference	`did::att_gt()`
Status	Complete
Last Review	2026-01-24

Verified Components:

Test Coverage:

46 methodology verification tests in tests/test_methodology_callaway.py
93 existing tests in tests/test_staggered.py
R benchmark tests (skip if R not available)

R Comparison Results:

Overall ATT matches within 20% (difference due to dynamic effects in generated data)
Post-treatment ATT(g,t) values match within 20%
Pre-treatment effects may differ due to base_period handling differences

Corrections Made:

(None - implementation verified correct)

Outstanding Concerns:

R comparison shows ~20% difference in overall ATT with generated data
- Likely due to differences in how dynamic effects are handled in data generation
- Individual ATT(g,t) values match closely for post-treatment periods
- Further investigation recommended with real-world data
Pre-treatment ATT(g,t) may differ from R due to base_period="varying" semantics
- Python uses t-1 as base for pre-treatment
- R's behavior requires verification

Deviations from R's did::att_gt():

NaN for invalid inference: When SE is non-finite or zero, Python returns NaN for t_stat/p_value rather than potentially erroring. This is a defensive enhancement.

Alignment with R's did::att_gt() (as of v2.1.5):

Webb weights: Webb's 6-point distribution with values ±√(3/2), ±1, ±√(1/2) uses equal probabilities (1/6 each) matching R's did package. This gives E[w]=0, Var(w)=1.0, consistent with other bootstrap weight distributions.

Verification: Our implementation matches the well-established fwildclusterboot R package (C++ source: wildboottest.cpp). The implementation uses sqrt(1.5), 1, sqrt(0.5) (and negatives) with equal 1/6 probabilities—identical to our values.

Note on documentation discrepancy: Some documentation (e.g., fwildclusterboot vignette) describes Webb weights as "±1.5, ±1, ±0.5". This appears to be a simplification for readability. The actual implementations use ±√1.5, ±1, ±√0.5 which provides the required unit variance (Var(w) = 1.0).

SunAbraham

Field	Value
Module	`sun_abraham.py`
Primary Reference	Sun & Abraham (2021)
R Reference	`fixest::sunab()`
Status	Complete
Last Review	2026-02-15

Verified Components:

Test Coverage:

43 tests in tests/test_sun_abraham.py (36 existing + 7 methodology verification)
R benchmark tests via benchmarks/run_benchmarks.py --estimator sunab

R Comparison Results:

Overall ATT matches within machine precision (diff < 1e-11 at both scales)
Cluster-robust SE matches within 0.3% (well within 1% threshold)
Event study effects match perfectly (correlation 1.0, max diff < 1e-11)
Validated at small (200 units) and 1k (1000 units) scales

Corrections Made:

DF adjustment for absorbed FE (sun_abraham.py, _fit_saturated_regression()): Added df_adjustment = n_units + n_times - 1 to LinearRegression.fit() to account for absorbed unit and time fixed effects in degrees of freedom. Unlike TWFE (which uses -2 plus an explicit intercept column), SunAbraham's saturated regression has no intercept, so all absorbed df must come from the adjustment. Affects t-distribution DoF for cohort-level p-values/CIs (slightly larger p-values, slightly wider CIs) but does NOT change VCV or SE values.
NaN return for no post-treatment effects (sun_abraham.py, _compute_overall_att()): Changed return from (0.0, 0.0) to (np.nan, np.nan) when no post-treatment effects exist. All downstream inference fields (t_stat, p_value, conf_int) correctly propagate NaN via existing guards in fit().
Deprecation warnings for unused parameters (sun_abraham.py, fit()): Added FutureWarning for min_pre_periods and min_post_periods parameters that are accepted but never used (no-op). These will be removed in a future version.
Removed event-time truncation at [-20, 20] (sun_abraham.py): Removed the hardcoded cap max(min(...), -20) / min(max(...), 20) to match R's fixest::sunab() which has no such limit. All available relative times are now estimated.
Warning for variance fallback path (sun_abraham.py, _compute_overall_att()): Added UserWarning when the full weight vector cannot be constructed and a simplified variance (ignoring covariances between periods) is used as fallback.
IW weights use event-time sample shares (sun_abraham.py, _compute_iw_effects()): Changed IW weights from n_g / Σ_g n_g (cohort sizes) to n_{g,e} / Σ_g n_{g,e} (per-event-time observation counts) to match the REGISTRY.md formula. For balanced panels these are identical; for unbalanced panels the new formula correctly reflects actual sample composition at each event-time. Added unbalanced panel test.
Normalize np.inf never-treated encoding (sun_abraham.py, fit()): first_treat=np.inf (documented as valid for never-treated) was included in treatment_groups and _rel_time via > 0 checks, producing -inf event times. Fixed by normalizing np.inf to 0 immediately after computing _never_treated. Same fix applied to staggered.py (CallawaySantAnna).

Outstanding Concerns:

Inference distribution: Cohort-level p-values use t-distribution (via LinearRegression.get_inference()), while aggregated event study and overall ATT p-values use normal distribution (via compute_p_value()). This is asymptotically equivalent and standard for delta-method-aggregated quantities. R's fixest uses t-distribution at all levels, so aggregated p-values may differ slightly for small samples — this is a documented deviation.

Deviations from R's fixest::sunab():

NaN for no post-treatment effects: Python returns (NaN, NaN) for overall ATT/SE when no post-treatment effects exist. R would error.
Normal distribution for aggregated inference: Aggregated p-values use normal distribution (asymptotically equivalent). R uses t-distribution.

StackedDiD

Field	Value
Module	`stacked_did.py`
Primary Reference	Wing, Freedman & Hollingsworth (2024), NBER WP 32054
R Reference	`stacked-did-weights` (`create_sub_exp()` + `compute_weights()`)
Status	Complete
Last Review	2026-02-19

Verified Components:

Test Coverage:

72 tests in tests/test_stacked_did.py across 11 test classes:
- TestStackedDiDBasic (8): fit, event study, group/all raises, simple aggregation, known constant effect, dynamic effects
- TestTrimming (5): IC1 window, IC2 no-controls, trimmed groups reported, all-trimmed raises, wider window
- TestQWeights (4): treated=1, aggregate formula, sample_share formula, positivity
- TestCleanControl (5): not_yet_treated, strict, never_treated, missing never-treated raises
- TestClustering (2): unit, unit_subexp
- TestStackedData (4): accessible, required columns, event time range
- TestEdgeCases (8): single cohort, anticipation, unbalanced panel, NaN inference, never-treated encodings
- TestSklearnInterface (4): get_params, set_params, unknown raises, convenience function
- TestResultsMethods (7): summary, to_dataframe, is_significant, significance_stars, repr
- TestValidation (8): missing columns, invalid params, population required, no treated units
R benchmark tests via benchmarks/run_benchmarks.py --estimator stacked

R Comparison Results (200 units, 8 periods, kappa_pre=2, kappa_post=2):

Metric	Python	R	Diff
Overall ATT	2.277699574579	2.2776995746	2.1e-11
Overall SE	0.062045687626	0.062045688027	4.0e-10
ES e=-2 ATT	0.044517975379	0.044517975379	<1e-12
ES e=0 ATT	2.104181683763	2.104181683800	<1e-11
ES e=1 ATT	2.209990715130	2.209990715100	<1e-11
ES e=2 ATT	2.518926324845	2.518926324800	<1e-11
Stacked obs	1600	1600	exact
Sub-experiments	3	3	exact

Corrections Made:

IC1 lower bound and time window aligned with R reference (stacked_did.py, _trim_adoption_events() and _build_sub_experiment()): The paper text specifies time window [a - kappa_pre - 1, a + kappa_post] (including an extra pre-period), but the R reference implementation by co-author Hollingsworth uses [a - kappa_pre, a + kappa_post]. The extra period had no event-study dummy, altering the baseline regression. Fixed to match R: removed -1 from both IC1 check (a - kappa_pre >= T_min) and time window start. Discrepancy documented in docs/methodology/papers/wing-2024-review.md Gaps section.
Q-weight computation: event-time-specific for aggregate weighting (stacked_did.py, _compute_q_weights()): Changed aggregate Q-weights from unit counts per sub-experiment to observation counts per (event_time, sub_exp), matching R reference compute_weights(). For balanced panels, results are unchanged. For unbalanced panels, weights now adjust for varying observation density. Population/sample_share retain unit-count formulas (paper notation).
Anticipation parameter: reference period and dummies (stacked_did.py, fit()): Reference period now shifts to e = -1 - anticipation. Event-time dummies cover the full window [-kappa_pre - anticipation, ..., kappa_post]. Post-treatment effects include anticipation periods. Consistent with ImputationDiD, TwoStageDiD, SunAbraham.
Group aggregation removed (stacked_did.py): aggregate="group" and aggregate="all" removed. The pooled stacked regression cannot produce cohort-specific effects without cohort×event-time interactions. Use CallawaySantAnna or ImputationDiD for cohort-level estimates.
n_sub_experiments metadata (stacked_did.py, fit()): Now tracks actual built sub-experiments, not all events in omega_kappa. Warns if any sub-experiments are empty after data filtering.

Outstanding Concerns:

Population/sample_share Q-weights use paper's unit-count formulas (no R reference to validate)
Anticipation not validated against R (R reference doesn't test anticipation > 0)

Deviations from R's stacked-did-weights:

NaN for invalid inference: Python returns NaN for t_stat/p_value/conf_int when SE is non-finite or zero. R would propagate through fixest::feols() error handling.

Advanced Estimators

SyntheticDiD

Field	Value
Module	`synthetic_did.py`
Primary Reference	Arkhangelsky et al. (2021)
R Reference	`synthdid::synthdid_estimate()`
Status	Complete
Last Review	2026-02-10

Corrections Made:

Time weights: Frank-Wolfe on collapsed form (was heuristic inverse-distance). Replaced ad-hoc inverse-distance weighting with the Frank-Wolfe algorithm operating on the collapsed (N_co x T_pre) problem as specified in Algorithm 1 of Arkhangelsky et al. (2021), matching R's synthdid::fw.step().
Unit weights: Frank-Wolfe with two-pass sparsification (was projected gradient descent with wrong penalty). Replaced projected gradient descent (which used an incorrect penalty formulation) with Frank-Wolfe optimization followed by two-pass sparsification, matching R's synthdid::sc.weight.fw() and sparsify_function().
Auto-computed regularization from data noise level (was lambda_reg=0.0, zeta=1.0). Regularization parameters zeta_omega and zeta_lambda are now computed automatically from the data noise level (N_tr * sigma^2) as specified in Appendix D of Arkhangelsky et al. (2021), matching R's default behavior.
Bootstrap SE uses fixed weights matching R's bootstrap_sample (was re-estimating all weights). The bootstrap variance procedure now holds unit and time weights fixed at their point estimates and only re-estimates the treatment effect, matching the approach in R's synthdid::bootstrap_sample().
Default variance_method changed to "placebo" matching R's default. The R package uses placebo variance by default (synthdid_estimate returns an object whose vcov() uses the placebo method); our default now matches.
Deprecated lambda_reg and zeta params; new params are zeta_omega and zeta_lambda. The old parameters had unclear semantics and did not correspond to the paper's notation. The new parameters directly match the paper and R package naming conventions. lambda_reg and zeta are deprecated with warnings and will be removed in a future release.

Outstanding Concerns:

(None)

TripleDifference

Field	Value
Module	`triple_diff.py`
Primary Reference	Ortiz-Villavicencio & Sant'Anna (2025)
R Reference	`triplediff::ddd()` (v0.2.1, CRAN)
Status	Complete
Last Review	2026-02-18

Verified Components:

ATT matches R triplediff::ddd() for all 3 methods (DR, RA, IPW) — <0.001% relative difference
SE matches R triplediff::ddd() for all 3 methods — <0.001% relative difference
With-covariates ATT matches R — <0.001% relative difference
With-covariates SE matches R — <0.001% relative difference
Verified across all 4 DGP types from gen_dgp_2periods() (different model misspecification scenarios)
Influence function-based SE: SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n)
Three-DiD decomposition: DDD = DiD_3 + DiD_2 - DiD_1 matching R's approach
safe_inference() used for all inference fields (t_stat, p_value, conf_int)

Corrections Made:

Complete rewrite of estimation methods (was naive cell-mean approach, now three-DiD decomposition). The original implementation computed DDD directly from 8 cell means with a naive cell-variance SE. Replaced with R's decomposition into three pairwise DiD comparisons (subgroup j vs reference subgroup 4), each using DR/IPW/RA methodology from Callaway & Sant'Anna. This fixed:
- DR SE: was off by >100% (naive cell variance vs influence function)
- IPW SE: was off by >200% (incorrect cell-probability-ratio weights)
- With-covariates ATT: was off by >1000% for all methods (incorrect cell-by-cell regression)
Influence function SE replaces naive cell variance for all methods: SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n) where w_j = n / n_j and IF_j is the per-observation influence function for pairwise DiD j.
Propensity score estimation now runs per-pairwise-comparison (P(subgroup=4|X) within {j, 4} subset) instead of global P(G=1|X).
Outcome regression now fits separate OLS per subgroup-time cell within each pairwise comparison, matching R's compute_outcome_regression_rc().

Outstanding Concerns:

Implementation uses panel=FALSE (repeated cross-section) mode. Panel mode (panel=TRUE) with differenced outcomes not yet implemented.

R Comparison Results (panel=FALSE, n=500 per DGP):

DGP	Method	Covariates	ATT Diff	SE Diff
1	DR	No	<0.001%	<0.001%
1	DR	Yes	<0.001%	<0.001%
1	REG	No	<0.001%	<0.001%
1	REG	Yes	<0.001%	<0.001%
1	IPW	No	<0.001%	<0.001%
1	IPW	Yes	<0.001%	<0.001%
2-4	All	Both	<0.001%	<0.001%

TROP

Field	Value
Module	`trop.py`
Primary Reference	Athey, Imbens, Qu & Viviano (2025)
R Reference	(forthcoming)
Status	Not Started
Last Review	-

Corrections Made:

(None yet)

Outstanding Concerns:

(None yet)

Diagnostics & Sensitivity

BaconDecomposition

Field	Value
Module	`bacon.py`
Primary Reference	Goodman-Bacon (2021)
R Reference	`bacondecomp::bacon()`
Status	Not Started
Last Review	-

Corrections Made:

(None yet)

Outstanding Concerns:

(None yet)

HonestDiD

Field	Value
Module	`honest_did.py`
Primary Reference	Rambachan & Roth (2023)
R Reference	`HonestDiD` package
Status	Not Started
Last Review	-

Corrections Made:

(None yet)

Outstanding Concerns:

(None yet)

PreTrendsPower

Field	Value
Module	`pretrends.py`
Primary Reference	Roth (2022)
R Reference	`pretrends` package
Status	Not Started
Last Review	-

Corrections Made:

(None yet)

Outstanding Concerns:

(None yet)

PowerAnalysis

Field	Value
Module	`power.py`
Primary Reference	Bloom (1995), Burlig et al. (2020)
R Reference	`pwr` / `DeclareDesign`
Status	Not Started
Last Review	-

Corrections Made:

(None yet)

Outstanding Concerns:

(None yet)

Review Process Guidelines

Review Checklist

For each estimator, complete the following steps:

Read primary academic source - Review the key paper(s) cited in REGISTRY.md
Compare key equations - Verify implementation matches equations in REGISTRY.md
Run benchmark against R reference - Execute benchmarks/run_benchmarks.py --estimator <name> if available
Verify edge case handling - Check behavior matches REGISTRY.md documentation
Check standard error formula - Confirm SE computation matches reference
Document any deviations - Add notes explaining intentional differences with rationale

When to Update This Document

After completing a review: Update status to "Complete" and add date
When making corrections: Document what was fixed in the "Corrections Made" section
When identifying issues: Add to "Outstanding Concerns" for future investigation
When deviating from reference: Document the deviation and rationale

Deviation Documentation

When our implementation intentionally differs from the reference implementation, document:

What differs: Specific behavior or formula that differs
Why: Rationale (e.g., "defensive enhancement", "bug in R package", "follows updated paper")
Impact: Whether results differ in practice
Cross-reference: Update REGISTRY.md edge cases section

Example:

**Deviation (2025-01-15)**: CallawaySantAnna returns NaN for t_stat when SE is non-finite,
whereas R's `did::att_gt` would error. This is a defensive enhancement that provides
more graceful handling of edge cases while still signaling invalid inference to users.

Priority Order

Suggested order for reviews based on usage and complexity:

High priority (most used, complex methodology):
- CallawaySantAnna
- SyntheticDiD
- HonestDiD
Medium priority (commonly used, simpler methodology):
- DifferenceInDifferences
- TwoWayFixedEffects
- MultiPeriodDiD
- SunAbraham
- BaconDecomposition
Lower priority (newer or less commonly used):
- TripleDifference
- TROP
- PreTrendsPower
- PowerAnalysis

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methodology Review

Overview

Review Status Summary

Detailed Review Notes

Core DiD Estimators

DifferenceInDifferences

MultiPeriodDiD

TwoWayFixedEffects

Modern Staggered Estimators

CallawaySantAnna

SunAbraham

StackedDiD

Advanced Estimators

SyntheticDiD

TripleDifference

TROP

Diagnostics & Sensitivity

BaconDecomposition

HonestDiD

PreTrendsPower

PowerAnalysis

Review Process Guidelines

Review Checklist

When to Update This Document

Deviation Documentation

Priority Order

Related Documents

FilesExpand file tree

METHODOLOGY_REVIEW.md

Latest commit

History

METHODOLOGY_REVIEW.md

File metadata and controls

Methodology Review

Overview

Review Status Summary

Detailed Review Notes

Core DiD Estimators

DifferenceInDifferences

MultiPeriodDiD

TwoWayFixedEffects

Modern Staggered Estimators

CallawaySantAnna

SunAbraham

StackedDiD

Advanced Estimators

SyntheticDiD

TripleDifference

TROP

Diagnostics & Sensitivity

BaconDecomposition

HonestDiD

PreTrendsPower

PowerAnalysis

Review Process Guidelines

Review Checklist

When to Update This Document

Deviation Documentation

Priority Order

Related Documents