This document tracks the progress of reviewing each estimator's implementation against the Methodology Registry and academic references. It ensures that implementations are correct, consistent, and well-documented.
For the methodology registry with academic foundations and key equations, see docs/methodology/REGISTRY.md.
Each estimator in diff-diff should be periodically reviewed to ensure:
- Correctness: Implementation matches the academic paper's equations
- Reference alignment: Behavior matches reference implementations (R packages, Stata commands)
- Edge case handling: Documented edge cases are handled correctly
- Standard errors: SE formulas match the documented approach
| Estimator | Module | R Reference | Status | Last Review |
|---|---|---|---|---|
| DifferenceInDifferences | estimators.py |
fixest::feols() |
Complete | 2026-01-24 |
| MultiPeriodDiD | estimators.py |
fixest::feols() |
Complete | 2026-02-02 |
| TwoWayFixedEffects | twfe.py |
fixest::feols() |
Complete | 2026-02-08 |
| CallawaySantAnna | staggered.py |
did::att_gt() |
Complete | 2026-01-24 |
| SunAbraham | sun_abraham.py |
fixest::sunab() |
Complete | 2026-02-15 |
| SyntheticDiD | synthetic_did.py |
synthdid::synthdid_estimate() |
Complete | 2026-02-10 |
| TripleDifference | triple_diff.py |
triplediff::ddd() |
Complete | 2026-02-18 |
| StackedDiD | stacked_did.py |
stacked-did-weights |
Complete | 2026-02-19 |
| TROP | trop.py |
(forthcoming) | Not Started | - |
| BaconDecomposition | bacon.py |
bacondecomp::bacon() |
Not Started | - |
| HonestDiD | honest_did.py |
HonestDiD package |
Not Started | - |
| PreTrendsPower | pretrends.py |
pretrends package |
Not Started | - |
| PowerAnalysis | power.py |
pwr / DeclareDesign |
Not Started | - |
Status legend:
- Not Started: No formal review conducted
- In Progress: Review underway
- Complete: Review finished, implementation verified
| Field | Value |
|---|---|
| Module | estimators.py |
| Primary Reference | Wooldridge (2010), Angrist & Pischke (2009) |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-01-24 |
Verified Components:
- ATT formula: Double-difference of cell means matches regression interaction coefficient
- R comparison: ATT matches
fixest::feols()within 1e-3 tolerance - R comparison: SE (HC1 robust) matches within 5%
- R comparison: P-value matches within 0.01
- R comparison: Confidence intervals overlap
- R comparison: Cluster-robust SE matches within 10%
- R comparison: Fixed effects (absorb) matches
feols(...|unit)within 1% - Wild bootstrap inference (Rademacher, Mammen, Webb weights)
- Formula interface (
y ~ treated * post) - All REGISTRY.md edge cases tested
Test Coverage:
- 53 methodology verification tests in
tests/test_methodology_did.py - 123 existing tests in
tests/test_estimators.py - R benchmark tests (skip if R not available)
R Comparison Results:
- ATT matches within 1e-3 (R JSON truncation limits precision)
- HC1 SE matches within 5%
- Cluster-robust SE matches within 10%
- Fixed effects results match within 1%
Corrections Made:
- (None - implementation verified correct)
Outstanding Concerns:
- R comparison precision limited by JSON output truncation (4 decimal places)
- Consider improving R script to output full precision for tighter tolerances
Edge Cases Verified:
- Empty cells: Produces rank deficiency warning (expected behavior)
- Singleton clusters: Included in variance estimation, contribute via residuals (corrected REGISTRY.md)
- Rank deficiency: All three modes (warn/error/silent) working
- Non-binary treatment/time: Raises ValueError as expected
- No variation in treatment/time: Raises ValueError as expected
- Missing values: Raises ValueError as expected
| Field | Value |
|---|---|
| Module | estimators.py |
| Primary Reference | Freyaldenhoven et al. (2021), Wooldridge (2010), Angrist & Pischke (2009) |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-02-02 |
Verified Components:
- Full event-study specification: treatment × period interactions for ALL non-reference periods (pre and post)
- Reference period coefficient is zero (normalized by omission from design matrix)
- Default reference period is last pre-period (e=-1 convention, matches fixest/did)
- Pre-period coefficients available for parallel trends assessment
- Average ATT computed from post-treatment effects only, with covariance-aware SE
- Returns PeriodEffect objects with confidence intervals for all periods
- Supports balanced and unbalanced panels
- NaN inference: t_stat/p_value/CI use NaN when SE is non-finite or zero
- R-style NA propagation: avg_att is NaN if any post-period effect is unidentified
- Rank-deficient design matrix: warns and sets NaN for dropped coefficients (R-style)
- Staggered adoption detection warning (via
unitparameter) - Treatment reversal detection warning
- Time-varying D_it detection warning (advises creating ever-treated indicator)
- Single pre-period warning (ATT valid but pre-trends assessment unavailable)
- Post-period reference_period raises ValueError (would bias avg_att)
- HonestDiD/PreTrendsPower integration uses interaction sub-VCV (not full regression VCV)
- All REGISTRY.md edge cases tested
Test Coverage:
- 50 tests across
TestMultiPeriodDiDandTestMultiPeriodDiDEventStudyintests/test_estimators.py - 18 new event-study specification tests added in PR #125
Corrections Made:
- PR #125 (2026-02-02): Transformed from post-period-only estimator into full event-study specification with pre-period coefficients. Reference period default changed from first pre-period to last pre-period (e=-1 convention). HonestDiD/PreTrendsPower VCV extraction fixed to use interaction sub-VCV instead of full regression VCV.
Outstanding Concerns:
No R comparison benchmarks yet— Resolved: R comparison benchmark added viabenchmarks/R/benchmark_multiperiod.Rusingfixest::feols(outcome ~ treated * time_f | unit). Results match R exactly: ATT diff < 1e-11, SE diff 0.0%, period effects correlation 1.0. Validated at small (200 units) and 1k scales.- Default SE is HC1 (not cluster-robust at unit level as fixest uses). Cluster-robust
available via
clusterparameter but not the default. - Endpoint binning for distant event times not yet implemented.
- FutureWarning for reference_period default change should eventually be removed once the transition is complete.
| Field | Value |
|---|---|
| Module | twfe.py |
| Primary Reference | Wooldridge (2010), Ch. 10 |
| R Reference | fixest::feols() |
| Status | Complete |
| Last Review | 2026-02-08 |
Verified Components:
- Within-transformation algebra:
y_it - ȳ_i - ȳ_t + ȳmatches hand calculation (rtol=1e-12) - ATT matches manual demeaned OLS (rtol=1e-10)
- ATT matches
DifferenceInDifferenceson 2-period data (rtol=1e-10) - Covariates are also within-transformed (sum to zero within unit/time groups)
- R comparison: ATT matches
fixest::feols(y ~ treated:post | unit + post, cluster=~unit)(rtol<0.1%) - R comparison: Cluster-robust SE match (rtol<1%)
- R comparison: P-value match (atol<0.01)
- R comparison: CI bounds match (rtol<1%)
- R comparison: ATT and SE match with covariate (same tolerances)
- Edge case: Staggered treatment triggers
UserWarning - Edge case: Auto-clusters at unit level (SE matches explicit
cluster="unit") - Edge case: DF adjustment for absorbed FE matches manual
solve_ols()withdf_adjustment - Edge case: Covariate collinear with interaction raises
ValueError("cannot be identified") - Edge case: Covariate collinearity warns but ATT remains finite
- Edge case:
rank_deficient_action="error"raisesValueError - Edge case:
rank_deficient_action="silent"emits no warnings - Edge case: Unbalanced panel produces valid results (finite ATT, positive SE)
- Edge case: Missing unit column raises
ValueError - Integration:
decompose()returnsBaconDecompositionResults - SE: Cluster-robust SE >= HC1 SE
- SE: VCoV positive semi-definite
- Wild bootstrap: Valid inference (finite SE, p-value in [0,1])
- Wild bootstrap: All weight types (rademacher, mammen, webb) produce valid inference
- Wild bootstrap:
inference="wild_bootstrap"routes correctly - Params:
get_params()returns all inherited parameters - Params:
set_params()modifies attributes - Results:
summary()contains "ATT" - Results:
to_dict()contains att, se, t_stat, p_value, n_obs - Results: residuals + fitted = demeaned outcome (not raw)
- Edge case: Multi-period time emits UserWarning advising binary post indicator
- Edge case: Non-{0,1} binary time emits UserWarning (ATT still correct)
- Edge case: ATT invariant to time encoding ({0,1} vs {2020,2021} produces identical results)
Key Implementation Detail:
The interaction term D_i × Post_t must be within-transformed (demeaned) alongside the outcome,
consistent with the Frisch-Waugh-Lovell (FWL) theorem: all regressors and the outcome must be
projected out of the fixed effects space. R's fixest::feols() does this automatically when
variables appear to the left of the | separator.
Corrections Made:
- Bug fix: interaction term must be within-transformed (found during review). The previous
implementation used raw (un-demeaned)
D_i × Post_tin the demeaned regression. This gave correct results only for 2-period panels wherepost == period. For multi-period panels (e.g., 4 periods with binarypost), the raw interaction had incorrect correlation with demeaned Y, producing ATT approximately 1/3 of the true value. Fixed by applying the same within-transformation to the interaction term before regression. This matches R'sfixest::feols()behavior. (twfe.pylines 99-113)
Outstanding Concerns:
- Multi-period
timeparameter: Multi-period time values (e.g., 1,2,3,4) producetreated × period_numberinstead oftreated × post_indicator, which is not the standard D_it treatment indicator. AUserWarningis emitted whentimehas >2 unique values. For binary time with non-{0,1} values (e.g., {2020, 2021}), the ATT is mathematically correct (the within-transformation absorbs the scaling), but a warning recommends 0/1 encoding for clarity. Users with multi-period data should create a binarypostcolumn. - Staggered treatment warning: The warning only fires when
timehas >2 unique values (i.e., actual period numbers). With binarytime="post", all treated units appear to start treatment attime=1, making staggering undetectable. Users with staggered designs should usedecompose()orCallawaySantAnnadirectly for proper diagnostics.
| Field | Value |
|---|---|
| Module | staggered.py |
| Primary Reference | Callaway & Sant'Anna (2021) |
| R Reference | did::att_gt() |
| Status | Complete |
| Last Review | 2026-01-24 |
Verified Components:
- ATT(g,t) basic formula (hand-calculated exact match)
- Doubly robust estimator
- IPW estimator
- Outcome regression
- Base period selection (varying/universal)
- Anticipation parameter handling
- Simple/event-study/group aggregation
- Analytical SE with weight influence function
- Bootstrap SE (Rademacher/Mammen/Webb)
- Control group composition (never_treated/not_yet_treated)
- All documented edge cases from REGISTRY.md
Test Coverage:
- 46 methodology verification tests in
tests/test_methodology_callaway.py - 93 existing tests in
tests/test_staggered.py - R benchmark tests (skip if R not available)
R Comparison Results:
- Overall ATT matches within 20% (difference due to dynamic effects in generated data)
- Post-treatment ATT(g,t) values match within 20%
- Pre-treatment effects may differ due to base_period handling differences
Corrections Made:
- (None - implementation verified correct)
Outstanding Concerns:
- R comparison shows ~20% difference in overall ATT with generated data
- Likely due to differences in how dynamic effects are handled in data generation
- Individual ATT(g,t) values match closely for post-treatment periods
- Further investigation recommended with real-world data
- Pre-treatment ATT(g,t) may differ from R due to base_period="varying" semantics
- Python uses t-1 as base for pre-treatment
- R's behavior requires verification
Deviations from R's did::att_gt():
- NaN for invalid inference: When SE is non-finite or zero, Python returns NaN for t_stat/p_value rather than potentially erroring. This is a defensive enhancement.
Alignment with R's did::att_gt() (as of v2.1.5):
-
Webb weights: Webb's 6-point distribution with values ±√(3/2), ±1, ±√(1/2) uses equal probabilities (1/6 each) matching R's
didpackage. This gives E[w]=0, Var(w)=1.0, consistent with other bootstrap weight distributions.Verification: Our implementation matches the well-established
fwildclusterbootR package (C++ source: wildboottest.cpp). The implementation usessqrt(1.5),1,sqrt(0.5)(and negatives) with equal 1/6 probabilities—identical to our values.Note on documentation discrepancy: Some documentation (e.g., fwildclusterboot vignette) describes Webb weights as "±1.5, ±1, ±0.5". This appears to be a simplification for readability. The actual implementations use ±√1.5, ±1, ±√0.5 which provides the required unit variance (Var(w) = 1.0).
| Field | Value |
|---|---|
| Module | sun_abraham.py |
| Primary Reference | Sun & Abraham (2021) |
| R Reference | fixest::sunab() |
| Status | Complete |
| Last Review | 2026-02-15 |
Verified Components:
- Saturated TWFE regression with cohort × relative-time interactions
- Within-transformation for unit and time fixed effects
- Interaction-weighted event study effects (δ̂_e = Σ_g ŵ_{g,e} × δ̂_{g,e})
- IW weights match event-time sample shares (n_{g,e} / Σ_g n_{g,e})
- Overall ATT as weighted average of post-treatment effects
- Delta method SE for aggregated effects (Var = w' Σ w)
- Cluster-robust SEs at unit level
- Reference period normalized to zero (e=-1 excluded from design matrix)
- R comparison: ATT matches
fixest::sunab()within machine precision (<1e-11) - R comparison: SE matches within 0.3% (small scale) / 0.1% (1k scale)
- R comparison: Event study effects correlation = 1.000000
- R comparison: Event study max diff < 1e-11
- Bootstrap inference (pairs bootstrap)
- Rank deficiency handling (warn/error/silent)
- All REGISTRY.md edge cases tested
Test Coverage:
- 43 tests in
tests/test_sun_abraham.py(36 existing + 7 methodology verification) - R benchmark tests via
benchmarks/run_benchmarks.py --estimator sunab
R Comparison Results:
- Overall ATT matches within machine precision (diff < 1e-11 at both scales)
- Cluster-robust SE matches within 0.3% (well within 1% threshold)
- Event study effects match perfectly (correlation 1.0, max diff < 1e-11)
- Validated at small (200 units) and 1k (1000 units) scales
Corrections Made:
-
DF adjustment for absorbed FE (
sun_abraham.py,_fit_saturated_regression()): Addeddf_adjustment = n_units + n_times - 1toLinearRegression.fit()to account for absorbed unit and time fixed effects in degrees of freedom. Unlike TWFE (which uses-2plus an explicit intercept column), SunAbraham's saturated regression has no intercept, so all absorbed df must come from the adjustment. Affects t-distribution DoF for cohort-level p-values/CIs (slightly larger p-values, slightly wider CIs) but does NOT change VCV or SE values. -
NaN return for no post-treatment effects (
sun_abraham.py,_compute_overall_att()): Changed return from(0.0, 0.0)to(np.nan, np.nan)when no post-treatment effects exist. All downstream inference fields (t_stat, p_value, conf_int) correctly propagate NaN via existing guards infit(). -
Deprecation warnings for unused parameters (
sun_abraham.py,fit()): AddedFutureWarningformin_pre_periodsandmin_post_periodsparameters that are accepted but never used (no-op). These will be removed in a future version. -
Removed event-time truncation at [-20, 20] (
sun_abraham.py): Removed the hardcoded capmax(min(...), -20)/min(max(...), 20)to match R'sfixest::sunab()which has no such limit. All available relative times are now estimated. -
Warning for variance fallback path (
sun_abraham.py,_compute_overall_att()): AddedUserWarningwhen the full weight vector cannot be constructed and a simplified variance (ignoring covariances between periods) is used as fallback. -
IW weights use event-time sample shares (
sun_abraham.py,_compute_iw_effects()): Changed IW weights fromn_g / Σ_g n_g(cohort sizes) ton_{g,e} / Σ_g n_{g,e}(per-event-time observation counts) to match the REGISTRY.md formula. For balanced panels these are identical; for unbalanced panels the new formula correctly reflects actual sample composition at each event-time. Added unbalanced panel test. -
Normalize
np.infnever-treated encoding (sun_abraham.py,fit()):first_treat=np.inf(documented as valid for never-treated) was included intreatment_groupsand_rel_timevia> 0checks, producing-infevent times. Fixed by normalizingnp.infto0immediately after computing_never_treated. Same fix applied tostaggered.py(CallawaySantAnna).
Outstanding Concerns:
- Inference distribution: Cohort-level p-values use t-distribution (via
LinearRegression.get_inference()), while aggregated event study and overall ATT p-values use normal distribution (viacompute_p_value()). This is asymptotically equivalent and standard for delta-method-aggregated quantities. R's fixest uses t-distribution at all levels, so aggregated p-values may differ slightly for small samples — this is a documented deviation.
Deviations from R's fixest::sunab():
- NaN for no post-treatment effects: Python returns
(NaN, NaN)for overall ATT/SE when no post-treatment effects exist. R would error. - Normal distribution for aggregated inference: Aggregated p-values use normal distribution (asymptotically equivalent). R uses t-distribution.
| Field | Value |
|---|---|
| Module | stacked_did.py |
| Primary Reference | Wing, Freedman & Hollingsworth (2024), NBER WP 32054 |
| R Reference | stacked-did-weights (create_sub_exp() + compute_weights()) |
| Status | Complete |
| Last Review | 2026-02-19 |
Verified Components:
- IC1 trimming:
a - kappa_pre >= T_min AND a + kappa_post <= T_max(matches R reference) - IC2 trimming: Three clean control modes (not_yet_treated, strict, never_treated)
- Sub-experiment construction: treated + clean controls within
[a - kappa_pre, a + kappa_post] - Q-weights aggregate: treated Q=1, control
Q = (sub_treat_n/stack_treat_n) / (sub_control_n/stack_control_n)per (event_time, sub_exp) — matches Rcompute_weights() - Q-weights population:
Q_a = (Pop_a^D / Pop^D) / (N_a^C / N^C)(Table 1, Row 2) - Q-weights sample_share:
Q_a = ((N_a^D + N_a^C)/(N^D+N^C)) / (N_a^C / N^C)(Table 1, Row 3) - WLS via sqrt(w) transformation (numerically equivalent to weighted regression)
- Event study regression:
Y = α_0 + α_1·D_sa + Σ_{h≠-1}[λ_h·1(e=h) + δ_h·D_sa·1(e=h)] + U(Eq. 3) - Reference period e=-1-anticipation normalized to zero (omitted from design matrix)
- Delta-method SE for overall ATT:
SE = sqrt(ones' @ sub_vcv @ ones) / K - Cluster-robust SEs at unit level (default) and unit×sub-experiment level
- Anticipation parameter: reference period shifts to e=-1-anticipation, post-treatment includes anticipation periods
- Rank deficiency handling (warn/error/silent via
solve_ols()) - Never-treated encoding: both
first_treat=0andfirst_treat=infhandled - R comparison: ATT matches within machine precision (diff < 2.1e-11)
- R comparison: SE matches within machine precision (diff < 4.0e-10)
- R comparison: Event study effects correlation = 1.000000, max diff < 4.5e-11
- safe_inference() used for all inference fields
- All REGISTRY.md edge cases tested
Test Coverage:
- 72 tests in
tests/test_stacked_did.pyacross 11 test classes:TestStackedDiDBasic(8): fit, event study, group/all raises, simple aggregation, known constant effect, dynamic effectsTestTrimming(5): IC1 window, IC2 no-controls, trimmed groups reported, all-trimmed raises, wider windowTestQWeights(4): treated=1, aggregate formula, sample_share formula, positivityTestCleanControl(5): not_yet_treated, strict, never_treated, missing never-treated raisesTestClustering(2): unit, unit_subexpTestStackedData(4): accessible, required columns, event time rangeTestEdgeCases(8): single cohort, anticipation, unbalanced panel, NaN inference, never-treated encodingsTestSklearnInterface(4): get_params, set_params, unknown raises, convenience functionTestResultsMethods(7): summary, to_dataframe, is_significant, significance_stars, reprTestValidation(8): missing columns, invalid params, population required, no treated units
- R benchmark tests via
benchmarks/run_benchmarks.py --estimator stacked
R Comparison Results (200 units, 8 periods, kappa_pre=2, kappa_post=2):
| Metric | Python | R | Diff |
|---|---|---|---|
| Overall ATT | 2.277699574579 | 2.2776995746 | 2.1e-11 |
| Overall SE | 0.062045687626 | 0.062045688027 | 4.0e-10 |
| ES e=-2 ATT | 0.044517975379 | 0.044517975379 | <1e-12 |
| ES e=0 ATT | 2.104181683763 | 2.104181683800 | <1e-11 |
| ES e=1 ATT | 2.209990715130 | 2.209990715100 | <1e-11 |
| ES e=2 ATT | 2.518926324845 | 2.518926324800 | <1e-11 |
| Stacked obs | 1600 | 1600 | exact |
| Sub-experiments | 3 | 3 | exact |
Corrections Made:
-
IC1 lower bound and time window aligned with R reference (
stacked_did.py,_trim_adoption_events()and_build_sub_experiment()): The paper text specifies time window[a - kappa_pre - 1, a + kappa_post](including an extra pre-period), but the R reference implementation by co-author Hollingsworth uses[a - kappa_pre, a + kappa_post]. The extra period had no event-study dummy, altering the baseline regression. Fixed to match R: removed-1from both IC1 check (a - kappa_pre >= T_min) and time window start. Discrepancy documented indocs/methodology/papers/wing-2024-review.mdGaps section. -
Q-weight computation: event-time-specific for aggregate weighting (
stacked_did.py,_compute_q_weights()): Changed aggregate Q-weights from unit counts per sub-experiment to observation counts per (event_time, sub_exp), matching R referencecompute_weights(). For balanced panels, results are unchanged. For unbalanced panels, weights now adjust for varying observation density. Population/sample_share retain unit-count formulas (paper notation). -
Anticipation parameter: reference period and dummies (
stacked_did.py,fit()): Reference period now shifts toe = -1 - anticipation. Event-time dummies cover the full window[-kappa_pre - anticipation, ..., kappa_post]. Post-treatment effects include anticipation periods. Consistent with ImputationDiD, TwoStageDiD, SunAbraham. -
Group aggregation removed (
stacked_did.py):aggregate="group"andaggregate="all"removed. The pooled stacked regression cannot produce cohort-specific effects without cohort×event-time interactions. Use CallawaySantAnna or ImputationDiD for cohort-level estimates. -
n_sub_experiments metadata (
stacked_did.py,fit()): Now tracks actual built sub-experiments, not all events in omega_kappa. Warns if any sub-experiments are empty after data filtering.
Outstanding Concerns:
- Population/sample_share Q-weights use paper's unit-count formulas (no R reference to validate)
- Anticipation not validated against R (R reference doesn't test anticipation > 0)
Deviations from R's stacked-did-weights:
- NaN for invalid inference: Python returns NaN for t_stat/p_value/conf_int when
SE is non-finite or zero. R would propagate through
fixest::feols()error handling.
| Field | Value |
|---|---|
| Module | synthetic_did.py |
| Primary Reference | Arkhangelsky et al. (2021) |
| R Reference | synthdid::synthdid_estimate() |
| Status | Complete |
| Last Review | 2026-02-10 |
Corrections Made:
- Time weights: Frank-Wolfe on collapsed form (was heuristic inverse-distance).
Replaced ad-hoc inverse-distance weighting with the Frank-Wolfe algorithm operating
on the collapsed (N_co x T_pre) problem as specified in Algorithm 1 of
Arkhangelsky et al. (2021), matching R's
synthdid::fw.step(). - Unit weights: Frank-Wolfe with two-pass sparsification (was projected gradient
descent with wrong penalty). Replaced projected gradient descent (which used an
incorrect penalty formulation) with Frank-Wolfe optimization followed by two-pass
sparsification, matching R's
synthdid::sc.weight.fw()andsparsify_function(). - Auto-computed regularization from data noise level (was
lambda_reg=0.0,zeta=1.0). Regularization parameterszeta_omegaandzeta_lambdaare now computed automatically from the data noise level (N_tr * sigma^2) as specified in Appendix D of Arkhangelsky et al. (2021), matching R's default behavior. - Bootstrap SE uses fixed weights matching R's
bootstrap_sample(was re-estimating all weights). The bootstrap variance procedure now holds unit and time weights fixed at their point estimates and only re-estimates the treatment effect, matching the approach in R'ssynthdid::bootstrap_sample(). - Default
variance_methodchanged to"placebo"matching R's default. The R package uses placebo variance by default (synthdid_estimatereturns an object whosevcov()uses the placebo method); our default now matches. - Deprecated
lambda_regandzetaparams; new params arezeta_omegaandzeta_lambda. The old parameters had unclear semantics and did not correspond to the paper's notation. The new parameters directly match the paper and R package naming conventions.lambda_regandzetaare deprecated with warnings and will be removed in a future release.
Outstanding Concerns:
- (None)
| Field | Value |
|---|---|
| Module | triple_diff.py |
| Primary Reference | Ortiz-Villavicencio & Sant'Anna (2025) |
| R Reference | triplediff::ddd() (v0.2.1, CRAN) |
| Status | Complete |
| Last Review | 2026-02-18 |
Verified Components:
- ATT matches R
triplediff::ddd()for all 3 methods (DR, RA, IPW) — <0.001% relative difference - SE matches R
triplediff::ddd()for all 3 methods — <0.001% relative difference - With-covariates ATT matches R — <0.001% relative difference
- With-covariates SE matches R — <0.001% relative difference
- Verified across all 4 DGP types from
gen_dgp_2periods()(different model misspecification scenarios) - Influence function-based SE:
SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n) - Three-DiD decomposition:
DDD = DiD_3 + DiD_2 - DiD_1matching R's approach - safe_inference() used for all inference fields (t_stat, p_value, conf_int)
Corrections Made:
- Complete rewrite of estimation methods (was naive cell-mean approach, now three-DiD
decomposition). The original implementation computed DDD directly from 8 cell means with
a naive cell-variance SE. Replaced with R's decomposition into three pairwise DiD
comparisons (subgroup j vs reference subgroup 4), each using DR/IPW/RA methodology
from Callaway & Sant'Anna. This fixed:
- DR SE: was off by >100% (naive cell variance vs influence function)
- IPW SE: was off by >200% (incorrect cell-probability-ratio weights)
- With-covariates ATT: was off by >1000% for all methods (incorrect cell-by-cell regression)
- Influence function SE replaces naive cell variance for all methods:
SE = std(w3*IF_3 + w2*IF_2 - w1*IF_1, ddof=1) / sqrt(n)wherew_j = n / n_jandIF_jis the per-observation influence function for pairwise DiD j. - Propensity score estimation now runs per-pairwise-comparison (P(subgroup=4|X) within {j, 4} subset) instead of global P(G=1|X).
- Outcome regression now fits separate OLS per subgroup-time cell within each pairwise
comparison, matching R's
compute_outcome_regression_rc().
Outstanding Concerns:
- Implementation uses
panel=FALSE(repeated cross-section) mode. Panel mode (panel=TRUE) with differenced outcomes not yet implemented.
R Comparison Results (panel=FALSE, n=500 per DGP):
| DGP | Method | Covariates | ATT Diff | SE Diff |
|---|---|---|---|---|
| 1 | DR | No | <0.001% | <0.001% |
| 1 | DR | Yes | <0.001% | <0.001% |
| 1 | REG | No | <0.001% | <0.001% |
| 1 | REG | Yes | <0.001% | <0.001% |
| 1 | IPW | No | <0.001% | <0.001% |
| 1 | IPW | Yes | <0.001% | <0.001% |
| 2-4 | All | Both | <0.001% | <0.001% |
| Field | Value |
|---|---|
| Module | trop.py |
| Primary Reference | Athey, Imbens, Qu & Viviano (2025) |
| R Reference | (forthcoming) |
| Status | Not Started |
| Last Review | - |
Corrections Made:
- (None yet)
Outstanding Concerns:
- (None yet)
| Field | Value |
|---|---|
| Module | bacon.py |
| Primary Reference | Goodman-Bacon (2021) |
| R Reference | bacondecomp::bacon() |
| Status | Not Started |
| Last Review | - |
Corrections Made:
- (None yet)
Outstanding Concerns:
- (None yet)
| Field | Value |
|---|---|
| Module | honest_did.py |
| Primary Reference | Rambachan & Roth (2023) |
| R Reference | HonestDiD package |
| Status | Not Started |
| Last Review | - |
Corrections Made:
- (None yet)
Outstanding Concerns:
- (None yet)
| Field | Value |
|---|---|
| Module | pretrends.py |
| Primary Reference | Roth (2022) |
| R Reference | pretrends package |
| Status | Not Started |
| Last Review | - |
Corrections Made:
- (None yet)
Outstanding Concerns:
- (None yet)
| Field | Value |
|---|---|
| Module | power.py |
| Primary Reference | Bloom (1995), Burlig et al. (2020) |
| R Reference | pwr / DeclareDesign |
| Status | Not Started |
| Last Review | - |
Corrections Made:
- (None yet)
Outstanding Concerns:
- (None yet)
For each estimator, complete the following steps:
- Read primary academic source - Review the key paper(s) cited in REGISTRY.md
- Compare key equations - Verify implementation matches equations in REGISTRY.md
- Run benchmark against R reference - Execute
benchmarks/run_benchmarks.py --estimator <name>if available - Verify edge case handling - Check behavior matches REGISTRY.md documentation
- Check standard error formula - Confirm SE computation matches reference
- Document any deviations - Add notes explaining intentional differences with rationale
- After completing a review: Update status to "Complete" and add date
- When making corrections: Document what was fixed in the "Corrections Made" section
- When identifying issues: Add to "Outstanding Concerns" for future investigation
- When deviating from reference: Document the deviation and rationale
When our implementation intentionally differs from the reference implementation, document:
- What differs: Specific behavior or formula that differs
- Why: Rationale (e.g., "defensive enhancement", "bug in R package", "follows updated paper")
- Impact: Whether results differ in practice
- Cross-reference: Update REGISTRY.md edge cases section
Example:
**Deviation (2025-01-15)**: CallawaySantAnna returns NaN for t_stat when SE is non-finite,
whereas R's `did::att_gt` would error. This is a defensive enhancement that provides
more graceful handling of edge cases while still signaling invalid inference to users.
Suggested order for reviews based on usage and complexity:
-
High priority (most used, complex methodology):
- CallawaySantAnna
- SyntheticDiD
- HonestDiD
-
Medium priority (commonly used, simpler methodology):
- DifferenceInDifferences
- TwoWayFixedEffects
- MultiPeriodDiD
- SunAbraham
- BaconDecomposition
-
Lower priority (newer or less commonly used):
- TripleDifference
- TROP
- PreTrendsPower
- PowerAnalysis
- REGISTRY.md - Academic foundations and key equations
- ROADMAP.md - Feature roadmap
- TODO.md - Technical debt tracking
- CLAUDE.md - Development guidelines