Skip to content

Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance#240

Open
igerber wants to merge 19 commits intomainfrom
survey-phase-seven
Open

Survey Phase 7: CS IPW/DR covariates, repeated cross-sections, HonestDiD survey variance#240
igerber wants to merge 19 commits intomainfrom
survey-phase-seven

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented Mar 28, 2026

Summary

  • Phase 7a: Remove NotImplementedError gate for CallawaySantAnna IPW/DR + covariates + survey. Implement DRDID panel nuisance IF corrections (propensity score + outcome regression) for both survey-weighted and non-survey DR paths (Sant'Anna & Zhao 2020, Theorem 3.1). Extract _safe_inv() helper for matrix inversions.
  • Phase 7d: Thread survey degrees of freedom through HonestDiD for t-distribution critical values. Compute full event-study variance-covariance matrix from influence function vectors in CallawaySantAnna aggregation. Add event_study_vcov field to CallawaySantAnnaResults and survey_metadata/df_survey to HonestDiDResults.
  • Phase 7b: Add panel=False for repeated cross-section support in CallawaySantAnna. New _precompute_structures_rc(), _compute_att_gt_rc(), and three RC estimation methods (_outcome_regression_rc, _ipw_estimation_rc, _doubly_robust_rc) with covariates and survey weights. Canonical index abstraction in aggregation/bootstrap mixins. RCS data generator via generate_staggered_data(panel=False).

Methodology references

  • Method name(s): Callaway-Sant'Anna (2021), Sant'Anna & Zhao (2020) DRDID panel/cross-section, Rambachan & Roth (2023) HonestDiD
  • Paper / source link(s):
    • Sant'Anna, P.H.C. & Zhao, J. (2020). "Doubly Robust Difference-in-Differences Estimators." J. Econometrics 219(1). Theorem 3.1 (panel IF corrections), Section 4 (cross-sectional DRDID).
    • Callaway, B. & Sant'Anna, P.H.C. (2021). "Difference-in-Differences with Multiple Time Periods." J. Econometrics 225(2). Section 4.1 (repeated cross-sections).
    • Rambachan, A. & Roth, J. (2023). "A More Credible Approach to Parallel Trends." Rev. Econ. Studies 90(5).
  • Intentional deviations: DR nuisance IF corrections use the same survey-weighted Hessian/score pattern as the existing IPW path. Non-survey DR path also receives IF corrections (was plug-in only). Per-cell SEs remain IF-based (not full TSL) — documented in REGISTRY.md. Event-study VCV under replicate weights falls back to diagonal (multivariate replicate VCV deferred).

Validation

  • Tests added/updated:
    • tests/test_survey_phase7a.py (22 tests): smoke, scale invariance, uniform-weight equivalence, IF correction, aggregation, bootstrap, edge cases
    • tests/test_staggered_rc.py (23 tests): all methods, covariates, survey, aggregation, bootstrap, control groups, base periods, data generator, edge cases
    • tests/test_honest_did.py (+4 tests): survey df extraction, VCV computation, bounds widening, no-survey baseline
    • tests/test_survey_phase4.py: 2 negative tests converted to positive assertions
  • Full test suite: 365 tests pass across all affected files (0 failures)

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

…DiD survey variance

Phase 7a: Remove NotImplementedError gate for IPW/DR + covariates + survey.
Add DRDID panel nuisance IF corrections (PS + OR) for both survey and
non-survey DR paths. Extract _safe_inv helper for matrix inversions.

Phase 7d: Thread survey df through HonestDiD for t-distribution critical
values. Compute full event-study VCV from influence function vectors.
Add event_study_vcov to CallawaySantAnnaResults.

Phase 7b: Add panel=False for repeated cross-section support in
CallawaySantAnna. New _precompute_structures_rc, _compute_att_gt_rc,
and three RC estimation methods (reg, ipw, dr) with covariates and
survey weights. Canonical index abstraction in aggregation/bootstrap.
RCS data generator in generate_staggered_data(panel=False).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1: the new repeated-cross-section CallawaySantAnna paths do not fully match the documented/source-method inference contract, and the new HonestDiD survey covariance plumbing diverges from the registry for replicate-weight designs.

Executive summary

  • Affected methods: Callaway-Sant'Anna (panel=False, IPW/DR, aggregation) and HonestDiD (survey-aware event-study covariance).
  • The Phase 7a panel DR survey changes look broadly aligned with the updated registry; I did not find a blocker in the new panel nuisance-correction code itself.
  • The new repeated-cross-section IPW/DR analytic SEs are still plug-in only and do not implement the cross-sectional nuisance-estimation IF corrections the registry says are supported.
  • For panel=False, unweighted simple/event-study aggregation uses time-specific treated-cell counts while the WIF code uses full-sample cohort shares, so the weighting contract is internally inconsistent.
  • The new HonestDiD covariance path for replicate-weight surveys does not implement the documented diagonal fallback; it builds a full Psi'Psi covariance and passes that to HonestDiD.
  • The new public panel parameter is not propagated into CallawaySantAnnaResults, and result summaries still label repeated-cross-section observation counts as “units”.

Methodology

Cross-check basis: docs/methodology/REGISTRY.md:L291-L319, docs/methodology/REGISTRY.md:L419-L424, and docs/methodology/REGISTRY.md:L1633-L1637.

  1. Severity: P1. diff_diff/staggered.py:L2872-L2978, diff_diff/staggered.py:L2980-L3127, docs/methodology/REGISTRY.md:L423-L424.
    Impact: the new repeated-cross-section IPW and DR analytic inference does not include nuisance-estimation IF corrections. _ipw_estimation_rc() computes SE from the plug-in IF only, and _doubly_robust_rc() likewise stops at the plug-in IF. That is a mismatch with the registry’s claim that panel=False uses Section 4 cross-sectional DRDID with per-observation influence functions, and it is notably weaker than the panel IPW/DR code in the same file, which now adds explicit PS/OR correction terms. The result is understated or otherwise incorrect SEs/CIs/p-values for covariate-adjusted RCS IPW/DR.
    Concrete fix: implement the Section 4 cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR, or explicitly document the deviation in REGISTRY.md and disable analytic inference for those branches until the correct IF is in place.

  2. Severity: P1. diff_diff/staggered_aggregation.py:L37-L152, diff_diff/staggered_aggregation.py:L289-L314, diff_diff/staggered_aggregation.py:L574-L645, diff_diff/staggered_bootstrap.py:L223-L267, diff_diff/staggered_bootstrap.py:L560-L657.
    Impact: the new unweighted panel=False aggregation uses data["n_treated"] from each (g,t) cell as the aggregation weight, but the WIF path for the same estimator computes pg from full-sample cohort counts. In panel data those coincide because cohort size is constant across t; in repeated cross-sections they generally do not. That means the point estimate, WIF correction, and bootstrap aggregation are no longer using the same weight definition. This changes the estimand/finite-sample weighting and makes the SE formula internally inconsistent with the aggregated estimator.
    Concrete fix: precompute fixed cohort masses for panel=False once from the full repeated-cross-section sample, then use those same cohort masses everywhere simple/event-study/bootstrap weights are formed.

  3. Severity: P1. diff_diff/staggered_aggregation.py:L710-L739, diff_diff/honest_did.py:L664-L669, docs/methodology/REGISTRY.md:L1637-L1637.
    Impact: the registry explicitly says replicate-weight event-study covariance should fall back to a diagonal matrix until multivariate replicate VCV is implemented, but _aggregate_event_study() currently builds a full Psi.T @ Psi matrix for all non-TSL cases, which includes replicate-weight designs. HonestDiD then consumes that full matrix whenever event_study_vcov is present. That is an undocumented methodology mismatch and can change HonestDiD bounds under replicate designs without warning.
    Concrete fix: when uses_replicate_variance is true, do not populate a full off-diagonal event_study_vcov; set it to None or an explicit diagonal-from-SEs fallback until a proper multivariate replicate covariance estimator is implemented and validated.

Code Quality

No additional findings beyond the methodology issues above.

Performance

No findings.

Maintainability

  1. Severity: P1. diff_diff/staggered.py:L258-L336, diff_diff/staggered.py:L1374-L1386, diff_diff/staggered.py:L1774-L1785, diff_diff/staggered_results.py:L63-L123, diff_diff/staggered_results.py:L157-L163.
    Impact: the new public panel parameter is not stored on CallawaySantAnnaResults, even though it changes how counts and aggregation should be interpreted. For panel=False, the fit path stores observation counts in n_treated_units / n_control_units, but the results API and summary still present them as “units”. That makes downstream use ambiguous and violates the expected propagation pattern for new public parameters.
    Concrete fix: add panel: bool (or equivalent index-space metadata) to CallawaySantAnnaResults, populate it in fit(), and switch summary labels/output semantics to observation-based wording when panel=False.

Tech Debt

No separate findings. The new TODO.md changes only resolve prior items; they do not track the P1 issues above, so those remain unmitigated.

Security

No findings.

Documentation/Tests

  1. Severity: P2. diff_diff/staggered.py:L119-L184, docs/api/staggered.rst:L15-L32, tests/test_staggered_rc.py:L48-L351, tests/test_honest_did.py:L1125-L1244.
    Impact: the public CallawaySantAnna class docstring still does not describe panel, so the generated API docs do not expose the new parameter contract. The new RCS tests mostly check “finite result” behavior, not methodology alignment, aggregation weights, or SE correctness. The HonestDiD additions test Taylor-style survey inputs, but there is no test for the new replicate-weight covariance branch, which is exactly where the implementation currently diverges from the registry.
    Concrete fix: document panel in the estimator docstring/API docs, add a reference-validation test against did::att_gt(panel=FALSE) for at least one aggregation case and one IPW/DR covariate case, and add a replicate-weight HonestDiD test that asserts diagonal fallback until multivariate replicate VCV is implemented.

Path to Approval

  1. Implement cross-sectional nuisance-estimation IF corrections for panel=False IPW/DR analytic inference, or explicitly document and temporarily disable those analytic SEs.
  2. Replace per-cell n_treated weighting with fixed cohort-mass weighting throughout panel=False simple/event-study/bootstrap aggregation.
  3. Make replicate-weight HonestDiD use the documented diagonal event-study covariance fallback instead of a full Psi.T @ Psi matrix.
  4. Propagate panel into CallawaySantAnnaResults and correct result-summary labels/semantics for repeated cross-sections.
  5. Add reference tests for panel=False methodology and a replicate-weight HonestDiD regression test.

Static review only; I was not able to do runtime spot-checks in this environment because the Python runtime here is missing required dependencies such as numpy.

… VCV, panel on results

Fix 5 findings from PR #240 CI review:
- Add cross-sectional nuisance IF corrections (PS + OR) to _ipw_estimation_rc
  and _doubly_robust_rc, matching panel path methodology
- Use fixed full-sample cohort masses for unweighted RCS aggregation weights
  (consistency with WIF group-share denominator)
- Guard replicate-weight designs from full event-study VCV (diagonal fallback)
- Add panel field to CallawaySantAnnaResults, fix summary labels for RCS
- Add panel to class docstring, replicate VCV test, RCS IF correction test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 4bf566d1f94ebca6500b5ffe806c87302aa9bb9a


PR Review

Overall assessment

⚠️ Needs changes. The highest unmitigated severity is P1. The prior HonestDiD replicate-weight fallback and panel propagation issues look resolved, but two repeated-cross-section inference problems remain.

Executive Summary

  • The remaining blockers are both in the new Callaway-Sant'Anna panel=False path.
  • Repeated-cross-section DR analytical inference is still internally inconsistent: the nuisance IF corrections reuse the post-period treated denominator for both the post and base-period components.
  • The earlier RCS weighting-contract fix is only partial: balance_e event-study aggregation and the unweighted bootstrap overall/event-study helpers still revert to per-cell n_treated counts instead of fixed cohort masses.
  • HonestDiD’s survey-variance changes now appear aligned with the updated registry, including the replicate-weight diagonal fallback.
  • The new RC tests are mostly smoke/finite-result checks and would not catch either remaining RC inference defect.

Methodology

Affected methods: Callaway-Sant'Anna repeated cross sections (panel=False, Section 4-style DRDID and aggregation/bootstrap). The HonestDiD survey-variance path looks consistent with the updated registry note.

  • Severity: P1. Impact: In panel=False DR, the point estimator correctly uses separate treated normalizers for the post and base-period pieces (sw_gt_sum vs sw_gs_sum, or n_gt vs n_gs), but the nuisance IF corrections collapse both periods onto a single normalizer = sum(sw_gt) or n_gt. That mis-scales both the PS correction and the base-period OR correction whenever the cohort-g sample size or treated weight sum differs across periods, which is the ordinary repeated-cross-section case. The resulting analytical SE/CIs/p-values are inconsistent with the estimator and with the Section 4 repeated-cross-section decomposition promised in the registry. Concrete fix: use separate normalizer_t and normalizer_s throughout M2_dr, M1_t, and M1_s, matching the denominators used in att_t_aug and att_s_aug; add a regression test with n_gt != n_gs and unequal treated weight sums. References: diff_diff/staggered.py:L3118-L3159 diff_diff/staggered.py:L3184-L3228 docs/methodology/REGISTRY.md:L423-L424

  • Severity: P1. Impact: The earlier panel=False weighting-contract bug is only partially fixed. Analytical simple/event-study aggregation now uses fixed cohort masses, but unweighted event-study with balance_e and the unweighted bootstrap helpers still fall back to cell-specific n_treated. In repeated cross-sections those cell counts vary by period, so the bootstrap SEs/CIs and balance_e event-study weights no longer correspond to the estimator/WIF denominator used elsewhere in the same results object. Concrete fix: compute one unweighted cohort-mass map from precomputed["unit_cohorts"] and use it everywhere panel=False aggregation weights are formed, including the balance_e branch in _aggregate_event_study(), overall bootstrap weights, and _prepare_event_study_aggregation(). References: diff_diff/staggered_aggregation.py:L76-L100 diff_diff/staggered_aggregation.py:L590-L610 diff_diff/staggered_aggregation.py:L621-L648 diff_diff/staggered_bootstrap.py:L229-L243 diff_diff/staggered_bootstrap.py:L561-L575 diff_diff/staggered_bootstrap.py:L614-L631

Code Quality

No findings beyond the methodology issues above.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. Neither remaining P1 issue is tracked in TODO.md:L48-L60, so TODO.md does not mitigate them.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new RC tests are largely finiteness/smoke checks. They do not exercise the two cases that break the current implementation: unequal treated cohort sizes across the two RC periods, and panel=False aggregation/bootstrap weight formation under balance_e or n_bootstrap>0. As written, the suite would pass with both P1 bugs still present. Concrete fix: add deterministic panel=False fixtures where cohort-g counts differ between t and s, assert the corrected DR SE against a hand-built IF/reference calculation, and add bootstrap/event-study tests that verify fixed cohort-mass weights under balance_e and n_bootstrap>0. References: tests/test_staggered_rc.py:L135-L152 tests/test_staggered_rc.py:L258-L300 tests/test_staggered_rc.py:L362-L381

Path to Approval

  1. Fix the repeated-cross-section DR nuisance IF normalization so the post-period and base-period PS/OR corrections use their own treated denominators (sw_gt_sum/sw_gs_sum or n_gt/n_gs).
  2. Finish propagating fixed cohort-mass weighting through all panel=False aggregation paths, especially balance_e event-study aggregation and the unweighted overall/event-study bootstrap helpers.
  3. Add regression tests covering unequal RC cohort counts across periods and panel=False bootstrap/event-study weight consistency.

Static review only: I could not execute the test suite here because this environment lacks pytest and scientific Python dependencies such as numpy.

…equal-count tests

- Use separate normalizer_t/normalizer_s in _doubly_robust_rc() IF corrections
  (post vs base period treated denominators must match the DR estimator)
- Precompute rcs_cohort_masses in _precompute_structures_rc() and return cohort
  mass as n_treated from _compute_att_gt_rc() — fixes all downstream consumers
  (aggregation, bootstrap, balance_e) at the source instead of per-consumer patches
- Remove now-unnecessary rcs_cohort_counts blocks from aggregation
- Add unequal cohort count test fixture and regression tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6080f927c1c76a760b7825b59d5f46032da64b1c


PR Review

Overall assessment
⚠️ Needs changes. The highest unmitigated severity is P1. The prior RC normalizer and fixed-cohort-weighting blockers appear resolved, but the new panel=False estimator formulas still do not line up with the repeated-cross-section methods this PR claims to implement.

Executive summary

  • The earlier re-review blockers look fixed: RC DR no longer reuses one denominator for both periods, and fixed cohort-mass weighting now propagates through analytical aggregation and bootstrap.
  • The remaining blocker is methodological: the new repeated-cross-section reg and dr paths do not match the DRDID / did::att_gt(panel=FALSE) estimators they are supposed to mirror.
  • _outcome_regression_rc() uses separate pre/post treated residual averages, but the reference reg_did_rc estimator pools treated weights when averaging the predicted change.
  • _doubly_robust_rc() is further off: it uses only control-group ORs and normalizes the control augmentation terms by treated-period masses, which does not match either drdid_rc or the simpler AIPW repeated-cross-section formula.
  • The added RC tests are almost entirely smoke/finite-result checks, so this kind of formula mismatch passes undetected.
  • The HonestDiD survey-df / event-study-vcov changes look consistent with the new registry note.

Methodology

  • Severity: P1. Impact: The new repeated-cross-section reg path in _outcome_regression_rc at diff_diff/staggered.py:L2795 computes ATT = mean_t(Y - m_t(X)) - mean_s(Y - m_s(X)) using separate treated averages for the post and base periods (diff_diff/staggered.py:L2843, diff_diff/staggered.py:L2859, diff_diff/staggered.py:L2869). did::att_gt(panel=FALSE, est_method="reg") dispatches to DRDID::reg_did_rc, and that estimator averages the predicted change over the treated group with pooled treated weights rather than separate pre/post treated residual means. That is a different finite-sample estimator whenever treated-sample composition differs across the two cross-sections. The registry note at docs/methodology/REGISTRY.md:L423 documents panel=False support, but not this estimator change. Concrete fix: Rework _outcome_regression_rc() and its IF to match reg_did_rc exactly, or explicitly document and rename a different RC regression estimator if that deviation is intentional. citeturn3view0turn5view1
  • Severity: P1. Impact: The new repeated-cross-section dr path in _doubly_robust_rc at diff_diff/staggered.py:L3031 does not match the cited DRDID repeated-cross-section estimators. The point estimator uses only control-group ORs (diff_diff/staggered.py:L3059) and divides the control augmentation terms by treated-period masses (diff_diff/staggered.py:L3131, diff_diff/staggered.py:L3153), with the same normalization baked into the IF corrections (diff_diff/staggered.py:L3190, diff_diff/staggered.py:L3211). But did::att_gt(panel=FALSE, est_method="dr") dispatches to DRDID::drdid_rc; that locally efficient estimator includes treated- and control-group outcome-regression pieces in both periods, and even the simpler AIPW repeated-cross-section formula normalizes each treated/control pre/post component by its own weight sum rather than the treated totals. So the current code changes both point estimates and IF-based SEs whenever reweighted control mass differs from treated mass. Concrete fix: Pick one specific Section 4 DR estimator (drdid_rc, aipw_did_rc1, or another named variant), implement its point estimator and IF end-to-end, and update the registry to name that exact estimator. citeturn3view1turn5view2turn1search4

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity: P2. Impact: In the RC path, _compute_att_gt_rc() stores the full cohort mass in n_treated so downstream aggregation can reuse existing weight plumbing (diff_diff/staggered.py:L2710), but the public results contract still documents n_treated as the number of treated observations for that group-time cell (diff_diff/staggered_results.py:L21). That silently turns a reporting field into an aggregation-weight field. Concrete fix: Keep n_treated as the actual cell count and add a separate cohort_mass / agg_weight field for RC aggregation.

Tech Debt

  • No separate findings. The P1 methodology issue above is not tracked in TODO.md:L48, so it remains unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new RC test file is almost entirely smoke coverage: it checks that fits are finite, methods run, and bootstrap returns something, but never compares panel=False reg/dr against DRDID formulas, did outputs, or even hand-built 2×2 RC reference values (tests/test_staggered_rc.py:L48, tests/test_staggered_rc.py:L160, tests/test_staggered_rc.py:L413). That means the formula mismatch above passes cleanly. Concrete fix: Add deterministic repeated-cross-section fixtures where treated covariate composition differs between pre and post, then assert reg/dr ATT and SE outputs against checked-in DRDID / R reference values or hand-verified formulas.

Path to Approval

  1. Re-implement _outcome_regression_rc() to match the repeated-cross-section reg_did_rc formula and influence function used by did::att_gt(panel=FALSE, est_method="reg").
  2. Re-implement _doubly_robust_rc() to match one named DRDID repeated-cross-section estimator exactly, including the correct normalization and nuisance-IF terms.
  3. Add RC reference-value tests that fail under the current formulas, not just finiteness/smoke checks.

…ulas

_outcome_regression_rc: Pool all treated obs for OR correction term
(was: separate per-period averages). Period-specific treated means for Y.
Matches Sant'Anna & Zhao (2020) Eq 2.2 / R reg_did_rc exactly.

_doubly_robust_rc: Fit 4 OLS models (control+treated, pre+post) for
locally efficient DR estimator (was: 2 control-only). Implements tau_1
(AIPW) + tau_2 (local efficiency adjustment) with full 11-component IF.
Matches Sant'Anna & Zhao (2020) Eq 3.3+3.4 / R drdid_rc exactly.

Add agg_weight field to group_time_effects for RCS aggregation weight
(cohort mass), separate from n_treated (per-cell display count).
Aggregation uses data.get("agg_weight", data["n_treated"]) for
backward compatibility with panel data.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b623deeed20d79ab05bdf255819745c93140371a


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Executive Summary

  • The prior repeated-cross-section point-estimator blockers appear addressed: the new reg path now pools the treated OR correction, and the new dr path includes treated-side OR terms plus nuisance IF corrections.
  • [Newly identified] The non-survey panel=False bootstrap path still reaggregates with realized cell counts n_{g,t} instead of the fixed cohort mass agg_weight/N_g, so bootstrap overall and event-study inference target a different estimator than the analytical path.
  • That mismatch is not covered by the new methodology note for repeated cross-sections in docs/methodology/REGISTRY.md:L423, and it is not tracked in TODO.md:L48-L60 or TODO.md:L167-L170.
  • The HonestDiD survey-df / event-study-vcov changes are consistent with the new registry note; no separate methodology defect stood out there.
  • The new RC bootstrap tests are smoke-only, so this weighting regression currently slips through.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

  • Severity: P3 Impact: event_study_vcov is cached on the estimator instance and later copied into results via getattr(self, "_event_study_vcov", None). If the same CallawaySantAnna object is reused, a fit that does not compute event-study aggregation can inherit stale covariance metadata from a previous fit. References: diff_diff/staggered_aggregation.py:L714-L755, diff_diff/staggered.py:L1772-L1796. Concrete fix: reset self._event_study_vcov = None at the start of fit() and only attach it when the current fit actually computed event_study_effects.

Tech Debt

No separate findings. The P1 bootstrap-weight issue above is not currently tracked in TODO.md:L48-L60 or TODO.md:L167-L170, so it remains unmitigated.

Security

No findings.

Documentation/Tests

  • Severity: P2 Impact: The new RC bootstrap coverage only checks that bootstrap runs, including on the unequal-cohort fixture, but never asserts that bootstrap aggregation uses the fixed cohort masses introduced for panel=False. References: tests/test_staggered_rc.py:L289-L300, tests/test_staggered_rc.py:L413-L482. Concrete fix: add a deterministic unequal-cohort panel=False test that verifies bootstrap overall/event-study prep uses the same fixed cohort weights as the analytical aggregation path.

Path to Approval

  1. Update CallawaySantAnnaBootstrapMixin._run_multiplier_bootstrap() so non-survey panel=False overall ATT bootstrap weights use fixed cohort mass (agg_weight / precomputed cohort mass), not per-cell n_treated.
  2. Update _prepare_event_study_aggregation() so non-survey panel=False event-study bootstrap weights use the same fixed cohort masses.
  3. Add a regression test on the unequal-cohort RC fixture that fails if bootstrap reaggregation falls back to per-cell n_treated instead of fixed cohort mass.

- Bootstrap overall/event-study reaggregation now uses agg_weight (fixed
  cohort mass) for panel=False, matching the analytical aggregation path
- Reset self._event_study_vcov = None at start of fit() to prevent stale
  VCV from prior fit leaking into reused estimator objects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 3b405b7f9009afb4d8d90397d931b327a5c65825


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the changed tests in this environment because the default Python interpreter here does not have the project dependencies available.

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: CallawaySantAnna.fit() computes and stores event_study_vcov from analytical IF vectors during event-study aggregation,diff_diff/staggered_aggregation.py:L714-L755 then, when n_bootstrap>0, overwrites event_study_effects[*]["se"], CIs, and p-values with bootstrap results while leaving that covariance matrix unchanged on the results object.diff_diff/staggered.py:L1709-L1733 diff_diff/staggered.py:L1773-L1799 HonestDiD now always prefers event_study_vcov when present,diff_diff/honest_did.py:L664-L670 so bootstrap-fit CS results silently feed analytical covariance into sensitivity analysis. That contradicts the Phase 7d intent that HonestDiD respect the same variance structure as the underlying event study.docs/methodology/REGISTRY.md:L1637-L1637 Concrete fix: when bootstrap inference is used for event-study results, either compute and store a bootstrap event-study covariance matrix from the bootstrap draws, or clear/ignore event_study_vcov so HonestDiD falls back to the bootstrap variance path instead of mixing analytical and bootstrap inference.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new HonestDiD tests validate analytical event_study_vcov creation and the replicate-weight diagonal fallback, but they do not cover CallawaySantAnna(..., n_bootstrap>0, aggregate="event_study"). That leaves the new covariance-source mismatch untested.tests/test_honest_did.py:L1158-L1281 Concrete fix: add a regression test that fits a bootstrap event study, runs HonestDiD.fit(), and asserts that the covariance source matches the bootstrap path (or that event_study_vcov is absent/recomputed from bootstrap draws on bootstrapped fits).

Path to Approval

  1. Make bootstrap-fit CallawaySantAnna event-study results expose a variance path that matches the reported bootstrap SEs: either store a bootstrap event-study covariance matrix, or suppress/ignore analytical event_study_vcov on bootstrapped fits.
  2. Add a targeted regression test for CallawaySantAnna(n_bootstrap>0, aggregate="event_study") -> HonestDiD.fit(...) that fails if HonestDiD consumes the analytical IF covariance instead of the bootstrap path.
  3. If the analytical-on-bootstrap behavior is intentional, document that explicitly in docs/methodology/REGISTRY.md; otherwise the current behavior remains an undocumented methodology deviation.

…dy SEs

Prevents HonestDiD from mixing analytical IF-based VCV with bootstrap SEs
on bootstrap-fit CallawaySantAnna results. When n_bootstrap>0, the
event_study_vcov is set to None so HonestDiD falls back to diagonal from
the bootstrap SEs (consistent variance path).

Add regression test: bootstrap CS → HonestDiD asserts vcov is None.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 53cfd5d5b06dd44f71f5272fe16a5e537543a327


Overall Assessment

⚠️ Needs changes. Highest unmitigated severity: P1.

Static review only: I could not execute the added tests here because pytest is not installed in this environment.

Executive Summary

Methodology

  • Severity: P3. Impact: The previous HonestDiD/bootstrap covariance-source finding looks resolved. Bootstrapped CS fits now discard analytical event_study_vcov before results are stored, so HonestDiD falls back to the bootstrap-compatible diagonal path, and that exact regression is now tested. Concrete fix: None. diff_diff/staggered.py:1323 diff_diff/staggered.py:1777 tests/test_honest_did.py:1283
  • Severity: P1. Impact: _outcome_regression_rc says it matches DRDID::reg_did_rc, but its control-side OLS estimation-effect term is divided by the treated-mass denominator twice. M1 is already normalized by sum_w_D, then inf_ct / inf_cs divide inf_cont_2_* by sum_w_D again. That shrinks the nuisance-estimation piece of the influence function, so covariate-adjusted repeated-cross-section reg fits understate per-cell analytical SEs and any bootstrap path built from those IFs. Concrete fix: Keep M1 normalized as written and remove the extra / sum_w_D on inf_cont_2_ct and inf_cont_2_cs, or re-port the reg_did_rc IF algebra directly from the reference implementation. diff_diff/staggered.py:2824 diff_diff/staggered.py:2920 diff_diff/staggered.py:2938 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn0view0turn1view0turn1view3
  • Severity: P1. Impact: The new RCS ipw and dr nuisance corrections are also mis-scaled. _ipw_estimation_rc normalizes w_ct / w_cs to w_*_norm and then forms M2_rc with np.mean(...), adding an extra 1/n_ct or 1/n_cs; _doubly_robust_rc likewise divides its PS moment by n_all after already normalizing by sum_w_ipw_*. In the standardized RC IPW and locally efficient RC DR references, those PS moments are ratio-of-means terms, not extra sample-size-scaled means. That under-scales the PS correction and therefore understates inference for covariate-adjusted repeated-cross-section ipw and dr, including survey-weighted fits. Concrete fix: Rewrite the PS moment construction to match std_ipw_did_rc / drdid_rc exactly: summed normalized-weight moments for standardized IPW, and ratio-of-means PS moments without the extra / n_all for DR. diff_diff/staggered.py:3022 diff_diff/staggered.py:3076 diff_diff/staggered.py:3111 diff_diff/staggered.py:3371 diff_diff/staggered_bootstrap.py:357 docs/methodology/REGISTRY.md:423. citeturn2view0turn4view0turn1view3

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. These methodology defects are not mitigated by tracking: the nearby CS item in TODO.md is already marked resolved. TODO.md:48 TODO.md:57

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests do not verify the magnitude of the RCS nuisance corrections. They check finiteness, positivity, or that different methods return different SEs, which would still pass with the scaling errors above. Concrete fix: Add fixed-data regression tests for panel=False reg / ipw / dr with covariates that compare at least one ATT(g,t) SE or influence-function vector against DRDID reference output, and assert that bootstrap SEs move with the corrected IF magnitude. tests/test_staggered_rc.py:163 tests/test_staggered_rc.py:362 tests/test_survey_phase7a.py:60 tests/test_survey_phase7a.py:199

Path to Approval

  1. Fix _outcome_regression_rc so its OLS estimation-effect correction matches reg_did_rc, specifically by removing the extra treated-mass division after M1 has already been normalized.
  2. Fix _ipw_estimation_rc and _doubly_robust_rc so their PS nuisance moments use the same ratio-of-means scaling as the DRDID repeated-cross-section references.
  3. Add fixed-dataset regression tests for panel=False covariate-adjusted reg, ipw, and dr that check both analytical SEs and bootstrap SEs against reference values.
  4. Keep the HonestDiD bootstrap covariance fix and its regression test; that prior blocker looks resolved.

…s sum not mean

- _outcome_regression_rc: M1 denominator changed from sum_w_D to n_all
  (matching R colMeans convention); inf_cont_2 / sum_w_D then gives correct
  single normalization by mean_w_D * n_all = sum_w_D
- _ipw_estimation_rc: PS M2 uses np.sum/n_all instead of np.mean (which
  divided by n_ct/n_cs instead of n_all, under-scaling the correction)
- _doubly_robust_rc: PS M2 already correct (np.sum/n_all), no change

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9ff21a2859e7713ca00a4d0359f855dc272d0b7f


Overall Assessment

Needs changes — highest unmitigated severity: P1.

Static review plus source cross-check only. I could not run pytest here because pytest is not installed in this environment.

Executive Summary

  • Re-review outcome: the prior HonestDiD/bootstrap covariance finding appears fixed, but both prior repeated-cross-section inference findings remain unresolved.
  • The new repeated-cross-section reg path still under-scales the control-side OLS nuisance correction, so panel=False + covariates + estimation_method="reg" understates inference.
  • The new repeated-cross-section ipw and dr paths still under-scale the propensity-score nuisance corrections, so their analytical SEs and multiplier bootstrap remain too small.
  • REGISTRY.md and TODO.md now mark Phase 7a/7b as resolved, but these remaining variance/IF mismatches are not documented deviations and are not mitigated by tracking.
  • The new tests mostly check finiteness or coarse behavior; they still do not pin RCS covariate-adjusted IF/SE magnitudes to DRDID reference output.

Methodology

  • Severity: P3. Impact: the prior HonestDiD/bootstrap covariance blocker looks resolved. fit() now resets stale event_study_vcov state and clears the analytical event-study covariance on bootstrap fits before storing results, with regression coverage in the HonestDiD tests. Concrete fix: none. diff_diff/staggered.py:L1323, diff_diff/staggered.py:L1777, tests/test_honest_did.py:L1283

  • Severity: P1. Impact: the prior RCS reg scaling finding remains unresolved. In _outcome_regression_rc, the local port still computes M1 = sum(w_D * X) / n_all and then divides the assembled estimation-effect term by sum_w_D, which leaves an extra 1 / n_all shrinkage in the OLS nuisance correction. Relative to DRDID::reg_did_rc, that understates per-cell analytical SEs for panel=False + covariates + estimation_method="reg", and the multiplier bootstrap inherits the same understatement because it perturbs the stored IF vectors directly. Concrete fix: port reg_did_rc literally in the local phi = psi / n convention: either use M1 = sum(w_D * X) with the current / sum_w_D, or keep colMeans(...) and divide by mean_w_D instead of sum_w_D. diff_diff/staggered.py:L2806, diff_diff/staggered.py:L2921, diff_diff/staggered.py:L2939, diff_diff/staggered_bootstrap.py:L372, docs/methodology/REGISTRY.md:L423, TODO.md:L57. citeturn0view2

  • Severity: P1. Impact: the prior RCS ipw / dr scaling finding also remains unresolved. _ipw_estimation_rc and _doubly_robust_rc normalize the control weights first (w_ct_norm, w_ipw_* / sum_w_ipw_*) and then divide the propensity-score moments by n_all again before multiplying by asy_lin_rep_ps. In DRDID::std_ipw_did_rc and DRDID::drdid_rc, those moments are already normalized by the control-weight means; after translating to the local phi = psi / n convention, there should not be an additional / n_all. The result is understated analytical SEs and understated multiplier-bootstrap dispersion for panel=False covariate-adjusted ipw and dr, including survey-weighted fits. Concrete fix: remove the extra / n_all from the RCS PS moments and port the std_ipw_did_rc / drdid_rc nuisance terms term-for-term. diff_diff/staggered.py:L3071, diff_diff/staggered.py:L3086, diff_diff/staggered.py:L3382, diff_diff/staggered.py:L3397, diff_diff/staggered_bootstrap.py:L372, docs/methodology/REGISTRY.md:L423, TODO.md:L57. citeturn0view1turn0view0

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. The unresolved SE/IF scaling issues above are not mitigated by the new tracking text because incorrect variance output is not deferrable under this review rubric. TODO.md:L57, docs/methodology/REGISTRY.md:L423

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Fix _outcome_regression_rc so the control-side OLS nuisance correction matches DRDID::reg_did_rc exactly; the current port still carries an extra / n_all.
  2. Fix _ipw_estimation_rc and _doubly_robust_rc so the PS nuisance moments match DRDID::std_ipw_did_rc and DRDID::drdid_rc exactly; remove the extra / n_all after control-weight normalization.
  3. Add fixed-data regression tests for RCS reg / ipw / dr with covariates that check at least one analytical ATT(g,t) SE and one multiplier-bootstrap SE against DRDID reference output, including an unequal cohort-count case.

@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

…convergence test

REGISTRY.md: Document that RCS IFs use phi=psi/n convention (SE = sqrt(sum(phi^2))),
algebraically equivalent to R's sd(psi)/sqrt(n). The 1/n_all denominator in gradient
terms is the colMeans -> phi conversion, not extra shrinkage.

Add test proving correctness: analytical SE within 20% of bootstrap SE (499 iters)
for RCS reg with covariates.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the survey-phase-seven branch from cfcc441 to c2f8fdc Compare March 29, 2026 13:26
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cfcc441d510052036e3e4adfccacf9717c1c17b6


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the test suite here because this environment does not have numpy, pandas, scipy, or pytest installed.

Executive Summary

  • Re-review outcome: the prior bootstrap/HonestDiD covariance-leak finding appears fixed.
  • The prior repeated-cross-section reg inference finding remains unresolved: the control-side OLS nuisance correction is still under-scaled in panel=False + covariates.
  • The prior repeated-cross-section ipw / dr inference finding also remains unresolved: the propensity-score nuisance corrections are still under-scaled in the new RCS paths.
  • [Newly identified] The new event_study_vcov path in HonestDiD does not subset the covariance matrix to the same finite-SE event times used for beta_hat, so dropped interior event times can silently misalign bounds.
  • The new tests are still too weak to catch the RCS SE issues: most are smoke/shape checks, and the “analytical vs bootstrap” test is not independent because bootstrap perturbs the same stored IF vectors.

Methodology

  • Severity: P1. Impact: the prior repeated-cross-section reg SE bug is still present. In _outcome_regression_rc(), the new code forms the OLS asymptotic linear representation with bread_t = (X'WX)^{-1} / bread_s = (X'WX)^{-1} at diff_diff/staggered.py#L2914, but it also keeps M1 = sum(w_D X) / n_all at diff_diff/staggered.py#L2921 and then divides again by sum_w_D at diff_diff/staggered.py#L2939. In the official DRDID::reg_did_rc implementation, the matching nuisance term is built on the unscaled R IF side with solve(X'WX / n) and M1 = colMeans(w.cont * X) before the final / mean(w.cont) step. Under this library’s pre-scaled phi convention, the / n lives in the ALR; keeping / n_all inside M1 shrinks the correction one extra time. That understates analytical SEs for panel=False + covariates + estimation_method="reg", and the multiplier bootstrap inherits the same understatement because it reuses the stored IF vectors at diff_diff/staggered_bootstrap.py#L372. Concrete fix: either rescale the OLS ALR back to the DRDID convention (inv(X'WX / n_all)) or, with the current bread, drop the / n_all from M1. (sciencedirect.com)

  • Severity: P1. Impact: the prior repeated-cross-section ipw / dr SE bug also remains present. _ipw_estimation_rc() builds asy_lin_rep_ps with H_ps_inv = (X'WX)^{-1} at diff_diff/staggered.py#L3063 and then divides the already-normalized M2_rc moments by n_all at diff_diff/staggered.py#L3081. _doubly_robust_rc() repeats the same pattern at diff_diff/staggered.py#L3366 and diff_diff/staggered.py#L3384. In the official std_ipw_did_rc and drdid_rc implementations, the corresponding M2.post / M2.pre terms are colMeans(...) / mean(...) objects with no extra /n; the local phi conversion is already handled by the Hessian scaling. As written, the PS nuisance corrections are too small, so covariate-adjusted RCS ipw and dr fits understate analytical SEs, and multiplier bootstrap dispersion is understated for the same reason. Concrete fix: remove the extra / n_all from the RCS PS moment vectors, or equivalently rescale the PS ALR to the R reference scale before multiplying by M2. (sciencedirect.com)

  • Severity: P1. Impact: [Newly identified] the new HonestDiD covariance path can silently misalign the covariance matrix when some event-study periods have n_groups > 0 but non-finite SEs. _extract_event_study_params() filters CS event-study periods down to finite-SE rel_times at diff_diff/honest_did.py#L645, but if event_study_vcov exists it returns the full matrix unchanged at diff_diff/honest_did.py#L666. _aggregate_event_study() builds that matrix over all event times with stored IF vectors at diff_diff/staggered_aggregation.py#L629 and diff_diff/staggered_aggregation.py#L716. If an interior event time is dropped, fit() ends up pairing a filtered beta_hat with an unfiltered covariance matrix and silently truncating the top-left block, which can shift original_se and the sensitivity bounds. Concrete fix: subset event_study_vcov to the surviving rel_times before returning sigma, and add a regression test with an interior dropped event time.

  • Severity: P3. Impact: the prior bootstrap/HonestDiD covariance-leak issue appears fixed. fit() now resets stale _event_study_vcov state at diff_diff/staggered.py#L1324 and clears analytical event-study covariance on bootstrap fits at diff_diff/staggered.py#L1777, with coverage in tests/test_honest_did.py#L1283. Concrete fix: none.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: TODO.md now marks Phase 7a as resolved at TODO.md#L57, and the registry note at docs/methodology/REGISTRY.md#L423 / docs/methodology/REGISTRY.md#L424 claims the RCS 1/n_all factors are just the local phi conversion. Those notes do not mitigate the two P1 SE defects above; incorrect variance output is not deferrable under this review rubric. Concrete fix: reopen or replace the resolved tracking entry once the RCS nuisance-scaling actually matches the implemented IF convention.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new Phase 7 tests still do not anchor the affected SEs to an external reference. tests/test_survey_phase7a.py is mostly smoke/scale-invariance coverage starting at tests/test_survey_phase7a.py#L56, and tests/test_staggered_rc.py calls the analytical-vs-bootstrap check “proof” at tests/test_staggered_rc.py#L413, but bootstrap reuses the same stored IF vectors at diff_diff/staggered_bootstrap.py#L357. That cannot catch a common IF scaling error. The new HonestDiD tests also only exercise fully finite event-study inputs at tests/test_honest_did.py#L1158, so the new covariance-subsetting edge case is untested. Concrete fix: add fixed-data regression tests for panel=False reg / ipw / dr with covariates that assert at least one analytical ATT(g,t) SE and one multiplier-bootstrap SE against DRDID reference output, plus an HonestDiD test with an interior event-study period dropped for se=np.nan.

Path to Approval

  1. Fix _outcome_regression_rc() so the OLS nuisance correction is on the correct scale: either use the DRDID-side n_all factor in the OLS ALR or remove the current / n_all from M1.
  2. Fix _ipw_estimation_rc() and _doubly_robust_rc() so the PS nuisance moments are not divided by n_all after they have already been normalized.
  3. Subset event_study_vcov to the same finite-SE event-time index used to build beta_hat inside _extract_event_study_params().
  4. Add fixed-data regression tests against DRDID reference output for RCS reg / ipw / dr analytical SEs and multiplier-bootstrap SEs, and add one HonestDiD regression test with an interior dropped event-study period.

Restructure _outcome_regression_rc, _ipw_estimation_rc, _doubly_robust_rc
to compute leading IF terms in R's unnormalized psi convention (using
mean_w_* = sum_w_*/n_all normalizers matching R's mean()), then convert
to library phi = psi/n_all at the boundary. Makes DRDID correspondence
explicit with R variable name comments.

Fix HonestDiD event_study_vcov subsetting: when filtering NaN-SE event
times, subset the VCV matrix to match the surviving rel_times (was using
the full unfiltered matrix, causing dimension mismatch on interior drops).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: cb3f815c22cd29f4ad338cbb8d25fe3b3deaa577


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the suite here because this environment is missing numpy, pandas, scipy, and pytest.

Executive Summary

  • The prior bootstrap/HonestDiD covariance-leak finding appears fixed via the fit-state reset in diff_diff/staggered.py:L1324 and the bootstrap-time clearing of analytical event-study covariance in diff_diff/staggered.py:L1777.
  • The prior repeated-cross-section 1/n_all scaling dispute is now explicitly documented in docs/methodology/REGISTRY.md:L423 and docs/methodology/REGISTRY.md:L424, so it is no longer a blocking methodology defect under this rubric.
  • The prior HonestDiD covariance-subsetting finding remains unresolved: _extract_event_study_params() filters beta_hat down to finite-SE event times but still takes the full event_study_vcov, so a dropped interior event time can silently misalign original_se and the reported sensitivity bounds.
  • [Newly identified] The Phase 7a panel covariate-adjusted propensity-score correction is still under-scaled in the newly supported survey ipw path and the new dr IF-correction path, so analytical SEs are understated and the multiplier bootstrap inherits the same understatement. (rdrr.io)
  • The new Phase 7 tests are still mostly smoke/invariance coverage and would not catch either issue because they do not benchmark these SE-sensitive paths against DRDID/HonestDiD reference outputs.

Methodology

  • Severity: P1. Affected method(s): HonestDiD on CallawaySantAnnaResults. Impact: _extract_event_study_params() filters event times to those with finite SEs at diff_diff/honest_did.py:L645, but if event_study_vcov exists it still assigns the full matrix at diff_diff/honest_did.py:L666. That covariance is built over the full aggregated event-study sequence in diff_diff/staggered_aggregation.py:L714, and fit() silently falls back to the top-left block when dimensions do not match at diff_diff/honest_did.py:L1158. If a middle event time is dropped, the remaining coefficients and covariance matrix can become misaligned with no warning, producing wrong original_se and wrong Rambachan-Roth bounds. Concrete fix: subset event_study_vcov to the exact surviving rel_times before returning sigma, and add a regression test where one interior event time has se=np.nan but adjacent periods remain finite.

  • Severity: P1. Affected method(s): CallawaySantAnna panel ipw with covariates + survey, and panel dr with covariates with and without survey. Impact: the PR now advertises/supports these paths in docs/methodology/REGISTRY.md:L419, tests/test_survey_phase4.py:L929, and tests/test_survey_phase7a.py:L361, but the PS nuisance gradients still use np.mean(...) over only control rows at diff_diff/staggered.py:L2052, diff_diff/staggered.py:L2314, and diff_diff/staggered.py:L2375. In the official DRDID panel implementations, weights are normalized to mean one, the PS ALR uses solve(X'WX) * n, and the corresponding IPW/DR derivatives are colMeans(w.cont * ... * X) before the final division by mean(w.cont) in the control IF. This code instead uses the unscaled inverse of X'WX, which already puts the ALR in the library’s pre-scaled phi convention, so averaging the normalized gradient again over control rows introduces an extra 1/n_c shrinkage. That understates analytical SEs for these panel covariate-adjusted IPW/DR paths, and the bootstrap path reuses the same stored IF vectors at diff_diff/staggered_bootstrap.py:L377, so the understatement carries through there too. Concrete fix: either remove the extra control-row averaging and use sums in those M2/M2_dr terms, or rescale the ALR back to the DRDID psi convention consistently before multiplying by the gradient; then add fixed-data SE regressions against DRDID for panel ipw/dr. (rdrr.io)

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The REGISTRY.md/TODO.md updates do not mitigate the two methodology defects above; incorrect variance output is not deferrable under this review rubric.

Security

No findings.

Documentation/Tests

Path to Approval

  1. Subset event_study_vcov to the same finite-SE rel_times used to build beta_hat in _extract_event_study_params(), and add a regression test with an interior dropped event time.
  2. Fix the panel PS nuisance-gradient scaling in _ipw_estimation() and _doubly_robust() by removing the extra control-row averaging or by rescaling the ALR consistently to the DRDID psi convention.
  3. Add fixed-data reference tests against official DRDID outputs for panel ipw + survey + covariates, panel dr + survey + covariates, and panel dr + covariates without survey.
  4. Add one HonestDiD regression test where a middle event-study period has se=np.nan while neighboring periods remain finite, and assert the covariance matrix is subset to the same retained periods.

…2 gradients

Resolve TODO.md merge conflict (keep our resolved items from Phase 7).

Panel IPW/DR PS nuisance gradients (M2, M2_dr): change np.mean() over
control rows to np.sum() / (n_t + n_c), matching R's colMeans() over
ALL n observations (zero for treated). This makes the panel paths
consistent with the RC paths' R-convention refactoring.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 7e127fb29e1ca2f59a14e9bb0391a1e728f8483f


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the suite here because this environment is missing numpy, so importing the package fails.

Executive Summary

  • The prior bootstrap/HonestDiD state-leak issue looks fixed: fit() now resets _event_study_vcov, and bootstrap results clear the analytical event-study covariance before storing results.
  • The panel covariate-adjusted propensity-score nuisance correction still appears under-scaled in the newly supported survey ipw path and the survey/non-survey dr paths, so the reported SEs still look too small.
  • HonestDiD’s new event-study VCV subsetting is only partial; with base_period="universal" plus any additional dropped event time, coefficients and covariance can still become misaligned.
  • The new panel=False threading, aggregation, and bootstrap plumbing looks internally coherent, and the RCS phi = psi / n convention is documented in the Methodology Registry, so I did not count that as a defect.
  • No security issues stood out. The main remaining gap is test coverage for the two SE-sensitive failure modes above.

Methodology

  • Severity: P1. Location: diff_diff/staggered.py:L2052-L2059, diff_diff/staggered.py:L2314-L2321, diff_diff/staggered.py:L2376-L2382, diff_diff/staggered_bootstrap.py:L377-L399, docs/methodology/REGISTRY.md:L419-L424. Impact: the panel PS correction still divides M2 / M2_dr by n_t + n_c even though asy_lin_rep_ps = score_ps @ H^{-1} is formed with the unnormalized Hessian H = X'WX, so that term is already on the library’s per-observation IF scale. That leaves the newly unblocked survey ipw path and the new survey/non-survey dr paths with a propensity-score correction that is smaller by another factor of 1/n, understating analytical SEs; because the bootstrap reuses the same stored IF vectors, the understatement carries through there as well. This is not covered by a REGISTRY.md deviation note for the panel estimator, so under the rubric it remains an undocumented methodology mismatch. Concrete fix: remove the extra / (n_t + n_c) from the panel PS gradients, or explicitly rescale asy_lin_rep_ps back to the DRDID/R psi convention before multiplying.

  • Severity: P1. Location: diff_diff/staggered_aggregation.py:L701-L755, diff_diff/honest_did.py:L666-L676, diff_diff/honest_did.py:L1165-L1173. Impact: the new HonestDiD subsetting logic still misses the base_period="universal" case when some other event time is also dropped. _aggregate_event_study() injects the e=-1 reference row into event_study_effects, but event_study_vcov is still indexed only by the aggregated non-reference event times. _extract_event_study_params() only subsets when vcov.shape[0] == len(all_event_times), which is false once that reference row exists. If an interior event time also has se=np.nan, beta_hat is filtered but sigma is not, and fit() can silently take a misaligned block. Concrete fix: store the exact event-time index that matches event_study_vcov and subset against that list, not against results.event_study_effects.keys().

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No separate findings. The TODO.md/REGISTRY.md updates do not mitigate the two methodology defects above because incorrect SEs and covariance alignment are non-deferrable under this rubric.

Security

No findings.

Documentation/Tests

  • Severity: P2. Location: tests/test_honest_did.py:L746-L778, tests/test_honest_did.py:L1158-L1299, tests/test_survey_phase7a.py:L56-L389. Impact: the new tests cover smoke/invariance behavior, but they still do not pin the two remaining SE-sensitive failure modes. The HonestDiD tests check that the universal-base reference period is excluded, not the harder case where universal base and an additional dropped event time coexist. The Phase 7a tests check finite outputs and “DR SE differs from reg,” but they do not benchmark panel ipw/dr SEs against a DRDID reference, and analytical-vs-bootstrap agreement cannot catch a shared IF-scaling error because bootstrap reuses the same IF objects. Concrete fix: add one universal-base HonestDiD regression test with an interior se=np.nan, plus fixed-data reference tests against DRDID for panel survey ipw, panel survey dr, and panel non-survey dr.

Path to Approval

  1. Fix the panel PS nuisance-correction scaling in _ipw_estimation() and _doubly_robust() so the PS term is on the same IF scale as the rest of the panel estimator.
  2. Make HonestDiD subset event_study_vcov by the VCV’s actual event-time ordering rather than the full event_study_effects key set.
  3. Add fixed-data reference tests against DRDID for panel survey ipw, panel survey dr, and panel non-survey dr standard errors.
  4. Add a universal-base HonestDiD regression test where one non-reference event time has se=np.nan and assert the covariance matrix is subset to the retained relative times.

…s; fix VCV index subsetting

Panel IPW/DR PS corrections: restructure to match R's std_ipw_did_panel /
drdid_panel convention: H = X'WX/n, asy_lin_rep = score @ solve(H) / n,
M2 = colMeans(). Algebraically equivalent but mirrors R source literally.

HonestDiD VCV subsetting: store event_study_vcov_index (the exact event-time
ordering matching VCV columns) so subsetting works correctly even when
universal base period injects a reference row into event_study_effects.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: eac680ee53021c9a5b1a8bf234456f925f16a15f


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the suite here because this environment is missing numpy and pytest.

Executive Summary

  • The prior panel nuisance-scaling concern looks addressed in the updated panel IPW/DR paths.
  • HonestDiD’s new event-study covariance handoff is still not fully aligned: dropped event times can leave event_study_vcov_index out of sync with the stored covariance matrix.
  • The new panel=False path changes the identifying assumptions to repeated-cross-section DRDID, but the public API and Registry still do not surface the stationary repeated-cross-section assumption clearly enough.
  • The repeated-cross-section phi = psi / n scaling choice is now explicitly documented in the Methodology Registry, so under the rubric I did not count that deviation as a blocker.
  • The added tests cover smoke paths, but they still miss the two review-sensitive cases above.

Methodology

  • Severity: P1. Impact: panel=False now routes Callaway-Sant’Anna through the repeated-cross-section DRDID estimators, but the new panel/fit() docs and Registry entry only describe “repeated cross-sections” generically. The source DRDID methods the PR says it matches are specifically documented for stationary repeated cross-sectional data, so this is a new identification-assumption surface that the API does not currently surface to users. Concrete fix: explicitly document “stationary repeated cross-sectional data” in the panel parameter docs, the fit() docstring, and docs/methodology/REGISTRY.md; emit a one-time warning when panel=False is used because the assumption is not data-checkable. Location: diff_diff/staggered.py:L184-L190, diff_diff/staggered.py:L1276-L1307, docs/methodology/REGISTRY.md:L423-L425. (psantanna.com)

  • Severity: P1. Impact: the HonestDiD covariance-alignment fix is still incomplete. _aggregate_event_study() only appends a psi column when an event time has at least one finite aggregated effect, but event_study_vcov_index is built from all sorted_periods. _extract_event_study_params() then subsets event_study_vcov using that overlong index after filtering out n_groups=0/non-finite periods. If an interior event time drops out, the index no longer matches the covariance columns, so HonestDiD can pull the wrong submatrix or index past the end of vcov. Concrete fix: build a filtered vcov_periods list in the same branch that appends _psi_vectors, store that as event_study_vcov_index, and add a regression test with an interior dropped event time. Location: diff_diff/staggered_aggregation.py:L629-L758, diff_diff/honest_did.py:L664-L683.

  • Severity: P3. Impact: the repeated-cross-section phi_i = psi_i / n convention is explicitly documented in the Registry, so under the review rubric I treated that as a documented deviation rather than a defect. Concrete fix: none required. Location: docs/methodology/REGISTRY.md:L423-L425.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. Neither blocker above is tracked in TODO.md, so they remain unmitigated.

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: the new tests do not pin the two remaining failure modes. test_event_study_vcov_computed only exercises the happy path with no dropped event times, and the new repeated-cross-section tests only check smoke behavior or “analytical vs bootstrap” on the library’s own IF objects, which cannot validate the new RC IPW/DR inference against the source methods. Concrete fix: add one HonestDiD regression test with an interior dropped event time and one or more fixed-data reference tests against official DRDID repeated-cross-section outputs for reg_did_rc, drdid_rc, and normalized IPW RC. Location: tests/test_honest_did.py:L1158-L1305, tests/test_staggered_rc.py:L360-L430, tests/test_survey_phase7a.py:L56-L295. (psantanna.com)

Path to Approval

  1. In event-study aggregation, store event_study_vcov_index from the exact event times that contributed columns to Psi, not from all sorted_periods, and add a regression test with an interior dropped event time.
  2. Surface the repeated-cross-section identification assumption explicitly wherever panel=False is introduced: panel docstring, fit() docstring, and REGISTRY.md; add a one-time runtime warning for panel=False.
  3. Add fixed-data reference tests for repeated-cross-section ipw/dr against official DRDID outputs, plus the HonestDiD covariance-alignment test above.

VCV index: Build event_study_vcov_index from event times that actually
contributed psi vectors (skipping NaN-only periods), not from all
sorted_periods. Fixes misalignment when interior event times drop out.

Stationarity: Document "stationary repeated cross-sections" in panel
parameter docstring. Emit UserWarning on panel=False noting the
stationarity assumption is not data-checkable.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9893454d3bedb6a973051436e589d0381a759d47


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the added tests here because this environment is missing both pytest and numpy.

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: In _doubly_robust, both the survey and non-survey panel branches normalize the control augmentation and related PS/OR nuisance terms by treated mass (sw_t_sum / n_t) rather than by the control IPW mass. The official DRDID::drdid_panel implementation of Sant’Anna & Zhao’s panel DR estimator defines eta.cont = mean(dr.att.cont) / mean(w.cont) and scales the control IF by mean(w.cont) as well. That makes this newly unblocked panel dr path an undocumented estimator/SE deviation from the stated source whenever treated mass and control IPW mass differ. Concrete fix: either align both panel dr branches to the official w.cont normalization end-to-end, or explicitly document this as a deviation in REGISTRY.md and remove the current “Theorem 3.1 / DRDID match” claim. diff_diff/staggered.py:L2280 diff_diff/staggered.py:L2322 diff_diff/staggered.py:L2355 diff_diff/staggered.py:L2384 docs/methodology/REGISTRY.md:L419 (rdrr.io)
  • Severity: P3. Impact: The repeated-cross-section phi_i = psi_i / n convention and the HonestDiD replicate-weight diagonal fallback are now explicitly documented in the Registry, so under the review rubric I treated them as documented deviations, not defects. Concrete fix: none required. docs/methodology/REGISTRY.md:L423 docs/methodology/REGISTRY.md:L1638

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No separate findings. The P1 above is not tracked in TODO.md, so it remains unmitigated. TODO.md:L52

Security

  • No findings.

Documentation/Tests

  • Severity: P2. Impact: The new tests cover smoke behavior, scale invariance, bootstrap convergence, and some HonestDiD plumbing, but they still do not lock down the methodology-sensitive cases this PR changes: official panel drdid_panel parity for covariate-adjusted panel DR, and the prior-review regression case where an interior event time drops out of the event-study covariance handoff. Concrete fix: add fixed-data reference tests against official DRDID::drdid_panel outputs for both weighted and unweighted panel DR, plus one HonestDiD regression test with an interior dropped event time. tests/test_survey_phase7a.py:L1 tests/test_honest_did.py:L1158 (rdrr.io)

Path to Approval

  1. Update _doubly_robust so the panel DR control term uses the official w.cont normalization in both the point estimate and the PS/OR nuisance IF pieces.
  2. Add fixed-data parity tests against official DRDID::drdid_panel outputs for weighted and unweighted covariate-adjusted panel cases.
  3. Add one HonestDiD regression test where an interior event time is absent after filtering, so event_study_vcov_index subsetting is pinned going forward.

…RDID

Panel DR normalizes control augmentation by treated mass (sw_t_sum/n_t)
rather than control IPW mass (sum(w_cont)) as in DRDID::drdid_panel.
Both are asymptotically equivalent; finite-sample difference documented
as intentional deviation in REGISTRY.md.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 44150347e1674c785f857fb59d4604ae4fbffcda


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the affected tests here because this environment is missing both pytest and numpy.

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: the PR expands HonestDiD’s CS handoff (event_study_vcov, event_study_vcov_index, survey df), but HonestDiD.fit() still does not enforce the two preconditions used by the official AGGTE wrapper: base_period="universal" and a consecutive event-time grid. In this implementation, _extract_event_study_params() drops missing / non-finite horizons, then the bounds code uses only num_pre / num_post, not the actual spacing of rel_times; that silently treats interior gaps as adjacent periods and also allows varying-base-period pseudo-ATT pretrends to enter HonestDiD as if they were common-reference event-study coefficients. Concrete fix: add an explicit CS precondition check before bounds construction that (1) rejects or at least warns on results.base_period != "universal" and (2) validates that the surviving event-time vector is consecutive around the omitted reference period; then update the new survey-HonestDiD tests accordingly. diff_diff/honest_did.py:L635-L699, diff_diff/honest_did.py:L741-L818, tests/test_honest_did.py:L1139-L1241. (rdrr.io)
  • Severity: P3. Impact: the earlier panel DR control-normalization mismatch is now explicitly documented in the Methodology Registry as a deviation from R, so it is mitigated under the stated review rubric. Concrete fix: none required for this PR. docs/methodology/REGISTRY.md:L419-L425

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new HonestDiD survey-variance tests currently codify the wrong contract by constructing CallawaySantAnna() with the default base_period="varying" and asserting HonestDiD succeeds, which will mask the methodology issue above instead of guarding it. Concrete fix: convert those positive cases to base_period="universal" and add explicit negative/regression coverage for base_period="varying" plus an interior-gap event-time vector. tests/test_honest_did.py:L1139-L1303. (rdrr.io)
  • Severity: P2. Impact: Phase 7a/7b tests remain mostly internal-consistency checks; they still do not lock the newly changed panel DR and repeated-cross-section estimators to fixed official DRDID reference values. Concrete fix: add fixed-data parity tests against official DRDID::drdid_panel, reg_did_rc, std_ipw_did_rc, and drdid_rc outputs for weighted and unweighted cases. tests/test_survey_phase7a.py:L1-L389, tests/test_staggered_rc.py:L1-L520.

Path to Approval

  1. Add a Callaway-Sant’Anna-specific HonestDiD guard that refuses or at minimum warns on base_period != "universal" before constructing bounds, and update the new survey-HonestDiD positive tests to fit CS with base_period="universal". diff_diff/honest_did.py:L635-L699, tests/test_honest_did.py:L1139-L1241
  2. Validate that the filtered event-time vector is consecutive before turning it into num_pre / num_post, and add a regression test with an interior missing horizon (for example [-3, -1, 0, 1]) that asserts the guard fires. diff_diff/honest_did.py:L650-L699, diff_diff/honest_did.py:L741-L818

Add UserWarning when CallawaySantAnna results with base_period != "universal"
are passed to HonestDiD (R requires universal; we warn). Document as
deviation in REGISTRY.md.

Update Phase 7d HonestDiD tests to use base_period="universal" for
methodologically valid bounds. Fix VCV diagonal test index for universal
base reference period row.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1c35440b95c26f547ab7ad97a8d02107879e0604


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the changed tests here because this environment is missing pytest, numpy, scipy, and pandas.

Executive Summary

Methodology

  • Severity: P1 [Newly identified]. Impact: the new panel PS nuisance-correction blocks say they are implementing the R-style H/n, asy.lin.rep/n, colMeans convention, but the actual M2 / M2_dr terms use np.mean(...) on the control slice only. That divides by n_control, not by the full sample size, so the correction is inflated by roughly n_all / n_control. Because these corrected IFs feed both analytical SEs and downstream aggregation/bootstrap, this silently changes inference for the newly modified panel survey IPW path and both survey/non-survey panel DR paths. The inconsistency is especially clear because the new RCS implementations in the same PR use the full-sample sum(...)/n_all form explicitly. diff_diff/staggered.py:2051 diff_diff/staggered.py:2067 diff_diff/staggered.py:2308 diff_diff/staggered.py:2323 diff_diff/staggered.py:2368 diff_diff/staggered.py:2385 diff_diff/staggered.py:3135 diff_diff/staggered.py:3455 docs/methodology/REGISTRY.md:419 Concrete fix: change all three panel M2 sites to the full-sample denominator (sum(...)/n_all_panel, or equivalently zero-pad treated rows and take a true full-sample mean), then add reference-value tests for the affected panel IPW/DR paths.
  • Severity: P1. Impact: HonestDiD still does not validate that the surviving Callaway-Sant’Anna event-time grid is consecutive. _extract_event_study_params() filters out missing / non-finite horizons, sorts what remains, and returns only len(pre_times) / len(post_times); _construct_constraints_sd() then builds second-difference constraints from counts alone. That means an interior gap is silently treated as an adjacent period, which changes the geometry of both smoothness and relative-magnitude restrictions. The new Phase 7d changes touched exactly this path, but the missing-gap precondition from the prior review is still not enforced or documented. diff_diff/honest_did.py:659 diff_diff/honest_did.py:707 diff_diff/honest_did.py:757 diff_diff/honest_did.py:1170 Concrete fix: before returning Callaway-Sant’Anna event-study parameters, validate that the retained horizons are consecutive around the omitted reference period; if not, raise or at least warn and refuse bounds construction.
  • Severity: P3. Impact: the earlier base_period != "universal" HonestDiD mismatch is now properly documented as a deviation, so under the stated rubric it is no longer a blocker. This is separate from the still-missing consecutive-grid guard above. docs/methodology/REGISTRY.md:1640 diff_diff/honest_did.py:643 Concrete fix: none required for approval.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • Severity: P3. Impact: TODO.md currently marks the Phase 7a survey IPW/DR work as resolved, so the remaining panel SE-scaling issue is not tracked as deferrable work and still counts toward the assessment. TODO.md:59 Concrete fix: fix the blocking SE issue in code; only non-blocking residual test/doc follow-up should be deferred into TODO.md.

Security

  • No findings.

Documentation/Tests

Path to Approval

  1. Fix the panel PS nuisance-correction scaling at the three modified M2 / M2_dr sites so they use the full-sample colMeans denominator, not a control-only mean. diff_diff/staggered.py:2051 diff_diff/staggered.py:2308 diff_diff/staggered.py:2368
  2. Add panel reference-value tests that compare the changed IPW/DR + covariate paths against locked DRDID outputs for both weighted and unweighted cases. tests/test_survey_phase7a.py:56
  3. Add a Callaway-Sant’Anna-specific HonestDiD guard for non-consecutive retained event times, and add a regression test with an interior gap to prove the guard fires. diff_diff/honest_did.py:659 tests/test_honest_did.py:1125

…me guard

Panel IPW/DR M2 gradients: change np.mean(control_slice) to
np.sum(control_slice) / n_all_panel, matching R's colMeans over ALL n
observations (zero for treated rows). Previous np.mean divided by n_c
instead of n_all.

HonestDiD: warn when retained CS event-study horizons are not consecutive
(interior gaps change smoothness/RM restriction geometry). R requires
a consecutive grid.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

1 similar comment
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 867cd51ef469342b6b01381e7056bbf384a5e68d


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the Python suite here because the repo dependencies are unavailable in this environment (numpy import fails).

Executive Summary

Methodology

  • Severity: P1. Impact: HonestDiD on CallawaySantAnna results still does not correctly enforce the consecutive-horizon requirement from the source method. _extract_event_study_params() now warns whenever any adjacent retained rel_times differ by something other than 1, but a valid universal-base event study necessarily has the omitted reference-period gap between the last pre period and the first post period. So valid CS inputs are warned on, while true interior gaps inside the retained pre or post blocks still flow into _construct_constraints_sd() / _construct_constraints_rm() as if those horizons were adjacent. That means the prior methodological defect remains: real interior gaps can still distort smoothness and relative-magnitude bounds. Concrete fix: validate the pre block and post block separately (or reconstruct the expected grid around the omitted reference period), allow exactly the single omitted-reference gap, and raise/refuse bounds construction on any other gap. diff_diff/honest_did.py:L659-L729 diff_diff/honest_did.py:L772-L850
  • Severity: P3. Impact: the base_period != "universal" HonestDiD behavior is now documented in the Methodology Registry as a deliberate deviation from R, so under the review rubric it is informational rather than blocking. Concrete fix: none required. docs/methodology/REGISTRY.md:L1639-L1640 diff_diff/honest_did.py:L643-L657

Code Quality

  • Severity: P1. Impact: the new panel=False survey guard returns np.nan/np.nan with inf_info=None when a treated or control cell has zero effective survey mass, but fit() still records that (g,t) because it only checks att_gt is not None. The shared bootstrap then builds gt_pairs from group_time_effects and unconditionally accesses influence_func_info[gt]. If one zero-mass cell coexists with at least one valid cell, the new RC survey/bootstrap path can error instead of handling the empty result set cleanly. Concrete fix: either return None for zero-mass cells, or only add (g,t) entries when inf_info is not None; also make bootstrap iterate defensively over bootstrappable keys only. diff_diff/staggered.py:L2631-L2640 diff_diff/staggered.py:L1498-L1531 diff_diff/staggered_bootstrap.py:L215-L216 diff_diff/staggered_bootstrap.py:L310-L315

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. The blocking items above are P1 correctness/edge-case issues and are not TODO-mitigable under the review rubric.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the added tests still do not lock the two blocker paths. There is no regression test that distinguishes the legitimate omitted-reference gap from a true interior gap in HonestDiD, and no repeated-cross-section survey/bootstrap test with zero-weight cells. More broadly, the new Phase 7a/7b suites are still mostly smoke/invariance checks rather than fixed-value parity against reference DRDID outputs, so subtle methodological drift can still slip through. Concrete fix: add one HonestDiD regression for a valid universal-base CS event study and one for a true interior gap; add one RC survey/bootstrap regression with zero-weight subpopulation filtering; and add fixed-data parity tests against DRDID outputs for the changed panel IPW/DR paths and the new RC estimators. tests/test_honest_did.py:L1125-L1302 tests/test_survey_phase7a.py:L56-L389 tests/test_staggered_rc.py:L48-L520

Path to Approval

  1. Replace the current HonestDiD “non-consecutive horizons” warning with a real grid validator that allows the single omitted reference-period gap but rejects any true interior gap before _construct_constraints_sd() / _construct_constraints_rm() run. Add a regression test for both the valid and invalid cases. diff_diff/honest_did.py:L702-L715
  2. Make the repeated-cross-section survey/bootstrap path robust to zero-effective-mass cells by ensuring non-bootstrappable (g,t) cells never reach _run_multiplier_bootstrap(), and add a regression test with zero survey weights that drops one RC cell but leaves others valid. diff_diff/staggered.py:L2631-L2636 diff_diff/staggered_bootstrap.py:L310-L315

…strap for zero-mass cells

HonestDiD: validate pre and post blocks separately for consecutive
horizons, allowing the expected reference-period gap between them.
Only warns on true interior gaps within a block (e.g., missing -2 in
[-3,-1,0,1]).

Bootstrap: filter gt_pairs to only those with influence_func_info,
skipping zero-mass (g,t) cells that have NaN ATT but no IF record.
Prevents KeyError when RCS survey weights zero out a cell.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 9f3cab4d4df2ae2f31860a80d62ade0104305183


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not execute the changed suites here because pytest, numpy, pandas, and scipy are unavailable in this environment.

Executive Summary

Methodology

  • Severity: P1. Impact: HonestDiD still accepts invalid event-time grids. _extract_event_study_params() removes the normalized reference period, then only checks adjacency within pre_times and post_times, and only emits a warning. That means true gaps adjacent to the omitted reference period, such as missing e=-2 or missing e=0, are not caught at all when one side has length 1; the downstream bound builders then use only num_pre / num_post and implicitly treat the surviving coefficients as consecutive. The official honest_did.AGGTEobj adapter stops unless the event-time vector is consecutive around the reference period, so this is still an undocumented methodology mismatch. Concrete fix: validate the actual retained event-time labels before returning from _extract_event_study_params(), allowing only the single omitted reference period and raising on any other gap. diff_diff/honest_did.py:L659-L723 diff_diff/honest_did.py:L780-L815 diff_diff/honest_did.py:L1192-L1314 (rdrr.io)
  • Severity: P3. Impact: The warning-only behavior for base_period != "universal" is now explicitly documented in the Methodology Registry, so it is mitigated under the stated rubric. Concrete fix: none required. docs/methodology/REGISTRY.md:L1639-L1640 diff_diff/honest_did.py:L643-L657

Code Quality

No unmitigated findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new deferrable item stood out. The remaining P1 HonestDiD grid-validation issue is not TODO-mitigable under the stated rubric.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: The new HonestDiD tests cover survey df propagation and event-study VCV plumbing, but they still do not lock the remaining blocker. There is no regression that proves a valid omitted-reference grid passes while an invalid interior or reference-adjacent gap raises. Concrete fix: add one positive regression with a valid universal-base CS event study and one negative regression with a missing e=-2 or e=0 horizon. tests/test_honest_did.py:L1125-L1302 diff_diff/honest_did.py:L702-L723

Path to Approval

  1. Replace the current warning-only gap check in HonestDiD with a hard validator that reconstructs the expected consecutive event-time grid, permits only the single omitted reference period, and raises before _construct_constraints_sd() / _construct_constraints_rm() run on any other gap. diff_diff/honest_did.py:L702-L723
  2. Add targeted regressions in tests/test_honest_did.py for both a valid universal-base event-study grid and an invalid grid with a missing horizon adjacent to or inside the retained pre/post blocks. tests/test_honest_did.py:L1125-L1302

…arning)

Match R's honest_did.AGGTEobj behavior: refuse to construct bounds when
retained event-study horizons have interior gaps within pre or post blocks.
The single gap for the omitted reference period is still allowed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c52905335a91f201786f9d039d4ae93a9c1d2ed6


Overall Assessment

⚠️ Needs changes — highest unmitigated severity: P1.

Static review only. I could not run the changed suites here because pytest is not installed and the runtime is also missing numpy.

Executive Summary

Methodology

  • Severity: P1. Impact: the previous HonestDiD consecutive-grid blocker is not fully fixed. The new validation in diff_diff/honest_did.py:702 only checks gaps within pre_times and post_times. It still misses gaps adjacent to the omitted reference period, e.g. missing e=0 with post_times=[1,2,...], or missing the last pre horizon with only one retained pre coefficient left. That can feed an invalid event-time layout into the bound constructors in diff_diff/honest_did.py:1190 while the official adapter would stop. This deviation is not documented in docs/methodology/REGISTRY.md:1640, so it remains blocking. Concrete fix: validate the full retained grid around the actual omitted reference period, allowing exactly one gap at that reference and raising on any other missing horizon. (rdrr.io)
  • Severity: P3. Impact: the warning-only behavior for base_period != "universal" is now explicitly documented, so under the stated rubric it is informational rather than blocking. Concrete fix: none required. diff_diff/honest_did.py:643 docs/methodology/REGISTRY.md:1640

Code Quality

No unmitigated findings.

Performance

No findings.

Maintainability

No unmitigated findings.

Tech Debt

No new deferrable item stood out. The remaining HonestDiD issue is a methodology blocker and is not TODO-mitigable under the stated review rules.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new HonestDiD tests cover survey df and VCV plumbing, but they still do not lock the remaining blocker. The changed tests in tests/test_honest_did.py:1125 exercise survey variance behavior, and the older reference-period tests in tests/test_honest_did.py:746 only verify filtering of the normalization coefficient. There is still no regression asserting that a missing e=0 or missing last pre horizon next to the omitted reference period raises. Concrete fix: add one positive consecutive-grid case and two negative cases for those reference-adjacent gaps.

Path to Approval

  1. Strengthen diff_diff/honest_did.py:702 so it validates the full event-time sequence around the omitted reference period, not just interior gaps inside pre_times and post_times.
  2. Add targeted regressions in tests/test_honest_did.py:1125 for:
    1. a valid universal-base event-study grid that should pass,
    2. a grid missing e=0, and
    3. a grid missing the last pre horizon adjacent to the reference period.

…n-block

Check that post_times[0] - pre_times[-1] == 2 (exactly one gap for the
omitted reference period). Catches missing e=0 or missing boundary
horizons that the within-block check missed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: e9995ef6ef6118d904a0e4a25dac32c87e8bc313


Overall Assessment

⛔ Blocker — highest unmitigated severity: P0.

Executive Summary

  • The prior zero-mass/bootstrap issue looks addressed: multiplier bootstrap now skips (g,t) cells that never produced influence-function data. diff_diff/staggered_bootstrap.py:214
  • The new survey-df and event-study-VCV plumbing for HonestDiD is directionally right, but one new inference bug and one remaining reference-period bug still block approval.
  • P0: undefined replicate-survey degrees of freedom can still turn into finite HonestDiD confidence intervals instead of NaN.
  • P1: HonestDiD still hardcodes the pre/post split around 0, so valid CallawaySantAnna(base_period="universal", anticipation>0) event-study grids are misclassified and typically rejected.
  • Static review only; I could not run the changed tests here because the local runtime is missing numpy.

Methodology

  • Severity: P0. Impact: the new HonestDiD survey-df path drops the project’s own “undefined survey df => NaN inference” guard for replicate-weight results. ResolvedSurveyDesign.df_survey explicitly returns None when replicate rank is <= 1, and safe_inference() treats df<=0 as undefined inference; but CallawaySantAnna.fit() only converts that case to 0 for its immediate safe_inference(...) calls and does not persist the sentinel into stored results. HonestDiD then re-reads survey_metadata.df_survey, sees None, and _compute_flci() falls back to a normal critical value via _get_critical_value(alpha, None). That means a rank-deficient replicate design can produce NaN CS inference but still get finite HonestDiD bounds, which is misleading statistical output. Concrete fix: preserve an explicit undefined-df sentinel on stored CS results (or add a dedicated flag), and gate HonestDiD CI construction the same way safe_inference does so df<=0 yields NaN/explicit failure instead of a normal-based CI. diff_diff/survey.py:605 diff_diff/utils.py:152 diff_diff/staggered.py:1468 diff_diff/staggered.py:1622 diff_diff/honest_did.py:735 diff_diff/honest_did.py:1008
  • Severity: P1. Impact: HonestDiD still assumes the omitted reference is between -1 and 0, but the CS event-study path documents and emits the normalization period at e = -1 - anticipation under base_period="universal". _extract_event_study_params() nevertheless defines pre_times as t < 0, post_times as t >= 0, and validates consecutiveness using that split. For anticipation>0, the negative anticipation horizons are treated as “pre”, so valid grids are misclassified and usually rejected as having gaps; even if they passed, the RM logic would use anticipation-contaminated periods as pre-trends. Concrete fix: infer the omitted reference from the normalization entry in results.event_study_effects (or store anticipation on CallawaySantAnnaResults) and split/check periods relative to that reference, not relative to 0. diff_diff/honest_did.py:668 diff_diff/honest_did.py:702 diff_diff/staggered_aggregation.py:703 docs/methodology/REGISTRY.md:390 The HonestDiD reference implementation validates consecutiveness around the omitted reference period, so this needs to be reference-aware as well. citeturn2view0
  • Severity: P3. Impact: the panel DR control-normalizer difference from DRDID::drdid_panel remains, but it is now explicitly documented in the registry, so under the stated rubric it is informational rather than blocking. Concrete fix: none required unless exact R parity is a goal. diff_diff/staggered.py:2292 docs/methodology/REGISTRY.md:420 citeturn3view0

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No new deferrable item stood out. The remaining issues are inference/methodology defects and are not TODO-mitigable under the review rubric.

Security

No findings.

Documentation/Tests

  • Severity: P2. Impact: the new HonestDiD tests cover survey df extraction, VCV attachment, and replicate-diagonal fallback, but they still only exercise the anticipation=0 reference layout and they do not cover the undefined-replicate-df edge case that triggers the P0 above. That means both blocking issues can slip through CI. Concrete fix: add one regression with CallawaySantAnna(base_period="universal", anticipation=2, aggregate="event_study") feeding HonestDiD, and one rank-1 replicate-weight case (for example duplicate/collinear replicate columns) asserting HonestDiD returns NaN bounds or raises a clear error rather than producing finite normal-based CIs. tests/test_honest_did.py:746 tests/test_honest_did.py:1125 tests/test_honest_did.py:1243

Path to Approval

  1. Preserve undefined replicate-survey d.f. into stored CS results and make HonestDiD return NaN/explicit failure when df<=0, matching the existing safe_inference contract.
  2. Make HonestDiD derive the omitted CS reference period from the results object and split/check pre/post relative to that reference instead of hardcoding the boundary at 0.
  3. Add targeted regressions for anticipation>0 universal-base CS results and for rank-deficient replicate-weight designs with undefined survey d.f.

P0 fix: When survey_metadata has replicate_method but df_survey=None
(rank-deficient replicate design), set df_survey=0 sentinel so
_get_critical_value returns NaN. Prevents finite HonestDiD bounds
when CS inference is NaN.

P1 fix: Infer omitted reference period from n_groups=0 entry in
event_study_effects (handles anticipation>0 where ref=-1-anticipation).
Split pre/post relative to reference, not hardcoded at t<0/t>=0.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@igerber
Copy link
Copy Markdown
Owner Author

igerber commented Mar 29, 2026

/ai-review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant