Add survey Phase 6: replicate weights, DEFF diagnostics, subpopulation analysis#238
Add survey Phase 6: replicate weights, DEFF diagnostics, subpopulation analysis#238
Conversation
…n analysis Complete the final Phase 6 survey features: - Replicate weight variance (BRR, Fay, JK1, JKn) as alternative to TSL - Per-coefficient DEFF diagnostics comparing survey vs SRS variance - Subpopulation analysis via SurveyDesign.subpopulation() Bug fixes: - EfficientDiD hausman_pretest() stale n_cl after NaN filtering - ContinuousDiD event-study anticipation filter Refactoring: - Extract _format_survey_block() helper across 11 results files - Rename DEFF display label to "Kish DEFF (weights)" Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
P1 fixes: - Implement JKn with explicit replicate_strata (per-stratum scaling) - Fix replicate IF variance scale: use weighted sums not means - Propagate replicate dispatch to ContinuousDiD, EfficientDiD, TripleDifference - Allow zero weights in solve_logit (matching solve_ols) - Preserve replicate metadata in SurveyDesign.subpopulation() P2 fixes: - Add DEFFDiagnostics and compute_deff_diagnostics to __all__ - Show replicate method/count in survey summary block - Update docs for JKn replicate_strata requirement Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 3 review fixes: - Add ResolvedSurveyDesign.subset_to_units() helper to carry replicate metadata through panel→unit collapse in ContinuousDiD and EfficientDiD - Normalize replicate weight columns to sum=n for pweight/aweight (matching full-sample normalization for scale-invariant IF variance) - Extend _validate_unit_constant_survey() to check replicate weight columns - Set n_psu=None for replicate designs in metadata (was bogus implicit count) - Warn on invalid replicate solves instead of silently dropping Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… tests Round 4 review fixes: - Fix compute_replicate_if_variance() to match compute_survey_if_variance() contract: accept psi as-is, use weight-ratio rescaling (w_r/w_full) for replicate contrasts instead of raw weight multiplication - Add positive-mass guard in solve_ols() and solve_logit(): reject all-zero weight vectors to prevent silent empty-sample fits - Narrow exception catch in compute_replicate_vcov() to LinAlgError/ValueError - Add numerical test comparing replicate IF variance to TSL IF variance on same PSU structure (ratio within [0.5, 2.0]) - Add test for all-zero weight rejection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Document the Phase 6 survey methodology additions: - Replicate weight variance: BRR/Fay/JK1/JKn formulas, IF contract (weight-ratio rescaling matching compute_survey_if_variance), df=R-1, normalization convention, JKn replicate_strata requirement - DEFF diagnostics: per-coefficient design effect formula, effective n, opt-in computation - Subpopulation analysis: domain estimation via zero-weight preservation, replicate metadata handling, solver zero-weight guards Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 5 review fixes: - solve_logit(): validate effective weighted sample (positive-weight rows) for class support and parameter identification before IRLS - compute_replicate_if_variance(): use np.divide(where=) to avoid divide-by-zero warnings on zero full-sample weights - Add regression tests: single-class positive-weight, too-few positive-weight obs, and zero-weight replicate IF (no RuntimeWarning) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run _detect_rank_deficiency() on positive-weight rows when weights contain zeros, so rank-deficient subpopulation/domain samples are rejected even when the full padded design is full rank. Add regression test with collinear positive-weight subset. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… DEFF Round 7 review fixes: - P0: TripleDifference replicate IF path now uses raw combined IF (not TSL-deweighted) for IPW/DR methods, matching REGISTRY contract - P1: ContinuousDiD rejects replicate_weights + n_bootstrap>0 with NotImplementedError (replicate variance is analytical, not bootstrap) - P2: LinearRegression.compute_deff() handles rank-deficient models by computing SRS vcov on kept columns only, expanding with NaN - Tests: ContinuousDiD replicate+bootstrap rejection, TripleDiff replicate regression method end-to-end Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- compute_deff(): return all-NaN DEFFDiagnostics directly when all coefficients are dropped, instead of calling compute_deff_diagnostics() on a singular design - Parameterize TripleDiff replicate test over reg/ipw/dr to cover the previously fixed IPW/DR IF scale bug Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 9 review fixes: - SunAbraham: reject replicate-weight survey designs with NotImplementedError (weighted within-transformation must be recomputed per replicate, not yet implemented) - subpopulation(): validate masks for NaN before bool coercion to prevent silent inclusion of missing-valued observations - Tests: SunAbraham replicate rejection, NaN mask rejection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… ContinuousDiD The TSL path via compute_survey_vcov(X_ones, if_vals, resolved) applies implicit score weighting (w * if) and bread normalization (1/sum(w)^2). The replicate IF path must apply equivalent score scaling before calling compute_replicate_if_variance() to produce matching SEs. Verified empirically: JK1 replicate SE now matches TSL SE within 0.3% on a toy weighted design (ratio 0.9967, previously 60x inflated). Score scaling by estimator: - EfficientDiD: psi = w * eif / sum(w) - ContinuousDiD: psi = w * if_vals (tsl_scale cancels with bread) - TripleDifference reg: psi = w * inf_func / sum(w) - TripleDifference ipw/dr: psi = inf_func / sum(w) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…t rank action Round 11 review fixes: - P0: Fix subpopulation() mask validation — remove except that swallowed its own ValueError for None masks. None values now properly rejected. - P1: EfficientDiD rejects replicate weights + n_bootstrap>0 with NotImplementedError (matching ContinuousDiD/SunAbraham pattern) - P1: solve_logit() effective-sample rank check now respects rank_deficient_action (warn/silent/error) instead of hard-erroring - P2: Update survey-roadmap.md with replicate-weight limitations (SunAbraham, ContinuousDiD/EfficientDiD bootstrap) - Tests: None mask rejection, EfficientDiD replicate+bootstrap rejection, logit rank-deficient warn vs error modes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TripleDifference: _compute_cell_means() validates positive survey mass per cell before np.average(), raises ValueError on zero-weight cells - ContinuousDiD: _compute_dose_response_gt() checks sum(w_treated) and sum(w_control) > 0, returns NaN for cells with zero effective mass instead of crashing on weighted mean/bread division Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 13 review fixes: - P0: compute_replicate_vcov() and compute_replicate_if_variance() return NaN when fewer than 2 valid replicates remain - P1: solve_logit() now actually drops rank-deficient columns from the effective positive-weight design in warn/silent modes (was warn-only) - P1: ContinuousDiD filters NaN cells from gt_results before aggregation so one zero-mass cell doesn't poison valid aggregates Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Round 14 review fixes: - CallawaySantAnna: reject replicate-weight + n_bootstrap>0 with NotImplementedError (matching ContinuousDiD/EfficientDiD/SunAbraham) - BaconDecomposition: guard weighted np.average() calls against zero effective weight in both treated-vs-never and timing comparison cells - Test: CS replicate+bootstrap rejection Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1 fixes from CI AI review (PR #238): - solve_logit(): track original column count and expand returned beta back to p+1 length after effective-sample column dropping. Previously returned a shortened vector breaking the solver contract. - subpopulation(): reject string/object masks that would silently coerce non-empty strings to True, defining the wrong domain. - REGISTRY.md: add Note entries for estimator-level replicate limitations (SunAbraham rejection, CS/ContinuousDiD/EfficientDiD bootstrap rejection) Tests: assert beta length p+1 after zero-weight rank-deficient solve, assert string mask raises ValueError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
I did not execute the test suite in this sandbox because the available Python environment is missing project dependencies ( |
…ocument df
CI review findings:
- Reject non-binary numeric masks in subpopulation() ({1,2} etc. coerce
to all-True via astype(bool), silently defining wrong domain)
- Fix test_survey_phase4.py: update "strictly positive" to "non-negative"
to match changed solve_logit() validation message
- Document replicate df limitation in TODO.md (df stays R-1 when invalid
replicates are dropped — marginal impact for typical R > 50)
- Add REGISTRY.md Note entries for replicate <2 valid returns NaN
- Tests: non-binary numeric mask rejection, beta length assertion
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
I could not execute the test suite in this sandbox because |
CallawaySantAnna._collapse_survey_to_unit_level() now carries through replicate_weights (collapsed via groupby().first()), replicate_method, fay_rho, n_replicates, and replicate_strata. Previously these fields were dropped, causing replicate-weight fits to silently fall back to non-replicate aggregation SEs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move the replicate df limitation from TODO.md (which only silences P2/P3) to REGISTRY.md with a **Note:** label (which the CI reviewer recognizes and downgrades to P3). The df stays R-1 instead of n_valid-1 when invalid replicates are dropped — marginal impact for typical R > 50. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When invalid replicates are dropped in compute_replicate_vcov() or compute_replicate_if_variance(), update resolved.n_replicates to the valid count so df_survey = n_valid - 1 flows through to t-based inference. Previously df stayed at R-1 (original count), making p-values/CIs stale when replicates failed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
I could not execute the test suite in this sandbox because |
Stop mutating resolved.n_replicates in place — instead return (result, n_valid) tuples from compute_replicate_vcov() and compute_replicate_if_variance(). Callers unpack the tuple and LinearRegression.fit() uses n_valid-1 for survey_df. This eliminates the shared-object mutation that the CI reviewer flagged as P0 (order-dependent bugs on reused resolved designs) while properly threading the effective df through inference. Updated all 7 callers across 5 files + 4 test call sites. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Executive Summary
Methodology
Code Quality
Performance No findings. Maintainability No findings beyond the live inference-state and d.f.-propagation issues above. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Re-review result: the prior CallawaySantAnna zero-effective-mass blocker appears fixed, and the ContinuousDiD anticipation / EfficientDiD Executive Summary
Methodology
Code Quality No unmitigated findings. Performance No findings. Maintainability No findings beyond the live replicate-d.f. issue above. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…sDiD df - Fix precomputed key: "df_survey" not "survey_df" in CS aggregation - CS: re-read df_survey from precomputed after aggregation so overall ATT inference uses updated n_valid-1 when replicate columns are dropped - ContinuousDiD: track _rep_n_valid across replicate IF calls, use min(n_valid) for df_survey in analytical SE return Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior zero-effective-mass fix in CallawaySantAnna, the ContinuousDiD anticipation filter fix, and the EfficientDiD Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…aths - EfficientDiD: update self._survey_df to n_valid-1 after replicate IF variance when replicates are dropped - TripleDifference: store _replicate_n_valid from compute_replicate_if_variance, use it to override survey_metadata.df_survey in fit() for inference Completes IF-based replicate df propagation across all estimators: CS (via precomputed), ContinuousDiD (via _rep_n_valid), EfficientDiD (via self._survey_df), TripleDifference (via self._replicate_n_valid). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior EfficientDiD Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Prevent stale replicate df from a prior fit leaking into a later non-replicate survey fit on the same estimator instance. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The earlier Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Match R svrepdesign() conventions: - combined_weights (default True): whether replicate columns include full-sample weight or are perturbation factors - replicate_scale: override default variance scaling factor - replicate_rscales: per-replicate scaling factors (vector of R) - mse (default True): center on full-sample estimate vs replicate mean - Remove per-column sum=n normalization that broke weight ratios in IF path - Update compute_replicate_vcov() and compute_replicate_if_variance() to use new fields for scaling, centering, and weight combination - Document in REGISTRY.md, add scale invariance and validation tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior Static review only; I could not execute the new tests in this environment because the available Python runtime is missing the project deps. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…its, CS collapse Carry combined_weights, replicate_scale, replicate_rscales, and mse through all three helper constructors that build derivative ResolvedSurveyDesign or SurveyDesign objects: - SurveyDesign.subpopulation() → new SurveyDesign - ResolvedSurveyDesign.subset_to_units() → new ResolvedSurveyDesign - CallawaySantAnna._collapse_survey_to_unit_level() → new ResolvedSurveyDesign Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The prior P1 propagation issue looks fixed: replicate metadata now survives Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
When replicates are dropped, keep the design's original scaling factor (e.g., (R-1)/R for JK1) rather than recomputing with n_valid. Dropped replicates contribute zero to the variance sum but don't change the design structure. n_valid is still used for df_survey = n_valid - 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Static review only; I could not execute the added tests in this sandbox because the Python environment is missing Executive Summary
Methodology
Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Path to Approval
|
…licate designs P0 fix: resolve() was normalizing full-sample weights to sum=n but leaving replicate columns raw, so compute_replicate_if_variance()'s w_r/w_full ratio was off by W/n. Now skips normalization for replicate designs so both are on the same raw scale. Also fix JKn dropped-replicate scaling: use original per-stratum counts (not valid-only counts) for the (n_h-1)/n_h factor. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…t obs - Reject combined_weights=True designs where w_r>0 but w_full=0 (the w_r/w_full ratio is undefined; silently zeroing gives wrong SEs) - Zero out aweight TSL scores for zero-weight observations to prevent arbitrary residuals from entering the meat computation Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall assessment Executive summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
- Replicate df now uses numerical rank of replicate weight matrix (matching R's survey::degf()) instead of hardcoded R-1 - Default mse=False (matching R's survey::svrepdesign() package default) - Update REGISTRY.md to reflect both changes - WLS identifiability for zero-weight subpopulation already handled by existing zero-mass guards in ContinuousDiD/TripleDifference Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ⛔ Blocker Most of the prior re-review blockers look fixed: Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
…for df Explicitly document in REGISTRY that the variance scaling factor uses the original design's R (matching R's survey package convention), while survey df uses n_valid-1 for t-based inference. This resolves the oscillating review feedback by making the design decision explicit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
SurveyDesign.subpopulation()hausman_pretest()stalen_clafter NaN filtering_format_survey_block()helper across 11 results filesResolvedSurveyDesign.subset_to_units()for panel→unit collapse with replicate metadatasolve_ols(),solve_logit(), and estimator cell meansMethodology references (required if estimator / math changes)
Validation
tests/test_survey_phase6.py(53 new tests),tests/test_survey_phase3.py,tests/test_survey_phase4.py,tests/test_survey_phase5.py(coverage gap tests),tests/test_survey.py,tests/test_efficient_did.py,tests/test_continuous_did.pySecurity / privacy
Generated with Claude Code