Skip to content

Conversation

@MaxGhenis
Copy link
Contributor

Summary

  • Add census block-level geographic assignment that replaces county-only methodology
  • Assign a census block to each household, then derive ALL geography from it for consistency
  • Look up comprehensive Census geographies from block GEOID:
    • County, tract, state (from GEOID structure)
    • CBSA/metro area (via NBER county crosswalk)
    • SLDU/SLDL (state legislative districts upper/lower)
    • Place/City FIPS (incorporated places and CDPs)
    • PUMA (Public Use Microdata Area)
    • VTD (Voting Tabulation District)

Data files

  • block_cd_distributions.csv.gz (25MB): P(block|CD) from 2020 Census populations + 119th Congress BEF
  • block_crosswalk.csv.gz (21MB): 8.1M blocks mapped to SLDU, SLDL, Place, VTD, PUMA from Census BAFs

Test plan

  • Tests for block assignment shape and determinism
  • Tests for geography lookup (county, tract, state from block GEOID)
  • Tests for CBSA lookup from county FIPS
  • Tests for SLDU/SLDL lookups
  • Tests for Place/City lookups
  • Tests for PUMA lookups
  • Tests for VTD lookups
  • Tests for integrated assignment returning all geography fields
  • Tests for County enum integration (backwards compatibility)

🤖 Generated with Claude Code

Copy link
Collaborator

@baogorek baogorek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very clean and well-structured. Approved, but based on the scope of this PR, and the fact that tests are failing due to no fault of this PR, I think it makes sense to finish #473 first. I will need a review after I make a few more changes.

@baogorek
Copy link
Collaborator

For my own reference:

  PR #484 Summary: Data Enhancement, NOT Architectural Change                                                                                                 
                                                                                                                                                              
  Core Change                                                                                                                                                 
                                                                                                                                                              
  This PR improves the geographic assignment granularity within the existing local area calibration system. It's an enhancement to data quality, not a        
  structural change.                                                                                                                                          
                                                                                                                                                              
  Before: Counties were assigned to households using county_assignment.py with P(county|CD) distributions.                                                    
                                                                                                                                                              
  After: Census blocks (15-digit GEOIDs) are assigned using block_assignment.py with P(block|CD) distributions, and ALL other geography is derived from the   
  block for consistency.                                                                                                                                      
                                                                                                                                                              
  What the block assignment provides                                                                                                                          
                                                                                                                                                              
  From a single block GEOID, the system now derives:                                                                                                          
  - From GEOID structure directly: state, county, tract                                                                                                       
  - Via crosswalks: CBSA/metro area, SLDU, SLDL, Place/City, PUMA, VTD                                                                                        
                                                                                                                                                              
  Changes to stacked_dataset_builder.py                                                                                                                       
                                                                                                                                                              
  The changes are minimal and preserve the existing flow:                                                                                                     
  # Before                                                                                                                                                    
  from .county_assignment import assign_counties_for_cd                                                                                                       
  county_indices = assign_counties_for_cd(cd_geoid, n_households, seed)                                                                                       
  cd_sim.set_input("county", time_period, county_indices)                                                                                                     
                                                                                                                                                              
  # After                                                                                                                                                     
  from .block_assignment import assign_geography_for_cd                                                                                                       
  geography = assign_geography_for_cd(cd_geoid, n_households, seed)                                                                                           
  cd_sim.set_input("county", time_period, geography["county_index"])  # backwards compat                                                                      
  cd_sim.set_input("block_geoid", time_period, geography["block_geoid"])                                                                                      
  # ... etc for other geography variables                                                                                                                     
                                                                                                                                                              
  New data files                                                                                                                                              
                                                                                                                                                              
  - block_cd_distributions.csv.gz (26MB) - P(block|CD) from Census population                                                                                 
  - block_crosswalk.csv.gz (22MB) - Maps 8.1M blocks to other geographies                                                                                     
                                                                                                                                                              
  Unrelated changes also in this PR                                                                                                                           
                                                                                                                                                              
  - Adds partnership_se_income variable from PUF (Schedule K-1 Box 14)                                                                                        
  - Pandas 3.0 compatibility fixes                                                                                                                            
                                                                                                                                                              
  Bottom line                                                                                                                                                 
                                                                                                                                                              
  This is data improvement - more granular geographic data within the existing CD-stacked dataset architecture. The system still builds CD-stacked datasets   
  the same way; it just has richer geographic information.         

@MaxGhenis
Copy link
Contributor Author

Let's add ZIP from block too

MaxGhenis and others added 7 commits January 26, 2026 14:16
- Assign census blocks using P(block|CD) from Census population data
- Look up all geography from block GEOID for consistency:
  - County, tract, state (from GEOID structure)
  - CBSA/metro area (via NBER county crosswalk)
  - SLDU/SLDL (state legislative districts)
  - Place/City FIPS (via Census BAF)
  - PUMA (via tract crosswalk)
  - VTD (voting tabulation district)
- Add block_crosswalk.csv.gz (8.1M blocks) with BAF data
- Add make_block_crosswalk.py to generate crosswalk from Census BAFs
- Add comprehensive tests for all geography lookups

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Updates stacked_dataset_builder.py to:
- Set all geography variables from block assignment (block_geoid, tract_geoid,
  cbsa_code, sldu, sldl, place_fips, vtd, puma)
- Include these variables in the saved h5 files

These variables enable granular geographic analysis at multiple levels:
state legislative districts, census tracts, metro areas, cities, etc.

Requires policyengine-us#7249 for the variable definitions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add ZCTA column to block_crosswalk.csv.gz from Census relationship file
- Add get_zcta_from_block() function and include zcta in assign_geography_for_cd()
- Save zcta to CD-stacked dataset output
- Add tests for ZCTA lookup

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@baogorek baogorek force-pushed the census-block-assignment branch from ca7e692 to 0906da9 Compare January 26, 2026 19:48
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 3b1180f into main Jan 26, 2026
7 checks passed
@MaxGhenis MaxGhenis deleted the census-block-assignment branch January 26, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants