feat: MA lobbying data pipeline and dashboard charts by nesanders · Pull Request #71 · nesanders/MAenvironmentaldata

nesanders · 2026-05-20T10:56:58Z

This PR adds a complete pipeline for collecting and analyzing Massachusetts lobbying disclosure data. The MA Secretary of State publishes semi-annual lobbying filings that document which organizations hired lobbyists, how much they paid, and which specific bills their lobbyists worked on. By pairing this with bill metadata from the MA Legislature API and semantic environmental relevance scoring via Google Gemini embeddings, we can identify which industries are spending most on environmental legislation, track whether heavily-lobbied bills are more or less likely to pass, and correlate lobbying intensity with trends in DEP enforcement and budget.

The data integrates naturally with existing AMEND analyses: lobbying spend can be plotted alongside DEP staffing and enforcement actions to surface relationships between industry influence and regulatory outcomes. All four new database tables are exposed in the AI Analysis tool's semantic context, enabling natural-language queries like "which companies spent the most lobbying against clean water bills" or "has lobbying spend on climate legislation increased since 2015." The pipeline is fully incremental — weekly CI runs exit early when no new semi-annual filings have been posted, so the added runtime cost is near-zero on most weeks.

Summary

Scraper (get_MA_lobbying.py): scrapes the MA SoS lobbying disclosure portal using an iPad User-Agent (bypasses Incapsula WAF without Selenium). Incremental via a disc_url set stored in MA_lobbying_summary_links.csv — weekly CI exits early when no new semi-annual filings are posted, so most runs touch only 2 search pages and exit.
Legislature bills (get_MA_legislature_bills.py): fetches bill metadata (title, sponsor, committee, status, passed bool) from the MA Legislature OpenAPI for every unique (bill_number, general_court) pair in the lobbying data. JSON responses cached under MA_legislature_cache/ for incremental re-runs.
Environmental scoring (score_lobbying_bills.py): scores each bill for environmental relevance using Gemini gemini-embedding-2 cosine similarity against 20 seed phrases (threshold 0.60). Only unscored bills are embedded per run.
Dashboard charts (MA_lobbying_viz.py): 4 weekly-updated charts — annual spend trend, top 15 employers, bill intensity + pass rate, lobbying spend vs. enforcement actions. Plus 2 analysis-post charts.
Pipeline wiring: all scripts added to update-data.yml CI, assemble_db.py (4 new tables: MA_Lobbying_Employers, MA_Lobbying_Bills, MA_Lobbying_Lobbyists, MA_Legislature_Bills), validate_data.py (lobbying tables in OPTIONAL_DATASETS — CI doesn't fail before first fetch), generate_semantic_context.py, and dashboard_charts.py.

Test plan

End-to-end test with --year 2024 --limit 10: produced 18 disclosure rows, 56 employer rows, 539 bill rows with correct entity names, compensation amounts, bill numbers, titles, and positions
Full 2024 fetch running in background — data CSVs will be committed as follow-up once complete
Run get_MA_legislature_bills.py against live API after lobbying fetch completes
Run score_lobbying_bills.py with real Google API key
Run dashboard_charts.py to verify chart generation once DB is assembled

🤖 Generated with Claude Code

Adds end-to-end pipeline for MA Secretary of State lobbying disclosures: - get_MA_lobbying.py: scrapes SoS portal (iPad UA bypasses Incapsula WAF), incremental via disc_url set in summary_links CSV — weekly CI exits early when no new semi-annual filings are posted - get_MA_legislature_bills.py: fetches bill metadata from MA Legislature OpenAPI for bills appearing in lobbying data; JSON cache under MA_legislature_cache/ for incremental re-runs - score_lobbying_bills.py: scores bills for environmental relevance using Gemini embedding-2 cosine similarity against 20 seed phrases (threshold 0.60) - MA_lobbying_viz.py: 4 dashboard charts (spend trend, top employers, bill intensity, lobbying vs enforcement) + 2 analysis-post charts - Wires all scripts into update-data.yml CI, assemble_db.py (4 new tables), validate_data.py (OPTIONAL_DATASETS so CI doesn't fail before first fetch), generate_semantic_context.py, and dashboard_charts.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-20T10:58:59Z

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric	Value
Hard pass rate	10/10 (100%)
Fatal failures	0
Mean judge score	5.0/5
P50 judge score	5/5
Model	gpt-4o-mini
Semantic context hash	`0b4c17034694`

Per-case results

ID	Hard pass	Score	Fatal	Reason
`cso_top_operator`	✅	5/5	no	The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
`cso_monthly_rainfall`	✅	5/5	no	The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
`cso_by_watershed`	✅	5/5	no	The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
`enforcement_vs_budget`	✅	5/5	no	The query correctly joins the enforcement actions with the budget data on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
`staffing_trend`	✅	5/5	no	The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
`303d_impaired_trend`	✅	5/5	no	The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
`303d_named_waterbody`	✅	5/5	no	The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
`cso_to_impaired`	✅	5/5	no	Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
`all_caps_boston`	✅	5/5	no	The query correctly uses UPPER() to filter the municipality for 'BOSTON' and checks for CSO event types using LIKE.
`ecos_per_capita`	✅	5/5	no	The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

…notes - docs/data/MA_lobbying.md: new dataset page with source description, data tables (employers, bills, legislature bills), and download links - docs/dashboard.md: add lobbying section with 4 chart includes and methodology note; add nav link - CLAUDE.md: document Incapsula WAF bypass (iPad UA), conda run stdout buffering gotcha, correct Gemini SDK (google.genai not google.generativeai), full historical fetch timing, and REQUEST_DELAY tip for historical runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-20T16:03:46Z

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric	Value
Hard pass rate	10/10 (100%)
Fatal failures	0
Mean judge score	5.0/5
P50 judge score	5/5
Model	gpt-4o-mini
Semantic context hash	`0b4c17034694`

Per-case results

ID	Hard pass	Score	Fatal	Reason
`cso_top_operator`	✅	5/5	no	The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
`cso_monthly_rainfall`	✅	5/5	no	The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
`cso_by_watershed`	✅	5/5	no	The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
`enforcement_vs_budget`	✅	5/5	no	The query correctly joins the enforcement actions with the budget data on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
`staffing_trend`	✅	5/5	no	The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
`303d_impaired_trend`	✅	5/5	no	The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
`303d_named_waterbody`	✅	5/5	no	The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
`cso_to_impaired`	✅	5/5	no	Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
`all_caps_boston`	✅	5/5	no	The query correctly uses UPPER() to filter the municipality for 'BOSTON' and checks for CSO event types using LIKE.
`ecos_per_capita`	✅	5/5	no	The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

…ill rows) Full fetch of all 1,715 registrants for 2024. Historical years (2005–2023) to follow in a subsequent commit once the full fetch completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ying viz - MA_lobbying_viz.py: entity_name/compensation (not employer_name/total_expenditure); dual-axis charts use yAxisID='y'/'y1' + y2nd=1 per chartjs convention - get_MA_legislature_bills.py: use /Documents/{billId} endpoint (not /Bills/); construct bill ID from chamber prefix + number; fetch history via separate DocumentHistoryActions URL; Action field (not StatusDescription) for passed - Add initial 2024 dashboard chart outputs (3 of 4; bill intensity pending legislature data) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- score_lobbying_bills.py: rewritten to embed bill titles directly from MA_lobbying_bills.csv (not legislature CSV); stores embeddings as MA_bill_embeddings.npy for clustering; incremental per run - cluster_lobbying_bills.py: one-time k-means (default 15 clusters) on normalized embeddings + Gemini Flash labeling of each cluster; writes MA_bill_cluster_labels.csv and updates cluster_id in scored CSV - MA_lobbying_viz.py: add Chart 5 — stacked bar of annual spend by topic cluster; gracefully skipped until cluster_lobbying_bills.py has been run - dashboard.md: add cluster spend chart include - requirements-ci.txt: add scikit-learn==1.8.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ata from DB - assemble_db.py: coerce bill_number/general_court to Int64 in MA_Lobbying_Bills, MA_Legislature_Bills, MA_Lobbying_Bills_Scored; add MA_Lobbying_Bills_Scored and MA_Bill_Cluster_Labels as DB tables so all downstream analysis reads from DB - MA_lobbying_viz.py: remove CSV file reads; load scored bills and cluster labels from DB; remove redundant numeric coercions (now guaranteed by assemble_db.py) - cluster_lobbying_bills.py: update to gemini-2.5-flash for cluster labeling - score_lobbying_bills.py: differential cosine scoring with example bills - Add dash_lobbying_bill_intensity.html and dash_lobbying_spend_by_cluster.html charts - Update semantic context with new DB tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-20T22:19:09Z

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric	Value
Hard pass rate	10/10 (100%)
Fatal failures	0
Mean judge score	5.0/5
P50 judge score	5/5
Model	gpt-4o-mini
Semantic context hash	`7dc233f87aac`

Per-case results

ID	Hard pass	Score	Fatal	Reason
`cso_top_operator`	✅	5/5	no	The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
`cso_monthly_rainfall`	✅	5/5	no	The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
`cso_by_watershed`	✅	5/5	no	The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
`enforcement_vs_budget`	✅	5/5	no	The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
`staffing_trend`	✅	5/5	no	The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
`303d_impaired_trend`	✅	5/5	no	The query correctly counts listings from EPA_303d_Impairments grouped by reportingCycle and ordered correctly.
`303d_named_waterbody`	✅	5/5	no	The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
`cso_to_impaired`	✅	5/5	no	Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
`all_caps_boston`	✅	5/5	no	The query correctly uses UPPER() to filter the municipality as required.
`ecos_per_capita`	✅	5/5	no	The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

Both scripts now flush progress to disk frequently so an interrupt loses at most one disclosure (lobbying) or 50 bills (legislature) of work, rather than the entire in-progress run. get_MA_lobbying.py: - Load each CSV independently so a missing lobbyists file doesn't prevent resuming from employers/bills/links - Flush all three CSVs to disk after every completed disclosure URL - Print running totals with each flush for live progress monitoring get_MA_legislature_bills.py: - Append each bill to the combined DataFrame and flush every 50 bills - Already had per-bill JSON cache; now the merged CSV is also safe Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- MA_lobbying.md: replace stale seed-phrase scoring description with accurate account of differential cosine similarity; add cluster summary table; add t-SNE section with lobbying_bill_tsne.html embed - MA_lobbying_tsne.py: new script generating interactive Plotly t-SNE scatter of all lobbied bills coloured by cluster; env bills shown larger with white ring; hover shows bill title and cluster - get_MA_lobbying.py: add exponential-backoff retry (5 attempts) on GET/POST timeouts and connection errors; remove unused existing_lobbyists Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Explains the full MA lobbying pipeline (scripts 7–9 + cluster): scraping strategy (iPad UA, ASP.NET viewstate, incremental disc_url cache), modern vs. legacy HTML formats, legislature API endpoint quirks, differential cosine embedding scoring, and k-means clustering. Also covers credentials, CI pipeline order, and manual-only scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-21T00:38:13Z

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric	Value
Hard pass rate	10/10 (100%)
Fatal failures	0
Mean judge score	5.0/5
P50 judge score	5/5
Model	gpt-4o-mini
Semantic context hash	`7dc233f87aac`

Per-case results

ID	Hard pass	Score	Fatal	Reason
`cso_top_operator`	✅	5/5	no	The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
`cso_monthly_rainfall`	✅	5/5	no	The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
`cso_by_watershed`	✅	5/5	no	The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
`enforcement_vs_budget`	✅	5/5	no	The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
`staffing_trend`	✅	5/5	no	The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
`303d_impaired_trend`	✅	5/5	no	The query correctly counts listings from EPA_303d_Impairments grouped by reportingCycle and ordered correctly.
`303d_named_waterbody`	✅	5/5	no	The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
`cso_to_impaired`	✅	5/5	no	Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
`all_caps_boston`	✅	5/5	no	The query correctly uses UPPER() to filter the municipality as required.
`ecos_per_capita`	✅	5/5	no	The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

Remove general repo overview, CI pipeline table, other scripts section, and SODA credential reference — lobbying-only content remains. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

github-actions · 2026-05-21T00:45:43Z

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric	Value
Hard pass rate	10/10 (100%)
Fatal failures	0
Mean judge score	5.0/5
P50 judge score	5/5
Model	gpt-4o-mini
Semantic context hash	`7dc233f87aac`

Per-case results

ID	Hard pass	Score	Fatal	Reason
`cso_top_operator`	✅	5/5	no	The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
`cso_monthly_rainfall`	✅	5/5	no	The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern and date range filters.
`cso_by_watershed`	✅	5/5	no	The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
`enforcement_vs_budget`	✅	5/5	no	The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
`staffing_trend`	✅	5/5	no	The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
`303d_impaired_trend`	✅	5/5	no	The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
`303d_named_waterbody`	✅	5/5	no	The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
`cso_to_impaired`	✅	5/5	no	Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
`all_caps_boston`	✅	5/5	no	The query correctly uses UPPER() to filter the municipality and eventType, ensuring accurate results.
`ecos_per_capita`	✅	5/5	no	The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 4 commits May 20, 2026 12:05

data: add 2024 MA lobbying disclosures (1,650 employer rows, 14,822 b…

2b33f77

…ill rows) Full fetch of all 1,715 registrants for 2024. Historical years (2005–2023) to follow in a subsequent commit once the full fetch completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

nesanders and others added 4 commits May 20, 2026 18:23

rename: README.md → README_lobbying.md in get_data/

8d516d2

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: trim README_lobbying.md to lobbying pipeline only

6002fa7

Remove general repo overview, CI pipeline table, other scripts section, and SODA credential reference — lobbying-only content remains. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MA lobbying data pipeline and dashboard charts#71

feat: MA lobbying data pipeline and dashboard charts#71
nesanders wants to merge 11 commits into
mainfrom
feat/ma-lobbying-data

nesanders commented May 20, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nesanders commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

github-actions Bot commented May 20, 2026

✅ Semantic Eval Results

Uh oh!

github-actions Bot commented May 20, 2026

✅ Semantic Eval Results

Uh oh!

github-actions Bot commented May 20, 2026

✅ Semantic Eval Results

Uh oh!

github-actions Bot commented May 21, 2026

✅ Semantic Eval Results

Uh oh!

github-actions Bot commented May 21, 2026

✅ Semantic Eval Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

nesanders commented May 20, 2026 •

edited

Loading