Skip to content

feat: MA lobbying data pipeline and dashboard charts#71

Open
nesanders wants to merge 11 commits into
mainfrom
feat/ma-lobbying-data
Open

feat: MA lobbying data pipeline and dashboard charts#71
nesanders wants to merge 11 commits into
mainfrom
feat/ma-lobbying-data

Conversation

@nesanders
Copy link
Copy Markdown
Owner

@nesanders nesanders commented May 20, 2026

This PR adds a complete pipeline for collecting and analyzing Massachusetts lobbying disclosure data. The MA Secretary of State publishes semi-annual lobbying filings that document which organizations hired lobbyists, how much they paid, and which specific bills their lobbyists worked on. By pairing this with bill metadata from the MA Legislature API and semantic environmental relevance scoring via Google Gemini embeddings, we can identify which industries are spending most on environmental legislation, track whether heavily-lobbied bills are more or less likely to pass, and correlate lobbying intensity with trends in DEP enforcement and budget.

The data integrates naturally with existing AMEND analyses: lobbying spend can be plotted alongside DEP staffing and enforcement actions to surface relationships between industry influence and regulatory outcomes. All four new database tables are exposed in the AI Analysis tool's semantic context, enabling natural-language queries like "which companies spent the most lobbying against clean water bills" or "has lobbying spend on climate legislation increased since 2015." The pipeline is fully incremental — weekly CI runs exit early when no new semi-annual filings have been posted, so the added runtime cost is near-zero on most weeks.

Summary

  • Scraper (get_MA_lobbying.py): scrapes the MA SoS lobbying disclosure portal using an iPad User-Agent (bypasses Incapsula WAF without Selenium). Incremental via a disc_url set stored in MA_lobbying_summary_links.csv — weekly CI exits early when no new semi-annual filings are posted, so most runs touch only 2 search pages and exit.
  • Legislature bills (get_MA_legislature_bills.py): fetches bill metadata (title, sponsor, committee, status, passed bool) from the MA Legislature OpenAPI for every unique (bill_number, general_court) pair in the lobbying data. JSON responses cached under MA_legislature_cache/ for incremental re-runs.
  • Environmental scoring (score_lobbying_bills.py): scores each bill for environmental relevance using Gemini gemini-embedding-2 cosine similarity against 20 seed phrases (threshold 0.60). Only unscored bills are embedded per run.
  • Dashboard charts (MA_lobbying_viz.py): 4 weekly-updated charts — annual spend trend, top 15 employers, bill intensity + pass rate, lobbying spend vs. enforcement actions. Plus 2 analysis-post charts.
  • Pipeline wiring: all scripts added to update-data.yml CI, assemble_db.py (4 new tables: MA_Lobbying_Employers, MA_Lobbying_Bills, MA_Lobbying_Lobbyists, MA_Legislature_Bills), validate_data.py (lobbying tables in OPTIONAL_DATASETS — CI doesn't fail before first fetch), generate_semantic_context.py, and dashboard_charts.py.

Test plan

  • End-to-end test with --year 2024 --limit 10: produced 18 disclosure rows, 56 employer rows, 539 bill rows with correct entity names, compensation amounts, bill numbers, titles, and positions
  • Full 2024 fetch running in background — data CSVs will be committed as follow-up once complete
  • Run get_MA_legislature_bills.py against live API after lobbying fetch completes
  • Run score_lobbying_bills.py with real Google API key
  • Run dashboard_charts.py to verify chart generation once DB is assembled

🤖 Generated with Claude Code

Adds end-to-end pipeline for MA Secretary of State lobbying disclosures:

- get_MA_lobbying.py: scrapes SoS portal (iPad UA bypasses Incapsula WAF),
  incremental via disc_url set in summary_links CSV — weekly CI exits early
  when no new semi-annual filings are posted
- get_MA_legislature_bills.py: fetches bill metadata from MA Legislature
  OpenAPI for bills appearing in lobbying data; JSON cache under
  MA_legislature_cache/ for incremental re-runs
- score_lobbying_bills.py: scores bills for environmental relevance using
  Gemini embedding-2 cosine similarity against 20 seed phrases (threshold 0.60)
- MA_lobbying_viz.py: 4 dashboard charts (spend trend, top employers, bill
  intensity, lobbying vs enforcement) + 2 analysis-post charts
- Wires all scripts into update-data.yml CI, assemble_db.py (4 new tables),
  validate_data.py (OPTIONAL_DATASETS so CI doesn't fail before first fetch),
  generate_semantic_context.py, and dashboard_charts.py

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 0b4c17034694
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions with the budget data on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality for 'BOSTON' and checks for CSO event types using LIKE.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

…notes

- docs/data/MA_lobbying.md: new dataset page with source description,
  data tables (employers, bills, legislature bills), and download links
- docs/dashboard.md: add lobbying section with 4 chart includes and
  methodology note; add nav link
- CLAUDE.md: document Incapsula WAF bypass (iPad UA), conda run stdout
  buffering gotcha, correct Gemini SDK (google.genai not google.generativeai),
  full historical fetch timing, and REQUEST_DELAY tip for historical runs

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 0b4c17034694
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions with the budget data on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality for 'BOSTON' and checks for CSO event types using LIKE.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

nesanders and others added 4 commits May 20, 2026 12:05
…ill rows)

Full fetch of all 1,715 registrants for 2024. Historical years (2005–2023)
to follow in a subsequent commit once the full fetch completes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ying viz

- MA_lobbying_viz.py: entity_name/compensation (not employer_name/total_expenditure);
  dual-axis charts use yAxisID='y'/'y1' + y2nd=1 per chartjs convention
- get_MA_legislature_bills.py: use /Documents/{billId} endpoint (not /Bills/);
  construct bill ID from chamber prefix + number; fetch history via separate
  DocumentHistoryActions URL; Action field (not StatusDescription) for passed
- Add initial 2024 dashboard chart outputs (3 of 4; bill intensity pending legislature data)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- score_lobbying_bills.py: rewritten to embed bill titles directly from
  MA_lobbying_bills.csv (not legislature CSV); stores embeddings as
  MA_bill_embeddings.npy for clustering; incremental per run
- cluster_lobbying_bills.py: one-time k-means (default 15 clusters) on
  normalized embeddings + Gemini Flash labeling of each cluster; writes
  MA_bill_cluster_labels.csv and updates cluster_id in scored CSV
- MA_lobbying_viz.py: add Chart 5 — stacked bar of annual spend by topic
  cluster; gracefully skipped until cluster_lobbying_bills.py has been run
- dashboard.md: add cluster spend chart include
- requirements-ci.txt: add scikit-learn==1.8.0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ata from DB

- assemble_db.py: coerce bill_number/general_court to Int64 in MA_Lobbying_Bills,
  MA_Legislature_Bills, MA_Lobbying_Bills_Scored; add MA_Lobbying_Bills_Scored and
  MA_Bill_Cluster_Labels as DB tables so all downstream analysis reads from DB
- MA_lobbying_viz.py: remove CSV file reads; load scored bills and cluster labels
  from DB; remove redundant numeric coercions (now guaranteed by assemble_db.py)
- cluster_lobbying_bills.py: update to gemini-2.5-flash for cluster labeling
- score_lobbying_bills.py: differential cosine scoring with example bills
- Add dash_lobbying_bill_intensity.html and dash_lobbying_spend_by_cluster.html charts
- Update semantic context with new DB tables

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 7dc233f87aac
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from EPA_303d_Impairments grouped by reportingCycle and ordered correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

nesanders and others added 4 commits May 20, 2026 18:23
Both scripts now flush progress to disk frequently so an interrupt
loses at most one disclosure (lobbying) or 50 bills (legislature)
of work, rather than the entire in-progress run.

get_MA_lobbying.py:
- Load each CSV independently so a missing lobbyists file doesn't
  prevent resuming from employers/bills/links
- Flush all three CSVs to disk after every completed disclosure URL
- Print running totals with each flush for live progress monitoring

get_MA_legislature_bills.py:
- Append each bill to the combined DataFrame and flush every 50 bills
- Already had per-bill JSON cache; now the merged CSV is also safe

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- MA_lobbying.md: replace stale seed-phrase scoring description with
  accurate account of differential cosine similarity; add cluster
  summary table; add t-SNE section with lobbying_bill_tsne.html embed
- MA_lobbying_tsne.py: new script generating interactive Plotly t-SNE
  scatter of all lobbied bills coloured by cluster; env bills shown
  larger with white ring; hover shows bill title and cluster
- get_MA_lobbying.py: add exponential-backoff retry (5 attempts) on
  GET/POST timeouts and connection errors; remove unused existing_lobbyists

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explains the full MA lobbying pipeline (scripts 7–9 + cluster):
scraping strategy (iPad UA, ASP.NET viewstate, incremental disc_url
cache), modern vs. legacy HTML formats, legislature API endpoint
quirks, differential cosine embedding scoring, and k-means clustering.
Also covers credentials, CI pipeline order, and manual-only scripts.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 7dc233f87aac
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and includes the necessary grouping and ordering.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern on aggregated months.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from EPA_303d_Impairments grouped by reportingCycle and ordered correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality as required.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

Remove general repo overview, CI pipeline table, other scripts section,
and SODA credential reference — lobbying-only content remains.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric Value
Hard pass rate 10/10 (100%)
Fatal failures 0
Mean judge score 5.0/5
P50 judge score 5/5
Model gpt-4o-mini
Semantic context hash 7dc233f87aac
Per-case results
ID Hard pass Score Fatal Reason
cso_top_operator 5/5 no The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
cso_monthly_rainfall 5/5 no The query correctly aggregates both CSO discharge volume and total monthly rainfall, using the appropriate join pattern and date range filters.
cso_by_watershed 5/5 no The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
enforcement_vs_budget 5/5 no The query correctly joins the enforcement actions and budget tables on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
staffing_trend 5/5 no The query uses the correct table, groups by year, and counts employees accurately from 2005 to present.
303d_impaired_trend 5/5 no The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
303d_named_waterbody 5/5 no The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
cso_to_impaired 5/5 no Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
all_caps_boston 5/5 no The query correctly uses UPPER() to filter the municipality and eventType, ensuring accurate results.
ecos_per_capita 5/5 no The query correctly uses the ECOS_budgets table, aggregates spending by state, and calculates per-capita spending for comparison across states.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant