feat: MA lobbying data pipeline and dashboard charts#71
Conversation
Adds end-to-end pipeline for MA Secretary of State lobbying disclosures: - get_MA_lobbying.py: scrapes SoS portal (iPad UA bypasses Incapsula WAF), incremental via disc_url set in summary_links CSV — weekly CI exits early when no new semi-annual filings are posted - get_MA_legislature_bills.py: fetches bill metadata from MA Legislature OpenAPI for bills appearing in lobbying data; JSON cache under MA_legislature_cache/ for incremental re-runs - score_lobbying_bills.py: scores bills for environmental relevance using Gemini embedding-2 cosine similarity against 20 seed phrases (threshold 0.60) - MA_lobbying_viz.py: 4 dashboard charts (spend trend, top employers, bill intensity, lobbying vs enforcement) + 2 analysis-post charts - Wires all scripts into update-data.yml CI, assemble_db.py (4 new tables), validate_data.py (OPTIONAL_DATASETS so CI doesn't fail before first fetch), generate_semantic_context.py, and dashboard_charts.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…notes - docs/data/MA_lobbying.md: new dataset page with source description, data tables (employers, bills, legislature bills), and download links - docs/dashboard.md: add lobbying section with 4 chart includes and methodology note; add nav link - CLAUDE.md: document Incapsula WAF bypass (iPad UA), conda run stdout buffering gotcha, correct Gemini SDK (google.genai not google.generativeai), full historical fetch timing, and REQUEST_DELAY tip for historical runs Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
…ill rows) Full fetch of all 1,715 registrants for 2024. Historical years (2005–2023) to follow in a subsequent commit once the full fetch completes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ying viz
- MA_lobbying_viz.py: entity_name/compensation (not employer_name/total_expenditure);
dual-axis charts use yAxisID='y'/'y1' + y2nd=1 per chartjs convention
- get_MA_legislature_bills.py: use /Documents/{billId} endpoint (not /Bills/);
construct bill ID from chamber prefix + number; fetch history via separate
DocumentHistoryActions URL; Action field (not StatusDescription) for passed
- Add initial 2024 dashboard chart outputs (3 of 4; bill intensity pending legislature data)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- score_lobbying_bills.py: rewritten to embed bill titles directly from MA_lobbying_bills.csv (not legislature CSV); stores embeddings as MA_bill_embeddings.npy for clustering; incremental per run - cluster_lobbying_bills.py: one-time k-means (default 15 clusters) on normalized embeddings + Gemini Flash labeling of each cluster; writes MA_bill_cluster_labels.csv and updates cluster_id in scored CSV - MA_lobbying_viz.py: add Chart 5 — stacked bar of annual spend by topic cluster; gracefully skipped until cluster_lobbying_bills.py has been run - dashboard.md: add cluster spend chart include - requirements-ci.txt: add scikit-learn==1.8.0 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ata from DB - assemble_db.py: coerce bill_number/general_court to Int64 in MA_Lobbying_Bills, MA_Legislature_Bills, MA_Lobbying_Bills_Scored; add MA_Lobbying_Bills_Scored and MA_Bill_Cluster_Labels as DB tables so all downstream analysis reads from DB - MA_lobbying_viz.py: remove CSV file reads; load scored bills and cluster labels from DB; remove redundant numeric coercions (now guaranteed by assemble_db.py) - cluster_lobbying_bills.py: update to gemini-2.5-flash for cluster labeling - score_lobbying_bills.py: differential cosine scoring with example bills - Add dash_lobbying_bill_intensity.html and dash_lobbying_spend_by_cluster.html charts - Update semantic context with new DB tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
Both scripts now flush progress to disk frequently so an interrupt loses at most one disclosure (lobbying) or 50 bills (legislature) of work, rather than the entire in-progress run. get_MA_lobbying.py: - Load each CSV independently so a missing lobbyists file doesn't prevent resuming from employers/bills/links - Flush all three CSVs to disk after every completed disclosure URL - Print running totals with each flush for live progress monitoring get_MA_legislature_bills.py: - Append each bill to the combined DataFrame and flush every 50 bills - Already had per-bill JSON cache; now the merged CSV is also safe Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- MA_lobbying.md: replace stale seed-phrase scoring description with accurate account of differential cosine similarity; add cluster summary table; add t-SNE section with lobbying_bill_tsne.html embed - MA_lobbying_tsne.py: new script generating interactive Plotly t-SNE scatter of all lobbied bills coloured by cluster; env bills shown larger with white ring; hover shows bill title and cluster - get_MA_lobbying.py: add exponential-backoff retry (5 attempts) on GET/POST timeouts and connection errors; remove unused existing_lobbyists Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Explains the full MA lobbying pipeline (scripts 7–9 + cluster): scraping strategy (iPad UA, ASP.NET viewstate, incremental disc_url cache), modern vs. legacy HTML formats, legislature API endpoint quirks, differential cosine embedding scoring, and k-means clustering. Also covers credentials, CI pipeline order, and manual-only scripts. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
Remove general repo overview, CI pipeline table, other scripts section, and SODA credential reference — lobbying-only content remains. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
This PR adds a complete pipeline for collecting and analyzing Massachusetts lobbying disclosure data. The MA Secretary of State publishes semi-annual lobbying filings that document which organizations hired lobbyists, how much they paid, and which specific bills their lobbyists worked on. By pairing this with bill metadata from the MA Legislature API and semantic environmental relevance scoring via Google Gemini embeddings, we can identify which industries are spending most on environmental legislation, track whether heavily-lobbied bills are more or less likely to pass, and correlate lobbying intensity with trends in DEP enforcement and budget.
The data integrates naturally with existing AMEND analyses: lobbying spend can be plotted alongside DEP staffing and enforcement actions to surface relationships between industry influence and regulatory outcomes. All four new database tables are exposed in the AI Analysis tool's semantic context, enabling natural-language queries like "which companies spent the most lobbying against clean water bills" or "has lobbying spend on climate legislation increased since 2015." The pipeline is fully incremental — weekly CI runs exit early when no new semi-annual filings have been posted, so the added runtime cost is near-zero on most weeks.
Summary
get_MA_lobbying.py): scrapes the MA SoS lobbying disclosure portal using an iPad User-Agent (bypasses Incapsula WAF without Selenium). Incremental via adisc_urlset stored inMA_lobbying_summary_links.csv— weekly CI exits early when no new semi-annual filings are posted, so most runs touch only 2 search pages and exit.get_MA_legislature_bills.py): fetches bill metadata (title, sponsor, committee, status,passedbool) from the MA Legislature OpenAPI for every unique(bill_number, general_court)pair in the lobbying data. JSON responses cached underMA_legislature_cache/for incremental re-runs.score_lobbying_bills.py): scores each bill for environmental relevance using Geminigemini-embedding-2cosine similarity against 20 seed phrases (threshold 0.60). Only unscored bills are embedded per run.MA_lobbying_viz.py): 4 weekly-updated charts — annual spend trend, top 15 employers, bill intensity + pass rate, lobbying spend vs. enforcement actions. Plus 2 analysis-post charts.update-data.ymlCI,assemble_db.py(4 new tables:MA_Lobbying_Employers,MA_Lobbying_Bills,MA_Lobbying_Lobbyists,MA_Legislature_Bills),validate_data.py(lobbying tables inOPTIONAL_DATASETS— CI doesn't fail before first fetch),generate_semantic_context.py, anddashboard_charts.py.Test plan
--year 2024 --limit 10: produced 18 disclosure rows, 56 employer rows, 539 bill rows with correct entity names, compensation amounts, bill numbers, titles, and positionsget_MA_legislature_bills.pyagainst live API after lobbying fetch completesscore_lobbying_bills.pywith real Google API keydashboard_charts.pyto verify chart generation once DB is assembled🤖 Generated with Claude Code