Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
c1e708a
feat: MA lobbying data pipeline and dashboard charts
nesanders May 20, 2026
e09b6b4
docs: add MA lobbying dataset page, dashboard section, and CLAUDE.md …
nesanders May 20, 2026
2b33f77
data: add 2024 MA lobbying disclosures (1,650 employer rows, 14,822 b…
nesanders May 20, 2026
db46876
fix: correct column names, API endpoint, and dual-axis params in lobb…
nesanders May 20, 2026
1952cc9
feat: bill embeddings, topic clustering, and cluster spend chart
nesanders May 20, 2026
6724e99
fix: store lobbying bill_number as integer in DB; load all analysis d…
nesanders May 20, 2026
6e181c3
fix: make lobbying and legislature fetch scripts fully resumable
nesanders May 20, 2026
280a08b
feat: t-SNE cluster plot, embedding docs, and fetch retry/cleanup
nesanders May 20, 2026
2fd4d92
docs: add get_data/README.md covering lobbying pipeline and embeddings
nesanders May 21, 2026
8d516d2
rename: README.md → README_lobbying.md in get_data/
nesanders May 21, 2026
6002fa7
docs: trim README_lobbying.md to lobbying pipeline only
nesanders May 21, 2026
fbfd386
feat: lobbying charts, semantic context, and pipeline fixes
nesanders May 22, 2026
839e9b2
feat: MA environmental lobbying analysis — charts, post draft, and pi…
nesanders May 27, 2026
636e051
fix: use sample CSVs in MA_lobbying.md; upload large CSVs to GCS in a…
nesanders May 27, 2026
88ab4ac
feat: normalize entity/client names at DB assembly; show 10 rows on d…
nesanders May 27, 2026
2b3ef74
fix: improve env scoring — expanded non-env examples, raise threshold…
nesanders May 27, 2026
f1ad513
fix: lower env threshold to 0.06 after calibrating with expanded non-…
nesanders May 27, 2026
8ea8e9c
fix: quote legislature CSV with QUOTE_NONNUMERIC to prevent C parser …
nesanders May 27, 2026
b92971f
data: regenerate lobbying charts and samples with full historical data
nesanders May 27, 2026
ff7032b
fix: add actions:write permission to dispatch update-charts workflow
nesanders May 27, 2026
42c7d04
data: regenerate t-SNE bill embedding visualization (25,928 bills)
nesanders May 27, 2026
4ae21b3
test: add test_bill_embedding_pipeline.py for iterating on embedding …
nesanders May 27, 2026
ae99d5b
feat: strip legislative boilerplate, prepend title, expand to 3000 ch…
nesanders May 27, 2026
b53db2b
data: regenerate bill embeddings with boilerplate stripping and updat…
nesanders May 27, 2026
5abb6eb
analysis: fix MA_lobbying_viz.py - proportional env spend, 3 new posi…
nesanders May 27, 2026
804e8f7
fix: correct General Court formula (FIRST_GC_START_YEAR 2005→2003) an…
nesanders May 28, 2026
fa226d1
feat: persist k-means model for incremental cluster assignment
nesanders May 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .claude/scheduled_tasks.lock
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
{"sessionId":"d2fc4fd3-d9ff-44a4-a608-05ddeb65f47b","pid":1140968,"procStart":"103407381","acquiredAt":1779884724489}
5 changes: 5 additions & 0 deletions .github/workflows/update-data.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ jobs:
contents: write # to push branch
issues: write # to open failure issues
pull-requests: write # to create PR
actions: write # to dispatch update-charts.yml after merge

steps:
- uses: actions/checkout@v4
Expand Down Expand Up @@ -54,6 +55,10 @@ jobs:
python get_MA_precipitation.py
python get_ATTAINS_303d.py
python get_MS4_annual_reports.py --yes --skip-download
python get_MA_lobbying.py
python get_MA_legislature_bills.py
python score_lobbying_bills.py
python cluster_lobbying_bills.py --incremental

- name: Validate data
working-directory: get_data
Expand Down
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,16 @@ get_data/MS4_annual_reports/

# Large files
*EEADP_drinkingWater.csv
docs/data/MA_bill_embeddings.parquet
docs/data/MA_bill_embeddings.npy
get_data/MA_legislature_cache/

# MA lobbying large CSVs — stored in GCS, only samples committed
docs/data/MA_lobbying_bills.csv
docs/data/MA_lobbying_employers.csv
docs/data/MA_lobbying_summary_links.csv
docs/data/MA_lobbying_bills_scored.csv
docs/data/MA_legislature_bills.csv

get_data/backup_AMEND.db
get_data/EEADP_drinkingWater.csv
Expand Down
7 changes: 7 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,13 @@ bash set_cors_gsutil.sh
- **EPA NPDES page changes**: EPA changed JSON format and column names around 2025; both handled with `isinstance` checks and fallback column detection.
- **EEA CSOAPI**: Requires `Referer` and `Origin` headers matching the portal URL; plain requests return HTTP 500. Pagination is 1-indexed.
- **303(d) data (biennial)**: `get_ATTAINS_303d.py` fetches from MassGIS S3-hosted shapefiles (not the ATTAINS REST API, which times out on `/assessments`). Data updates only biennially (even years); the script exits early if all known cycles are already in the cached CSV. The 2020 cycle was never published by MassGIS. The 2024/2026 cycle is in draft as of April 2026 — the script will auto-detect it when MassGIS publishes the approved shapefile. `CSO_303d_Mapping` in `assemble_db.py` is a manually curated dict (35 verified matches of 56 CSO waterbodies); update it when a new cycle is added by reviewing new assessment unit names against CSO waterBody values.
- **MA lobbying portal (Incapsula WAF)**: The SoS portal (`sec.state.ma.us/LobbyistPublicSearch/`) is protected by Incapsula WAF. A Chrome User-Agent gets a 302 redirect to a JS challenge page. An **iPad User-Agent** bypasses it entirely with plain `requests` — no Selenium needed. The working UA is `Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148`.
- **MA lobbying portal (search form)**: ASP.NET with ViewState. POST to `Default.aspx` with `drpType=L` (Lobbyist or Lobbying Entity — do NOT use `Z` which returns Client pages with different structure), `drpPageSize=20000`, `ddlYear=<year>`. POST timeout must be 120s (response is ~2MB for ~1700 results). Data goes back to 2005 (22 years).
- **MA lobbying incremental fetch**: Disclosures are filed semi-annually per employer, so weekly CI almost always exits early. The incremental state is the `disc_url` set in `MA_lobbying_summary_links.csv` — only new `CompleteDisclosure.aspx` URLs get fetched. Always re-checks current and prior year for new filers.
- **MA lobbying full historical fetch**: At `REQUEST_DELAY=1.0s`, fetching a single year (~1700 registrants) takes ~10 hours. The full 22-year history (2005–2026) will take several days. To speed up the historical fetch (frozen past years), lower `REQUEST_DELAY` to 0.5s — safe since those filings never change.
- **Running lobbying scripts**: Do NOT use `conda run` with stdout redirect (`> file`) — `conda run` buffers all output through a pipe and the log file stays empty until the process exits. Run Python directly: `/home/nes/miniconda/envs/amend_python/bin/python -u get_MA_lobbying.py` (the `-u` flag ensures unbuffered output).
- **MA lobbying Gemini SDK**: Uses `google-genai` (new SDK, `google.genai`), NOT the old `google-generativeai` package. API: `client = genai.Client(api_key=...)`, then `client.models.embed_content(model='gemini-embedding-2', contents=text, config=types.EmbedContentConfig(output_dimensionality=768))`.
- **MA lobbying General Court formula bug (unfixed)**: `get_MA_lobbying.py` uses `FIRST_GC_START_YEAR = 2005` but the 183rd General Court started in 2003, not 2005. The correct constant is `2003`. As a result, every bill's `general_court` assignment is one session too low (year 2024 → GC192 instead of GC193, etc.), and `get_MA_legislature_bills.py` fetches bill text from the wrong legislative session. Fixing this constant and re-running the full pipeline would bring the title match rate from ~2% to ~65%. The remaining ~35% of mismatches are string-normalisation differences or genuinely wrong bill numbers in the SoS portal. See `get_data/NOTES_bill_embeddings.md` for full analysis.

## Running scripts

Expand Down
224 changes: 224 additions & 0 deletions analysis/MA_lobbying_tsne.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
"""Generate a t-SNE scatter plot of MA lobbying bill embeddings.

Visual design philosophy
─────────────────────────
MA legislative bill embeddings are semantically dense — all bills share heavy
regulatory language, so inter-cluster cosine distances are ~0.006 vs.
intra-cluster spread of ~0.53. Running t-SNE on all 25k bills produces a
featureless blob regardless of perplexity, because the structure simply doesn't
separate in 2-D.

Instead the chart shows TWO layers:

Background (grey) — stratified sample of ~120 non-environmental bills per
cluster, rendered as tiny translucent grey dots. Provides
geographic context for the policy landscape.

Signal (coloured) — all 329 env-relevant bills, one colour per cluster,
large outlined dots. These are what the visitor cares about.

t-SNE is computed on the combined ~3,300 point sample (all env + background),
which runs in seconds and produces cleaner structure than the full 25k set.
Perplexity is scaled to ~√N of the subsample.

Run from the analysis/ directory:
/path/to/python -u MA_lobbying_tsne.py

Outputs:
../docs/_includes/charts/lobbying_bill_tsne.html
"""

import sys
from pathlib import Path

import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.preprocessing import normalize
import plotly.graph_objects as go

sys.path.insert(0, str(Path(__file__).parent))

GCS_PARQUET = 'gs://openamend-data/MA_bill_embeddings.parquet'
LOCAL_PARQUET = Path('../docs/data/MA_bill_embeddings.parquet')
LABELS_CSV = Path('../docs/data/MA_bill_cluster_labels.csv')
OUT_HTML = Path('../docs/_includes/charts/lobbying_bill_tsne.html')

# Non-env bills sampled per cluster for background context.
# 120 × 25 clusters ≈ 3 000 background points + ~329 env = ~3 300 total.
BG_PER_CLUSTER = 120
TSNE_ITER = 1000
RANDOM_STATE = 42

# 25-colour palette — qualitative, perceptually distinct, no cycling
PALETTE_25 = [
'#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
'#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
'#aec7e8', '#ffbb78', '#98df8a', '#ff9896', '#c5b0d5',
'#c49c94', '#f7b6d2', '#c7c7c7', '#dbdb8d', '#9edae5',
'#393b79', '#637939', '#8c6d31', '#843c39', '#7b4173',
]


def _load_parquet() -> pd.DataFrame:
try:
import gcsfs
fs = gcsfs.GCSFileSystem()
if fs.exists(GCS_PARQUET):
with fs.open(GCS_PARQUET, 'rb') as f:
df = pd.read_parquet(f)
print(f'Loaded {len(df)} rows from {GCS_PARQUET}')
return df
except Exception as e:
print(f'GCS load failed ({e}), trying local...')
if LOCAL_PARQUET.exists():
df = pd.read_parquet(LOCAL_PARQUET)
print(f'Loaded {len(df)} rows from local Parquet')
return df
raise FileNotFoundError('No Parquet file found. Run score_lobbying_bills.py first.')


def main():
parquet_df = _load_parquet()

# Restrict to clustered bills
parquet_df = parquet_df[
parquet_df['cluster_id'].notna() & (parquet_df['cluster_id'] != -1)
].copy()
parquet_df['cluster_id'] = parquet_df['cluster_id'].astype(int)

if 'is_environmental' not in parquet_df.columns:
parquet_df['is_environmental'] = False
parquet_df['is_environmental'] = parquet_df['is_environmental'].fillna(False).astype(bool)

labels_df = pd.read_csv(LABELS_CSV)
label_map = dict(zip(labels_df['cluster_id'].astype(int), labels_df['label']))
nenv_map = dict(zip(labels_df['cluster_id'].astype(int), labels_df['n_env_bills']))

# ── Build subsample ──────────────────────────────────────────────────────
# Keep ALL env bills; sample BG_PER_CLUSTER non-env bills per cluster.
env_df = parquet_df[parquet_df['is_environmental']].copy()
non_env = parquet_df[~parquet_df['is_environmental']]

rng = np.random.default_rng(RANDOM_STATE)
bg_parts = []
for cid in sorted(non_env['cluster_id'].unique()):
sub = non_env[non_env['cluster_id'] == cid]
n = min(BG_PER_CLUSTER, len(sub))
bg_parts.append(sub.sample(n=n, random_state=int(rng.integers(0, 2**31))))

bg_df = pd.concat(bg_parts, ignore_index=True)
sample = pd.concat([env_df, bg_df], ignore_index=True)
print(f'Subsample: {len(env_df)} env + {len(bg_df)} background = {len(sample)} total')

# ── Embeddings ───────────────────────────────────────────────────────────
emb = np.vstack(sample['embedding'].apply(
lambda v: np.array(v, dtype=np.float32)
).values)
emb_norm = normalize(emb, norm='l2')

# ── t-SNE ────────────────────────────────────────────────────────────────
# Perplexity ≈ √N is a sensible heuristic for subsampled sets.
perplexity = max(20, min(80, int(np.sqrt(len(sample)))))
print(f'Running t-SNE (n={len(sample)}, perplexity={perplexity}, iter={TSNE_ITER})...')
tsne = TSNE(
n_components=2,
perplexity=perplexity,
max_iter=TSNE_ITER,
random_state=RANDOM_STATE,
init='pca',
learning_rate='auto',
)
coords = tsne.fit_transform(emb_norm)
sample = sample.copy()
sample['x'] = coords[:, 0]
sample['y'] = coords[:, 1]

# ── Build Plotly figure ──────────────────────────────────────────────────
fig = go.Figure()

bg = sample[~sample['is_environmental']]
envs = sample[sample['is_environmental']]

# Layer 1 — grey background (all non-env, single trace for performance)
fig.add_trace(go.Scatter(
x=bg['x'], y=bg['y'],
mode='markers',
marker=dict(color='#aaaaaa', size=4, opacity=0.20),
name='Non-environmental bills',
hovertext=[
f'<b>{row.get("bill_title", "")}</b><br>'
f'GC {int(row["general_court"])} · {label_map.get(int(row["cluster_id"]), "")}'
for _, row in bg.iterrows()
],
hoverinfo='text',
showlegend=True,
legendgroup='bg',
legendgrouptitle=dict(text='Background'),
))

# Layer 2 — env bills, one trace per cluster that has any env bills
env_cluster_ids = sorted(envs['cluster_id'].unique())
for i, cid in enumerate(env_cluster_ids):
sub = envs[envs['cluster_id'] == cid]
lbl = label_map.get(cid, f'Cluster {cid}')
nenv = nenv_map.get(cid, len(sub))
color = PALETTE_25[cid % len(PALETTE_25)]

fig.add_trace(go.Scatter(
x=sub['x'], y=sub['y'],
mode='markers',
marker=dict(
color=color, size=11, opacity=0.92,
line=dict(color='black', width=1.2),
),
name=f'{lbl} ({nenv} env)',
hovertext=[
f'<b>{row.get("bill_title", "")}</b><br>'
f'GC {int(row["general_court"])} · 🌿 environmental<br>'
f'Cluster: {lbl}<br>'
f'Score: {row.get("env_relevance_score", ""):.3f}'
for _, row in sub.iterrows()
],
hoverinfo='text',
showlegend=True,
legendgroup='env',
legendgrouptitle=dict(text='Environmental bills by cluster') if i == 0 else dict(text=''),
))

fig.update_layout(
title=dict(
text=(
'MA Lobbying Bills — Environmental Bills in the Policy Landscape'
'<br><sup>Coloured = environmentally-relevant bills (329) · '
'grey = background sample (~3 000 non-env) · '
'colour = topic cluster · hover for details</sup>'
),
font=dict(size=13),
),
xaxis=dict(visible=False),
yaxis=dict(visible=False),
legend=dict(
font=dict(size=10),
itemsizing='constant',
tracegroupgap=8,
),
margin=dict(l=10, r=10, t=70, b=10),
width=880,
height=600,
plot_bgcolor='#f8f8f8',
paper_bgcolor='white',
hovermode='closest',
)

OUT_HTML.parent.mkdir(parents=True, exist_ok=True)
html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True})
OUT_HTML.write_text(
'{% raw %}\n' + html + '\n{% endraw %}\n',
encoding='utf-8',
)
print(f'Wrote {OUT_HTML}')


if __name__ == '__main__':
main()
Loading
Loading