nesanders · nesanders · May 20, 2026 · May 20, 2026 · May 20, 2026 · May 20, 2026
diff --git a/.claude/scheduled_tasks.lock b/.claude/scheduled_tasks.lock
@@ -0,0 +1 @@
+{"sessionId":"d2fc4fd3-d9ff-44a4-a608-05ddeb65f47b","pid":1140968,"procStart":"103407381","acquiredAt":1779884724489}
diff --git a/.github/workflows/update-data.yml b/.github/workflows/update-data.yml
@@ -15,6 +15,7 @@ jobs:
       contents: write       # to push branch
       issues: write         # to open failure issues
       pull-requests: write  # to create PR
+      actions: write        # to dispatch update-charts.yml after merge
 
     steps:
       - uses: actions/checkout@v4
@@ -54,6 +55,10 @@ jobs:
           python get_MA_precipitation.py
           python get_ATTAINS_303d.py
           python get_MS4_annual_reports.py --yes --skip-download
+          python get_MA_lobbying.py
+          python get_MA_legislature_bills.py
+          python score_lobbying_bills.py
+          python cluster_lobbying_bills.py --incremental
 
       - name: Validate data
         working-directory: get_data

diff --git a/.gitignore b/.gitignore
@@ -103,6 +103,16 @@ get_data/MS4_annual_reports/
 
 # Large files
 *EEADP_drinkingWater.csv
+docs/data/MA_bill_embeddings.parquet
+docs/data/MA_bill_embeddings.npy
+get_data/MA_legislature_cache/
+
+# MA lobbying large CSVs — stored in GCS, only samples committed
+docs/data/MA_lobbying_bills.csv
+docs/data/MA_lobbying_employers.csv
+docs/data/MA_lobbying_summary_links.csv
+docs/data/MA_lobbying_bills_scored.csv
+docs/data/MA_legislature_bills.csv
 
 get_data/backup_AMEND.db
 get_data/EEADP_drinkingWater.csv

diff --git a/CLAUDE.md b/CLAUDE.md
@@ -67,6 +67,13 @@ bash set_cors_gsutil.sh
 - **EPA NPDES page changes**: EPA changed JSON format and column names around 2025; both handled with `isinstance` checks and fallback column detection.
 - **EEA CSOAPI**: Requires `Referer` and `Origin` headers matching the portal URL; plain requests return HTTP 500.  Pagination is 1-indexed.
 - **303(d) data (biennial)**: `get_ATTAINS_303d.py` fetches from MassGIS S3-hosted shapefiles (not the ATTAINS REST API, which times out on `/assessments`). Data updates only biennially (even years); the script exits early if all known cycles are already in the cached CSV. The 2020 cycle was never published by MassGIS. The 2024/2026 cycle is in draft as of April 2026 — the script will auto-detect it when MassGIS publishes the approved shapefile. `CSO_303d_Mapping` in `assemble_db.py` is a manually curated dict (35 verified matches of 56 CSO waterbodies); update it when a new cycle is added by reviewing new assessment unit names against CSO waterBody values.
+- **MA lobbying portal (Incapsula WAF)**: The SoS portal (`sec.state.ma.us/LobbyistPublicSearch/`) is protected by Incapsula WAF. A Chrome User-Agent gets a 302 redirect to a JS challenge page. An **iPad User-Agent** bypasses it entirely with plain `requests` — no Selenium needed. The working UA is `Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148`.
+- **MA lobbying portal (search form)**: ASP.NET with ViewState. POST to `Default.aspx` with `drpType=L` (Lobbyist or Lobbying Entity — do NOT use `Z` which returns Client pages with different structure), `drpPageSize=20000`, `ddlYear=<year>`. POST timeout must be 120s (response is ~2MB for ~1700 results). Data goes back to 2005 (22 years).
+- **MA lobbying incremental fetch**: Disclosures are filed semi-annually per employer, so weekly CI almost always exits early. The incremental state is the `disc_url` set in `MA_lobbying_summary_links.csv` — only new `CompleteDisclosure.aspx` URLs get fetched. Always re-checks current and prior year for new filers.
+- **MA lobbying full historical fetch**: At `REQUEST_DELAY=1.0s`, fetching a single year (~1700 registrants) takes ~10 hours. The full 22-year history (2005–2026) will take several days. To speed up the historical fetch (frozen past years), lower `REQUEST_DELAY` to 0.5s — safe since those filings never change.
+- **Running lobbying scripts**: Do NOT use `conda run` with stdout redirect (`> file`) — `conda run` buffers all output through a pipe and the log file stays empty until the process exits. Run Python directly: `/home/nes/miniconda/envs/amend_python/bin/python -u get_MA_lobbying.py` (the `-u` flag ensures unbuffered output).
+- **MA lobbying Gemini SDK**: Uses `google-genai` (new SDK, `google.genai`), NOT the old `google-generativeai` package. API: `client = genai.Client(api_key=...)`, then `client.models.embed_content(model='gemini-embedding-2', contents=text, config=types.EmbedContentConfig(output_dimensionality=768))`.
+- **MA lobbying General Court formula bug (unfixed)**: `get_MA_lobbying.py` uses `FIRST_GC_START_YEAR = 2005` but the 183rd General Court started in 2003, not 2005. The correct constant is `2003`. As a result, every bill's `general_court` assignment is one session too low (year 2024 → GC192 instead of GC193, etc.), and `get_MA_legislature_bills.py` fetches bill text from the wrong legislative session. Fixing this constant and re-running the full pipeline would bring the title match rate from ~2% to ~65%. The remaining ~35% of mismatches are string-normalisation differences or genuinely wrong bill numbers in the SoS portal. See `get_data/NOTES_bill_embeddings.md` for full analysis.
 
 ## Running scripts
 

diff --git a/analysis/MA_lobbying_tsne.py b/analysis/MA_lobbying_tsne.py
@@ -0,0 +1,224 @@
+"""Generate a t-SNE scatter plot of MA lobbying bill embeddings.
+
+Visual design philosophy
+─────────────────────────
+MA legislative bill embeddings are semantically dense — all bills share heavy
+regulatory language, so inter-cluster cosine distances are ~0.006 vs.
+intra-cluster spread of ~0.53. Running t-SNE on all 25k bills produces a
+featureless blob regardless of perplexity, because the structure simply doesn't
+separate in 2-D.
+
+Instead the chart shows TWO layers:
+
+  Background (grey)  — stratified sample of ~120 non-environmental bills per
+                        cluster, rendered as tiny translucent grey dots. Provides
+                        geographic context for the policy landscape.
+
+  Signal (coloured)  — all 329 env-relevant bills, one colour per cluster,
+                        large outlined dots. These are what the visitor cares about.
+
+t-SNE is computed on the combined ~3,300 point sample (all env + background),
+which runs in seconds and produces cleaner structure than the full 25k set.
+Perplexity is scaled to ~√N of the subsample.
+
+Run from the analysis/ directory:
+    /path/to/python -u MA_lobbying_tsne.py
+
+Outputs:
+    ../docs/_includes/charts/lobbying_bill_tsne.html
+"""
+
+import sys
+from pathlib import Path
+
+import numpy as np
+import pandas as pd
+from sklearn.manifold import TSNE
+from sklearn.preprocessing import normalize
+import plotly.graph_objects as go
+
+sys.path.insert(0, str(Path(__file__).parent))
+
+GCS_PARQUET   = 'gs://openamend-data/MA_bill_embeddings.parquet'
+LOCAL_PARQUET = Path('../docs/data/MA_bill_embeddings.parquet')
+LABELS_CSV    = Path('../docs/data/MA_bill_cluster_labels.csv')
+OUT_HTML      = Path('../docs/_includes/charts/lobbying_bill_tsne.html')
+
+# Non-env bills sampled per cluster for background context.
+# 120 × 25 clusters ≈ 3 000 background points + ~329 env = ~3 300 total.
+BG_PER_CLUSTER  = 120
+TSNE_ITER       = 1000
+RANDOM_STATE    = 42
+
+# 25-colour palette — qualitative, perceptually distinct, no cycling
+PALETTE_25 = [
+    '#1f77b4', '#ff7f0e', '#2ca02c', '#d62728', '#9467bd',
+    '#8c564b', '#e377c2', '#7f7f7f', '#bcbd22', '#17becf',
+    '#aec7e8', '#ffbb78', '#98df8a', '#ff9896', '#c5b0d5',
+    '#c49c94', '#f7b6d2', '#c7c7c7', '#dbdb8d', '#9edae5',
+    '#393b79', '#637939', '#8c6d31', '#843c39', '#7b4173',
+]
+
+
+def _load_parquet() -> pd.DataFrame:
+    try:
+        import gcsfs
+        fs = gcsfs.GCSFileSystem()
+        if fs.exists(GCS_PARQUET):
+            with fs.open(GCS_PARQUET, 'rb') as f:
+                df = pd.read_parquet(f)
+            print(f'Loaded {len(df)} rows from {GCS_PARQUET}')
+            return df
+    except Exception as e:
+        print(f'GCS load failed ({e}), trying local...')
+    if LOCAL_PARQUET.exists():
+        df = pd.read_parquet(LOCAL_PARQUET)
+        print(f'Loaded {len(df)} rows from local Parquet')
+        return df
+    raise FileNotFoundError('No Parquet file found. Run score_lobbying_bills.py first.')
+
+
+def main():
+    parquet_df = _load_parquet()
+
+    # Restrict to clustered bills
+    parquet_df = parquet_df[
+        parquet_df['cluster_id'].notna() & (parquet_df['cluster_id'] != -1)
+    ].copy()
+    parquet_df['cluster_id'] = parquet_df['cluster_id'].astype(int)
+
+    if 'is_environmental' not in parquet_df.columns:
+        parquet_df['is_environmental'] = False
+    parquet_df['is_environmental'] = parquet_df['is_environmental'].fillna(False).astype(bool)
+
+    labels_df = pd.read_csv(LABELS_CSV)
+    label_map = dict(zip(labels_df['cluster_id'].astype(int), labels_df['label']))
+    nenv_map  = dict(zip(labels_df['cluster_id'].astype(int), labels_df['n_env_bills']))
+
+    # ── Build subsample ──────────────────────────────────────────────────────
+    # Keep ALL env bills; sample BG_PER_CLUSTER non-env bills per cluster.
+    env_df  = parquet_df[parquet_df['is_environmental']].copy()
+    non_env = parquet_df[~parquet_df['is_environmental']]
+
+    rng = np.random.default_rng(RANDOM_STATE)
+    bg_parts = []
+    for cid in sorted(non_env['cluster_id'].unique()):
+        sub = non_env[non_env['cluster_id'] == cid]
+        n   = min(BG_PER_CLUSTER, len(sub))
+        bg_parts.append(sub.sample(n=n, random_state=int(rng.integers(0, 2**31))))
+
+    bg_df  = pd.concat(bg_parts, ignore_index=True)
+    sample = pd.concat([env_df, bg_df], ignore_index=True)
+    print(f'Subsample: {len(env_df)} env + {len(bg_df)} background = {len(sample)} total')
+
+    # ── Embeddings ───────────────────────────────────────────────────────────
+    emb      = np.vstack(sample['embedding'].apply(
+        lambda v: np.array(v, dtype=np.float32)
+    ).values)
+    emb_norm = normalize(emb, norm='l2')
+
+    # ── t-SNE ────────────────────────────────────────────────────────────────
+    # Perplexity ≈ √N is a sensible heuristic for subsampled sets.
+    perplexity = max(20, min(80, int(np.sqrt(len(sample)))))
+    print(f'Running t-SNE (n={len(sample)}, perplexity={perplexity}, iter={TSNE_ITER})...')
+    tsne = TSNE(
+        n_components=2,
+        perplexity=perplexity,
+        max_iter=TSNE_ITER,
+        random_state=RANDOM_STATE,
+        init='pca',
+        learning_rate='auto',
+    )
+    coords  = tsne.fit_transform(emb_norm)
+    sample  = sample.copy()
+    sample['x'] = coords[:, 0]
+    sample['y'] = coords[:, 1]
+
+    # ── Build Plotly figure ──────────────────────────────────────────────────
+    fig = go.Figure()
+
+    bg   = sample[~sample['is_environmental']]
+    envs = sample[sample['is_environmental']]
+
+    # Layer 1 — grey background (all non-env, single trace for performance)
+    fig.add_trace(go.Scatter(
+        x=bg['x'], y=bg['y'],
+        mode='markers',
+        marker=dict(color='#aaaaaa', size=4, opacity=0.20),
+        name='Non-environmental bills',
+        hovertext=[
+            f'<b>{row.get("bill_title", "")}</b><br>'
+            f'GC {int(row["general_court"])} · {label_map.get(int(row["cluster_id"]), "")}'
+            for _, row in bg.iterrows()
+        ],
+        hoverinfo='text',
+        showlegend=True,
+        legendgroup='bg',
+        legendgrouptitle=dict(text='Background'),
+    ))
+
+    # Layer 2 — env bills, one trace per cluster that has any env bills
+    env_cluster_ids = sorted(envs['cluster_id'].unique())
+    for i, cid in enumerate(env_cluster_ids):
+        sub  = envs[envs['cluster_id'] == cid]
+        lbl  = label_map.get(cid, f'Cluster {cid}')
+        nenv = nenv_map.get(cid, len(sub))
+        color = PALETTE_25[cid % len(PALETTE_25)]
+
+        fig.add_trace(go.Scatter(
+            x=sub['x'], y=sub['y'],
+            mode='markers',
+            marker=dict(
+                color=color, size=11, opacity=0.92,
+                line=dict(color='black', width=1.2),
+            ),
+            name=f'{lbl} ({nenv} env)',
+            hovertext=[
+                f'<b>{row.get("bill_title", "")}</b><br>'
+                f'GC {int(row["general_court"])} · 🌿 environmental<br>'
+                f'Cluster: {lbl}<br>'
+                f'Score: {row.get("env_relevance_score", ""):.3f}'
+                for _, row in sub.iterrows()
+            ],
+            hoverinfo='text',
+            showlegend=True,
+            legendgroup='env',
+            legendgrouptitle=dict(text='Environmental bills by cluster') if i == 0 else dict(text=''),
+        ))
+
+    fig.update_layout(
+        title=dict(
+            text=(
+                'MA Lobbying Bills — Environmental Bills in the Policy Landscape'
+                '<br><sup>Coloured = environmentally-relevant bills (329) · '
+                'grey = background sample (~3 000 non-env) · '
+                'colour = topic cluster · hover for details</sup>'
+            ),
+            font=dict(size=13),
+        ),
+        xaxis=dict(visible=False),
+        yaxis=dict(visible=False),
+        legend=dict(
+            font=dict(size=10),
+            itemsizing='constant',
+            tracegroupgap=8,
+        ),
+        margin=dict(l=10, r=10, t=70, b=10),
+        width=880,
+        height=600,
+        plot_bgcolor='#f8f8f8',
+        paper_bgcolor='white',
+        hovermode='closest',
+    )
+
+    OUT_HTML.parent.mkdir(parents=True, exist_ok=True)
+    html = fig.to_html(full_html=False, include_plotlyjs='cdn', config={'responsive': True})
+    OUT_HTML.write_text(
+        '{% raw  %}\n' + html + '\n{% endraw %}\n',
+        encoding='utf-8',
+    )
+    print(f'Wrote {OUT_HTML}')
+
+
+if __name__ == '__main__':
+    main()
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		{"sessionId":"d2fc4fd3-d9ff-44a4-a608-05ddeb65f47b","pid":1140968,"procStart":"103407381","acquiredAt":1779884724489}