Skip to content

fix(stage 7): federation isolation — silent-pass bug, doc + broker code together#69

Merged
hanwencheng merged 2 commits intomainfrom
fix/stage7-federation-isolation
May 7, 2026
Merged

fix(stage 7): federation isolation — silent-pass bug, doc + broker code together#69
hanwencheng merged 2 commits intomainfrom
fix/stage7-federation-isolation

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

Supersedes #67 (docs) and #68 (broker code) — consolidating into one PR because the bugs they each fix are interdependent and the federation only works when both land together.

TL;DR

docs/cloud-setup.md §4 (OIDC federation) had three independent bugs that combined to produce a silent false pass: end-to-end §4.5 returned success on both the own-prefix and foreign-prefix list calls, even though the runbook reads as if AWS is enforcing per-tenant isolation. This PR fixes all three layers.

Reproduction (current main, before this PR)

$ aws sts get-caller-identity
# → assumed-role/agentkeys-data-role/fed-proof-...

$ aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "$WALLET/"
{ "RequestCharged": null, "Prefix": "0xe5cc..." }    # success

$ aws s3api list-objects-v2 --bucket "$BUCKET" --prefix "0xdeadbeef/"
{ "RequestCharged": null, "Prefix": "0xdeadbeef/" }  # success — should be AccessDenied

4b should be AccessDenied. It isn't. The cloud is enforcing nothing.

Three bugs, in the order you'd discover them

Bug A: §3 inline policy agentkeys-data-role-inline masks the bucket policy

agentkeys-data-role-inline (from §3.2) grants the role broad s3:GetObject + s3:ListBucket on the entire bucket — necessary in the static-IAM path but fatal here: IAM is union-of-allows, so this identity-based grant overrides §4.4's bucket-policy isolation. As long as it's attached, the bucket policy's PrincipalTag scoping never matters.

The runbook moves §3 → §4 but never tells the operator to delete or rewrite this. Doc-only fix.

Bug B: §4.3 trust policy uses StringNotEquals: "" instead of Null: "false"

The original snippet:

"StringNotEquals": { "aws:RequestTag/agentkeys_user_wallet": "" }

Looks like it requires the tag to be non-empty. Doesn't. AWS IAM evaluates negated string operators on missing context keys as TRUE ("the missing key is not equal to anything"). So a JWT carrying no AWS tags claim silently bypasses the check — assume-role-with-web-identity succeeds, the session has no PrincipalTag, and the bucket policy's tag-based scoping has nothing to evaluate against.

Correct enforcement uses Null:

"Null": { "aws:RequestTag/agentkeys_user_wallet": "false" }

"the key MUST exist" — actually rejects sessions where the tag isn't set. Doc-only fix.

Bug C (show-stopper): broker doesn't emit https://aws.amazon.com/tags

crates/agentkeys-broker-server/src/handlers/oidc.rs:106-113 builds JWT claims as:

let claims = json!({
    "iss": state.config.oidc_issuer,
    "sub": format!("agentkeys:agent:{}", session.wallet),
    "aud": "sts.amazonaws.com",
    "iat": now,
    "exp": exp,
    "agentkeys_user_wallet": session.wallet,
});

agentkeys_user_wallet appears as a top-level claim only. AWS does not auto-promote arbitrary OIDC claims to session tags — it specifically requires the magic-named https://aws.amazon.com/tags claim with a principal_tags (and optionally transitive_tag_keys) shape. Without it, no PrincipalTag is set on assumed sessions, and ${aws:PrincipalTag/agentkeys_user_wallet} in the bucket policy expands to empty.

Spec: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_session-tags.html#oidc-session-tags

Code fix in mint_oidc_jwt:

"https://aws.amazon.com/tags": {
    "principal_tags": {
        "agentkeys_user_wallet": [session.wallet],
    },
    "transitive_tag_keys": ["agentkeys_user_wallet"],
}

transitive_tag_keys ensures the tag persists across role chaining.

Why all three bugs hid in plain sight

The bugs are sequenced such that each masks the next:

  • Bug C alone → session has no tag → bucket policy can't enforce → 4b should fail.
  • Bug B keeps assume-role from failing fast at STS time, which would have surfaced Bug C.
  • Bug A makes the bucket policy irrelevant — the inline policy grants broad access → 4b succeeds, masking both Bugs B and C.

Net result: end-to-end §4.5 prints what looks like success. Without explicitly testing 4b against a foreign prefix, an operator would ship Stage 7 thinking federation isolation works.

What this PR does

Commit 1 — docs(cloud-setup): fix §4 federation runbook

docs/cloud-setup.md only:

  • §4.3 trust policy: Null: "false" instead of StringNotEquals: "". Explanatory paragraph on why negated string operators don't work for tag-presence enforcement.
  • §4.4 bucket policy: split combined s3:GetObject + s3:ListBucket statement (which AWS rejects with MalformedPolicy) into AllowDaemonListOwnPrefix (ListBucket + s3:prefix StringLike) and AllowDaemonGetOwnObjects (GetObject; resource ARN itself enforces the prefix). StringLike "${tag}/*" instead of StringEquals "${tag}/" so sub-prefixes work.
  • §4.4.1 (new) strip agentkeys-data-role-inline's broad-bucket grant. Preserves any non-S3 statements (ses:SendRawEmail); falls back to delete-role-policy if there are none.
  • §4.5 split into Part A (broker host: mint JWT) and Part B (operator workstation: assume role + verify). Explicit env-var scope callout listing which workstation-side vars don't exist on the broker host. Comment headers in Part A's bash block (# === Run on your operator workstation === / # === The rest runs inside the SSH session ===). Base64url-aware JWT decoder (replaces @base64d which fails with Malformed BOM when payload base64 happens to contain - or _). Diagnostic guide mapping observed outcomes to root causes.

Diff: 1 file, +122/-15.

Commit 2 — fix(broker): emit https://aws.amazon.com/tags claim

crates/agentkeys-broker-server/src/handlers/oidc.rs:

  • mint_oidc_jwt adds the https://aws.amazon.com/tags claim with principal_tags.agentkeys_user_wallet = [session.wallet] and transitive_tag_keys = ["agentkeys_user_wallet"].
  • Discovery doc claims_supported advertises https://aws.amazon.com/tags.
  • Inline comment block above the claims construction explains why this magic-named claim is required.

crates/agentkeys-broker-server/tests/oidc_flow.rs:

  • mint_oidc_jwt_signs_claims_for_session_wallet asserts the JWT carries the AWS tags claim with the wallet under principal_tags and transitive_tag_keys populated. Bug-regression guard.
  • discovery_returns_aws_compatible_shape asserts claims_supported includes the tags claim.

cargo test -p agentkeys-broker-server → all 9 + 6 + 0 tests pass.

Test plan

  • cargo test -p agentkeys-broker-server locally — all green.
  • Both jq snippets in §4.3 + §4.4 validate via jq -n (well-formed JSON, ${aws:PrincipalTag/...} placeholders preserved).
  • After merge + broker deploy: re-run §4.5 against the live broker. Expect 4a → success (empty list, no error), 4b → AccessDenied. Decoded JWT should show https://aws.amazon.com/tags.principal_tags.agentkeys_user_wallet populated.

Deploy notes

After merge, the broker host needs a rebuild + restart for the JWT shape to actually change:

ssh agentkey@$BROKER_HOST
cd ~/agentKeys
git pull
cargo build --release -p agentkeys-broker-server
sudo cp target/release/agentkeys-broker-server /usr/local/bin/
sudo systemctl restart agentkeys-broker
sudo journalctl -u agentkeys-broker -n 20    # verify clean restart
exit

Then re-run §4.5 from the runbook.

Related

🤖 Generated with Claude Code

@hanwencheng hanwencheng merged commit c7b7f01 into main May 7, 2026
1 check passed
hanwencheng added a commit that referenced this pull request May 7, 2026
PHASE-0-CHECKPOINT.md covers Phase 0 in isolation against localhost.
This guide is the production equivalent — full Stage 7 (Phases 0 +
A.1 + A.2 + B + C-structural + D-rest + E) running on a real EC2
broker host with the AWS account from cloud-setup.md.

Sections walk an operator through:
- Two-machine layout (operator workstation vs broker host) with
  inline === ON … === banners on every command block.
- Prerequisites checklist (cloud-setup.md §0–4 done, broker host
  bootstrapped, two cast-generated test wallets).
- /healthz + /readyz + OIDC discovery + JWKS + IAM-side OIDC provider
  cross-checks (with the byte-for-byte issuer match invariant).
- SIWE wallet auth round-trip for both wallets, signing with
  cast wallet sign (no --no-hash).
- /v1/mint-oidc-jwt → AssumeRoleWithWebIdentity manual path,
  decoding the https://aws.amazon.com/tags claim.
- Cloud-enforced isolation proof (the climax): wallet A reads its
  own prefix; wallet B's prefix returns AccessDenied from S3 itself,
  not app code. Includes the diagnostic-state runbook for both
  failure modes (own-prefix denied → JWT missing tag claim;
  other-prefix succeeds → cloud-setup.md §4.4.1 not applied; this is
  the silent-pass bug PR #69 fixed at the broker layer).
- /v1/mint-aws-creds the daemon path with audit_record_id +
  anchored fields.
- Capability grants (create / list / revoke), wallet linking +
  unauthenticated recover/lookup, email-link + OAuth2/Google flows.
- Audit log inspection (sqlite plugin_mint_log columns explained).
- Phase C EVM anchor (structural-only in v0; live alloy lands in
  V0.1-FOLLOWUPS hardening).
- Prometheus metrics + Idempotency-Key (hit/miss/422 cases).
- harness/stage-7-issue-64-done.sh as the programmatic gate.
- Failure-mode walk-through: BOOT_FAIL anchor table,
  InvalidIdentityToken triage, AccessDenied-on-own-prefix,
  24h-clean-exit + Restart=always.
- 'What's intentionally not yet live' section pointing at
  V0.1-FOLLOWUPS.md so operators know which structural features
  ship as stubs (live EVM anchor, TEE signer, fail-closed grants
  default, latency histograms).

860 lines. All 6 cross-referenced files exist (verified).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant