Skip to content

docs(cloud-setup): fix §4 federation runbook — split bucket policy, enforce tag presence, drop §3 inline grant, cross-machine §4.5 guidance#67

Closed
hanwencheng wants to merge 1 commit intomainfrom
docs/fix-cloud-setup-s4.4-policy-split
Closed

docs(cloud-setup): fix §4 federation runbook — split bucket policy, enforce tag presence, drop §3 inline grant, cross-machine §4.5 guidance#67
hanwencheng wants to merge 1 commit intomainfrom
docs/fix-cloud-setup-s4.4-policy-split

Conversation

@hanwencheng
Copy link
Copy Markdown
Member

@hanwencheng hanwencheng commented May 6, 2026

Summary

Four interrelated fixes to docs/cloud-setup.md §4 (OIDC federation). End-to-end §4.5 against the deployed prod broker silently passed broad-access tests because three runbook bugs combined to mask each other. This PR addresses the doc-side bugs; broker code fix (the show-stopper) is out of scope and will be filed separately.

What this PR fixes

1. §4.4 bucket policy — split into two statements (originally MalformedPolicy)

Old: one statement combining s3:GetObject + s3:ListBucket under one s3:prefix condition. AWS rejected at policy-save time:

MalformedPolicy: Conditions do not apply to combination of actions and resources in statement

s3:prefix is a request-time condition that only applies to s3:ListBucket (the prefix filter on listings). s3:GetObject doesn't carry a prefix parameter, so combining the two under one s3:prefix condition is semantically incoherent.

Fix: split into AllowDaemonListOwnPrefix (ListBucket + s3:prefix StringLike) and AllowDaemonGetOwnObjects (GetObject; resource ARN itself enforces the prefix via ${aws:PrincipalTag/...} expansion).

Also changed StringEquals "${tag}/"StringLike "${tag}/*" so the daemon can list sub-prefixes (<wallet>/inbox/, <wallet>/sent/2026-05/), not just the root. Matches the shape already in docs/spec/ses-email-architecture.md §10.4 and wiki/tag-based-access.

2. §4.3 trust policy — Null operator instead of StringNotEquals: ""

Old:

"StringNotEquals": { "aws:RequestTag/agentkeys_user_wallet": "" }

Looks like it requires the tag to be non-empty. Doesn't. AWS IAM evaluates negated string operators on missing context keys as TRUE ("the missing key is not equal to anything"). So a JWT carrying no AWS tags claim silently bypasses the check — assume-role-with-web-identity succeeds, the session has no PrincipalTag, and the bucket policy's tag-based scoping has nothing to evaluate against.

Fix:

"Null": { "aws:RequestTag/agentkeys_user_wallet": "false" }

Null: \"false\" means "the key MUST exist (be not-null)" — actually rejects sessions where the tag isn't set.

3. New §4.4.1 — strip the §3 broad-bucket grant from agentkeys-data-role-inline

§3.2's inline policy grants s3:GetObject + s3:ListBucket on the entire bucket — necessary in the static-IAM path (no PrincipalTag to scope on) but fatal in §4: IAM is union-of-allows, so this identity-based grant overrides §4.4's bucket-policy isolation. §4.5's 4b test silently succeeds instead of correctly returning AccessDenied — federation appears to work while the cloud is enforcing nothing.

The runbook moves §3 → §4 but never tells the operator to delete or rewrite this inline policy. New section adds the cleanup commands (preserve ses:SendRawEmail; drop the broad S3 grant) and a verification step.

4. §4.5 — cross-machine guidance + base64url-aware decoder

Two doc bugs in the original §4.5 that bit me running it for real:

(a) Conflated environments. Step 1 hits 127.0.0.1:8090 (operator's localhost) and step 2 hits \$OIDC_ISSUER (prod broker). These don't share session state — the prod broker validates against its own local backend on the EC2 host. The runbook reads as if everything is local; it isn't.

Fix: split into Part A ("on the broker host — mint the JWT") and Part B ("on your operator workstation — assume role + verify"). Explicit SSH preamble; the operator copies the JWT across.

(b) @base64d doesn't handle base64url. JWT segments are base64url-encoded per RFC 7515 (-/_ instead of +//, no = padding). jq's @base64d is strict base64. Most JWT payloads happen to not contain - or _ in their base64 representation, so it works by accident — until it doesn't, with the cryptic error jq: error (at <stdin>:1): Malformed BOM (while parsing '��').

Fix: replace the decoder with one that converts base64url → base64 + adds padding before @base64d:

gsub(\"-\";\"+\") | gsub(\"_\";\"/\") | . + (\"=\" * ((4 - length % 4) % 4)) | @base64d | fromjson

5. New diagnostic guide for §4.5 intermediate states

Added a "Diagnosing intermediate states" subsection that maps observed outcomes to root causes:

  • Both 4a and 4b succeed → §4.4.1 wasn't applied (inline policy still masking)
  • Both 4a and 4b deny → broker isn't emitting https://aws.amazon.com/tags claim (separate code bug, fix incoming)

This is the bug pattern that hid Stage 7's broken federation in the original runbook.

Out of scope

The show-stopper bug — crates/agentkeys-broker-server/src/handlers/oidc.rs:106-113 doesn't emit the https://aws.amazon.com/tags claim — is a code change, not a doc fix. To be filed in a separate PR. Until that lands and is deployed, the corrected §4.5 will see both 4a and 4b deny (the documented intermediate state). The doc fixes here remove the masking layers so that bug actually surfaces.

Verification

  • Both jq snippets validate against jq -n locally (well-formed JSON output, \${aws:PrincipalTag/...} placeholders preserved).
  • Reproduced the original errors in production: MalformedPolicy from the §4.4 combined statement, Malformed BOM from @base64d on a real JWT, both list-objects-v2 calls succeeding because of the inline policy, JWT decode confirming no AWS tags claim.

Test plan

  • Apply §4.4 corrected put-bucket-policy against a fresh bucket — succeeds.
  • Apply §4.3 corrected update-assume-role-policy — succeeds.
  • Apply §4.4.1 inline-policy strip — Statement[*].Action returns only ses:SendRawEmail.
  • Run §4.5 Part A on broker host, Part B on workstation, with current broker (no tags claim): both 4a and 4b → AccessDenied.
  • After broker code fix lands + deploys: 4a → success (empty list), 4b → AccessDenied.

🤖 Generated with Claude Code

@hanwencheng hanwencheng force-pushed the docs/fix-cloud-setup-s4.4-policy-split branch from 0be90ec to afd7213 Compare May 7, 2026 06:28
@hanwencheng hanwencheng changed the title docs(cloud-setup): split §4.4 bucket policy into ListBucket + GetObject statements docs(cloud-setup): fix §4 federation runbook — split bucket policy, enforce tag presence, drop §3 inline grant, cross-machine §4.5 guidance May 7, 2026
… Null operator, §4.4.1 inline cleanup, §4.5 cross-machine guidance)
@hanwencheng
Copy link
Copy Markdown
Member Author

Consolidating with #68 into a single PR for the Stage 7 federation fix (docs + broker code together — same conceptual change).

@hanwencheng hanwencheng closed this May 7, 2026
@hanwencheng hanwencheng deleted the docs/fix-cloud-setup-s4.4-policy-split branch May 7, 2026 07:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant